metadata

title: RAG BITS Tutor
emoji: 🎓
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.45.1
app_file: app.py
pinned: false

RAG Study Tutor for Business IT Strategy

Author: Laurel Mayer Module: AI Applications (w.3KIA) - Project 3

1. Project Description

This project implements a Retrieval Augmented Generation (RAG) application designed to act as a "Study Tutor" for the subject "Business IT Strategy." The primary goal is to enable users to ask questions about specific course content and receive well-founded, context-based answers derived from the provided lecture materials and case studies. The application integrates a retrieval component for searching relevant text passages with a Large Language Model (LLM) for generating the final answers.

Name & URL

Name	URL
Code	GitHub Repository
Embedding Model Page	Sahajtomar/German-semantic
LLM Provider (Groq)	Groq
Jupyter Notebook	main_project.ipynb
FAISS Index & Chunks	/faiss_index_bits/

2. Data Sources

The knowledge base for the RAG Tutor consists of:

Data Source	Description
Own course materials (Lecture PDFs)	13 PDF documents comprising lecture notes and case studies (including solutions) for the "Business IT Strategy" course.
(Total extracted text volume: approx. 221,049 characters)	The files are located in the `data/` folder of this repository.

3. RAG Improvements

To enhance the RAG system's performance, the following adaptation was implemented:

Improvement	Description
`Query Expansion` (using LLM)	The original user query is sent to an LLM (`llama3-8b-8192` via Groq) to generate 2-3 alternative formulations or relevant keywords. These expanded queries are then additionally used for retrieval to create a broader contextual base for final answer generation. The implementation and evaluation of this method are detailed in Section 5 of the Jupyter Notebook (`main_project.ipynb`).
Other Potential Improvements	For this project, the focus was on implementing and evaluating Query Expansion. Further potential improvements and adaptation mechanisms (e.g., re-ranking of search results, hybrid search) are discussed in the "Conclusion and Outlook" section of this document and in the notebook.

4. Chunking

Data Chunking Method

The choice of chunking strategy is crucial for retrieval quality, as it determines how context is divided and fed to the embedding model. For this project, the text extracted from the PDFs was chunked as follows:

Type of Chunking	Configuration	Result (Number of Chunks)
`RecursiveCharacterTextSplitter` (Langchain) - Chosen Method	Chunk Size: 1500 characters, Overlap: 200 characters	203

Reasoning for the chosen method: The RecursiveCharacterTextSplitter was selected because it attempts to maintain semantically coherent blocks by recursively splitting at various separators (like paragraphs, sentences, etc.). A chunk_size of 1500 characters with an overlap of 200 characters was chosen as a good starting point. The goal was to obtain chunks that contain sufficient context for understanding but are not so large as to exceed the maximum input length of embedding models or introduce too much noise for specific queries. The resulting 203 chunks represented a manageable quantity for further processing.

Alternatively Considered Chunking Approaches:

Type of Chunking	Hypothetical Configuration/Consideration	Potential Advantages/Disadvantages
`CharacterTextSplitter` (Langchain)	Fixed Chunk Size (e.g., 1000), Overlap (e.g., 150)	Simpler, but less regard for semantic boundaries; could split sentences/thoughts.
`SentenceTransformersTokenTextSplitter`	Based on token limits of the embedding model (e.g., 256 Tokens)	More precise adaptation to the embedding model, but requires knowledge of tokenizer specifics. Could have led to a different number and granularity of chunks.
Smaller `chunk_size` with `RecursiveCharacterTextSplitter`	e.g., 500 characters, Overlap 50	More, but more specific chunks. Could help with very detailed questions, but also fragment context more and require more chunks for an answer.

Decision Process: Although other methods and configurations exist, the initial configuration of the RecursiveCharacterTextSplitter was retained for this project as it offered a good compromise between implementation effort, context preservation, and the resulting number of chunks for the chosen dataset. Deeper optimization of the chunking strategy would be a potential next step in further development to potentially enhance retrieval accuracy. The documentation of this project focuses on the overall process and the implementation of a core RAG pipeline with one form of adaptation.

5. Choice of LLM

LLMs accessed via the Groq API were used for this RAG application:

LLM Name (Groq)	Used for	Link/Reference
`llama3-70b-8192`	Final Answer Generation	Groq Models
`llama3-8b-8192`	Query Expansion (Adaptation Mechanism)	Groq Models

Reasoning: llama3-70b-8192 was chosen for answer generation due to its strong performance in synthesizing information and generating coherent text. For query expansion, the smaller llama3-8b-8192 model was used to reduce the latency of this intermediate step, while still expecting good quality expansions.

6. Test Method

The evaluation of the RAG application and the query expansion mechanism was conducted qualitatively. Specific test questions regarding the content of the course materials were formulated. The procedure was as follows:

Generate an answer based on the original user query and the chunks retrieved directly for it.
Generate expanded search queries from the original user query using an LLM.
Retrieve chunks based on these expanded queries, then collect and de-duplicate them to form an expanded context.
Generate an answer based on the expanded context and the original user query.
Conduct a qualitative comparison of the two generated answers in terms of depth of detail, correctness, and relevance to the context. The hypothesis was that query expansion could lead to more comprehensive and precise answers.

Detailed test cases and results are documented in the Jupyter Notebook (main_project.ipynb) in Sections 4.2 and 5.

7. Results

As the evaluation was primarily qualitative, the main observations are summarized here. Detailed examples can be found in the notebook.

Model/Method	Observation
Base RAG (Original Query)	Provides precise and good answers for direct questions (e.g., for "Was ist eine IT-Strategie?").
RAG with Query Expansion	For the question "Was ist eine IT-Strategie?", there was hardly any difference compared to the base RAG. For the question "Welche Rolle spielt IT-Governance?", query expansion led to a visibly more detailed and comprehensive answer, which included additional relevant aspects.

Conclusion of Results: Query expansion can improve answer quality by providing a broader and more relevant context for the LLM. However, the added value is highly dependent on the initial question and the quality of the generated expansions.

8. Setup and Execution

To run this project locally:

Prerequisites:
- Python 3.10 or higher (Python 3.12 was used).
- Git.

Clone Repository:

git clone [https://github.com/patronlaurel/RAG-BITS-Tutor.git](https://github.com/patronlaurel/RAG-BITS-Tutor.git)
cd RAG-BITS-Tutor

(Replace with your actual repository URL if different)

Create and Activate Virtual Environment:

python -m venv .venv
# Windows:
.\.venv\Scripts\activate
# macOS/Linux:
source .venv/bin/activate

Install Dependencies:
```
pip install -r requirements.txt
```
(The requirements.txt file was generated using pip freeze > requirements.txt in the project and is included in the repository).
Set up API Key:
- Create a file named .env in the project's root directory.
- Add your Groq API key: GROQ_API_KEY=your_groq_api_key
Start Jupyter Notebook:
```
jupyter lab
```
Then open the notebook main_project.ipynb. The PDF data must be placed in the data/ folder. The FAISS index (faiss_index_bits/bits_tutor.index) and chunks (faiss_index_bits/bits_chunks.pkl) are created and saved during the first run of Section 3.4 in the notebook.

9. Technologies and Libraries Used

Python 3.12
Jupyter Lab
Langchain
Sentence Transformers (Sahajtomar/German-semantic)
FAISS (Facebook AI Similarity Search)
Groq API (llama3-70b-8192, llama3-8b-8192)
PyPDF2
NumPy
Dotenv
Tqdm

10. Conclusion and Outlook

(Summarize the key points from Section 7 of your notebook. Example:) This project successfully demonstrated the construction of a RAG application as a "Study Tutor." By implementing LLM-based query expansion, it was shown how the depth of detail and informational content of answers can be improved for certain queries. Key insights relate to the importance of data quality, appropriate model selection, and the potential of adaptation mechanisms. Future work could focus on extended evaluation methods, exploring further adaptation techniques like re-ranking, or developing an interactive user interface.

11. References

(List any specific scientific papers, blog posts, or other sources you heavily relied on for your methodology or understanding here. For this project, direct code references and general methodology are likely sufficient.)