Spaces:

patronmoses
/

RAG-BITS-Tutor

Sleeping

File size: 12,739 Bytes

---
title: RAG BITS Tutor
emoji: 🎓
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: "1.45.1"
app_file: app.py
pinned: false
---

# RAG Study Tutor for Business IT Strategy

**Author:** Laurel Mayer
**Module:** AI Applications (w.3KIA) - Project 3

## 1. Project Description

This project implements a Retrieval Augmented Generation (RAG) application designed to act as a "Study Tutor" for the subject "Business IT Strategy." The primary goal is to enable users to ask questions about specific course content and receive well-founded, context-based answers derived from the provided lecture materials and case studies. The application integrates a retrieval component for searching relevant text passages with a Large Language Model (LLM) for generating the final answers.

### Name & URL

| Name                  | URL                                                                                      |
|-----------------------|------------------------------------------------------------------------------------------|
| Code                  | [GitHub Repository](https://github.com/patronlaurel/RAG-BITS-Tutor)                 |
| Embedding Model Page  | [Sahajtomar/German-semantic](https://huggingface.co/Sahajtomar/German-semantic)          |
| LLM Provider (Groq)   | [Groq](https://groq.com/)                                                                |
| Jupyter Notebook      | [main_project.ipynb](main_project.ipynb)                                                 |
| FAISS Index & Chunks  | [/faiss_index_bits/](faiss_index_bits/)                                                  |


## 2. Data Sources

The knowledge base for the RAG Tutor consists of:

| Data Source                                       | Description                                                                                                |
|---------------------------------------------------|------------------------------------------------------------------------------------------------------------|
| Own course materials (Lecture PDFs)          | 13 PDF documents comprising lecture notes and case studies (including solutions) for the "Business IT Strategy" course. |
| (Total extracted text volume: approx. 221,049 characters)   | The files are located in the `data/` folder of this repository.                                          |

## 3. RAG Improvements

To enhance the RAG system's performance, the following adaptation was implemented:

| Improvement                                     | Description                                                                                                                                                                                                  |
|-------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `Query Expansion` (using LLM)                 | The original user query is sent to an LLM (`llama3-8b-8192` via Groq) to generate 2-3 alternative formulations or relevant keywords. These expanded queries are then additionally used for retrieval to create a broader contextual base for final answer generation. The implementation and evaluation of this method are detailed in Section 5 of the Jupyter Notebook (`main_project.ipynb`). |
| Other Potential Improvements                  | For this project, the focus was on implementing and evaluating Query Expansion. Further potential improvements and adaptation mechanisms (e.g., re-ranking of search results, hybrid search) are discussed in the "Conclusion and Outlook" section of this document and in the notebook. |

## 4. Chunking

### Data Chunking Method

The choice of chunking strategy is crucial for retrieval quality, as it determines how context is divided and fed to the embedding model. For this project, the text extracted from the PDFs was chunked as follows:

| Type of Chunking                 | Configuration                                   | Result (Number of Chunks) |
|----------------------------------|-------------------------------------------------|---------------------------|
| **`RecursiveCharacterTextSplitter` (Langchain) - Chosen Method** | Chunk Size: 1500 characters, Overlap: 200 characters | 203                       |

**Reasoning for the chosen method:**
The `RecursiveCharacterTextSplitter` was selected because it attempts to maintain semantically coherent blocks by recursively splitting at various separators (like paragraphs, sentences, etc.). A `chunk_size` of 1500 characters with an `overlap` of 200 characters was chosen as a good starting point. The goal was to obtain chunks that contain sufficient context for understanding but are not so large as to exceed the maximum input length of embedding models or introduce too much noise for specific queries. The resulting 203 chunks represented a manageable quantity for further processing.

**Alternatively Considered Chunking Approaches:**

| Type of Chunking                           | Hypothetical Configuration/Consideration        | Potential Advantages/Disadvantages                                                                 |
|--------------------------------------------|-------------------------------------------------|----------------------------------------------------------------------------------------------------|
| `CharacterTextSplitter` (Langchain)        | Fixed Chunk Size (e.g., 1000), Overlap (e.g., 150)  | Simpler, but less regard for semantic boundaries; could split sentences/thoughts.                |
| `SentenceTransformersTokenTextSplitter`    | Based on token limits of the embedding model (e.g., 256 Tokens) | More precise adaptation to the embedding model, but requires knowledge of tokenizer specifics. Could have led to a different number and granularity of chunks. |
| Smaller `chunk_size` with `RecursiveCharacterTextSplitter` | e.g., 500 characters, Overlap 50                    | More, but more specific chunks. Could help with very detailed questions, but also fragment context more and require more chunks for an answer. |

*Decision Process:* Although other methods and configurations exist, the initial configuration of the `RecursiveCharacterTextSplitter` was retained for this project as it offered a good compromise between implementation effort, context preservation, and the resulting number of chunks for the chosen dataset. Deeper optimization of the chunking strategy would be a potential next step in further development to potentially enhance retrieval accuracy. The documentation of this project focuses on the overall process and the implementation of a core RAG pipeline with one form of adaptation.

## 5. Choice of LLM

LLMs accessed via the Groq API were used for this RAG application:

| LLM Name (Groq)       | Used for                                   | Link/Reference                     |
|-----------------------|--------------------------------------------|------------------------------------|
| `llama3-70b-8192`    | Final Answer Generation                  | [Groq Models](https://console.groq.com/docs/models) |
| `llama3-8b-8192`     | Query Expansion (Adaptation Mechanism)     | [Groq Models](https://console.groq.com/docs/models) |

*Reasoning:* `llama3-70b-8192` was chosen for answer generation due to its strong performance in synthesizing information and generating coherent text. For query expansion, the smaller `llama3-8b-8192` model was used to reduce the latency of this intermediate step, while still expecting good quality expansions.

## 6. Test Method

The evaluation of the RAG application and the query expansion mechanism was conducted qualitatively. Specific test questions regarding the content of the course materials were formulated.
The procedure was as follows:
1.  Generate an answer based on the **original user query** and the chunks retrieved directly for it.
2.  Generate **expanded search queries** from the original user query using an LLM.
3.  Retrieve chunks based on these expanded queries, then collect and de-duplicate them to form an **expanded context**.
4.  Generate an answer based on the expanded context and the original user query.
5.  Conduct a **qualitative comparison** of the two generated answers in terms of depth of detail, correctness, and relevance to the context.
The hypothesis was that query expansion could lead to more comprehensive and precise answers.

Detailed test cases and results are documented in the Jupyter Notebook (`main_project.ipynb`) in Sections 4.2 and 5.

## 7. Results

As the evaluation was primarily qualitative, the main observations are summarized here. Detailed examples can be found in the notebook.

| Model/Method                               | Observation                                                                                                                                                                                                                                                          |
|--------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Base RAG (Original Query)                | Provides precise and good answers for direct questions (e.g., for "Was ist eine IT-Strategie?").                                                                                                                                                                   |
| RAG with Query Expansion                    | For the question "Was ist eine IT-Strategie?", there was hardly any difference compared to the base RAG. For the question "Welche Rolle spielt IT-Governance?", query expansion led to a **visibly more detailed and comprehensive answer**, which included additional relevant aspects. |

**Conclusion of Results**: Query expansion can improve answer quality by providing a broader and more relevant context for the LLM. However, the added value is highly dependent on the initial question and the quality of the generated expansions.

## 8. Setup and Execution

To run this project locally:

1.  **Prerequisites**:
    * Python 3.10 or higher (Python 3.12 was used).
    * Git.
2.  **Clone Repository**:
    ```bash
    git clone [https://github.com/patronlaurel/RAG-BITS-Tutor.git](https://github.com/patronlaurel/RAG-BITS-Tutor.git)
    cd RAG-BITS-Tutor
    ```
    *(Replace with your actual repository URL if different)*
3.  **Create and Activate Virtual Environment**:
    ```bash
    python -m venv .venv
    # Windows:
    .\.venv\Scripts\activate
    # macOS/Linux:
    source .venv/bin/activate
    ```
4.  **Install Dependencies**:
    ```bash
    pip install -r requirements.txt
    ```
    (The `requirements.txt` file was generated using `pip freeze > requirements.txt` in the project and is included in the repository).
5.  **Set up API Key**:
    * Create a file named `.env` in the project's root directory.
    * Add your Groq API key: `GROQ_API_KEY=your_groq_api_key`
6.  **Start Jupyter Notebook**:
    ```bash
    jupyter lab
    ```
    Then open the notebook `main_project.ipynb`. The PDF data must be placed in the `data/` folder. The FAISS index (`faiss_index_bits/bits_tutor.index`) and chunks (`faiss_index_bits/bits_chunks.pkl`) are created and saved during the first run of Section 3.4 in the notebook.

## 9. Technologies and Libraries Used
* Python 3.12
* Jupyter Lab
* Langchain
* Sentence Transformers (`Sahajtomar/German-semantic`)
* FAISS (Facebook AI Similarity Search)
* Groq API (`llama3-70b-8192`, `llama3-8b-8192`)
* PyPDF2
* NumPy
* Dotenv
* Tqdm

## 10. Conclusion and Outlook
*(Summarize the key points from Section 7 of your notebook. Example:)*
This project successfully demonstrated the construction of a RAG application as a "Study Tutor." By implementing LLM-based query expansion, it was shown how the depth of detail and informational content of answers can be improved for certain queries. Key insights relate to the importance of data quality, appropriate model selection, and the potential of adaptation mechanisms. Future work could focus on extended evaluation methods, exploring further adaptation techniques like re-ranking, or developing an interactive user interface.

## 11. References
*(List any specific scientific papers, blog posts, or other sources you heavily relied on for your methodology or understanding here. For this project, direct code references and general methodology are likely sufficient.)*