RAG-BITS-Tutor / README.md
patronmoses's picture
Update README.md
0543cca verified

A newer version of the Streamlit SDK is available: 1.48.1

Upgrade
metadata
title: RAG BITS Tutor
emoji: πŸŽ“
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.45.1
app_file: app.py
pinned: false

RAG Study Tutor for Business IT Strategy

Author: Laurel Mayer Module: AI Applications (w.3KIA) - Project 3

1. Project Description

This project implements a Retrieval Augmented Generation (RAG) application designed to act as a "Study Tutor" for the subject "Business IT Strategy." The primary goal is to enable users to ask questions about specific course content and receive well-founded, context-based answers derived from the provided lecture materials and case studies. The application integrates a retrieval component for searching relevant text passages with a Large Language Model (LLM) for generating the final answers.

Name & URL

Name URL
Code GitHub Repository
Embedding Model Page Sahajtomar/German-semantic
LLM Provider (Groq) Groq
Jupyter Notebook main_project.ipynb
FAISS Index & Chunks /faiss_index_bits/

2. Data Sources

The knowledge base for the RAG Tutor consists of:

Data Source Description
Own course materials (Lecture PDFs) 13 PDF documents comprising lecture notes and case studies (including solutions) for the "Business IT Strategy" course.
(Total extracted text volume: approx. 221,049 characters) The files are located in the data/ folder of this repository.

3. RAG Improvements

To enhance the RAG system's performance, the following adaptation was implemented:

Improvement Description
Query Expansion (using LLM) The original user query is sent to an LLM (llama3-8b-8192 via Groq) to generate 2-3 alternative formulations or relevant keywords. These expanded queries are then additionally used for retrieval to create a broader contextual base for final answer generation. The implementation and evaluation of this method are detailed in Section 5 of the Jupyter Notebook (main_project.ipynb).
Other Potential Improvements For this project, the focus was on implementing and evaluating Query Expansion. Further potential improvements and adaptation mechanisms (e.g., re-ranking of search results, hybrid search) are discussed in the "Conclusion and Outlook" section of this document and in the notebook.

4. Chunking

Data Chunking Method

The choice of chunking strategy is crucial for retrieval quality, as it determines how context is divided and fed to the embedding model. For this project, the text extracted from the PDFs was chunked as follows:

Type of Chunking Configuration Result (Number of Chunks)
RecursiveCharacterTextSplitter (Langchain) - Chosen Method Chunk Size: 1500 characters, Overlap: 200 characters 203

Reasoning for the chosen method: The RecursiveCharacterTextSplitter was selected because it attempts to maintain semantically coherent blocks by recursively splitting at various separators (like paragraphs, sentences, etc.). A chunk_size of 1500 characters with an overlap of 200 characters was chosen as a good starting point. The goal was to obtain chunks that contain sufficient context for understanding but are not so large as to exceed the maximum input length of embedding models or introduce too much noise for specific queries. The resulting 203 chunks represented a manageable quantity for further processing.

Alternatively Considered Chunking Approaches:

Type of Chunking Hypothetical Configuration/Consideration Potential Advantages/Disadvantages
CharacterTextSplitter (Langchain) Fixed Chunk Size (e.g., 1000), Overlap (e.g., 150) Simpler, but less regard for semantic boundaries; could split sentences/thoughts.
SentenceTransformersTokenTextSplitter Based on token limits of the embedding model (e.g., 256 Tokens) More precise adaptation to the embedding model, but requires knowledge of tokenizer specifics. Could have led to a different number and granularity of chunks.
Smaller chunk_size with RecursiveCharacterTextSplitter e.g., 500 characters, Overlap 50 More, but more specific chunks. Could help with very detailed questions, but also fragment context more and require more chunks for an answer.

Decision Process: Although other methods and configurations exist, the initial configuration of the RecursiveCharacterTextSplitter was retained for this project as it offered a good compromise between implementation effort, context preservation, and the resulting number of chunks for the chosen dataset. Deeper optimization of the chunking strategy would be a potential next step in further development to potentially enhance retrieval accuracy. The documentation of this project focuses on the overall process and the implementation of a core RAG pipeline with one form of adaptation.

5. Choice of LLM

LLMs accessed via the Groq API were used for this RAG application:

LLM Name (Groq) Used for Link/Reference
llama3-70b-8192 Final Answer Generation Groq Models
llama3-8b-8192 Query Expansion (Adaptation Mechanism) Groq Models

Reasoning: llama3-70b-8192 was chosen for answer generation due to its strong performance in synthesizing information and generating coherent text. For query expansion, the smaller llama3-8b-8192 model was used to reduce the latency of this intermediate step, while still expecting good quality expansions.

6. Test Method

The evaluation of the RAG application and the query expansion mechanism was conducted qualitatively. Specific test questions regarding the content of the course materials were formulated. The procedure was as follows:

  1. Generate an answer based on the original user query and the chunks retrieved directly for it.
  2. Generate expanded search queries from the original user query using an LLM.
  3. Retrieve chunks based on these expanded queries, then collect and de-duplicate them to form an expanded context.
  4. Generate an answer based on the expanded context and the original user query.
  5. Conduct a qualitative comparison of the two generated answers in terms of depth of detail, correctness, and relevance to the context. The hypothesis was that query expansion could lead to more comprehensive and precise answers.

Detailed test cases and results are documented in the Jupyter Notebook (main_project.ipynb) in Sections 4.2 and 5.

7. Results

As the evaluation was primarily qualitative, the main observations are summarized here. Detailed examples can be found in the notebook.

Model/Method Observation
Base RAG (Original Query) Provides precise and good answers for direct questions (e.g., for "Was ist eine IT-Strategie?").
RAG with Query Expansion For the question "Was ist eine IT-Strategie?", there was hardly any difference compared to the base RAG. For the question "Welche Rolle spielt IT-Governance?", query expansion led to a visibly more detailed and comprehensive answer, which included additional relevant aspects.

Conclusion of Results: Query expansion can improve answer quality by providing a broader and more relevant context for the LLM. However, the added value is highly dependent on the initial question and the quality of the generated expansions.

8. Setup and Execution

To run this project locally:

  1. Prerequisites:
    • Python 3.10 or higher (Python 3.12 was used).
    • Git.
  2. Clone Repository:
    git clone [https://github.com/patronlaurel/RAG-BITS-Tutor.git](https://github.com/patronlaurel/RAG-BITS-Tutor.git)
    cd RAG-BITS-Tutor
    
    (Replace with your actual repository URL if different)
  3. Create and Activate Virtual Environment:
    python -m venv .venv
    # Windows:
    .\.venv\Scripts\activate
    # macOS/Linux:
    source .venv/bin/activate
    
  4. Install Dependencies:
    pip install -r requirements.txt
    
    (The requirements.txt file was generated using pip freeze > requirements.txt in the project and is included in the repository).
  5. Set up API Key:
    • Create a file named .env in the project's root directory.
    • Add your Groq API key: GROQ_API_KEY=your_groq_api_key
  6. Start Jupyter Notebook:
    jupyter lab
    
    Then open the notebook main_project.ipynb. The PDF data must be placed in the data/ folder. The FAISS index (faiss_index_bits/bits_tutor.index) and chunks (faiss_index_bits/bits_chunks.pkl) are created and saved during the first run of Section 3.4 in the notebook.

9. Technologies and Libraries Used

  • Python 3.12
  • Jupyter Lab
  • Langchain
  • Sentence Transformers (Sahajtomar/German-semantic)
  • FAISS (Facebook AI Similarity Search)
  • Groq API (llama3-70b-8192, llama3-8b-8192)
  • PyPDF2
  • NumPy
  • Dotenv
  • Tqdm

10. Conclusion and Outlook

(Summarize the key points from Section 7 of your notebook. Example:) This project successfully demonstrated the construction of a RAG application as a "Study Tutor." By implementing LLM-based query expansion, it was shown how the depth of detail and informational content of answers can be improved for certain queries. Key insights relate to the importance of data quality, appropriate model selection, and the potential of adaptation mechanisms. Future work could focus on extended evaluation methods, exploring further adaptation techniques like re-ranking, or developing an interactive user interface.

11. References

(List any specific scientific papers, blog posts, or other sources you heavily relied on for your methodology or understanding here. For this project, direct code references and general methodology are likely sufficient.)