Spaces:
				
			
			
	
			
			
		Sleeping
		
	
	
	
			
			
	
	
	
	
		
		
		Sleeping
		
	AI Tutor App Instructions for Claude
Project Overview
This is an AI tutor application that uses RAG (Retrieval Augmented Generation) to provide accurate responses about AI concepts by searching through multiple documentation sources. The application has a Gradio UI and uses ChromaDB for vector storage.
Key Repositories and URLs
- Main code: https://github.com/towardsai/ai-tutor-app
 - Live demo: https://huggingface.co/spaces/towardsai-tutors/ai-tutor-chatbot
 - Vector database: https://huggingface.co/datasets/towardsai-tutors/ai-tutor-vector-db
 - Private JSONL repo: https://huggingface.co/datasets/towardsai-tutors/ai-tutor-data
 
Architecture Overview
- Frontend: Gradio-based UI in 
scripts/main.py - Retrieval: Custom retriever using ChromaDB vector stores
 - Embedding: Cohere embeddings for vector search
 - LLM: OpenAI models (GPT-4o, etc.) for context addition and responses
 - Storage: Individual JSONL files per source + combined file for retrieval
 
Data Update Workflows
1. Adding a New Course
python data/scraping_scripts/add_course_workflow.py --course [COURSE_NAME]
- This requires the course to be configured in 
process_md_files.pyunderSOURCE_CONFIGS - The workflow will pause for manual URL addition after processing markdown files
 - Only new content will have context added by default (efficient)
 - Use 
--process-all-contextif you need to regenerate context for all documents - Both database and data files are uploaded to HuggingFace by default
 - Use 
--skip-data-uploadif you don't want to upload data files 
2. Updating Documentation from GitHub
python data/scraping_scripts/update_docs_workflow.py
- Updates all supported documentation sources (or specify specific ones with 
--sources) - Downloads fresh documentation from GitHub repositories
 - Only new content will have context added by default (efficient)
 - Use 
--process-all-contextif you need to regenerate context for all documents - Both database and data files are uploaded to HuggingFace by default
 - Use 
--skip-data-uploadif you don't want to upload data files 
3. Data File Management
# Upload both JSONL and PKL files to private HuggingFace repository
python data/scraping_scripts/upload_data_to_hf.py
Data Flow and File Relationships
Document Processing Pipeline
- Markdown Files β 
process_md_files.pyβ Individual JSONL files (e.g.,transformers_data.jsonl) - Individual JSONL files β 
combine_all_sources()βall_sources_data.jsonl all_sources_data.jsonlβadd_context_to_nodes.pyβall_sources_contextual_nodes.pklall_sources_contextual_nodes.pklβcreate_vector_stores.pyβ ChromaDB vector stores
Important Files and Their Purpose
all_sources_data.jsonl- Combined raw document data without context- Source-specific JSONL files (e.g., 
transformers_data.jsonl) - Raw data for individual sources all_sources_contextual_nodes.pkl- Processed nodes with added contextchroma-db-all_sources- Vector database directory containing embeddingsdocument_dict_all_sources.pkl- Dictionary mapping document IDs to full documents
Configuration Details
Adding a New Course Source
- Update 
SOURCE_CONFIGSinprocess_md_files.py: 
"new_course": {
    "base_url": "",
    "input_directory": "data/new_course",
    "output_file": "data/new_course_data.jsonl",
    "source_name": "new_course",
    "use_include_list": False,
    "included_dirs": [],
    "excluded_dirs": [],
    "excluded_root_files": [],
    "included_root_files": [],
    "url_extension": "",
},
- Update UI configurations in:
setup.py: Add toAVAILABLE_SOURCESandAVAILABLE_SOURCES_UImain.py: Add mapping insource_mappingdictionary
 
Deployment and Publishing
GitHub Actions Workflow
The application is automatically deployed to HuggingFace Spaces when changes are pushed to the main branch (excluding documentation and scraping scripts).
Manual Deployment
git push --force https://$HF_USERNAME:$HF_TOKEN@huggingface.co/spaces/towardsai-tutors/ai-tutor-chatbot main:main
Development Environment Setup
Required Environment Variables
OPENAI_API_KEY- For LLM processingCOHERE_API_KEY- For embeddingsHF_TOKEN- For HuggingFace uploadsGITHUB_TOKEN- For accessing documentation via the GitHub API
Running the Application Locally
# Install dependencies
pip install -r requirements.txt
# Start the Gradio UI
python scripts/main.py
Important Notes
When adding new courses, make sure to:
- Place markdown files exported from Notion in the appropriate directory
 - Add URLs manually from the live course platform
 - Example URL format: 
https://academy.towardsai.net/courses/take/python-for-genai/multimedia/62515980-course-structure - Configure the course in 
process_md_files.py - Verify it appears in the UI after deployment
 
For updating documentation:
- The GitHub API is used to fetch the latest documentation
 - The workflow handles updating existing sources without affecting course data
 
For efficient context addition:
- Only new content gets processed by default
 - Old nodes for updated sources are removed from the PKL file
 - This ensures no duplicate content in the vector database
 
Technical Details for Debugging
Node Removal Logic
- When adding context, the workflow now removes existing nodes for sources being updated
 - This prevents duplication of content in the vector database
 - The source of each node is extracted from either 
node.source_node.metadataornode.metadata 
Performance Considerations
- Context addition is the most time-consuming step (uses OpenAI API)
 - The new default behavior only processes new content
 - For large updates, consider running in batches