Nullpointer-KK commited on
Commit
ed4ed40
Β·
unverified Β·
1 Parent(s): 699928a

Add files via upload

Browse files
Files changed (3) hide show
  1. .gitignore +179 -0
  2. CLAUDE.md +142 -0
  3. requirements.txt +21 -0
.gitignore ADDED
@@ -0,0 +1,179 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Byte-compiled / optimized / DLL files
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+
6
+ # C extensions
7
+ *.so
8
+
9
+ # Distribution / packaging
10
+ .Python
11
+ build/
12
+ develop-eggs/
13
+ dist/
14
+ downloads/
15
+ eggs/
16
+ .eggs/
17
+ lib/
18
+ lib64/
19
+ parts/
20
+ sdist/
21
+ var/
22
+ wheels/
23
+ share/python-wheels/
24
+ *.egg-info/
25
+ .installed.cfg
26
+ *.egg
27
+ MANIFEST
28
+
29
+ # PyInstaller
30
+ # Usually these files are written by a python script from a template
31
+ # before PyInstaller builds the exe, so as to inject date/other infos into it.
32
+ *.manifest
33
+ *.spec
34
+
35
+ # Installer logs
36
+ pip-log.txt
37
+ pip-delete-this-directory.txt
38
+
39
+ # Unit test / coverage reports
40
+ htmlcov/
41
+ .tox/
42
+ .nox/
43
+ .coverage
44
+ .coverage.*
45
+ .cache
46
+ nosetests.xml
47
+ coverage.xml
48
+ *.cover
49
+ *.py,cover
50
+ .hypothesis/
51
+ .pytest_cache/
52
+ cover/
53
+
54
+ # Translations
55
+ *.mo
56
+ *.pot
57
+
58
+ # Django stuff:
59
+ *.log
60
+ local_settings.py
61
+ db.sqlite3
62
+ db.sqlite3-journal
63
+
64
+ # Flask stuff:
65
+ instance/
66
+ .webassets-cache
67
+
68
+ # Scrapy stuff:
69
+ .scrapy
70
+
71
+ # Sphinx documentation
72
+ docs/_build/
73
+
74
+ # PyBuilder
75
+ .pybuilder/
76
+ target/
77
+
78
+ # Jupyter Notebook
79
+ .ipynb_checkpoints
80
+
81
+ # IPython
82
+ profile_default/
83
+ ipython_config.py
84
+
85
+ # pyenv
86
+ # For a library or package, you might want to ignore these files since the code is
87
+ # intended to run in multiple environments; otherwise, check them in:
88
+ # .python-version
89
+
90
+ # pipenv
91
+ # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
92
+ # However, in case of collaboration, if having platform-specific dependencies or dependencies
93
+ # having no cross-platform support, pipenv may install dependencies that don't work, or not
94
+ # install all needed dependencies.
95
+ #Pipfile.lock
96
+
97
+ # poetry
98
+ # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
99
+ # This is especially recommended for binary packages to ensure reproducibility, and is more
100
+ # commonly ignored for libraries.
101
+ # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
102
+ #poetry.lock
103
+
104
+ # pdm
105
+ # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
106
+ #pdm.lock
107
+ # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
108
+ # in version control.
109
+ # https://pdm.fming.dev/#use-with-ide
110
+ .pdm.toml
111
+
112
+ # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
113
+ __pypackages__/
114
+
115
+ # Celery stuff
116
+ celerybeat-schedule
117
+ celerybeat.pid
118
+
119
+ # SageMath parsed files
120
+ *.sage.py
121
+
122
+ # Environments
123
+ .env
124
+ .venv
125
+ env/
126
+ venv/
127
+ ENV/
128
+ env.bak/
129
+ venv.bak/
130
+ ai-tutor/
131
+ venv_ai_tutor/
132
+
133
+ # Spyder project settings
134
+ .spyderproject
135
+ .spyproject
136
+
137
+ # Rope project settings
138
+ .ropeproject
139
+
140
+ # mkdocs documentation
141
+ /site
142
+
143
+ # mypy
144
+ .mypy_cache/
145
+ .dmypy.json
146
+ dmypy.json
147
+
148
+ # Pyre type checker
149
+ .pyre/
150
+
151
+ # pytype static type analyzer
152
+ .pytype/
153
+
154
+ # Cython debug symbols
155
+ cython_debug/
156
+
157
+ # PyCharm
158
+ # JetBrains specific template is maintained in a separate JetBrains.gitignore that can
159
+ # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
160
+ # and can be added to the global gitignore or merged into this file. For a more nuclear
161
+ # option (not recommended) you can uncomment the following to ignore the entire idea folder.
162
+ #.idea/
163
+
164
+ .vscode/
165
+ data/chroma-db**/
166
+ evaluation_data/chroma-db**/
167
+
168
+ .huggingface
169
+
170
+ .DS_Store
171
+
172
+ *.csv
173
+ *.json
174
+ *.jsonl
175
+ *.html
176
+ *.mdx
177
+ *.pkl
178
+ *.png
179
+ *.mov
CLAUDE.md ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AI Tutor App Instructions for Claude
2
+
3
+ ## Project Overview
4
+ This is an AI tutor application that uses RAG (Retrieval Augmented Generation) to provide accurate responses about AI concepts by searching through multiple documentation sources. The application has a Gradio UI and uses ChromaDB for vector storage.
5
+
6
+ ## Key Repositories and URLs
7
+ - Main code: https://github.com/towardsai/ai-tutor-app
8
+ - Live demo: https://huggingface.co/spaces/towardsai-tutors/ai-tutor-chatbot
9
+ - Vector database: https://huggingface.co/datasets/towardsai-tutors/ai-tutor-vector-db
10
+ - Private JSONL repo: https://huggingface.co/datasets/towardsai-tutors/ai-tutor-data
11
+
12
+ ## Architecture Overview
13
+ - Frontend: Gradio-based UI in `scripts/main.py`
14
+ - Retrieval: Custom retriever using ChromaDB vector stores
15
+ - Embedding: Cohere embeddings for vector search
16
+ - LLM: OpenAI models (GPT-4o, etc.) for context addition and responses
17
+ - Storage: Individual JSONL files per source + combined file for retrieval
18
+
19
+ ## Data Update Workflows
20
+
21
+ ### 1. Adding a New Course
22
+ ```bash
23
+ python data/scraping_scripts/add_course_workflow.py --course [COURSE_NAME]
24
+ ```
25
+ - This requires the course to be configured in `process_md_files.py` under `SOURCE_CONFIGS`
26
+ - The workflow will pause for manual URL addition after processing markdown files
27
+ - Only new content will have context added by default (efficient)
28
+ - Use `--process-all-context` if you need to regenerate context for all documents
29
+ - Both database and data files are uploaded to HuggingFace by default
30
+ - Use `--skip-data-upload` if you don't want to upload data files
31
+
32
+ ### 2. Updating Documentation from GitHub
33
+ ```bash
34
+ python data/scraping_scripts/update_docs_workflow.py
35
+ ```
36
+ - Updates all supported documentation sources (or specify specific ones with `--sources`)
37
+ - Downloads fresh documentation from GitHub repositories
38
+ - Only new content will have context added by default (efficient)
39
+ - Use `--process-all-context` if you need to regenerate context for all documents
40
+ - Both database and data files are uploaded to HuggingFace by default
41
+ - Use `--skip-data-upload` if you don't want to upload data files
42
+
43
+ ### 3. Data File Management
44
+ ```bash
45
+ # Upload both JSONL and PKL files to private HuggingFace repository
46
+ python data/scraping_scripts/upload_data_to_hf.py
47
+ ```
48
+
49
+ ## Data Flow and File Relationships
50
+
51
+ ### Document Processing Pipeline
52
+ 1. **Markdown Files** β†’ `process_md_files.py` β†’ **Individual JSONL files** (e.g., `transformers_data.jsonl`)
53
+ 2. Individual JSONL files β†’ `combine_all_sources()` β†’ `all_sources_data.jsonl`
54
+ 3. `all_sources_data.jsonl` β†’ `add_context_to_nodes.py` β†’ `all_sources_contextual_nodes.pkl`
55
+ 4. `all_sources_contextual_nodes.pkl` β†’ `create_vector_stores.py` β†’ ChromaDB vector stores
56
+
57
+ ### Important Files and Their Purpose
58
+ - `all_sources_data.jsonl` - Combined raw document data without context
59
+ - Source-specific JSONL files (e.g., `transformers_data.jsonl`) - Raw data for individual sources
60
+ - `all_sources_contextual_nodes.pkl` - Processed nodes with added context
61
+ - `chroma-db-all_sources` - Vector database directory containing embeddings
62
+ - `document_dict_all_sources.pkl` - Dictionary mapping document IDs to full documents
63
+
64
+ ## Configuration Details
65
+
66
+ ### Adding a New Course Source
67
+ 1. Update `SOURCE_CONFIGS` in `process_md_files.py`:
68
+ ```python
69
+ "new_course": {
70
+ "base_url": "",
71
+ "input_directory": "data/new_course",
72
+ "output_file": "data/new_course_data.jsonl",
73
+ "source_name": "new_course",
74
+ "use_include_list": False,
75
+ "included_dirs": [],
76
+ "excluded_dirs": [],
77
+ "excluded_root_files": [],
78
+ "included_root_files": [],
79
+ "url_extension": "",
80
+ },
81
+ ```
82
+
83
+ 2. Update UI configurations in:
84
+ - `setup.py`: Add to `AVAILABLE_SOURCES` and `AVAILABLE_SOURCES_UI`
85
+ - `main.py`: Add mapping in `source_mapping` dictionary
86
+
87
+ ## Deployment and Publishing
88
+
89
+ ### GitHub Actions Workflow
90
+ The application is automatically deployed to HuggingFace Spaces when changes are pushed to the main branch (excluding documentation and scraping scripts).
91
+
92
+ ### Manual Deployment
93
+ ```bash
94
+ git push --force https://$HF_USERNAME:[email protected]/spaces/towardsai-tutors/ai-tutor-chatbot main:main
95
+ ```
96
+
97
+ ## Development Environment Setup
98
+
99
+ ### Required Environment Variables
100
+ - `OPENAI_API_KEY` - For LLM processing
101
+ - `COHERE_API_KEY` - For embeddings
102
+ - `HF_TOKEN` - For HuggingFace uploads
103
+ - `GITHUB_TOKEN` - For accessing documentation via the GitHub API
104
+
105
+ ### Running the Application Locally
106
+ ```bash
107
+ # Install dependencies
108
+ pip install -r requirements.txt
109
+
110
+ # Start the Gradio UI
111
+ python scripts/main.py
112
+ ```
113
+
114
+ ## Important Notes
115
+
116
+ 1. When adding new courses, make sure to:
117
+ - Place markdown files exported from Notion in the appropriate directory
118
+ - Add URLs manually from the live course platform
119
+ - Example URL format: `https://academy.towardsai.net/courses/take/python-for-genai/multimedia/62515980-course-structure`
120
+ - Configure the course in `process_md_files.py`
121
+ - Verify it appears in the UI after deployment
122
+
123
+ 2. For updating documentation:
124
+ - The GitHub API is used to fetch the latest documentation
125
+ - The workflow handles updating existing sources without affecting course data
126
+
127
+ 3. For efficient context addition:
128
+ - Only new content gets processed by default
129
+ - Old nodes for updated sources are removed from the PKL file
130
+ - This ensures no duplicate content in the vector database
131
+
132
+ ## Technical Details for Debugging
133
+
134
+ ### Node Removal Logic
135
+ - When adding context, the workflow now removes existing nodes for sources being updated
136
+ - This prevents duplication of content in the vector database
137
+ - The source of each node is extracted from either `node.source_node.metadata` or `node.metadata`
138
+
139
+ ### Performance Considerations
140
+ - Context addition is the most time-consuming step (uses OpenAI API)
141
+ - The new default behavior only processes new content
142
+ - For large updates, consider running in batches
requirements.txt ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ modal
2
+ openai
3
+ anthropic
4
+ instructor
5
+ pydantic
6
+ logfire
7
+ chromadb
8
+ cohere
9
+ tiktoken
10
+ llama-index
11
+ llama-index-postprocessor-cohere-rerank
12
+ llama-index-embeddings-cohere
13
+ llama-index-vector-stores-chroma
14
+ python-dotenv
15
+ ipykernel
16
+ google-generativeai
17
+ llama-index-llms-gemini
18
+ gradio
19
+ pymongo
20
+ huggingface_hub
21
+ nbconvert