Nullpointer-KK commited on
Commit
61e2aff
·
unverified ·
1 Parent(s): 6d29fb0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -76
README.md CHANGED
@@ -1,104 +1,89 @@
1
- # AI Tutor App Data Workflows
 
 
 
 
 
 
 
 
 
 
2
 
3
- This directory contains scripts for managing the AI Tutor App's data pipeline.
4
 
5
- ## Workflow Scripts
6
 
7
- ### 1. Adding a New Course
8
 
9
- To add a new course to the AI Tutor:
10
 
11
- ```bash
12
- python add_course_workflow.py --course [COURSE_NAME]
13
- ```
14
 
15
- This will guide you through the complete process:
 
 
16
 
17
- 1. Process markdown files from Notion exports
18
- 2. Prompt you to manually add URLs to the course content
19
- 3. Merge the course data into the main dataset
20
- 4. Add contextual information to document nodes
21
- 5. Create vector stores
22
- 6. Upload databases to HuggingFace
23
- 7. Update UI configuration
24
 
25
- **Requirements before running:**
26
 
27
- - The course name must be properly configured in `process_md_files.py` under `SOURCE_CONFIGS`
28
- - Course markdown files must be placed in the directory specified in the configuration
29
- - You must have access to the live course platform to add URLs
30
 
31
- ### 2. Updating Documentation via GitHub API
32
 
33
- To update library documentation from GitHub repositories:
 
 
34
 
35
- ```bash
36
- python update_docs_workflow.py
37
- ```
38
 
39
- This will update all supported documentation sources. You can also specify specific sources:
 
 
40
 
41
- ```bash
42
- python update_docs_workflow.py --sources transformers peft
43
- ```
44
 
45
- The workflow includes:
46
 
47
- 1. Downloading documentation from GitHub using the API
48
- 2. Processing markdown files to create JSONL data
49
- 3. Adding contextual information to document nodes
50
- 4. Creating vector stores
51
- 5. Uploading databases to HuggingFace
52
 
53
- ### 3. Uploading JSONL to HuggingFace
54
 
55
- To upload the main JSONL file to a private HuggingFace repository:
 
 
 
56
 
57
- ```bash
58
- python upload_jsonl_to_hf.py
59
- ```
60
 
61
- This is useful for sharing the latest data with team members.
 
 
 
62
 
63
- ## Individual Components
64
 
65
- If you need to run specific steps individually:
 
 
66
 
67
- - **GitHub to Markdown**: `github_to_markdown_ai_docs.py`
68
- - **Process Markdown**: `process_md_files.py`
69
- - **Add Context**: `add_context_to_nodes.py`
70
- - **Create Vector Stores**: `create_vector_stores.py`
71
- - **Upload to HuggingFace**: `upload_dbs_to_hf.py`
72
 
73
- ## Tips for New Team Members
74
 
75
- 1. To update the AI Tutor with new content:
76
- - For new courses, use `add_course_workflow.py`
77
- - For updated documentation, use `update_docs_workflow.py`
78
 
79
- 2. When adding URLs to course content:
80
- - Get the URLs from the live course platform
81
- - Add them to the generated JSONL file in the `url` field
82
- - Example URL format: `https://academy.towardsai.net/courses/take/python-for-genai/multimedia/62515980-course-structure`
83
- - Make sure every document has a valid URL
 
 
 
84
 
85
- 3. By default, only new content will have context added to save time and resources. Use `--process-all-context` only if you need to regenerate context for all documents. Use `--skip-data-upload` if you don't want to upload data files to the private HuggingFace repo (they're uploaded by default).
86
 
87
- 4. When adding a new course, verify that it appears in the Gradio UI:
88
- - The workflow automatically updates `main.py` and `setup.py` to include the new source
89
- - Check that the new source appears in the dropdown menu in the UI
90
- - Make sure it's properly included in the default selected sources
91
- - Restart the Gradio app to see the changes
92
-
93
- 5. First time setup or missing files:
94
- - Both workflows automatically check for and download required data files:
95
- - `all_sources_data.jsonl` - Contains the raw document data
96
- - `all_sources_contextual_nodes.pkl` - Contains the processed nodes with added context
97
- - If the PKL file exists, the `--new-context-only` flag will only process new content
98
- - You must have proper HuggingFace credentials with access to the private repository
99
-
100
- 6. Make sure you have the required environment variables set:
101
- - `OPENAI_API_KEY` for LLM processing
102
- - `COHERE_API_KEY` for embeddings
103
- - `HF_TOKEN` for HuggingFace uploads
104
- - `GITHUB_TOKEN` for accessing documentation via the GitHub API
 
1
+ ---
2
+ title: AI Tutor Chatbot
3
+ emoji: 🧑🏻‍🏫
4
+ colorFrom: gray
5
+ colorTo: pink
6
+ sdk: gradio
7
+ sdk_version: 5.20.1
8
+ app_file: scripts/main.py
9
+ pinned: false
10
+ ---
11
+ ### Gradio UI Chatbot
12
 
13
+ A Gradio UI for the chatbot is available in [scripts/main.py](./scripts/main.py).
14
 
15
+ The Gradio demo is deployed on Hugging Face Spaces at: [AI Tutor Chatbot on Hugging Face](https://huggingface.co/spaces/towardsai-tutors/ai-tutor-chatbot).
16
 
17
+ **Note:** A GitHub Action automatically deploys the Gradio demo when changes are pushed to the main branch (excluding documentation and scripts in the `data/scraping_scripts` directory).
18
 
19
+ ### Installation (for Gradio UI)
20
 
21
+ 1. **Create a new Python environment:**
 
 
22
 
23
+ ```bash
24
+ python -m venv .venv
25
+ ```
26
 
27
+ 2. **Activate the environment:**
 
 
 
 
 
 
28
 
29
+ For macOS and Linux:
30
 
31
+ ```bash
32
+ source .venv/bin/activate
33
+ ```
34
 
35
+ For Windows:
36
 
37
+ ```bash
38
+ .venv\Scripts\activate
39
+ ```
40
 
41
+ 3. **Install the dependencies:**
 
 
42
 
43
+ ```bash
44
+ pip install -r requirements.txt
45
+ ```
46
 
47
+ ### Usage (for Gradio UI)
 
 
48
 
49
+ 1. **Set environment variables:**
50
 
51
+ Before running the application, set up the required API keys:
 
 
 
 
52
 
53
+ For macOS and Linux:
54
 
55
+ ```bash
56
+ export OPENAI_API_KEY=your_openai_api_key_here
57
+ export COHERE_API_KEY=your_cohere_api_key_here
58
+ ```
59
 
60
+ For Windows:
 
 
61
 
62
+ ```bash
63
+ set OPENAI_API_KEY=your_openai_api_key_here
64
+ set COHERE_API_KEY=your_cohere_api_key_here
65
+ ```
66
 
67
+ 2. **Run the application:**
68
 
69
+ ```bash
70
+ python scripts/main.py
71
+ ```
72
 
73
+ This command starts the Gradio interface for the AI Tutor chatbot.
 
 
 
 
74
 
75
+ ### Updating Data Sources
76
 
77
+ This application uses a RAG (Retrieval Augmented Generation) system with multiple data sources, including documentation and courses. To update these sources:
 
 
78
 
79
+ 1. **For adding new courses or updating documentation:**
80
+ - See the detailed instructions in [data/scraping_scripts/README.md](./data/scraping_scripts/README.md)
81
+ - Automated workflows are available for both course addition and documentation updates
82
+
83
+ 2. **Available workflows:**
84
+ - `add_course_workflow.py` - For adding new course content
85
+ - `update_docs_workflow.py` - For updating documentation from GitHub repositories
86
+ - `upload_data_to_hf.py` - For uploading data files to HuggingFace
87
 
88
+ These scripts streamline the process of adding new content to the AI Tutor and ensure consistency across team members.
89