llm_project

Build error

App Files Files Community

Nullpointer-KK commited on 14 days ago

Commit

61e2aff

unverified ·

1 Parent(s): 6d29fb0

Update README.md

Browse files

Files changed (1) hide show

README.md +61 -76

README.md CHANGED Viewed

@@ -1,104 +1,89 @@
-# AI Tutor App Data Workflows
-This directory contains scripts for managing the AI Tutor App's data pipeline.
-## Workflow Scripts
-### 1. Adding a New Course
-To add a new course to the AI Tutor:
-```bash
-python add_course_workflow.py --course [COURSE_NAME]
-```
-This will guide you through the complete process:
-1. Process markdown files from Notion exports
-2. Prompt you to manually add URLs to the course content
-3. Merge the course data into the main dataset
-4. Add contextual information to document nodes
-5. Create vector stores
-6. Upload databases to HuggingFace
-7. Update UI configuration
-**Requirements before running:**
-- The course name must be properly configured in `process_md_files.py` under `SOURCE_CONFIGS`
-- Course markdown files must be placed in the directory specified in the configuration
-- You must have access to the live course platform to add URLs
-### 2. Updating Documentation via GitHub API
-To update library documentation from GitHub repositories:
-```bash
-python update_docs_workflow.py
-```
-This will update all supported documentation sources. You can also specify specific sources:
-```bash
-python update_docs_workflow.py --sources transformers peft
-```
-The workflow includes:
-1. Downloading documentation from GitHub using the API
-2. Processing markdown files to create JSONL data
-3. Adding contextual information to document nodes
-4. Creating vector stores
-5. Uploading databases to HuggingFace
-### 3. Uploading JSONL to HuggingFace
-To upload the main JSONL file to a private HuggingFace repository:
-```bash
-python upload_jsonl_to_hf.py
-```
-This is useful for sharing the latest data with team members.
-## Individual Components
-If you need to run specific steps individually:
-- **GitHub to Markdown**: `github_to_markdown_ai_docs.py`
-- **Process Markdown**: `process_md_files.py`
-- **Add Context**: `add_context_to_nodes.py`
-- **Create Vector Stores**: `create_vector_stores.py`
-- **Upload to HuggingFace**: `upload_dbs_to_hf.py`
-## Tips for New Team Members
-1. To update the AI Tutor with new content:
-   - For new courses, use `add_course_workflow.py`
-   - For updated documentation, use `update_docs_workflow.py`
-2. When adding URLs to course content:
-   - Get the URLs from the live course platform
-   - Add them to the generated JSONL file in the `url` field
-   - Example URL format: `https://academy.towardsai.net/courses/take/python-for-genai/multimedia/62515980-course-structure`
-   - Make sure every document has a valid URL
-3. By default, only new content will have context added to save time and resources. Use `--process-all-context` only if you need to regenerate context for all documents. Use `--skip-data-upload` if you don't want to upload data files to the private HuggingFace repo (they're uploaded by default).
-4. When adding a new course, verify that it appears in the Gradio UI:
-   - The workflow automatically updates `main.py` and `setup.py` to include the new source
-   - Check that the new source appears in the dropdown menu in the UI
-   - Make sure it's properly included in the default selected sources
-   - Restart the Gradio app to see the changes
-5. First time setup or missing files:
-   - Both workflows automatically check for and download required data files:
-     - `all_sources_data.jsonl` - Contains the raw document data
-     - `all_sources_contextual_nodes.pkl` - Contains the processed nodes with added context
-   - If the PKL file exists, the `--new-context-only` flag will only process new content
-   - You must have proper HuggingFace credentials with access to the private repository
-6. Make sure you have the required environment variables set:
-   - `OPENAI_API_KEY` for LLM processing
-   - `COHERE_API_KEY` for embeddings
-   - `HF_TOKEN` for HuggingFace uploads
-   - `GITHUB_TOKEN` for accessing documentation via the GitHub API

+---
+title: AI Tutor Chatbot
+emoji: 🧑🏻‍🏫
+colorFrom: gray
+colorTo: pink
+sdk: gradio
+sdk_version: 5.20.1
+app_file: scripts/main.py
+pinned: false
+---
+### Gradio UI Chatbot
+A Gradio UI for the chatbot is available in [scripts/main.py](./scripts/main.py).
+The Gradio demo is deployed on Hugging Face Spaces at: [AI Tutor Chatbot on Hugging Face](https://huggingface.co/spaces/towardsai-tutors/ai-tutor-chatbot).
+**Note:** A GitHub Action automatically deploys the Gradio demo when changes are pushed to the main branch (excluding documentation and scripts in the `data/scraping_scripts` directory).
+### Installation (for Gradio UI)
+1. **Create a new Python environment:**
+   ```bash
+   python -m venv .venv
+   ```
+2. **Activate the environment:**
+   For macOS and Linux:
+   ```bash
+   source .venv/bin/activate
+   ```
+   For Windows:
+   ```bash
+   .venv\Scripts\activate
+   ```
+3. **Install the dependencies:**
+   ```bash
+   pip install -r requirements.txt
+   ```
+### Usage (for Gradio UI)
+1. **Set environment variables:**
+   Before running the application, set up the required API keys:
+   For macOS and Linux:
+   ```bash
+   export OPENAI_API_KEY=your_openai_api_key_here
+   export COHERE_API_KEY=your_cohere_api_key_here
+   ```
+   For Windows:
+   ```bash
+   set OPENAI_API_KEY=your_openai_api_key_here
+   set COHERE_API_KEY=your_cohere_api_key_here
+   ```
+2. **Run the application:**
+   ```bash
+   python scripts/main.py
+   ```
+   This command starts the Gradio interface for the AI Tutor chatbot.
+### Updating Data Sources
+This application uses a RAG (Retrieval Augmented Generation) system with multiple data sources, including documentation and courses. To update these sources:
+1. **For adding new courses or updating documentation:**
+   - See the detailed instructions in [data/scraping_scripts/README.md](./data/scraping_scripts/README.md)
+   - Automated workflows are available for both course addition and documentation updates
+2. **Available workflows:**
+   - `add_course_workflow.py` - For adding new course content
+   - `update_docs_workflow.py` - For updating documentation from GitHub repositories
+   - `upload_data_to_hf.py` - For uploading data files to HuggingFace
+These scripts streamline the process of adding new content to the AI Tutor and ensure consistency across team members.