llm_project

Build error

App Files Files Community

Nullpointer-KK commited on 13 days ago

Commit

c9da9e4

unverified ·

1 Parent(s): 6225a0c

Add files via upload

Browse files

Files changed (10) hide show

data/scraping_scripts/README.md +104 -0
data/scraping_scripts/add_context_to_nodes.py +196 -0
data/scraping_scripts/add_course_workflow.py +541 -0
data/scraping_scripts/create_vector_stores.py +218 -0
data/scraping_scripts/csv_to_jsonl.py +61 -0
data/scraping_scripts/github_to_markdown_ai_docs.py +231 -0
data/scraping_scripts/process_md_files.py +370 -0
data/scraping_scripts/update_docs_workflow.py +409 -0
data/scraping_scripts/upload_data_to_hf.py +129 -0
data/scraping_scripts/upload_dbs_to_hf.py +38 -0

data/scraping_scripts/README.md ADDED Viewed

	@@ -0,0 +1,104 @@

+# AI Tutor App Data Workflows
+This directory contains scripts for managing the AI Tutor App's data pipeline.
+## Workflow Scripts
+### 1. Adding a New Course
+To add a new course to the AI Tutor:
+```bash
+python add_course_workflow.py --course [COURSE_NAME]
+```
+This will guide you through the complete process:
+1. Process markdown files from Notion exports
+2. Prompt you to manually add URLs to the course content
+3. Merge the course data into the main dataset
+4. Add contextual information to document nodes
+5. Create vector stores
+6. Upload databases to HuggingFace
+7. Update UI configuration
+**Requirements before running:**
+- The course name must be properly configured in `process_md_files.py` under `SOURCE_CONFIGS`
+- Course markdown files must be placed in the directory specified in the configuration
+- You must have access to the live course platform to add URLs
+### 2. Updating Documentation via GitHub API
+To update library documentation from GitHub repositories:
+```bash
+python update_docs_workflow.py
+```
+This will update all supported documentation sources. You can also specify specific sources:
+```bash
+python update_docs_workflow.py --sources transformers peft
+```
+The workflow includes:
+1. Downloading documentation from GitHub using the API
+2. Processing markdown files to create JSONL data
+3. Adding contextual information to document nodes
+4. Creating vector stores
+5. Uploading databases to HuggingFace
+### 3. Uploading JSONL to HuggingFace
+To upload the main JSONL file to a private HuggingFace repository:
+```bash
+python upload_jsonl_to_hf.py
+```
+This is useful for sharing the latest data with team members.
+## Individual Components
+If you need to run specific steps individually:
+- **GitHub to Markdown**: `github_to_markdown_ai_docs.py`
+- **Process Markdown**: `process_md_files.py`
+- **Add Context**: `add_context_to_nodes.py`
+- **Create Vector Stores**: `create_vector_stores.py`
+- **Upload to HuggingFace**: `upload_dbs_to_hf.py`
+## Tips for New Team Members
+1. To update the AI Tutor with new content:
+   - For new courses, use `add_course_workflow.py`
+   - For updated documentation, use `update_docs_workflow.py`
+2. When adding URLs to course content:
+   - Get the URLs from the live course platform
+   - Add them to the generated JSONL file in the `url` field
+   - Example URL format: `https://academy.towardsai.net/courses/take/python-for-genai/multimedia/62515980-course-structure`
+   - Make sure every document has a valid URL
+3. By default, only new content will have context added to save time and resources. Use `--process-all-context` only if you need to regenerate context for all documents. Use `--skip-data-upload` if you don't want to upload data files to the private HuggingFace repo (they're uploaded by default).
+4. When adding a new course, verify that it appears in the Gradio UI:
+   - The workflow automatically updates `main.py` and `setup.py` to include the new source
+   - Check that the new source appears in the dropdown menu in the UI
+   - Make sure it's properly included in the default selected sources
+   - Restart the Gradio app to see the changes
+5. First time setup or missing files:
+   - Both workflows automatically check for and download required data files:
+     - `all_sources_data.jsonl` - Contains the raw document data
+     - `all_sources_contextual_nodes.pkl` - Contains the processed nodes with added context
+   - If the PKL file exists, the `--new-context-only` flag will only process new content
+   - You must have proper HuggingFace credentials with access to the private repository
+6. Make sure you have the required environment variables set:
+   - `OPENAI_API_KEY` for LLM processing
+   - `COHERE_API_KEY` for embeddings
+   - `HF_TOKEN` for HuggingFace uploads
+   - `GITHUB_TOKEN` for accessing documentation via the GitHub API

data/scraping_scripts/add_context_to_nodes.py ADDED Viewed

	@@ -0,0 +1,196 @@

+import asyncio
+import json
+import pdb
+import pickle
+from typing import Dict, List
+import instructor
+import logfire
+import tiktoken
+from anthropic import AsyncAnthropic
+from dotenv import load_dotenv
+from jinja2 import Template
+from llama_index.core import Document
+from llama_index.core.ingestion import IngestionPipeline
+from llama_index.core.node_parser import SentenceSplitter
+from llama_index.core.schema import TextNode
+from openai import AsyncOpenAI
+from pydantic import BaseModel, Field
+from tenacity import retry, stop_after_attempt, wait_exponential
+from tqdm.asyncio import tqdm
+load_dotenv(".env")
+# logfire.configure()
+def create_docs(input_file: str) -> List[Document]:
+    with open(input_file, "r") as f:
+        documents: list[Document] = []
+        for line in f:
+            data = json.loads(line)
+            documents.append(
+                Document(
+                    doc_id=data["doc_id"],
+                    text=data["content"],
+                    metadata={  # type: ignore
+                        "url": data["url"],
+                        "title": data["name"],
+                        "tokens": data["tokens"],
+                        "retrieve_doc": data["retrieve_doc"],
+                        "source": data["source"],
+                    },
+                    excluded_llm_metadata_keys=[
+                        "title",
+                        "tokens",
+                        "retrieve_doc",
+                        "source",
+                    ],
+                    excluded_embed_metadata_keys=[
+                        "url",
+                        "tokens",
+                        "retrieve_doc",
+                        "source",
+                    ],
+                )
+            )
+    return documents
+class SituatedContext(BaseModel):
+    title: str = Field(..., description="The title of the document.")
+    context: str = Field(
+        ..., description="The context to situate the chunk within the document."
+    )
+# client = AsyncInstructor(
+#     client=AsyncAnthropic(),
+#     create=patch(
+#         create=AsyncAnthropic().beta.prompt_caching.messages.create,
+#         mode=Mode.ANTHROPIC_TOOLS,
+#     ),
+#     mode=Mode.ANTHROPIC_TOOLS,
+# )
+aclient = AsyncOpenAI()
+# logfire.instrument_openai(aclient)
+client: instructor.AsyncInstructor = instructor.from_openai(aclient)
+@retry(stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=4, max=10))
+async def situate_context(doc: str, chunk: str) -> str:
+    template = Template(
+        """
+<document>
+{{ doc }}
+</document>
+Here is the chunk we want to situate within the whole document above:
+<chunk>
+{{ chunk }}
+</chunk>
+Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk.
+Answer only with the succinct context and nothing else.
+"""
+    )
+    content = template.render(doc=doc, chunk=chunk)
+    response = await client.chat.completions.create(
+        model="gpt-4o-mini",
+        max_tokens=1000,
+        temperature=0,
+        messages=[
+            {
+                "role": "user",
+                "content": content,
+            }
+        ],
+        response_model=SituatedContext,
+    )
+    return response.context
+async def process_chunk(node: TextNode, document_dict: dict) -> TextNode:
+    doc_id: str = node.source_node.node_id  # type: ignore
+    doc: Document = document_dict[doc_id]
+    if doc.metadata["tokens"] > 120_000:
+        # Tokenize the document text
+        encoding = tiktoken.encoding_for_model("gpt-4o-mini")
+        tokens = encoding.encode(doc.get_content())
+        # Trim to 120,000 tokens
+        trimmed_tokens = tokens[:120_000]
+        # Decode back to text
+        trimmed_text = encoding.decode(trimmed_tokens)
+        # Update the document with trimmed text
+        doc = Document(text=trimmed_text, metadata=doc.metadata)
+        doc.metadata["tokens"] = 120_000
+    context: str = await situate_context(doc.get_content(), node.text)
+    node.text = f"{node.text}\n\n{context}"
+    return node
+async def process(
+    documents: List[Document], semaphore_limit: int = 50
+) -> List[TextNode]:
+    # From the document, we create chunks
+    pipeline = IngestionPipeline(
+        transformations=[SentenceSplitter(chunk_size=800, chunk_overlap=0)]
+    )
+    all_nodes: list[TextNode] = pipeline.run(documents=documents, show_progress=True)
+    print(f"Number of nodes: {len(all_nodes)}")
+    document_dict: dict[str, Document] = {doc.doc_id: doc for doc in documents}
+    semaphore = asyncio.Semaphore(semaphore_limit)
+    async def process_with_semaphore(node):
+        async with semaphore:
+            result = await process_chunk(node, document_dict)
+            await asyncio.sleep(0.1)
+            return result
+    tasks = [process_with_semaphore(node) for node in all_nodes]
+    results: List[TextNode] = await tqdm.gather(*tasks, desc="Processing chunks")
+    # pdb.set_trace()
+    return results
+async def main():
+    documents: List[Document] = create_docs("data/all_sources_data.jsonl")
+    enhanced_nodes: List[TextNode] = await process(documents)
+    with open("data/all_sources_contextual_nodes.pkl", "wb") as f:
+        pickle.dump(enhanced_nodes, f)
+    # pipeline = IngestionPipeline(
+    #     transformations=[SentenceSplitter(chunk_size=800, chunk_overlap=0)]
+    # )
+    # all_nodes: list[TextNode] = pipeline.run(documents=documents, show_progress=True)
+    # print(all_nodes[7933])
+    # pdb.set_trace()
+    with open("data/all_sources_contextual_nodes.pkl", "rb") as f:
+        enhanced_nodes: list[TextNode] = pickle.load(f)
+    for i, node in enumerate(enhanced_nodes):
+        print(f"Chunk {i + 1}:")
+        print(f"Node: {node}")
+        print(f"Text: {node.text}")
+        # pdb.set_trace()
+        break
+if __name__ == "__main__":
+    asyncio.run(main())

data/scraping_scripts/add_course_workflow.py ADDED Viewed

	@@ -0,0 +1,541 @@

+#!/usr/bin/env python
+"""
+AI Tutor App - Course Addition Workflow
+This script guides you through the complete process of adding a new course to the AI Tutor App:
+1. Process course markdown files to create JSONL data
+2. MANDATORY MANUAL STEP: Add URLs to course content in the generated JSONL
+3. Merge course JSONL into all_sources_data.jsonl
+4. Add contextual information to document nodes
+5. Create vector stores
+6. Upload databases to HuggingFace
+7. Update UI configuration
+Usage:
+    python add_course_workflow.py --course [COURSE_NAME]
+    Additional flags to run specific steps (if you want to restart from a specific point):
+    --skip-process-md       Skip the markdown processing step
+    --skip-merge            Skip merging into all_sources_data.jsonl
+    --new-context-only      Only process new content when adding context
+    --skip-context          Skip the context addition step entirely
+    --skip-vectors          Skip vector store creation
+    --skip-upload           Skip uploading to HuggingFace
+    --skip-ui-update        Skip updating the UI configuration
+"""
+import argparse
+import json
+import logging
+import os
+import pickle
+import subprocess
+import sys
+import time
+from pathlib import Path
+from typing import Dict, List, Set
+from dotenv import load_dotenv
+from huggingface_hub import HfApi, hf_hub_download
+# Load environment variables from .env file
+load_dotenv()
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
+)
+logger = logging.getLogger(__name__)
+def ensure_required_files_exist():
+    """Download required data files from HuggingFace if they don't exist locally."""
+    # List of files to check and download
+    required_files = {
+        # Critical files
+        "data/all_sources_data.jsonl": "all_sources_data.jsonl",
+        "data/all_sources_contextual_nodes.pkl": "all_sources_contextual_nodes.pkl",
+        # Documentation source files
+        "data/transformers_data.jsonl": "transformers_data.jsonl",
+        "data/peft_data.jsonl": "peft_data.jsonl",
+        "data/trl_data.jsonl": "trl_data.jsonl",
+        "data/llama_index_data.jsonl": "llama_index_data.jsonl",
+        "data/langchain_data.jsonl": "langchain_data.jsonl",
+        "data/openai_cookbooks_data.jsonl": "openai_cookbooks_data.jsonl",
+        # Course files
+        "data/tai_blog_data.jsonl": "tai_blog_data.jsonl",
+        "data/8-hour_primer_data.jsonl": "8-hour_primer_data.jsonl",
+        "data/llm_developer_data.jsonl": "llm_developer_data.jsonl",
+        "data/python_primer_data.jsonl": "python_primer_data.jsonl"
+    }
+    # Critical files that must be downloaded
+    critical_files = [
+        "data/all_sources_data.jsonl",
+        "data/all_sources_contextual_nodes.pkl"
+    ]
+    # Check and download each file
+    for local_path, remote_filename in required_files.items():
+        if not os.path.exists(local_path):
+            logger.info(f"{remote_filename} not found. Attempting to download from HuggingFace...")
+            try:
+                hf_hub_download(
+                    token=os.getenv("HF_TOKEN"),
+                    repo_id="towardsai-tutors/ai-tutor-data",
+                    filename=remote_filename,
+                    repo_type="dataset",
+                    local_dir="data",
+                )
+                logger.info(f"Successfully downloaded {remote_filename} from HuggingFace")
+            except Exception as e:
+                logger.warning(f"Could not download {remote_filename}: {e}")
+                # Only create empty file for all_sources_data.jsonl if it's missing
+                if local_path == "data/all_sources_data.jsonl":
+                    logger.warning("Creating a new all_sources_data.jsonl file. This will not include previously existing data.")
+                    with open(local_path, "w") as f:
+                        pass
+                # If critical file is missing, print a more serious warning
+                if local_path in critical_files:
+                    logger.warning(f"Critical file {remote_filename} is missing. The workflow may not function correctly.")
+                    if local_path == "data/all_sources_contextual_nodes.pkl":
+                        logger.warning("The context addition step will process all documents since no existing contexts were found.")
+def load_jsonl(file_path: str) -> List[Dict]:
+    """Load data from a JSONL file."""
+    data = []
+    with open(file_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data.append(json.loads(line))
+    return data
+def save_jsonl(data: List[Dict], file_path: str) -> None:
+    """Save data to a JSONL file."""
+    with open(file_path, "w", encoding="utf-8") as f:
+        for item in data:
+            json.dump(item, f, ensure_ascii=False)
+            f.write("\n")
+def process_markdown_files(course_name: str) -> str:
+    """Process markdown files for a specific course. Returns path to output JSONL."""
+    logger.info(f"Processing markdown files for course: {course_name}")
+    cmd = ["python", "data/scraping_scripts/process_md_files.py", course_name]
+    result = subprocess.run(cmd)
+    if result.returncode != 0:
+        logger.error(f"Error processing markdown files - check output above")
+        sys.exit(1)
+    logger.info(f"Successfully processed markdown files for {course_name}")
+    # Determine the output file path from process_md_files.py
+    from data.scraping_scripts.process_md_files import SOURCE_CONFIGS
+    if course_name not in SOURCE_CONFIGS:
+        logger.error(f"Course {course_name} not found in SOURCE_CONFIGS")
+        sys.exit(1)
+    output_file = SOURCE_CONFIGS[course_name]["output_file"]
+    return output_file
+def manual_url_addition(jsonl_path: str) -> None:
+    """Guide the user through manually adding URLs to the course JSONL."""
+    logger.info(f"=== MANDATORY MANUAL STEP: URL ADDITION ===")
+    logger.info(f"Please add the URLs to the course content in: {jsonl_path}")
+    logger.info(f"For each document in the JSONL file:")
+    logger.info(f"1. Open the file in a text editor")
+    logger.info(f"2. Find the empty 'url' field for each document")
+    logger.info(f"3. Add the appropriate URL from the live course platform")
+    logger.info(f"   Example URL format: https://academy.towardsai.net/courses/take/python-for-genai/multimedia/62515980-course-structure")
+    logger.info(f"4. Save the file when done")
+    # Check if URLs are present
+    data = load_jsonl(jsonl_path)
+    missing_urls = sum(1 for item in data if not item.get("url"))
+    if missing_urls > 0:
+        logger.warning(f"Found {missing_urls} documents without URLs in {jsonl_path}")
+        answer = input(
+            f"\n{missing_urls} documents are missing URLs. Have you added all the URLs? (yes/no): "
+        )
+        if answer.lower() not in ["yes", "y"]:
+            logger.info("Please add the URLs and run the script again.")
+            sys.exit(0)
+    else:
+        logger.info("All documents have URLs. Continuing with the workflow.")
+def merge_into_all_sources(course_jsonl_path: str) -> None:
+    """Merge the course JSONL into all_sources_data.jsonl."""
+    all_sources_path = "data/all_sources_data.jsonl"
+    logger.info(f"Merging {course_jsonl_path} into {all_sources_path}")
+    # Load course data
+    course_data = load_jsonl(course_jsonl_path)
+    # Load existing all_sources data if it exists
+    all_data = []
+    if os.path.exists(all_sources_path):
+        all_data = load_jsonl(all_sources_path)
+    # Get doc_ids from existing data
+    existing_ids = {item["doc_id"] for item in all_data}
+    # Add new course data (avoiding duplicates)
+    new_items = 0
+    for item in course_data:
+        if item["doc_id"] not in existing_ids:
+            all_data.append(item)
+            existing_ids.add(item["doc_id"])
+            new_items += 1
+    # Save the combined data
+    save_jsonl(all_data, all_sources_path)
+    logger.info(f"Added {new_items} new documents to {all_sources_path}")
+def get_processed_doc_ids() -> Set[str]:
+    """Get set of doc_ids that have already been processed with context."""
+    if not os.path.exists("data/all_sources_contextual_nodes.pkl"):
+        return set()
+    try:
+        with open("data/all_sources_contextual_nodes.pkl", "rb") as f:
+            nodes = pickle.load(f)
+            return {node.source_node.node_id for node in nodes}
+    except Exception as e:
+        logger.error(f"Error loading processed doc_ids: {e}")
+        return set()
+def add_context_to_nodes(new_only: bool = False) -> None:
+    """Add context to document nodes, optionally processing only new content."""
+    logger.info("Adding context to document nodes")
+    if new_only:
+        # Load all documents
+        all_docs = load_jsonl("data/all_sources_data.jsonl")
+        processed_ids = get_processed_doc_ids()
+        # Filter for unprocessed documents
+        new_docs = [doc for doc in all_docs if doc["doc_id"] not in processed_ids]
+        if not new_docs:
+            logger.info("No new documents to process")
+            return
+        # Save temporary JSONL with only new documents
+        temp_file = "data/new_docs_temp.jsonl"
+        save_jsonl(new_docs, temp_file)
+        # Temporarily modify the add_context_to_nodes.py script to use the temp file
+        cmd = [
+            "python",
+            "-c",
+            f"""
+import asyncio
+import os
+import pickle
+import json
+from data.scraping_scripts.add_context_to_nodes import create_docs, process
+async def main():
+    # First, get the list of sources being updated from the temp file
+    updated_sources = set()
+    with open("{temp_file}", "r") as f:
+        for line in f:
+            data = json.loads(line)
+            updated_sources.add(data["source"])
+    print(f"Updating nodes for sources: {{updated_sources}}")
+    # Process new documents
+    documents = create_docs("{temp_file}")
+    enhanced_nodes = await process(documents)
+    print(f"Generated context for {{len(enhanced_nodes)}} new nodes")
+    # Load existing nodes if they exist
+    existing_nodes = []
+    if os.path.exists("data/all_sources_contextual_nodes.pkl"):
+        with open("data/all_sources_contextual_nodes.pkl", "rb") as f:
+            existing_nodes = pickle.load(f)
+        # Filter out existing nodes for sources we're updating
+        filtered_nodes = []
+        removed_count = 0
+        for node in existing_nodes:
+            # Try to extract source from node metadata
+            try:
+                source = None
+                if hasattr(node, 'source_node') and hasattr(node.source_node, 'metadata'):
+                    source = node.source_node.metadata.get("source")
+                elif hasattr(node, 'metadata'):
+                    source = node.metadata.get("source")
+                if source not in updated_sources:
+                    filtered_nodes.append(node)
+                else:
+                    removed_count += 1
+            except Exception:
+                # Keep nodes where we can't determine the source
+                filtered_nodes.append(node)
+        print(f"Removed {{removed_count}} existing nodes for updated sources")
+        existing_nodes = filtered_nodes
+    # Combine filtered existing nodes with new nodes
+    all_nodes = existing_nodes + enhanced_nodes
+    # Save all nodes
+    with open("data/all_sources_contextual_nodes.pkl", "wb") as f:
+        pickle.dump(all_nodes, f)
+    print(f"Total nodes in updated file: {{len(all_nodes)}}")
+asyncio.run(main())
+            """,
+        ]
+    else:
+        # Process all documents
+        cmd = ["python", "data/scraping_scripts/add_context_to_nodes.py"]
+    result = subprocess.run(cmd)
+    if result.returncode != 0:
+        logger.error(f"Error adding context to nodes - check output above")
+        sys.exit(1)
+    logger.info("Successfully added context to nodes")
+    # Clean up temp file if it exists
+    if new_only and os.path.exists("data/new_docs_temp.jsonl"):
+        os.remove("data/new_docs_temp.jsonl")
+def create_vector_stores() -> None:
+    """Create vector stores from processed documents."""
+    logger.info("Creating vector stores")
+    cmd = ["python", "data/scraping_scripts/create_vector_stores.py", "all_sources"]
+    result = subprocess.run(cmd)
+    if result.returncode != 0:
+        logger.error(f"Error creating vector stores - check output above")
+        sys.exit(1)
+    logger.info("Successfully created vector stores")
+def upload_to_huggingface(upload_jsonl: bool = False) -> None:
+    """Upload databases to HuggingFace."""
+    logger.info("Uploading databases to HuggingFace")
+    cmd = ["python", "data/scraping_scripts/upload_dbs_to_hf.py"]
+    result = subprocess.run(cmd)
+    if result.returncode != 0:
+        logger.error(f"Error uploading databases - check output above")
+        sys.exit(1)
+    logger.info("Successfully uploaded databases to HuggingFace")
+    if upload_jsonl:
+        logger.info("Uploading data files to HuggingFace")
+        try:
+            # Note: This uses a separate private repository
+            cmd = ["python", "data/scraping_scripts/upload_data_to_hf.py"]
+            result = subprocess.run(cmd)
+            if result.returncode != 0:
+                logger.error(f"Error uploading data files - check output above")
+                sys.exit(1)
+            logger.info("Successfully uploaded data files to HuggingFace")
+        except Exception as e:
+            logger.error(f"Error uploading JSONL file: {e}")
+            sys.exit(1)
+def update_ui_files(course_name: str) -> None:
+    """Update main.py and setup.py with the new source."""
+    logger.info(f"Updating UI files with new course: {course_name}")
+    # Get the source configuration for display name
+    from data.scraping_scripts.process_md_files import SOURCE_CONFIGS
+    if course_name not in SOURCE_CONFIGS:
+        logger.error(f"Course {course_name} not found in SOURCE_CONFIGS")
+        return
+    # Get a readable display name for the UI
+    display_name = course_name.replace("_", " ").title()
+    # Update setup.py - add to AVAILABLE_SOURCES and AVAILABLE_SOURCES_UI
+    setup_path = Path("scripts/setup.py")
+    if setup_path.exists():
+        setup_content = setup_path.read_text()
+        # Check if already added
+        if f'"{course_name}"' in setup_content:
+            logger.info(f"Course {course_name} already in setup.py")
+        else:
+            # Add to AVAILABLE_SOURCES_UI
+            ui_list_start = setup_content.find("AVAILABLE_SOURCES_UI = [")
+            ui_list_end = setup_content.find("]", ui_list_start)
+            new_ui_content = (
+                setup_content[:ui_list_end]
+                + f'    "{display_name}",\n'
+                + setup_content[ui_list_end:]
+            )
+            # Add to AVAILABLE_SOURCES
+            sources_list_start = new_ui_content.find("AVAILABLE_SOURCES = [")
+            sources_list_end = new_ui_content.find("]", sources_list_start)
+            new_content = (
+                new_ui_content[:sources_list_end]
+                + f'    "{course_name}",\n'
+                + new_ui_content[sources_list_end:]
+            )
+            # Write updated content
+            setup_path.write_text(new_content)
+            logger.info(f"Updated setup.py with {course_name}")
+    else:
+        logger.warning(f"setup.py not found at {setup_path}")
+    # Update main.py - add to source_mapping
+    main_path = Path("scripts/main.py")
+    if main_path.exists():
+        main_content = main_path.read_text()
+        # Check if already added
+        if f'"{display_name}": "{course_name}"' in main_content:
+            logger.info(f"Course {course_name} already in main.py")
+        else:
+            # Add to source_mapping
+            mapping_start = main_content.find("source_mapping = {")
+            mapping_end = main_content.find("}", mapping_start)
+            new_main_content = (
+                main_content[:mapping_end]
+                + f'            "{display_name}": "{course_name}",\n'
+                + main_content[mapping_end:]
+            )
+            # Add to default selected sources if not there
+            value_start = new_main_content.find("value=[")
+            value_end = new_main_content.find("]", value_start)
+            if f'"{display_name}"' not in new_main_content[value_start:value_end]:
+                new_main_content = (
+                    new_main_content[: value_start + 7]
+                    + f'        "{display_name}",\n'
+                    + new_main_content[value_start + 7 :]
+                )
+            # Write updated content
+            main_path.write_text(new_main_content)
+            logger.info(f"Updated main.py with {course_name}")
+    else:
+        logger.warning(f"main.py not found at {main_path}")
+def main():
+    parser = argparse.ArgumentParser(
+        description="AI Tutor App Course Addition Workflow"
+    )
+    parser.add_argument(
+        "--course",
+        required=True,
+        help="Name of the course to process (must match SOURCE_CONFIGS)",
+    )
+    parser.add_argument(
+        "--skip-process-md",
+        action="store_true",
+        help="Skip the markdown processing step",
+    )
+    parser.add_argument(
+        "--skip-merge",
+        action="store_true",
+        help="Skip merging into all_sources_data.jsonl",
+    )
+    parser.add_argument(
+        "--process-all-context",
+        action="store_true",
+        help="Process all content when adding context (default: only process new content)",
+    )
+    parser.add_argument(
+        "--skip-context",
+        action="store_true",
+        help="Skip the context addition step entirely",
+    )
+    parser.add_argument(
+        "--skip-vectors", action="store_true", help="Skip vector store creation"
+    )
+    parser.add_argument(
+        "--skip-upload", action="store_true", help="Skip uploading to HuggingFace"
+    )
+    parser.add_argument(
+        "--skip-ui-update",
+        action="store_true",
+        help="Skip updating the UI configuration",
+    )
+    parser.add_argument(
+        "--skip-data-upload",
+        action="store_true",
+        help="Skip uploading data files to private HuggingFace repo (they are uploaded by default)",
+    )
+    args = parser.parse_args()
+    course_name = args.course
+    # Ensure required data files exist before proceeding
+    ensure_required_files_exist()
+    # Get the output file path
+    from data.scraping_scripts.process_md_files import SOURCE_CONFIGS
+    if course_name not in SOURCE_CONFIGS:
+        logger.error(f"Course {course_name} not found in SOURCE_CONFIGS")
+        sys.exit(1)
+    course_jsonl_path = SOURCE_CONFIGS[course_name]["output_file"]
+    # Execute the workflow steps
+    if not args.skip_process_md:
+        course_jsonl_path = process_markdown_files(course_name)
+    # Always do the manual URL addition step for courses
+    manual_url_addition(course_jsonl_path)
+    if not args.skip_merge:
+        merge_into_all_sources(course_jsonl_path)
+    if not args.skip_context:
+        add_context_to_nodes(not args.process_all_context)
+    if not args.skip_vectors:
+        create_vector_stores()
+    if not args.skip_upload:
+        # By default, also upload the data files (JSONL and PKL) unless explicitly skipped
+        upload_to_huggingface(not args.skip_data_upload)
+    if not args.skip_ui_update:
+        update_ui_files(course_name)
+    logger.info("Course addition workflow completed successfully")
+if __name__ == "__main__":
+    main()

data/scraping_scripts/create_vector_stores.py ADDED Viewed

	@@ -0,0 +1,218 @@

+"""
+Vector Store Creation Script
+Purpose:
+This script processes various data sources (e.g., transformers, peft, trl, llama_index, openai_cookbooks, langchain)
+to create vector stores using Chroma and LlamaIndex. It reads data from JSONL files, creates document embeddings,
+and stores them in persistent Chroma databases for efficient retrieval.
+Usage:
+python script_name.py <source1> <source2> ...
+Example:
+python script_name.py transformers peft llama_index
+The script accepts one or more source names as command-line arguments. Valid source names are:
+transformers, peft, trl, llama_index, openai_cookbooks, langchain
+For each specified source, the script will:
+1. Read data from the corresponding JSONL file
+2. Create document embeddings
+3. Store the embeddings in a Chroma vector database
+4. Save a dictionary of documents for future reference
+Note: Ensure that the input JSONL files are present in the 'data' directory.
+"""
+import argparse
+import json
+import os
+import pdb
+import pickle
+import shutil
+import chromadb
+from dotenv import load_dotenv
+from llama_index.core import Document, StorageContext, VectorStoreIndex
+from llama_index.core.node_parser import SentenceSplitter
+from llama_index.core.schema import MetadataMode, TextNode
+from llama_index.embeddings.cohere import CohereEmbedding
+from llama_index.embeddings.openai import OpenAIEmbedding
+from llama_index.llms.openai import OpenAI
+from llama_index.vector_stores.chroma import ChromaVectorStore
+load_dotenv()
+# Configuration for different sources
+SOURCE_CONFIGS = {
+    "transformers": {
+        "input_file": "data/transformers_data.jsonl",
+        "db_name": "chroma-db-transformers",
+    },
+    "peft": {"input_file": "data/peft_data.jsonl", "db_name": "chroma-db-peft"},
+    "trl": {"input_file": "data/trl_data.jsonl", "db_name": "chroma-db-trl"},
+    "llama_index": {
+        "input_file": "data/llama_index_data.jsonl",
+        "db_name": "chroma-db-llama_index",
+    },
+    "openai_cookbooks": {
+        "input_file": "data/openai_cookbooks_data.jsonl",
+        "db_name": "chroma-db-openai_cookbooks",
+    },
+    "langchain": {
+        "input_file": "data/langchain_data.jsonl",
+        "db_name": "chroma-db-langchain",
+    },
+    "tai_blog": {
+        "input_file": "data/tai_blog_data.jsonl",
+        "db_name": "chroma-db-tai_blog",
+    },
+    "all_sources": {
+        "input_file": "data/all_sources_data.jsonl",
+        "db_name": "chroma-db-all_sources",
+    },
+}
+def create_docs(input_file: str) -> list[Document]:
+    with open(input_file, "r") as f:
+        documents = []
+        for line in f:
+            data = json.loads(line)
+            documents.append(
+                Document(
+                    doc_id=data["doc_id"],
+                    text=data["content"],
+                    metadata={  # type: ignore
+                        "url": data["url"],
+                        "title": data["name"],
+                        "tokens": data["tokens"],
+                        "retrieve_doc": data["retrieve_doc"],
+                        "source": data["source"],
+                    },
+                    excluded_llm_metadata_keys=[  # url is included in LLM context
+                        "title",
+                        "tokens",
+                        "retrieve_doc",
+                        "source",
+                    ],
+                    excluded_embed_metadata_keys=[  # title is embedded along the content
+                        "url",
+                        "tokens",
+                        "retrieve_doc",
+                        "source",
+                    ],
+                )
+            )
+    return documents
+def process_source(source: str):
+    config = SOURCE_CONFIGS[source]
+    input_file = config["input_file"]
+    db_name = config["db_name"]
+    db_path = f"data/{db_name}"
+    print(f"Processing source: {source}")
+    documents: list[Document] = create_docs(input_file)
+    print(f"Created {len(documents)} documents")
+    # Check if the folder exists and delete it
+    if os.path.exists(db_path):
+        print(f"Existing database found at {db_path}. Deleting...")
+        shutil.rmtree(db_path)
+        print(f"Deleted existing database at {db_path}")
+    # Create Chroma client and collection
+    chroma_client = chromadb.PersistentClient(path=f"data/{db_name}")
+    chroma_collection = chroma_client.create_collection(db_name)
+    # Create vector store and storage context
+    vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
+    storage_context = StorageContext.from_defaults(vector_store=vector_store)
+    # Save document dictionary
+    document_dict: dict[str, Document] = {doc.doc_id: doc for doc in documents}
+    document_dict_file = f"data/{db_name}/document_dict_{source}.pkl"
+    with open(document_dict_file, "wb") as f:
+        pickle.dump(document_dict, f)
+    print(f"Saved document dictionary to {document_dict_file}")
+    # Load nodes with context
+    with open("data/all_sources_contextual_nodes.pkl", "rb") as f:
+        nodes_with_context: list[TextNode] = pickle.load(f)
+    print(f"Loaded {len(nodes_with_context)} nodes with context")
+    # pdb.set_trace()
+    # exit()
+    # Create vector store index
+    index = VectorStoreIndex(
+        nodes=nodes_with_context,
+        # embed_model=OpenAIEmbedding(model="text-embedding-3-large", mode="similarity"),
+        embed_model=CohereEmbedding(
+            api_key=os.environ["COHERE_API_KEY"],
+            model_name="embed-english-v3.0",
+            input_type="search_document",
+        ),
+        show_progress=True,
+        use_async=True,
+        storage_context=storage_context,
+    )
+    llm = OpenAI(
+        temperature=1,
+        model="gpt-4o-mini",
+        # model="gpt-4o",
+        max_tokens=5000,
+        max_retries=3,
+    )
+    query_engine = index.as_query_engine(llm=llm)
+    response = query_engine.query("How to fine-tune an llm?")
+    print(response)
+    for src in response.source_nodes:
+        print("Node ID\t", src.node_id)
+        print("Title\t", src.metadata["title"])
+        print("Text\t", src.text)
+        print("Score\t", src.score)
+        print("-_" * 20)
+    # # Create vector store index
+    # index = VectorStoreIndex.from_documents(
+    #     documents,
+    #     # embed_model=OpenAIEmbedding(model="text-embedding-3-large", mode="similarity"),
+    #     embed_model=CohereEmbedding(
+    #         api_key=os.environ["COHERE_API_KEY"],
+    #         model_name="embed-english-v3.0",
+    #         input_type="search_document",
+    #     ),
+    #     transformations=[SentenceSplitter(chunk_size=800, chunk_overlap=0)],
+    #     show_progress=True,
+    #     use_async=True,
+    #     storage_context=storage_context,
+    # )
+    print(f"Created vector store index for {source}")
+def main(sources: list[str]):
+    for source in sources:
+        if source in SOURCE_CONFIGS:
+            process_source(source)
+        else:
+            print(f"Unknown source: {source}. Skipping.")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Process sources and create vector stores."
+    )
+    parser.add_argument(
+        "sources",
+        nargs="+",
+        choices=SOURCE_CONFIGS.keys(),
+        help="Specify one or more sources to process",
+    )
+    args = parser.parse_args()
+    main(args.sources)

data/scraping_scripts/csv_to_jsonl.py ADDED Viewed

	@@ -0,0 +1,61 @@

+import json
+import uuid
+import pandas as pd
+import tiktoken
+# Function to count tokens using tiktoken
+def num_tokens_from_string(string: str, encoding_name: str) -> int:
+    encoding = tiktoken.get_encoding(encoding_name)
+    num_tokens = len(
+        encoding.encode(
+            string, disallowed_special=(encoding.special_tokens_set - {"<|endoftext|>"})
+        )
+    )
+    return num_tokens
+# Function to clean or remove specific content, e.g., copyright headers
+def remove_copyright_header(content: str) -> str:
+    # Implement any cleaning logic you need here
+    return content
+# Function to convert DataFrame to JSONL format with token counting
+def convert_to_jsonl_with_conditions(df, encoding_name="cl100k_base"):
+    jsonl_data = []
+    for _, row in df.iterrows():
+        token_count = num_tokens_from_string(row["text"], encoding_name)
+        # Skip entries based on token count conditions
+        if token_count < 100 or token_count > 200_000:
+            print(f"Skipping {row['title']} due to token count {token_count}")
+            continue
+        cleaned_content = remove_copyright_header(row["text"])
+        entry = {
+            "tokens": token_count,  # Token count using tiktoken
+            "doc_id": str(uuid.uuid4()),  # Generate a unique UUID
+            "name": row["title"],
+            "url": row["tai_url"],
+            "retrieve_doc": (token_count <= 8000),  # retrieve_doc condition
+            "source": "tai_blog",
+            "content": cleaned_content,
+        }
+        jsonl_data.append(entry)
+    return jsonl_data
+# Load the CSV file
+data = pd.read_csv("data/tai.csv")
+# Convert the dataframe to JSONL format with token counting and conditions
+jsonl_data_with_conditions = convert_to_jsonl_with_conditions(data)
+# Save the output to a new JSONL file using json.dumps to ensure proper escaping
+output_path = "data/tai_blog_data_conditions.jsonl"
+with open(output_path, "w") as f:
+    for entry in jsonl_data_with_conditions:
+        f.write(json.dumps(entry) + "\n")

data/scraping_scripts/github_to_markdown_ai_docs.py ADDED Viewed

	@@ -0,0 +1,231 @@

+"""
+Fetch Markdown files from specified GitHub repositories.
+This script fetches Markdown (.md), MDX (.mdx), and Jupyter Notebook (.ipynb) files
+from specified GitHub repositories, particularly focusing on documentation sources
+for various AI and machine learning libraries.
+Key features:
+1. Configurable for multiple documentation sources (e.g., Hugging Face Transformers, PEFT, TRL)
+2. Command-line interface for specifying one or more sources to process
+3. Automatic conversion of Jupyter Notebooks to Markdown
+4. Rate limiting handling to comply with GitHub API restrictions
+5. Retry mechanism for resilience against network issues
+Usage:
+    python github_to_markdown_ai_docs.py <source1> [<source2> ...]
+Where <sourceN> is one of the predefined sources in SOURCE_CONFIGS (e.g., 'transformers', 'peft', 'trl').
+Example:
+    python github_to_markdown_ai_docs.py trl peft
+This will download and process the documentation files for both TRL and PEFT libraries.
+Note:
+- Ensure you have set the GITHUB_TOKEN variable with your GitHub Personal Access Token.
+- The script creates a 'data' directory in the current working directory to store the downloaded files.
+- Each source's files are stored in a subdirectory named '<repo>_md_files'.
+"""
+import argparse
+import json
+import os
+import random
+import time
+from typing import Dict, List
+import nbformat
+import requests
+from dotenv import load_dotenv
+from nbconvert import MarkdownExporter
+load_dotenv()
+# Configuration for different sources
+SOURCE_CONFIGS = {
+    "transformers": {
+        "owner": "huggingface",
+        "repo": "transformers",
+        "path": "docs/source/en",
+    },
+    "peft": {
+        "owner": "huggingface",
+        "repo": "peft",
+        "path": "docs/source",
+    },
+    "trl": {
+        "owner": "huggingface",
+        "repo": "trl",
+        "path": "docs/source",
+    },
+    "llama_index": {
+        "owner": "run-llama",
+        "repo": "llama_index",
+        "path": "docs/docs",
+    },
+    "openai_cookbooks": {
+        "owner": "openai",
+        "repo": "openai-cookbook",
+        "path": "examples",
+    },
+    "langchain": {
+        "owner": "langchain-ai",
+        "repo": "langchain",
+        "path": "docs/docs",
+    },
+}
+# GitHub Personal Access Token (replace with your own token)
+GITHUB_TOKEN = os.getenv("GITHUB_TOKEN")
+# Headers for authenticated requests
+HEADERS = {
+    "Authorization": f"token {GITHUB_TOKEN}",
+    "Accept": "application/vnd.github.v3+json",
+}
+# Maximum number of retries
+MAX_RETRIES = 5
+def check_rate_limit():
+    rate_limit_url = "https://api.github.com/rate_limit"
+    response = requests.get(rate_limit_url, headers=HEADERS)
+    data = response.json()
+    remaining = data["resources"]["core"]["remaining"]
+    reset_time = data["resources"]["core"]["reset"]
+    if remaining < 10:  # Adjust this threshold as needed
+        wait_time = reset_time - time.time()
+        print(f"Rate limit nearly exceeded. Waiting for {wait_time:.2f} seconds.")
+        time.sleep(wait_time + 1)  # Add 1 second buffer
+def get_files_in_directory(api_url: str, retries: int = 0) -> List[Dict]:
+    try:
+        check_rate_limit()
+        response = requests.get(api_url, headers=HEADERS)
+        response.raise_for_status()
+        return response.json()
+    except requests.exceptions.RequestException as e:
+        if retries < MAX_RETRIES:
+            wait_time = (2**retries) + random.random()
+            print(
+                f"Error fetching directory contents: {e}. Retrying in {wait_time:.2f} seconds..."
+            )
+            time.sleep(wait_time)
+            return get_files_in_directory(api_url, retries + 1)
+        else:
+            print(
+                f"Failed to fetch directory contents after {MAX_RETRIES} retries: {e}"
+            )
+            return []
+def download_file(file_url: str, file_path: str, retries: int = 0):
+    try:
+        check_rate_limit()
+        response = requests.get(file_url, headers=HEADERS)
+        response.raise_for_status()
+        with open(file_path, "wb") as file:
+            file.write(response.content)
+    except requests.exceptions.RequestException as e:
+        if retries < MAX_RETRIES:
+            wait_time = (2**retries) + random.random()
+            print(
+                f"Error downloading file: {e}. Retrying in {wait_time:.2f} seconds..."
+            )
+            time.sleep(wait_time)
+            download_file(file_url, file_path, retries + 1)
+        else:
+            print(f"Failed to download file after {MAX_RETRIES} retries: {e}")
+    # def convert_ipynb_to_md(ipynb_path: str, md_path: str):
+    #     with open(ipynb_path, "r", encoding="utf-8") as f:
+    #         notebook = nbformat.read(f, as_version=4)
+    #     exporter = MarkdownExporter()
+    #     markdown, _ = exporter.from_notebook_node(notebook)
+    #     with open(md_path, "w", encoding="utf-8") as f:
+    #         f.write(markdown)
+def convert_ipynb_to_md(ipynb_path: str, md_path: str):
+    try:
+        with open(ipynb_path, "r", encoding="utf-8") as f:
+            notebook = nbformat.read(f, as_version=4)
+        exporter = MarkdownExporter()
+        markdown, _ = exporter.from_notebook_node(notebook)
+        with open(md_path, "w", encoding="utf-8") as f:
+            f.write(markdown)
+    except (json.JSONDecodeError, nbformat.reader.NotJSONError) as e:
+        print(f"Error converting notebook {ipynb_path}: {str(e)}")
+        print("Skipping this file and continuing with others...")
+    except Exception as e:
+        print(f"Unexpected error converting notebook {ipynb_path}: {str(e)}")
+        print("Skipping this file and continuing with others...")
+def fetch_files(api_url: str, local_dir: str):
+    files = get_files_in_directory(api_url)
+    for file in files:
+        if file["type"] == "file" and file["name"].endswith((".md", ".mdx", ".ipynb")):
+            file_url = file["download_url"]
+            file_name = file["name"]
+            file_path = os.path.join(local_dir, file_name)
+            print(f"Downloading {file_name}...")
+            download_file(file_url, file_path)
+            if file_name.endswith(".ipynb"):
+                md_file_name = file_name.replace(".ipynb", ".md")
+                md_file_path = os.path.join(local_dir, md_file_name)
+                print(f"Converting {file_name} to markdown...")
+                convert_ipynb_to_md(file_path, md_file_path)
+                os.remove(file_path)  # Remove the .ipynb file after conversion
+        elif file["type"] == "dir":
+            subdir = os.path.join(local_dir, file["name"])
+            os.makedirs(subdir, exist_ok=True)
+            fetch_files(file["url"], subdir)
+def process_source(source: str):
+    if source not in SOURCE_CONFIGS:
+        print(
+            f"Error: Unknown source '{source}'. Available sources: {', '.join(SOURCE_CONFIGS.keys())}"
+        )
+        return
+    config = SOURCE_CONFIGS[source]
+    api_url = f"https://api.github.com/repos/{config['owner']}/{config['repo']}/contents/{config['path']}"
+    local_dir = f"data/{config['repo']}_md_files"
+    os.makedirs(local_dir, exist_ok=True)
+    print(f"Processing source: {source}")
+    fetch_files(api_url, local_dir)
+    print(f"Finished processing {source}")
+def main(sources: List[str]):
+    for source in sources:
+        process_source(source)
+    print("All specified sources have been processed.")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Fetch Markdown files from specified GitHub repositories."
+    )
+    parser.add_argument(
+        "sources",
+        nargs="+",
+        choices=SOURCE_CONFIGS.keys(),
+        help="Specify one or more sources to process",
+    )
+    args = parser.parse_args()
+    main(args.sources)

data/scraping_scripts/process_md_files.py ADDED Viewed

	@@ -0,0 +1,370 @@

+"""
+Markdown Document Processor for Documentation Sources
+This script processes Markdown (.md) and MDX (.mdx) files from various documentation sources
+(such as Hugging Face Transformers, PEFT, TRL, LlamaIndex, and OpenAI Cookbook) and converts
+them into a standardized JSONL format for further processing or indexing.
+Key features:
+1. Configurable for multiple documentation sources
+2. Extracts titles, generates URLs, and counts tokens for each document
+3. Supports inclusion/exclusion of specific directories and root files
+4. Removes copyright headers from content
+5. Generates a unique ID for each document
+6. Determines if a whole document should be retrieved based on token count
+7. Handles special cases like openai-cookbook repo by adding .ipynb extensions
+8. Processes multiple sources in a single run
+Usage:
+    python process_md_files.py <source1> <source2> ...
+Where <source1>, <source2>, etc. are one or more of the predefined sources in SOURCE_CONFIGS
+(e.g., 'transformers', 'llama_index', 'openai_cookbooks').
+The script processes all Markdown files in the specified input directories (and their subdirectories),
+applies the configured filters, and saves the results in JSONL files. Each line in the output
+files represents a single document with metadata and content.
+To add or modify sources, update the SOURCE_CONFIGS dictionary at the top of the script.
+"""
+import argparse
+import json
+import logging
+import os
+import re
+import uuid
+from typing import Dict, List
+import tiktoken
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+# Configuration for different sources
+SOURCE_CONFIGS = {
+    "transformers": {
+        "base_url": "https://huggingface.co/docs/transformers/",
+        "input_directory": "data/transformers_md_files",
+        "output_file": "data/transformers_data.jsonl",
+        "source_name": "transformers",
+        "use_include_list": False,
+        "included_dirs": [],
+        "excluded_dirs": ["internal", "main_classes"],
+        "excluded_root_files": [],
+        "included_root_files": [],
+        "url_extension": "",
+    },
+    "peft": {
+        "base_url": "https://huggingface.co/docs/peft/",
+        "input_directory": "data/peft_md_files",
+        "output_file": "data/peft_data.jsonl",
+        "source_name": "peft",
+        "use_include_list": False,
+        "included_dirs": [],
+        "excluded_dirs": [],
+        "excluded_root_files": [],
+        "included_root_files": [],
+        "url_extension": "",
+    },
+    "trl": {
+        "base_url": "https://huggingface.co/docs/trl/",
+        "input_directory": "data/trl_md_files",
+        "output_file": "data/trl_data.jsonl",
+        "source_name": "trl",
+        "use_include_list": False,
+        "included_dirs": [],
+        "excluded_dirs": [],
+        "excluded_root_files": [],
+        "included_root_files": [],
+        "url_extension": "",
+    },
+    "llama_index": {
+        "base_url": "https://docs.llamaindex.ai/en/stable/",
+        "input_directory": "data/llama_index_md_files",
+        "output_file": "data/llama_index_data.jsonl",
+        "source_name": "llama_index",
+        "use_include_list": True,
+        "included_dirs": [
+            "getting_started",
+            "understanding",
+            "use_cases",
+            "examples",
+            "module_guides",
+            "optimizing",
+        ],
+        "excluded_dirs": [],
+        "excluded_root_files": [],
+        "included_root_files": ["index.md"],
+        "url_extension": "",
+    },
+    "openai_cookbooks": {
+        "base_url": "https://github.com/openai/openai-cookbook/blob/main/examples/",
+        "input_directory": "data/openai-cookbook_md_files",
+        "output_file": "data/openai_cookbooks_data.jsonl",
+        "source_name": "openai_cookbooks",
+        "use_include_list": False,
+        "included_dirs": [],
+        "excluded_dirs": [],
+        "excluded_root_files": [],
+        "included_root_files": [],
+        "url_extension": ".ipynb",
+    },
+    "langchain": {
+        "base_url": "https://python.langchain.com/docs/",
+        "input_directory": "data/langchain_md_files",
+        "output_file": "data/langchain_data.jsonl",
+        "source_name": "langchain",
+        "use_include_list": True,
+        "included_dirs": ["how_to", "versions", "turorials", "integrations"],
+        "excluded_dirs": [],
+        "excluded_root_files": [],
+        "included_root_files": ["security.md", "concepts.mdx", "introduction.mdx"],
+        "url_extension": "",
+    },
+    "tai_blog": {
+        "base_url": "",
+        "input_directory": "",
+        "output_file": "data/tai_blog_data.jsonl",
+        "source_name": "tai_blog",
+        "use_include_list": False,
+        "included_dirs": [],
+        "excluded_dirs": [],
+        "excluded_root_files": [],
+        "included_root_files": [],
+        "url_extension": "",
+    },
+    "8-hour_primer": {
+        "base_url": "",
+        "input_directory": "data/8-hour_primer",  # Path to the directory that contains the Markdown files
+        "output_file": "data/8-hour_primer_data.jsonl",  # 8-hour Generative AI Primer
+        "source_name": "8-hour_primer",
+        "use_include_list": False,
+        "included_dirs": [],
+        "excluded_dirs": [],
+        "excluded_root_files": [],
+        "included_root_files": [],
+        "url_extension": "",
+    },
+    "llm_developer": {
+        "base_url": "",
+        "input_directory": "data/llm_developer",  # Path to the directory that contains the Markdown files
+        "output_file": "data/llm_developer_data.jsonl",  # From Beginner to Advanced LLM Developer
+        "source_name": "llm_developer",
+        "use_include_list": False,
+        "included_dirs": [],
+        "excluded_dirs": [],
+        "excluded_root_files": [],
+        "included_root_files": [],
+        "url_extension": "",
+    },
+    "python_primer": {
+        "base_url": "",
+        "input_directory": "data/python_primer",  # Path to the directory that contains the Markdown files
+        "output_file": "data/python_primer_data.jsonl",  # From Beginner to Advanced LLM Developer
+        "source_name": "python_primer",
+        "use_include_list": False,
+        "included_dirs": [],
+        "excluded_dirs": [],
+        "excluded_root_files": [],
+        "included_root_files": [],
+        "url_extension": "",
+    },
+}
+def extract_title(content: str):
+    title_match = re.search(r"^#\s+(.+)$", content, re.MULTILINE)
+    if title_match:
+        return title_match.group(1).strip()
+    lines = content.split("\n")
+    for line in lines:
+        if line.strip():
+            return line.strip()
+    return None
+def generate_url(file_path: str, config: Dict) -> str:
+    """
+    Return an empty string if base_url is empty;
+    otherwise return the constructed URL as before.
+    """
+    if not config["base_url"]:
+        return ""
+    path_without_extension = os.path.splitext(file_path)[0]
+    path_with_forward_slashes = path_without_extension.replace("\\", "/")
+    return config["base_url"] + path_with_forward_slashes + config["url_extension"]
+def should_include_file(file_path: str, config: Dict) -> bool:
+    if os.path.dirname(file_path) == "":
+        if config["use_include_list"]:
+            return os.path.basename(file_path) in config["included_root_files"]
+        else:
+            return os.path.basename(file_path) not in config["excluded_root_files"]
+    if config["use_include_list"]:
+        return any(file_path.startswith(dir) for dir in config["included_dirs"])
+    else:
+        return not any(file_path.startswith(dir) for dir in config["excluded_dirs"])
+def num_tokens_from_string(string: str, encoding_name: str) -> int:
+    encoding = tiktoken.get_encoding(encoding_name)
+    num_tokens = len(encoding.encode(string, disallowed_special=()))
+    return num_tokens
+def remove_copyright_header(content: str) -> str:
+    header_pattern = re.compile(r"<!--Copyright.*?-->\s*", re.DOTALL)
+    cleaned_content = header_pattern.sub("", content, count=1)
+    return cleaned_content.strip()
+def process_md_files(directory: str, config: Dict) -> List[Dict]:
+    jsonl_data = []
+    for root, _, files in os.walk(directory):
+        for file in files:
+            if file.endswith(".md") or file.endswith(".mdx"):
+                file_path = os.path.join(root, file)
+                relative_path = os.path.relpath(file_path, directory)
+                if should_include_file(relative_path, config):
+                    with open(file_path, "r", encoding="utf-8") as f:
+                        content = f.read()
+                    title = extract_title(content)
+                    token_count = num_tokens_from_string(content, "cl100k_base")
+                    # Skip very small or extremely large files
+                    if token_count < 100 or token_count > 200_000:
+                        logger.info(
+                            f"Skipping {relative_path} due to token count {token_count}"
+                        )
+                        continue
+                    cleaned_content = remove_copyright_header(content)
+                    json_object = {
+                        "tokens": token_count,
+                        "doc_id": str(uuid.uuid4()),
+                        "name": (title if title else file),
+                        "url": generate_url(relative_path, config),
+                        "retrieve_doc": (token_count <= 8000),
+                        "source": config["source_name"],
+                        "content": cleaned_content,
+                    }
+                    jsonl_data.append(json_object)
+    return jsonl_data
+def save_jsonl(data: List[Dict], output_file: str) -> None:
+    with open(output_file, "w", encoding="utf-8") as f:
+        for item in data:
+            json.dump(item, f, ensure_ascii=False)
+            f.write("\n")
+def combine_all_sources(sources: List[str]) -> None:
+    """
+    Combine JSONL files from multiple sources, preserving existing sources not being processed.
+    For example, if sources = ['transformers'], this will:
+    1. Load data from transformers_data.jsonl
+    2. Load data from all other source JSONL files that exist (course files, etc.)
+    3. Combine them all into all_sources_data.jsonl
+    """
+    all_data = []
+    output_file = "data/all_sources_data.jsonl"
+    # Track which sources we're processing
+    processed_sources = set()
+    # First, add data from sources we're explicitly processing
+    for source in sources:
+        if source not in SOURCE_CONFIGS:
+            logger.error(f"Unknown source '{source}'. Skipping.")
+            continue
+        processed_sources.add(source)
+        input_file = SOURCE_CONFIGS[source]["output_file"]
+        logger.info(f"Processing updated source: {source} from {input_file}")
+        try:
+            source_data = []
+            with open(input_file, "r", encoding="utf-8") as f:
+                for line in f:
+                    source_data.append(json.loads(line))
+            logger.info(f"Added {len(source_data)} documents from {source}")
+            all_data.extend(source_data)
+        except Exception as e:
+            logger.error(f"Error loading {input_file}: {e}")
+    # Now add data from all other sources not being processed
+    for source_name, config in SOURCE_CONFIGS.items():
+        # Skip sources we already processed
+        if source_name in processed_sources:
+            continue
+        # Try to load the individual source file
+        source_file = config["output_file"]
+        if os.path.exists(source_file):
+            logger.info(f"Preserving existing source: {source_name} from {source_file}")
+            try:
+                source_data = []
+                with open(source_file, "r", encoding="utf-8") as f:
+                    for line in f:
+                        source_data.append(json.loads(line))
+                logger.info(f"Preserved {len(source_data)} documents from {source_name}")
+                all_data.extend(source_data)
+            except Exception as e:
+                logger.error(f"Error loading {source_file}: {e}")
+    logger.info(f"Total documents combined: {len(all_data)}")
+    save_jsonl(all_data, output_file)
+    logger.info(f"Combined data saved to {output_file}")
+def process_source(source: str) -> None:
+    if source not in SOURCE_CONFIGS:
+        logger.error(f"Unknown source '{source}'. Skipping.")
+        return
+    config = SOURCE_CONFIGS[source]
+    logger.info(f"\n\nProcessing source: {source}")
+    jsonl_data = process_md_files(config["input_directory"], config)
+    save_jsonl(jsonl_data, config["output_file"])
+    logger.info(
+        f"Processed {len(jsonl_data)} files and saved to {config['output_file']}"
+    )
+def main(sources: List[str]) -> None:
+    for source in sources:
+        process_source(source)
+    if len(sources) > 1:
+        combine_all_sources(sources)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Process Markdown files from specified sources."
+    )
+    parser.add_argument(
+        "sources",
+        nargs="+",
+        choices=SOURCE_CONFIGS.keys(),
+        help="Specify one or more sources to process",
+    )
+    args = parser.parse_args()
+    main(args.sources)

data/scraping_scripts/update_docs_workflow.py ADDED Viewed

	@@ -0,0 +1,409 @@

+#!/usr/bin/env python
+"""
+AI Tutor App - Documentation Update Workflow
+This script automates the process of updating documentation from GitHub repositories:
+1. Download documentation from GitHub using the API
+2. Process markdown files to create JSONL data
+3. Add contextual information to document nodes
+4. Create vector stores
+5. Upload databases to HuggingFace
+This workflow is specific to updating library documentation (Transformers, PEFT, LlamaIndex, etc.).
+For adding courses, use the add_course_workflow.py script instead.
+Usage:
+    python update_docs_workflow.py --sources [SOURCE1] [SOURCE2] ...
+    Additional flags to run specific steps (if you want to restart from a specific point):
+    --skip-download         Skip the GitHub download step
+    --skip-process          Skip the markdown processing step
+    --new-context-only      Only process new content when adding context
+    --skip-context          Skip the context addition step entirely
+    --skip-vectors          Skip vector store creation
+    --skip-upload           Skip uploading to HuggingFace
+"""
+import argparse
+import json
+import logging
+import os
+import pickle
+import subprocess
+import sys
+from typing import Dict, List, Set
+from dotenv import load_dotenv
+from huggingface_hub import HfApi, hf_hub_download
+# Load environment variables from .env file
+load_dotenv()
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
+)
+logger = logging.getLogger(__name__)
+def ensure_required_files_exist():
+    """Download required data files from HuggingFace if they don't exist locally."""
+    # List of files to check and download
+    required_files = {
+        # Critical files
+        "data/all_sources_data.jsonl": "all_sources_data.jsonl",
+        "data/all_sources_contextual_nodes.pkl": "all_sources_contextual_nodes.pkl",
+        # Documentation source files
+        "data/transformers_data.jsonl": "transformers_data.jsonl",
+        "data/peft_data.jsonl": "peft_data.jsonl",
+        "data/trl_data.jsonl": "trl_data.jsonl",
+        "data/llama_index_data.jsonl": "llama_index_data.jsonl",
+        "data/langchain_data.jsonl": "langchain_data.jsonl",
+        "data/openai_cookbooks_data.jsonl": "openai_cookbooks_data.jsonl",
+        # Course files
+        "data/tai_blog_data.jsonl": "tai_blog_data.jsonl",
+        "data/8-hour_primer_data.jsonl": "8-hour_primer_data.jsonl",
+        "data/llm_developer_data.jsonl": "llm_developer_data.jsonl",
+        "data/python_primer_data.jsonl": "python_primer_data.jsonl",
+    }
+    # Critical files that must be downloaded
+    critical_files = [
+        "data/all_sources_data.jsonl",
+        "data/all_sources_contextual_nodes.pkl",
+    ]
+    # Check and download each file
+    for local_path, remote_filename in required_files.items():
+        if not os.path.exists(local_path):
+            logger.info(
+                f"{remote_filename} not found. Attempting to download from HuggingFace..."
+            )
+            try:
+                hf_hub_download(
+                    token=os.getenv("HF_TOKEN"),
+                    repo_id="towardsai-tutors/ai-tutor-data",
+                    filename=remote_filename,
+                    repo_type="dataset",
+                    local_dir="data",
+                )
+                logger.info(
+                    f"Successfully downloaded {remote_filename} from HuggingFace"
+                )
+            except Exception as e:
+                logger.warning(f"Could not download {remote_filename}: {e}")
+                # Only create empty file for all_sources_data.jsonl if it's missing
+                if local_path == "data/all_sources_data.jsonl":
+                    logger.warning(
+                        "Creating a new all_sources_data.jsonl file. This will not include previously existing data."
+                    )
+                    with open(local_path, "w") as f:
+                        pass
+                # If critical file is missing, print a more serious warning
+                if local_path in critical_files:
+                    logger.warning(
+                        f"Critical file {remote_filename} is missing. The workflow may not function correctly."
+                    )
+                    if local_path == "data/all_sources_contextual_nodes.pkl":
+                        logger.warning(
+                            "The context addition step will process all documents since no existing contexts were found."
+                        )
+# Documentation sources that can be updated via GitHub API
+GITHUB_SOURCES = [
+    "transformers",
+    "peft",
+    "trl",
+    "llama_index",
+    "openai_cookbooks",
+    "langchain",
+]
+def load_jsonl(file_path: str) -> List[Dict]:
+    """Load data from a JSONL file."""
+    data = []
+    with open(file_path, "r", encoding="utf-8") as f:
+        for line in f:
+            data.append(json.loads(line))
+    return data
+def save_jsonl(data: List[Dict], file_path: str) -> None:
+    """Save data to a JSONL file."""
+    with open(file_path, "w", encoding="utf-8") as f:
+        for item in data:
+            json.dump(item, f, ensure_ascii=False)
+            f.write("\n")
+def download_from_github(sources: List[str]) -> None:
+    """Download documentation from GitHub repositories."""
+    logger.info(f"Downloading documentation from GitHub for sources: {sources}")
+    for source in sources:
+        if source not in GITHUB_SOURCES:
+            logger.warning(f"Source {source} is not a GitHub source, skipping download")
+            continue
+        logger.info(f"Downloading {source} documentation")
+        cmd = ["python", "data/scraping_scripts/github_to_markdown_ai_docs.py", source]
+        result = subprocess.run(cmd)
+        if result.returncode != 0:
+            logger.error(
+                f"Error downloading {source} documentation - check output above"
+            )
+            # Continue with other sources instead of exiting
+            continue
+        logger.info(f"Successfully downloaded {source} documentation")
+def process_markdown_files(sources: List[str]) -> None:
+    """Process markdown files for specific sources."""
+    logger.info(f"Processing markdown files for sources: {sources}")
+    cmd = ["python", "data/scraping_scripts/process_md_files.py"] + sources
+    result = subprocess.run(cmd)
+    if result.returncode != 0:
+        logger.error(f"Error processing markdown files - check output above")
+        sys.exit(1)
+    logger.info(f"Successfully processed markdown files")
+def get_processed_doc_ids() -> Set[str]:
+    """Get set of doc_ids that have already been processed with context."""
+    if not os.path.exists("data/all_sources_contextual_nodes.pkl"):
+        return set()
+    try:
+        with open("data/all_sources_contextual_nodes.pkl", "rb") as f:
+            nodes = pickle.load(f)
+            return {node.source_node.node_id for node in nodes}
+    except Exception as e:
+        logger.error(f"Error loading processed doc_ids: {e}")
+        return set()
+def add_context_to_nodes(new_only: bool = False) -> None:
+    """Add context to document nodes, optionally processing only new content."""
+    logger.info("Adding context to document nodes")
+    if new_only:
+        # Load all documents
+        all_docs = load_jsonl("data/all_sources_data.jsonl")
+        processed_ids = get_processed_doc_ids()
+        # Filter for unprocessed documents
+        new_docs = [doc for doc in all_docs if doc["doc_id"] not in processed_ids]
+        if not new_docs:
+            logger.info("No new documents to process")
+            return
+        # Save temporary JSONL with only new documents
+        temp_file = "data/new_docs_temp.jsonl"
+        save_jsonl(new_docs, temp_file)
+        # Temporarily modify the add_context_to_nodes.py script to use the temp file
+        cmd = [
+            "python",
+            "-c",
+            f"""
+import asyncio
+import os
+import pickle
+import json
+from data.scraping_scripts.add_context_to_nodes import create_docs, process
+async def main():
+    # First, get the list of sources being updated from the temp file
+    updated_sources = set()
+    with open("{temp_file}", "r") as f:
+        for line in f:
+            data = json.loads(line)
+            updated_sources.add(data["source"])
+    print(f"Updating nodes for sources: {{updated_sources}}")
+    # Process new documents
+    documents = create_docs("{temp_file}")
+    enhanced_nodes = await process(documents)
+    print(f"Generated context for {{len(enhanced_nodes)}} new nodes")
+    # Load existing nodes if they exist
+    existing_nodes = []
+    if os.path.exists("data/all_sources_contextual_nodes.pkl"):
+        with open("data/all_sources_contextual_nodes.pkl", "rb") as f:
+            existing_nodes = pickle.load(f)
+        # Filter out existing nodes for sources we're updating
+        filtered_nodes = []
+        removed_count = 0
+        for node in existing_nodes:
+            # Try to extract source from node metadata
+            try:
+                source = None
+                if hasattr(node, 'source_node') and hasattr(node.source_node, 'metadata'):
+                    source = node.source_node.metadata.get("source")
+                elif hasattr(node, 'metadata'):
+                    source = node.metadata.get("source")
+                if source not in updated_sources:
+                    filtered_nodes.append(node)
+                else:
+                    removed_count += 1
+            except Exception:
+                # Keep nodes where we can't determine the source
+                filtered_nodes.append(node)
+        print(f"Removed {{removed_count}} existing nodes for updated sources")
+        existing_nodes = filtered_nodes
+    # Combine filtered existing nodes with new nodes
+    all_nodes = existing_nodes + enhanced_nodes
+    # Save all nodes
+    with open("data/all_sources_contextual_nodes.pkl", "wb") as f:
+        pickle.dump(all_nodes, f)
+    print(f"Total nodes in updated file: {{len(all_nodes)}}")
+asyncio.run(main())
+            """,
+        ]
+    else:
+        # Process all documents
+        logger.info("Adding context to all nodes")
+        cmd = ["python", "data/scraping_scripts/add_context_to_nodes.py"]
+    result = subprocess.run(cmd)
+    if result.returncode != 0:
+        logger.error(f"Error adding context to nodes - check output above")
+        sys.exit(1)
+    logger.info("Successfully added context to nodes")
+    # Clean up temp file if it exists
+    if new_only and os.path.exists("data/new_docs_temp.jsonl"):
+        os.remove("data/new_docs_temp.jsonl")
+def create_vector_stores() -> None:
+    """Create vector stores from processed documents."""
+    logger.info("Creating vector stores")
+    cmd = ["python", "data/scraping_scripts/create_vector_stores.py", "all_sources"]
+    result = subprocess.run(cmd)
+    if result.returncode != 0:
+        logger.error(f"Error creating vector stores - check output above")
+        sys.exit(1)
+    logger.info("Successfully created vector stores")
+def upload_to_huggingface(upload_jsonl: bool = False) -> None:
+    """Upload databases to HuggingFace."""
+    logger.info("Uploading databases to HuggingFace")
+    cmd = ["python", "data/scraping_scripts/upload_dbs_to_hf.py"]
+    result = subprocess.run(cmd)
+    if result.returncode != 0:
+        logger.error(f"Error uploading databases - check output above")
+        sys.exit(1)
+    logger.info("Successfully uploaded databases to HuggingFace")
+    if upload_jsonl:
+        logger.info("Uploading data files to HuggingFace")
+        try:
+            # Note: This uses a separate private repository
+            cmd = ["python", "data/scraping_scripts/upload_data_to_hf.py"]
+            result = subprocess.run(cmd)
+            if result.returncode != 0:
+                logger.error(f"Error uploading data files - check output above")
+                sys.exit(1)
+            logger.info("Successfully uploaded data files to HuggingFace")
+        except Exception as e:
+            logger.error(f"Error uploading JSONL file: {e}")
+            sys.exit(1)
+def main():
+    parser = argparse.ArgumentParser(
+        description="AI Tutor App Documentation Update Workflow"
+    )
+    parser.add_argument(
+        "--sources",
+        nargs="+",
+        choices=GITHUB_SOURCES,
+        default=GITHUB_SOURCES,
+        help="GitHub documentation sources to update",
+    )
+    parser.add_argument(
+        "--skip-download", action="store_true", help="Skip downloading from GitHub"
+    )
+    parser.add_argument(
+        "--skip-process", action="store_true", help="Skip processing markdown files"
+    )
+    parser.add_argument(
+        "--process-all-context",
+        action="store_true",
+        help="Process all content when adding context (default: only process new content)",
+    )
+    parser.add_argument(
+        "--skip-context",
+        action="store_true",
+        help="Skip the context addition step entirely",
+    )
+    parser.add_argument(
+        "--skip-vectors", action="store_true", help="Skip vector store creation"
+    )
+    parser.add_argument(
+        "--skip-upload", action="store_true", help="Skip uploading to HuggingFace"
+    )
+    parser.add_argument(
+        "--skip-data-upload",
+        action="store_true",
+        help="Skip uploading data files (.jsonl and .pkl) to private HuggingFace repo (they are uploaded by default)",
+    )
+    args = parser.parse_args()
+    # Ensure required data files exist before proceeding
+    ensure_required_files_exist()
+    # Execute the workflow steps
+    if not args.skip_download:
+        download_from_github(args.sources)
+    if not args.skip_process:
+        process_markdown_files(args.sources)
+    if not args.skip_context:
+        add_context_to_nodes(not args.process_all_context)
+    if not args.skip_vectors:
+        create_vector_stores()
+    if not args.skip_upload:
+        # By default, also upload the data files (JSONL and PKL) unless explicitly skipped
+        upload_to_huggingface(not args.skip_data_upload)
+    logger.info("Documentation update workflow completed successfully")
+if __name__ == "__main__":
+    main()

data/scraping_scripts/upload_data_to_hf.py ADDED Viewed

	@@ -0,0 +1,129 @@

+#!/usr/bin/env python
+"""
+Upload Data Files to HuggingFace
+This script uploads key data files to a private HuggingFace dataset repository:
+1. all_sources_data.jsonl - The raw document data
+2. all_sources_contextual_nodes.pkl - The processed nodes with added context
+This is useful for new team members who need the latest version of the data.
+Usage:
+    python upload_data_to_hf.py [--repo REPO_ID]
+Arguments:
+    --repo REPO_ID     HuggingFace dataset repository ID (default: towardsai-tutors/ai-tutor-data)
+"""
+import argparse
+import os
+from dotenv import load_dotenv
+from huggingface_hub import HfApi
+load_dotenv()
+def upload_files_to_huggingface(repo_id="towardsai-tutors/ai-tutor-data"):
+    """Upload data files to a private HuggingFace repository."""
+    # Main files to upload
+    files_to_upload = [
+        # Combined data and vector store
+        "data/all_sources_data.jsonl",
+        "data/all_sources_contextual_nodes.pkl",
+        # Individual source files
+        "data/transformers_data.jsonl",
+        "data/peft_data.jsonl",
+        "data/trl_data.jsonl",
+        "data/llama_index_data.jsonl",
+        "data/langchain_data.jsonl",
+        "data/openai_cookbooks_data.jsonl",
+        # Course files
+        "data/tai_blog_data.jsonl",
+        "data/8-hour_primer_data.jsonl",
+        "data/llm_developer_data.jsonl",
+        "data/python_primer_data.jsonl",
+    ]
+    # Filter to only include files that exist
+    existing_files = []
+    missing_files = []
+    for file_path in files_to_upload:
+        if os.path.exists(file_path):
+            existing_files.append(file_path)
+        else:
+            missing_files.append(file_path)
+    # Critical files must exist
+    critical_files = [
+        "data/all_sources_data.jsonl",
+        "data/all_sources_contextual_nodes.pkl",
+    ]
+    critical_missing = [f for f in critical_files if f in missing_files]
+    if critical_missing:
+        print(
+            f"Error: The following critical files were not found: {', '.join(critical_missing)}"
+        )
+        # return False
+    if missing_files:
+        print(
+            f"Warning: The following files were not found and will not be uploaded: {', '.join(missing_files)}"
+        )
+        print("This is normal if you're only updating certain sources.")
+    try:
+        api = HfApi(token=os.getenv("HF_TOKEN"))
+        # Check if repository exists, create if it doesn't
+        try:
+            api.repo_info(repo_id=repo_id, repo_type="dataset")
+            print(f"Repository {repo_id} exists")
+        except Exception:
+            print(
+                f"Repository {repo_id} doesn't exist. Please create it first on the HuggingFace platform."
+            )
+            print("Make sure to set it as private if needed.")
+            return False
+        # Upload all existing files
+        for file_path in existing_files:
+            try:
+                file_name = os.path.basename(file_path)
+                print(f"Uploading {file_name}...")
+                api.upload_file(
+                    path_or_fileobj=file_path,
+                    path_in_repo=file_name,
+                    repo_id=repo_id,
+                    repo_type="dataset",
+                )
+                print(
+                    f"Successfully uploaded {file_name} to HuggingFace repository {repo_id}"
+                )
+            except Exception as e:
+                print(f"Error uploading {file_path}: {e}")
+                # Continue with other files even if one fails
+        return True
+    except Exception as e:
+        print(f"Error uploading files: {e}")
+        return False
+def main():
+    parser = argparse.ArgumentParser(description="Upload Data Files to HuggingFace")
+    parser.add_argument(
+        "--repo",
+        default="towardsai-tutors/ai-tutor-data",
+        help="HuggingFace dataset repository ID",
+    )
+    args = parser.parse_args()
+    upload_files_to_huggingface(args.repo)
+if __name__ == "__main__":
+    main()

data/scraping_scripts/upload_dbs_to_hf.py ADDED Viewed

	@@ -0,0 +1,38 @@

+"""
+Hugging Face Data Upload Script
+Purpose:
+This script uploads a local folder to a Hugging Face dataset repository. It's designed to
+update or create a dataset on the Hugging Face Hub by uploading the contents of a specified
+local folder.
+Usage:
+- Run the script: python data/scraping_scripts/upload_dbs_to_hf.py
+The script will:
+- Upload the contents of the 'data' folder to the specified Hugging Face dataset repository.
+- https://huggingface.co/datasets/towardsai-buster/ai-tutor-vector-db
+Configuration:
+- The script is set to upload to the "towardsai-buster/test-data" dataset repository.
+- It deletes all existing files in the repository before uploading (due to delete_patterns=["*"]).
+"""
+import os
+from dotenv import load_dotenv
+from huggingface_hub import HfApi
+load_dotenv()
+api = HfApi(token=os.getenv("HF_TOKEN"))
+api.upload_folder(
+    folder_path="data",
+    repo_id="towardsai-tutors/ai-tutor-vector-db",
+    repo_type="dataset",
+    # multi_commits=True,
+    # multi_commits_verbose=True,
+    delete_patterns=["*"],
+    ignore_patterns=["*.jsonl", "*.py", "*.txt", "*.ipynb", "*.md", "*.pyc"],
+)