Nullpointer-KK commited on
Commit
c9da9e4
·
unverified ·
1 Parent(s): 6225a0c

Add files via upload

Browse files
data/scraping_scripts/README.md ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AI Tutor App Data Workflows
2
+
3
+ This directory contains scripts for managing the AI Tutor App's data pipeline.
4
+
5
+ ## Workflow Scripts
6
+
7
+ ### 1. Adding a New Course
8
+
9
+ To add a new course to the AI Tutor:
10
+
11
+ ```bash
12
+ python add_course_workflow.py --course [COURSE_NAME]
13
+ ```
14
+
15
+ This will guide you through the complete process:
16
+
17
+ 1. Process markdown files from Notion exports
18
+ 2. Prompt you to manually add URLs to the course content
19
+ 3. Merge the course data into the main dataset
20
+ 4. Add contextual information to document nodes
21
+ 5. Create vector stores
22
+ 6. Upload databases to HuggingFace
23
+ 7. Update UI configuration
24
+
25
+ **Requirements before running:**
26
+
27
+ - The course name must be properly configured in `process_md_files.py` under `SOURCE_CONFIGS`
28
+ - Course markdown files must be placed in the directory specified in the configuration
29
+ - You must have access to the live course platform to add URLs
30
+
31
+ ### 2. Updating Documentation via GitHub API
32
+
33
+ To update library documentation from GitHub repositories:
34
+
35
+ ```bash
36
+ python update_docs_workflow.py
37
+ ```
38
+
39
+ This will update all supported documentation sources. You can also specify specific sources:
40
+
41
+ ```bash
42
+ python update_docs_workflow.py --sources transformers peft
43
+ ```
44
+
45
+ The workflow includes:
46
+
47
+ 1. Downloading documentation from GitHub using the API
48
+ 2. Processing markdown files to create JSONL data
49
+ 3. Adding contextual information to document nodes
50
+ 4. Creating vector stores
51
+ 5. Uploading databases to HuggingFace
52
+
53
+ ### 3. Uploading JSONL to HuggingFace
54
+
55
+ To upload the main JSONL file to a private HuggingFace repository:
56
+
57
+ ```bash
58
+ python upload_jsonl_to_hf.py
59
+ ```
60
+
61
+ This is useful for sharing the latest data with team members.
62
+
63
+ ## Individual Components
64
+
65
+ If you need to run specific steps individually:
66
+
67
+ - **GitHub to Markdown**: `github_to_markdown_ai_docs.py`
68
+ - **Process Markdown**: `process_md_files.py`
69
+ - **Add Context**: `add_context_to_nodes.py`
70
+ - **Create Vector Stores**: `create_vector_stores.py`
71
+ - **Upload to HuggingFace**: `upload_dbs_to_hf.py`
72
+
73
+ ## Tips for New Team Members
74
+
75
+ 1. To update the AI Tutor with new content:
76
+ - For new courses, use `add_course_workflow.py`
77
+ - For updated documentation, use `update_docs_workflow.py`
78
+
79
+ 2. When adding URLs to course content:
80
+ - Get the URLs from the live course platform
81
+ - Add them to the generated JSONL file in the `url` field
82
+ - Example URL format: `https://academy.towardsai.net/courses/take/python-for-genai/multimedia/62515980-course-structure`
83
+ - Make sure every document has a valid URL
84
+
85
+ 3. By default, only new content will have context added to save time and resources. Use `--process-all-context` only if you need to regenerate context for all documents. Use `--skip-data-upload` if you don't want to upload data files to the private HuggingFace repo (they're uploaded by default).
86
+
87
+ 4. When adding a new course, verify that it appears in the Gradio UI:
88
+ - The workflow automatically updates `main.py` and `setup.py` to include the new source
89
+ - Check that the new source appears in the dropdown menu in the UI
90
+ - Make sure it's properly included in the default selected sources
91
+ - Restart the Gradio app to see the changes
92
+
93
+ 5. First time setup or missing files:
94
+ - Both workflows automatically check for and download required data files:
95
+ - `all_sources_data.jsonl` - Contains the raw document data
96
+ - `all_sources_contextual_nodes.pkl` - Contains the processed nodes with added context
97
+ - If the PKL file exists, the `--new-context-only` flag will only process new content
98
+ - You must have proper HuggingFace credentials with access to the private repository
99
+
100
+ 6. Make sure you have the required environment variables set:
101
+ - `OPENAI_API_KEY` for LLM processing
102
+ - `COHERE_API_KEY` for embeddings
103
+ - `HF_TOKEN` for HuggingFace uploads
104
+ - `GITHUB_TOKEN` for accessing documentation via the GitHub API
data/scraping_scripts/add_context_to_nodes.py ADDED
@@ -0,0 +1,196 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import asyncio
2
+ import json
3
+ import pdb
4
+ import pickle
5
+ from typing import Dict, List
6
+
7
+ import instructor
8
+ import logfire
9
+ import tiktoken
10
+ from anthropic import AsyncAnthropic
11
+ from dotenv import load_dotenv
12
+ from jinja2 import Template
13
+ from llama_index.core import Document
14
+ from llama_index.core.ingestion import IngestionPipeline
15
+ from llama_index.core.node_parser import SentenceSplitter
16
+ from llama_index.core.schema import TextNode
17
+ from openai import AsyncOpenAI
18
+ from pydantic import BaseModel, Field
19
+ from tenacity import retry, stop_after_attempt, wait_exponential
20
+ from tqdm.asyncio import tqdm
21
+
22
+ load_dotenv(".env")
23
+
24
+ # logfire.configure()
25
+
26
+
27
+ def create_docs(input_file: str) -> List[Document]:
28
+ with open(input_file, "r") as f:
29
+ documents: list[Document] = []
30
+ for line in f:
31
+ data = json.loads(line)
32
+ documents.append(
33
+ Document(
34
+ doc_id=data["doc_id"],
35
+ text=data["content"],
36
+ metadata={ # type: ignore
37
+ "url": data["url"],
38
+ "title": data["name"],
39
+ "tokens": data["tokens"],
40
+ "retrieve_doc": data["retrieve_doc"],
41
+ "source": data["source"],
42
+ },
43
+ excluded_llm_metadata_keys=[
44
+ "title",
45
+ "tokens",
46
+ "retrieve_doc",
47
+ "source",
48
+ ],
49
+ excluded_embed_metadata_keys=[
50
+ "url",
51
+ "tokens",
52
+ "retrieve_doc",
53
+ "source",
54
+ ],
55
+ )
56
+ )
57
+ return documents
58
+
59
+
60
+ class SituatedContext(BaseModel):
61
+ title: str = Field(..., description="The title of the document.")
62
+ context: str = Field(
63
+ ..., description="The context to situate the chunk within the document."
64
+ )
65
+
66
+
67
+ # client = AsyncInstructor(
68
+ # client=AsyncAnthropic(),
69
+ # create=patch(
70
+ # create=AsyncAnthropic().beta.prompt_caching.messages.create,
71
+ # mode=Mode.ANTHROPIC_TOOLS,
72
+ # ),
73
+ # mode=Mode.ANTHROPIC_TOOLS,
74
+ # )
75
+ aclient = AsyncOpenAI()
76
+ # logfire.instrument_openai(aclient)
77
+ client: instructor.AsyncInstructor = instructor.from_openai(aclient)
78
+
79
+
80
+ @retry(stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=4, max=10))
81
+ async def situate_context(doc: str, chunk: str) -> str:
82
+ template = Template(
83
+ """
84
+ <document>
85
+ {{ doc }}
86
+ </document>
87
+
88
+ Here is the chunk we want to situate within the whole document above:
89
+
90
+ <chunk>
91
+ {{ chunk }}
92
+ </chunk>
93
+
94
+ Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk.
95
+ Answer only with the succinct context and nothing else.
96
+ """
97
+ )
98
+
99
+ content = template.render(doc=doc, chunk=chunk)
100
+
101
+ response = await client.chat.completions.create(
102
+ model="gpt-4o-mini",
103
+ max_tokens=1000,
104
+ temperature=0,
105
+ messages=[
106
+ {
107
+ "role": "user",
108
+ "content": content,
109
+ }
110
+ ],
111
+ response_model=SituatedContext,
112
+ )
113
+ return response.context
114
+
115
+
116
+ async def process_chunk(node: TextNode, document_dict: dict) -> TextNode:
117
+ doc_id: str = node.source_node.node_id # type: ignore
118
+ doc: Document = document_dict[doc_id]
119
+
120
+ if doc.metadata["tokens"] > 120_000:
121
+ # Tokenize the document text
122
+ encoding = tiktoken.encoding_for_model("gpt-4o-mini")
123
+ tokens = encoding.encode(doc.get_content())
124
+
125
+ # Trim to 120,000 tokens
126
+ trimmed_tokens = tokens[:120_000]
127
+
128
+ # Decode back to text
129
+ trimmed_text = encoding.decode(trimmed_tokens)
130
+
131
+ # Update the document with trimmed text
132
+ doc = Document(text=trimmed_text, metadata=doc.metadata)
133
+ doc.metadata["tokens"] = 120_000
134
+
135
+ context: str = await situate_context(doc.get_content(), node.text)
136
+ node.text = f"{node.text}\n\n{context}"
137
+ return node
138
+
139
+
140
+ async def process(
141
+ documents: List[Document], semaphore_limit: int = 50
142
+ ) -> List[TextNode]:
143
+
144
+ # From the document, we create chunks
145
+ pipeline = IngestionPipeline(
146
+ transformations=[SentenceSplitter(chunk_size=800, chunk_overlap=0)]
147
+ )
148
+ all_nodes: list[TextNode] = pipeline.run(documents=documents, show_progress=True)
149
+ print(f"Number of nodes: {len(all_nodes)}")
150
+
151
+ document_dict: dict[str, Document] = {doc.doc_id: doc for doc in documents}
152
+
153
+ semaphore = asyncio.Semaphore(semaphore_limit)
154
+
155
+ async def process_with_semaphore(node):
156
+ async with semaphore:
157
+ result = await process_chunk(node, document_dict)
158
+ await asyncio.sleep(0.1)
159
+ return result
160
+
161
+ tasks = [process_with_semaphore(node) for node in all_nodes]
162
+
163
+ results: List[TextNode] = await tqdm.gather(*tasks, desc="Processing chunks")
164
+
165
+ # pdb.set_trace()
166
+
167
+ return results
168
+
169
+
170
+ async def main():
171
+ documents: List[Document] = create_docs("data/all_sources_data.jsonl")
172
+ enhanced_nodes: List[TextNode] = await process(documents)
173
+
174
+ with open("data/all_sources_contextual_nodes.pkl", "wb") as f:
175
+ pickle.dump(enhanced_nodes, f)
176
+
177
+ # pipeline = IngestionPipeline(
178
+ # transformations=[SentenceSplitter(chunk_size=800, chunk_overlap=0)]
179
+ # )
180
+ # all_nodes: list[TextNode] = pipeline.run(documents=documents, show_progress=True)
181
+ # print(all_nodes[7933])
182
+ # pdb.set_trace()
183
+
184
+ with open("data/all_sources_contextual_nodes.pkl", "rb") as f:
185
+ enhanced_nodes: list[TextNode] = pickle.load(f)
186
+
187
+ for i, node in enumerate(enhanced_nodes):
188
+ print(f"Chunk {i + 1}:")
189
+ print(f"Node: {node}")
190
+ print(f"Text: {node.text}")
191
+ # pdb.set_trace()
192
+ break
193
+
194
+
195
+ if __name__ == "__main__":
196
+ asyncio.run(main())
data/scraping_scripts/add_course_workflow.py ADDED
@@ -0,0 +1,541 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ """
3
+ AI Tutor App - Course Addition Workflow
4
+
5
+ This script guides you through the complete process of adding a new course to the AI Tutor App:
6
+
7
+ 1. Process course markdown files to create JSONL data
8
+ 2. MANDATORY MANUAL STEP: Add URLs to course content in the generated JSONL
9
+ 3. Merge course JSONL into all_sources_data.jsonl
10
+ 4. Add contextual information to document nodes
11
+ 5. Create vector stores
12
+ 6. Upload databases to HuggingFace
13
+ 7. Update UI configuration
14
+
15
+ Usage:
16
+ python add_course_workflow.py --course [COURSE_NAME]
17
+
18
+ Additional flags to run specific steps (if you want to restart from a specific point):
19
+ --skip-process-md Skip the markdown processing step
20
+ --skip-merge Skip merging into all_sources_data.jsonl
21
+ --new-context-only Only process new content when adding context
22
+ --skip-context Skip the context addition step entirely
23
+ --skip-vectors Skip vector store creation
24
+ --skip-upload Skip uploading to HuggingFace
25
+ --skip-ui-update Skip updating the UI configuration
26
+ """
27
+
28
+ import argparse
29
+ import json
30
+ import logging
31
+ import os
32
+ import pickle
33
+ import subprocess
34
+ import sys
35
+ import time
36
+ from pathlib import Path
37
+ from typing import Dict, List, Set
38
+
39
+ from dotenv import load_dotenv
40
+ from huggingface_hub import HfApi, hf_hub_download
41
+
42
+ # Load environment variables from .env file
43
+ load_dotenv()
44
+
45
+ # Configure logging
46
+ logging.basicConfig(
47
+ level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
48
+ )
49
+ logger = logging.getLogger(__name__)
50
+
51
+
52
+ def ensure_required_files_exist():
53
+ """Download required data files from HuggingFace if they don't exist locally."""
54
+ # List of files to check and download
55
+ required_files = {
56
+ # Critical files
57
+ "data/all_sources_data.jsonl": "all_sources_data.jsonl",
58
+ "data/all_sources_contextual_nodes.pkl": "all_sources_contextual_nodes.pkl",
59
+
60
+ # Documentation source files
61
+ "data/transformers_data.jsonl": "transformers_data.jsonl",
62
+ "data/peft_data.jsonl": "peft_data.jsonl",
63
+ "data/trl_data.jsonl": "trl_data.jsonl",
64
+ "data/llama_index_data.jsonl": "llama_index_data.jsonl",
65
+ "data/langchain_data.jsonl": "langchain_data.jsonl",
66
+ "data/openai_cookbooks_data.jsonl": "openai_cookbooks_data.jsonl",
67
+
68
+ # Course files
69
+ "data/tai_blog_data.jsonl": "tai_blog_data.jsonl",
70
+ "data/8-hour_primer_data.jsonl": "8-hour_primer_data.jsonl",
71
+ "data/llm_developer_data.jsonl": "llm_developer_data.jsonl",
72
+ "data/python_primer_data.jsonl": "python_primer_data.jsonl"
73
+ }
74
+
75
+ # Critical files that must be downloaded
76
+ critical_files = [
77
+ "data/all_sources_data.jsonl",
78
+ "data/all_sources_contextual_nodes.pkl"
79
+ ]
80
+
81
+ # Check and download each file
82
+ for local_path, remote_filename in required_files.items():
83
+ if not os.path.exists(local_path):
84
+ logger.info(f"{remote_filename} not found. Attempting to download from HuggingFace...")
85
+ try:
86
+ hf_hub_download(
87
+ token=os.getenv("HF_TOKEN"),
88
+ repo_id="towardsai-tutors/ai-tutor-data",
89
+ filename=remote_filename,
90
+ repo_type="dataset",
91
+ local_dir="data",
92
+ )
93
+ logger.info(f"Successfully downloaded {remote_filename} from HuggingFace")
94
+ except Exception as e:
95
+ logger.warning(f"Could not download {remote_filename}: {e}")
96
+
97
+ # Only create empty file for all_sources_data.jsonl if it's missing
98
+ if local_path == "data/all_sources_data.jsonl":
99
+ logger.warning("Creating a new all_sources_data.jsonl file. This will not include previously existing data.")
100
+ with open(local_path, "w") as f:
101
+ pass
102
+
103
+ # If critical file is missing, print a more serious warning
104
+ if local_path in critical_files:
105
+ logger.warning(f"Critical file {remote_filename} is missing. The workflow may not function correctly.")
106
+
107
+ if local_path == "data/all_sources_contextual_nodes.pkl":
108
+ logger.warning("The context addition step will process all documents since no existing contexts were found.")
109
+
110
+
111
+ def load_jsonl(file_path: str) -> List[Dict]:
112
+ """Load data from a JSONL file."""
113
+ data = []
114
+ with open(file_path, "r", encoding="utf-8") as f:
115
+ for line in f:
116
+ data.append(json.loads(line))
117
+ return data
118
+
119
+
120
+ def save_jsonl(data: List[Dict], file_path: str) -> None:
121
+ """Save data to a JSONL file."""
122
+ with open(file_path, "w", encoding="utf-8") as f:
123
+ for item in data:
124
+ json.dump(item, f, ensure_ascii=False)
125
+ f.write("\n")
126
+
127
+
128
+ def process_markdown_files(course_name: str) -> str:
129
+ """Process markdown files for a specific course. Returns path to output JSONL."""
130
+ logger.info(f"Processing markdown files for course: {course_name}")
131
+ cmd = ["python", "data/scraping_scripts/process_md_files.py", course_name]
132
+ result = subprocess.run(cmd)
133
+
134
+ if result.returncode != 0:
135
+ logger.error(f"Error processing markdown files - check output above")
136
+ sys.exit(1)
137
+
138
+ logger.info(f"Successfully processed markdown files for {course_name}")
139
+
140
+ # Determine the output file path from process_md_files.py
141
+ from data.scraping_scripts.process_md_files import SOURCE_CONFIGS
142
+
143
+ if course_name not in SOURCE_CONFIGS:
144
+ logger.error(f"Course {course_name} not found in SOURCE_CONFIGS")
145
+ sys.exit(1)
146
+
147
+ output_file = SOURCE_CONFIGS[course_name]["output_file"]
148
+ return output_file
149
+
150
+
151
+ def manual_url_addition(jsonl_path: str) -> None:
152
+ """Guide the user through manually adding URLs to the course JSONL."""
153
+ logger.info(f"=== MANDATORY MANUAL STEP: URL ADDITION ===")
154
+ logger.info(f"Please add the URLs to the course content in: {jsonl_path}")
155
+ logger.info(f"For each document in the JSONL file:")
156
+ logger.info(f"1. Open the file in a text editor")
157
+ logger.info(f"2. Find the empty 'url' field for each document")
158
+ logger.info(f"3. Add the appropriate URL from the live course platform")
159
+ logger.info(f" Example URL format: https://academy.towardsai.net/courses/take/python-for-genai/multimedia/62515980-course-structure")
160
+ logger.info(f"4. Save the file when done")
161
+
162
+ # Check if URLs are present
163
+ data = load_jsonl(jsonl_path)
164
+ missing_urls = sum(1 for item in data if not item.get("url"))
165
+
166
+ if missing_urls > 0:
167
+ logger.warning(f"Found {missing_urls} documents without URLs in {jsonl_path}")
168
+
169
+ answer = input(
170
+ f"\n{missing_urls} documents are missing URLs. Have you added all the URLs? (yes/no): "
171
+ )
172
+ if answer.lower() not in ["yes", "y"]:
173
+ logger.info("Please add the URLs and run the script again.")
174
+ sys.exit(0)
175
+ else:
176
+ logger.info("All documents have URLs. Continuing with the workflow.")
177
+
178
+
179
+ def merge_into_all_sources(course_jsonl_path: str) -> None:
180
+ """Merge the course JSONL into all_sources_data.jsonl."""
181
+ all_sources_path = "data/all_sources_data.jsonl"
182
+ logger.info(f"Merging {course_jsonl_path} into {all_sources_path}")
183
+
184
+ # Load course data
185
+ course_data = load_jsonl(course_jsonl_path)
186
+
187
+ # Load existing all_sources data if it exists
188
+ all_data = []
189
+ if os.path.exists(all_sources_path):
190
+ all_data = load_jsonl(all_sources_path)
191
+
192
+ # Get doc_ids from existing data
193
+ existing_ids = {item["doc_id"] for item in all_data}
194
+
195
+ # Add new course data (avoiding duplicates)
196
+ new_items = 0
197
+ for item in course_data:
198
+ if item["doc_id"] not in existing_ids:
199
+ all_data.append(item)
200
+ existing_ids.add(item["doc_id"])
201
+ new_items += 1
202
+
203
+ # Save the combined data
204
+ save_jsonl(all_data, all_sources_path)
205
+ logger.info(f"Added {new_items} new documents to {all_sources_path}")
206
+
207
+
208
+ def get_processed_doc_ids() -> Set[str]:
209
+ """Get set of doc_ids that have already been processed with context."""
210
+ if not os.path.exists("data/all_sources_contextual_nodes.pkl"):
211
+ return set()
212
+
213
+ try:
214
+ with open("data/all_sources_contextual_nodes.pkl", "rb") as f:
215
+ nodes = pickle.load(f)
216
+ return {node.source_node.node_id for node in nodes}
217
+ except Exception as e:
218
+ logger.error(f"Error loading processed doc_ids: {e}")
219
+ return set()
220
+
221
+
222
+ def add_context_to_nodes(new_only: bool = False) -> None:
223
+ """Add context to document nodes, optionally processing only new content."""
224
+ logger.info("Adding context to document nodes")
225
+
226
+ if new_only:
227
+ # Load all documents
228
+ all_docs = load_jsonl("data/all_sources_data.jsonl")
229
+ processed_ids = get_processed_doc_ids()
230
+
231
+ # Filter for unprocessed documents
232
+ new_docs = [doc for doc in all_docs if doc["doc_id"] not in processed_ids]
233
+
234
+ if not new_docs:
235
+ logger.info("No new documents to process")
236
+ return
237
+
238
+ # Save temporary JSONL with only new documents
239
+ temp_file = "data/new_docs_temp.jsonl"
240
+ save_jsonl(new_docs, temp_file)
241
+
242
+ # Temporarily modify the add_context_to_nodes.py script to use the temp file
243
+ cmd = [
244
+ "python",
245
+ "-c",
246
+ f"""
247
+ import asyncio
248
+ import os
249
+ import pickle
250
+ import json
251
+ from data.scraping_scripts.add_context_to_nodes import create_docs, process
252
+
253
+ async def main():
254
+ # First, get the list of sources being updated from the temp file
255
+ updated_sources = set()
256
+ with open("{temp_file}", "r") as f:
257
+ for line in f:
258
+ data = json.loads(line)
259
+ updated_sources.add(data["source"])
260
+
261
+ print(f"Updating nodes for sources: {{updated_sources}}")
262
+
263
+ # Process new documents
264
+ documents = create_docs("{temp_file}")
265
+ enhanced_nodes = await process(documents)
266
+ print(f"Generated context for {{len(enhanced_nodes)}} new nodes")
267
+
268
+ # Load existing nodes if they exist
269
+ existing_nodes = []
270
+ if os.path.exists("data/all_sources_contextual_nodes.pkl"):
271
+ with open("data/all_sources_contextual_nodes.pkl", "rb") as f:
272
+ existing_nodes = pickle.load(f)
273
+
274
+ # Filter out existing nodes for sources we're updating
275
+ filtered_nodes = []
276
+ removed_count = 0
277
+
278
+ for node in existing_nodes:
279
+ # Try to extract source from node metadata
280
+ try:
281
+ source = None
282
+ if hasattr(node, 'source_node') and hasattr(node.source_node, 'metadata'):
283
+ source = node.source_node.metadata.get("source")
284
+ elif hasattr(node, 'metadata'):
285
+ source = node.metadata.get("source")
286
+
287
+ if source not in updated_sources:
288
+ filtered_nodes.append(node)
289
+ else:
290
+ removed_count += 1
291
+ except Exception:
292
+ # Keep nodes where we can't determine the source
293
+ filtered_nodes.append(node)
294
+
295
+ print(f"Removed {{removed_count}} existing nodes for updated sources")
296
+ existing_nodes = filtered_nodes
297
+
298
+ # Combine filtered existing nodes with new nodes
299
+ all_nodes = existing_nodes + enhanced_nodes
300
+
301
+ # Save all nodes
302
+ with open("data/all_sources_contextual_nodes.pkl", "wb") as f:
303
+ pickle.dump(all_nodes, f)
304
+
305
+ print(f"Total nodes in updated file: {{len(all_nodes)}}")
306
+
307
+ asyncio.run(main())
308
+ """,
309
+ ]
310
+ else:
311
+ # Process all documents
312
+ cmd = ["python", "data/scraping_scripts/add_context_to_nodes.py"]
313
+
314
+ result = subprocess.run(cmd)
315
+
316
+ if result.returncode != 0:
317
+ logger.error(f"Error adding context to nodes - check output above")
318
+ sys.exit(1)
319
+
320
+ logger.info("Successfully added context to nodes")
321
+
322
+ # Clean up temp file if it exists
323
+ if new_only and os.path.exists("data/new_docs_temp.jsonl"):
324
+ os.remove("data/new_docs_temp.jsonl")
325
+
326
+
327
+ def create_vector_stores() -> None:
328
+ """Create vector stores from processed documents."""
329
+ logger.info("Creating vector stores")
330
+ cmd = ["python", "data/scraping_scripts/create_vector_stores.py", "all_sources"]
331
+ result = subprocess.run(cmd)
332
+
333
+ if result.returncode != 0:
334
+ logger.error(f"Error creating vector stores - check output above")
335
+ sys.exit(1)
336
+
337
+ logger.info("Successfully created vector stores")
338
+
339
+
340
+ def upload_to_huggingface(upload_jsonl: bool = False) -> None:
341
+ """Upload databases to HuggingFace."""
342
+ logger.info("Uploading databases to HuggingFace")
343
+ cmd = ["python", "data/scraping_scripts/upload_dbs_to_hf.py"]
344
+ result = subprocess.run(cmd)
345
+
346
+ if result.returncode != 0:
347
+ logger.error(f"Error uploading databases - check output above")
348
+ sys.exit(1)
349
+
350
+ logger.info("Successfully uploaded databases to HuggingFace")
351
+
352
+ if upload_jsonl:
353
+ logger.info("Uploading data files to HuggingFace")
354
+
355
+ try:
356
+ # Note: This uses a separate private repository
357
+ cmd = ["python", "data/scraping_scripts/upload_data_to_hf.py"]
358
+ result = subprocess.run(cmd)
359
+
360
+ if result.returncode != 0:
361
+ logger.error(f"Error uploading data files - check output above")
362
+ sys.exit(1)
363
+
364
+ logger.info("Successfully uploaded data files to HuggingFace")
365
+ except Exception as e:
366
+ logger.error(f"Error uploading JSONL file: {e}")
367
+ sys.exit(1)
368
+
369
+
370
+ def update_ui_files(course_name: str) -> None:
371
+ """Update main.py and setup.py with the new source."""
372
+ logger.info(f"Updating UI files with new course: {course_name}")
373
+
374
+ # Get the source configuration for display name
375
+ from data.scraping_scripts.process_md_files import SOURCE_CONFIGS
376
+
377
+ if course_name not in SOURCE_CONFIGS:
378
+ logger.error(f"Course {course_name} not found in SOURCE_CONFIGS")
379
+ return
380
+
381
+ # Get a readable display name for the UI
382
+ display_name = course_name.replace("_", " ").title()
383
+
384
+ # Update setup.py - add to AVAILABLE_SOURCES and AVAILABLE_SOURCES_UI
385
+ setup_path = Path("scripts/setup.py")
386
+ if setup_path.exists():
387
+ setup_content = setup_path.read_text()
388
+
389
+ # Check if already added
390
+ if f'"{course_name}"' in setup_content:
391
+ logger.info(f"Course {course_name} already in setup.py")
392
+ else:
393
+ # Add to AVAILABLE_SOURCES_UI
394
+ ui_list_start = setup_content.find("AVAILABLE_SOURCES_UI = [")
395
+ ui_list_end = setup_content.find("]", ui_list_start)
396
+ new_ui_content = (
397
+ setup_content[:ui_list_end]
398
+ + f' "{display_name}",\n'
399
+ + setup_content[ui_list_end:]
400
+ )
401
+
402
+ # Add to AVAILABLE_SOURCES
403
+ sources_list_start = new_ui_content.find("AVAILABLE_SOURCES = [")
404
+ sources_list_end = new_ui_content.find("]", sources_list_start)
405
+ new_content = (
406
+ new_ui_content[:sources_list_end]
407
+ + f' "{course_name}",\n'
408
+ + new_ui_content[sources_list_end:]
409
+ )
410
+
411
+ # Write updated content
412
+ setup_path.write_text(new_content)
413
+ logger.info(f"Updated setup.py with {course_name}")
414
+ else:
415
+ logger.warning(f"setup.py not found at {setup_path}")
416
+
417
+ # Update main.py - add to source_mapping
418
+ main_path = Path("scripts/main.py")
419
+ if main_path.exists():
420
+ main_content = main_path.read_text()
421
+
422
+ # Check if already added
423
+ if f'"{display_name}": "{course_name}"' in main_content:
424
+ logger.info(f"Course {course_name} already in main.py")
425
+ else:
426
+ # Add to source_mapping
427
+ mapping_start = main_content.find("source_mapping = {")
428
+ mapping_end = main_content.find("}", mapping_start)
429
+ new_main_content = (
430
+ main_content[:mapping_end]
431
+ + f' "{display_name}": "{course_name}",\n'
432
+ + main_content[mapping_end:]
433
+ )
434
+
435
+ # Add to default selected sources if not there
436
+ value_start = new_main_content.find("value=[")
437
+ value_end = new_main_content.find("]", value_start)
438
+
439
+ if f'"{display_name}"' not in new_main_content[value_start:value_end]:
440
+ new_main_content = (
441
+ new_main_content[: value_start + 7]
442
+ + f' "{display_name}",\n'
443
+ + new_main_content[value_start + 7 :]
444
+ )
445
+
446
+ # Write updated content
447
+ main_path.write_text(new_main_content)
448
+ logger.info(f"Updated main.py with {course_name}")
449
+ else:
450
+ logger.warning(f"main.py not found at {main_path}")
451
+
452
+
453
+ def main():
454
+ parser = argparse.ArgumentParser(
455
+ description="AI Tutor App Course Addition Workflow"
456
+ )
457
+ parser.add_argument(
458
+ "--course",
459
+ required=True,
460
+ help="Name of the course to process (must match SOURCE_CONFIGS)",
461
+ )
462
+ parser.add_argument(
463
+ "--skip-process-md",
464
+ action="store_true",
465
+ help="Skip the markdown processing step",
466
+ )
467
+ parser.add_argument(
468
+ "--skip-merge",
469
+ action="store_true",
470
+ help="Skip merging into all_sources_data.jsonl",
471
+ )
472
+ parser.add_argument(
473
+ "--process-all-context",
474
+ action="store_true",
475
+ help="Process all content when adding context (default: only process new content)",
476
+ )
477
+ parser.add_argument(
478
+ "--skip-context",
479
+ action="store_true",
480
+ help="Skip the context addition step entirely",
481
+ )
482
+ parser.add_argument(
483
+ "--skip-vectors", action="store_true", help="Skip vector store creation"
484
+ )
485
+ parser.add_argument(
486
+ "--skip-upload", action="store_true", help="Skip uploading to HuggingFace"
487
+ )
488
+ parser.add_argument(
489
+ "--skip-ui-update",
490
+ action="store_true",
491
+ help="Skip updating the UI configuration",
492
+ )
493
+ parser.add_argument(
494
+ "--skip-data-upload",
495
+ action="store_true",
496
+ help="Skip uploading data files to private HuggingFace repo (they are uploaded by default)",
497
+ )
498
+
499
+ args = parser.parse_args()
500
+ course_name = args.course
501
+
502
+ # Ensure required data files exist before proceeding
503
+ ensure_required_files_exist()
504
+
505
+ # Get the output file path
506
+ from data.scraping_scripts.process_md_files import SOURCE_CONFIGS
507
+
508
+ if course_name not in SOURCE_CONFIGS:
509
+ logger.error(f"Course {course_name} not found in SOURCE_CONFIGS")
510
+ sys.exit(1)
511
+
512
+ course_jsonl_path = SOURCE_CONFIGS[course_name]["output_file"]
513
+
514
+ # Execute the workflow steps
515
+ if not args.skip_process_md:
516
+ course_jsonl_path = process_markdown_files(course_name)
517
+
518
+ # Always do the manual URL addition step for courses
519
+ manual_url_addition(course_jsonl_path)
520
+
521
+ if not args.skip_merge:
522
+ merge_into_all_sources(course_jsonl_path)
523
+
524
+ if not args.skip_context:
525
+ add_context_to_nodes(not args.process_all_context)
526
+
527
+ if not args.skip_vectors:
528
+ create_vector_stores()
529
+
530
+ if not args.skip_upload:
531
+ # By default, also upload the data files (JSONL and PKL) unless explicitly skipped
532
+ upload_to_huggingface(not args.skip_data_upload)
533
+
534
+ if not args.skip_ui_update:
535
+ update_ui_files(course_name)
536
+
537
+ logger.info("Course addition workflow completed successfully")
538
+
539
+
540
+ if __name__ == "__main__":
541
+ main()
data/scraping_scripts/create_vector_stores.py ADDED
@@ -0,0 +1,218 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Vector Store Creation Script
3
+
4
+ Purpose:
5
+ This script processes various data sources (e.g., transformers, peft, trl, llama_index, openai_cookbooks, langchain)
6
+ to create vector stores using Chroma and LlamaIndex. It reads data from JSONL files, creates document embeddings,
7
+ and stores them in persistent Chroma databases for efficient retrieval.
8
+
9
+ Usage:
10
+ python script_name.py <source1> <source2> ...
11
+
12
+ Example:
13
+ python script_name.py transformers peft llama_index
14
+
15
+ The script accepts one or more source names as command-line arguments. Valid source names are:
16
+ transformers, peft, trl, llama_index, openai_cookbooks, langchain
17
+
18
+ For each specified source, the script will:
19
+ 1. Read data from the corresponding JSONL file
20
+ 2. Create document embeddings
21
+ 3. Store the embeddings in a Chroma vector database
22
+ 4. Save a dictionary of documents for future reference
23
+
24
+ Note: Ensure that the input JSONL files are present in the 'data' directory.
25
+ """
26
+
27
+ import argparse
28
+ import json
29
+ import os
30
+ import pdb
31
+ import pickle
32
+ import shutil
33
+
34
+ import chromadb
35
+ from dotenv import load_dotenv
36
+ from llama_index.core import Document, StorageContext, VectorStoreIndex
37
+ from llama_index.core.node_parser import SentenceSplitter
38
+ from llama_index.core.schema import MetadataMode, TextNode
39
+ from llama_index.embeddings.cohere import CohereEmbedding
40
+ from llama_index.embeddings.openai import OpenAIEmbedding
41
+ from llama_index.llms.openai import OpenAI
42
+ from llama_index.vector_stores.chroma import ChromaVectorStore
43
+
44
+ load_dotenv()
45
+
46
+ # Configuration for different sources
47
+ SOURCE_CONFIGS = {
48
+ "transformers": {
49
+ "input_file": "data/transformers_data.jsonl",
50
+ "db_name": "chroma-db-transformers",
51
+ },
52
+ "peft": {"input_file": "data/peft_data.jsonl", "db_name": "chroma-db-peft"},
53
+ "trl": {"input_file": "data/trl_data.jsonl", "db_name": "chroma-db-trl"},
54
+ "llama_index": {
55
+ "input_file": "data/llama_index_data.jsonl",
56
+ "db_name": "chroma-db-llama_index",
57
+ },
58
+ "openai_cookbooks": {
59
+ "input_file": "data/openai_cookbooks_data.jsonl",
60
+ "db_name": "chroma-db-openai_cookbooks",
61
+ },
62
+ "langchain": {
63
+ "input_file": "data/langchain_data.jsonl",
64
+ "db_name": "chroma-db-langchain",
65
+ },
66
+ "tai_blog": {
67
+ "input_file": "data/tai_blog_data.jsonl",
68
+ "db_name": "chroma-db-tai_blog",
69
+ },
70
+ "all_sources": {
71
+ "input_file": "data/all_sources_data.jsonl",
72
+ "db_name": "chroma-db-all_sources",
73
+ },
74
+ }
75
+
76
+
77
+ def create_docs(input_file: str) -> list[Document]:
78
+ with open(input_file, "r") as f:
79
+ documents = []
80
+ for line in f:
81
+ data = json.loads(line)
82
+ documents.append(
83
+ Document(
84
+ doc_id=data["doc_id"],
85
+ text=data["content"],
86
+ metadata={ # type: ignore
87
+ "url": data["url"],
88
+ "title": data["name"],
89
+ "tokens": data["tokens"],
90
+ "retrieve_doc": data["retrieve_doc"],
91
+ "source": data["source"],
92
+ },
93
+ excluded_llm_metadata_keys=[ # url is included in LLM context
94
+ "title",
95
+ "tokens",
96
+ "retrieve_doc",
97
+ "source",
98
+ ],
99
+ excluded_embed_metadata_keys=[ # title is embedded along the content
100
+ "url",
101
+ "tokens",
102
+ "retrieve_doc",
103
+ "source",
104
+ ],
105
+ )
106
+ )
107
+ return documents
108
+
109
+
110
+ def process_source(source: str):
111
+ config = SOURCE_CONFIGS[source]
112
+
113
+ input_file = config["input_file"]
114
+ db_name = config["db_name"]
115
+ db_path = f"data/{db_name}"
116
+
117
+ print(f"Processing source: {source}")
118
+
119
+ documents: list[Document] = create_docs(input_file)
120
+ print(f"Created {len(documents)} documents")
121
+
122
+ # Check if the folder exists and delete it
123
+ if os.path.exists(db_path):
124
+ print(f"Existing database found at {db_path}. Deleting...")
125
+ shutil.rmtree(db_path)
126
+ print(f"Deleted existing database at {db_path}")
127
+
128
+ # Create Chroma client and collection
129
+ chroma_client = chromadb.PersistentClient(path=f"data/{db_name}")
130
+ chroma_collection = chroma_client.create_collection(db_name)
131
+
132
+ # Create vector store and storage context
133
+ vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
134
+ storage_context = StorageContext.from_defaults(vector_store=vector_store)
135
+
136
+ # Save document dictionary
137
+ document_dict: dict[str, Document] = {doc.doc_id: doc for doc in documents}
138
+ document_dict_file = f"data/{db_name}/document_dict_{source}.pkl"
139
+ with open(document_dict_file, "wb") as f:
140
+ pickle.dump(document_dict, f)
141
+ print(f"Saved document dictionary to {document_dict_file}")
142
+
143
+ # Load nodes with context
144
+ with open("data/all_sources_contextual_nodes.pkl", "rb") as f:
145
+ nodes_with_context: list[TextNode] = pickle.load(f)
146
+
147
+ print(f"Loaded {len(nodes_with_context)} nodes with context")
148
+ # pdb.set_trace()
149
+ # exit()
150
+
151
+ # Create vector store index
152
+ index = VectorStoreIndex(
153
+ nodes=nodes_with_context,
154
+ # embed_model=OpenAIEmbedding(model="text-embedding-3-large", mode="similarity"),
155
+ embed_model=CohereEmbedding(
156
+ api_key=os.environ["COHERE_API_KEY"],
157
+ model_name="embed-english-v3.0",
158
+ input_type="search_document",
159
+ ),
160
+ show_progress=True,
161
+ use_async=True,
162
+ storage_context=storage_context,
163
+ )
164
+ llm = OpenAI(
165
+ temperature=1,
166
+ model="gpt-4o-mini",
167
+ # model="gpt-4o",
168
+ max_tokens=5000,
169
+ max_retries=3,
170
+ )
171
+ query_engine = index.as_query_engine(llm=llm)
172
+ response = query_engine.query("How to fine-tune an llm?")
173
+ print(response)
174
+ for src in response.source_nodes:
175
+ print("Node ID\t", src.node_id)
176
+ print("Title\t", src.metadata["title"])
177
+ print("Text\t", src.text)
178
+ print("Score\t", src.score)
179
+ print("-_" * 20)
180
+
181
+ # # Create vector store index
182
+ # index = VectorStoreIndex.from_documents(
183
+ # documents,
184
+ # # embed_model=OpenAIEmbedding(model="text-embedding-3-large", mode="similarity"),
185
+ # embed_model=CohereEmbedding(
186
+ # api_key=os.environ["COHERE_API_KEY"],
187
+ # model_name="embed-english-v3.0",
188
+ # input_type="search_document",
189
+ # ),
190
+ # transformations=[SentenceSplitter(chunk_size=800, chunk_overlap=0)],
191
+ # show_progress=True,
192
+ # use_async=True,
193
+ # storage_context=storage_context,
194
+ # )
195
+ print(f"Created vector store index for {source}")
196
+
197
+
198
+ def main(sources: list[str]):
199
+ for source in sources:
200
+ if source in SOURCE_CONFIGS:
201
+ process_source(source)
202
+ else:
203
+ print(f"Unknown source: {source}. Skipping.")
204
+
205
+
206
+ if __name__ == "__main__":
207
+ parser = argparse.ArgumentParser(
208
+ description="Process sources and create vector stores."
209
+ )
210
+ parser.add_argument(
211
+ "sources",
212
+ nargs="+",
213
+ choices=SOURCE_CONFIGS.keys(),
214
+ help="Specify one or more sources to process",
215
+ )
216
+ args = parser.parse_args()
217
+
218
+ main(args.sources)
data/scraping_scripts/csv_to_jsonl.py ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import uuid
3
+
4
+ import pandas as pd
5
+ import tiktoken
6
+
7
+
8
+ # Function to count tokens using tiktoken
9
+ def num_tokens_from_string(string: str, encoding_name: str) -> int:
10
+ encoding = tiktoken.get_encoding(encoding_name)
11
+ num_tokens = len(
12
+ encoding.encode(
13
+ string, disallowed_special=(encoding.special_tokens_set - {"<|endoftext|>"})
14
+ )
15
+ )
16
+ return num_tokens
17
+
18
+
19
+ # Function to clean or remove specific content, e.g., copyright headers
20
+ def remove_copyright_header(content: str) -> str:
21
+ # Implement any cleaning logic you need here
22
+ return content
23
+
24
+
25
+ # Function to convert DataFrame to JSONL format with token counting
26
+ def convert_to_jsonl_with_conditions(df, encoding_name="cl100k_base"):
27
+ jsonl_data = []
28
+ for _, row in df.iterrows():
29
+ token_count = num_tokens_from_string(row["text"], encoding_name)
30
+
31
+ # Skip entries based on token count conditions
32
+ if token_count < 100 or token_count > 200_000:
33
+ print(f"Skipping {row['title']} due to token count {token_count}")
34
+ continue
35
+
36
+ cleaned_content = remove_copyright_header(row["text"])
37
+
38
+ entry = {
39
+ "tokens": token_count, # Token count using tiktoken
40
+ "doc_id": str(uuid.uuid4()), # Generate a unique UUID
41
+ "name": row["title"],
42
+ "url": row["tai_url"],
43
+ "retrieve_doc": (token_count <= 8000), # retrieve_doc condition
44
+ "source": "tai_blog",
45
+ "content": cleaned_content,
46
+ }
47
+ jsonl_data.append(entry)
48
+ return jsonl_data
49
+
50
+
51
+ # Load the CSV file
52
+ data = pd.read_csv("data/tai.csv")
53
+
54
+ # Convert the dataframe to JSONL format with token counting and conditions
55
+ jsonl_data_with_conditions = convert_to_jsonl_with_conditions(data)
56
+
57
+ # Save the output to a new JSONL file using json.dumps to ensure proper escaping
58
+ output_path = "data/tai_blog_data_conditions.jsonl"
59
+ with open(output_path, "w") as f:
60
+ for entry in jsonl_data_with_conditions:
61
+ f.write(json.dumps(entry) + "\n")
data/scraping_scripts/github_to_markdown_ai_docs.py ADDED
@@ -0,0 +1,231 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Fetch Markdown files from specified GitHub repositories.
3
+
4
+ This script fetches Markdown (.md), MDX (.mdx), and Jupyter Notebook (.ipynb) files
5
+ from specified GitHub repositories, particularly focusing on documentation sources
6
+ for various AI and machine learning libraries.
7
+
8
+ Key features:
9
+ 1. Configurable for multiple documentation sources (e.g., Hugging Face Transformers, PEFT, TRL)
10
+ 2. Command-line interface for specifying one or more sources to process
11
+ 3. Automatic conversion of Jupyter Notebooks to Markdown
12
+ 4. Rate limiting handling to comply with GitHub API restrictions
13
+ 5. Retry mechanism for resilience against network issues
14
+
15
+ Usage:
16
+ python github_to_markdown_ai_docs.py <source1> [<source2> ...]
17
+
18
+ Where <sourceN> is one of the predefined sources in SOURCE_CONFIGS (e.g., 'transformers', 'peft', 'trl').
19
+
20
+ Example:
21
+ python github_to_markdown_ai_docs.py trl peft
22
+
23
+ This will download and process the documentation files for both TRL and PEFT libraries.
24
+
25
+ Note:
26
+ - Ensure you have set the GITHUB_TOKEN variable with your GitHub Personal Access Token.
27
+ - The script creates a 'data' directory in the current working directory to store the downloaded files.
28
+ - Each source's files are stored in a subdirectory named '<repo>_md_files'.
29
+
30
+ """
31
+
32
+ import argparse
33
+ import json
34
+ import os
35
+ import random
36
+ import time
37
+ from typing import Dict, List
38
+
39
+ import nbformat
40
+ import requests
41
+ from dotenv import load_dotenv
42
+ from nbconvert import MarkdownExporter
43
+
44
+ load_dotenv()
45
+
46
+ # Configuration for different sources
47
+ SOURCE_CONFIGS = {
48
+ "transformers": {
49
+ "owner": "huggingface",
50
+ "repo": "transformers",
51
+ "path": "docs/source/en",
52
+ },
53
+ "peft": {
54
+ "owner": "huggingface",
55
+ "repo": "peft",
56
+ "path": "docs/source",
57
+ },
58
+ "trl": {
59
+ "owner": "huggingface",
60
+ "repo": "trl",
61
+ "path": "docs/source",
62
+ },
63
+ "llama_index": {
64
+ "owner": "run-llama",
65
+ "repo": "llama_index",
66
+ "path": "docs/docs",
67
+ },
68
+ "openai_cookbooks": {
69
+ "owner": "openai",
70
+ "repo": "openai-cookbook",
71
+ "path": "examples",
72
+ },
73
+ "langchain": {
74
+ "owner": "langchain-ai",
75
+ "repo": "langchain",
76
+ "path": "docs/docs",
77
+ },
78
+ }
79
+
80
+ # GitHub Personal Access Token (replace with your own token)
81
+ GITHUB_TOKEN = os.getenv("GITHUB_TOKEN")
82
+
83
+ # Headers for authenticated requests
84
+ HEADERS = {
85
+ "Authorization": f"token {GITHUB_TOKEN}",
86
+ "Accept": "application/vnd.github.v3+json",
87
+ }
88
+
89
+ # Maximum number of retries
90
+ MAX_RETRIES = 5
91
+
92
+
93
+ def check_rate_limit():
94
+ rate_limit_url = "https://api.github.com/rate_limit"
95
+ response = requests.get(rate_limit_url, headers=HEADERS)
96
+ data = response.json()
97
+ remaining = data["resources"]["core"]["remaining"]
98
+ reset_time = data["resources"]["core"]["reset"]
99
+
100
+ if remaining < 10: # Adjust this threshold as needed
101
+ wait_time = reset_time - time.time()
102
+ print(f"Rate limit nearly exceeded. Waiting for {wait_time:.2f} seconds.")
103
+ time.sleep(wait_time + 1) # Add 1 second buffer
104
+
105
+
106
+ def get_files_in_directory(api_url: str, retries: int = 0) -> List[Dict]:
107
+ try:
108
+ check_rate_limit()
109
+ response = requests.get(api_url, headers=HEADERS)
110
+ response.raise_for_status()
111
+ return response.json()
112
+ except requests.exceptions.RequestException as e:
113
+ if retries < MAX_RETRIES:
114
+ wait_time = (2**retries) + random.random()
115
+ print(
116
+ f"Error fetching directory contents: {e}. Retrying in {wait_time:.2f} seconds..."
117
+ )
118
+ time.sleep(wait_time)
119
+ return get_files_in_directory(api_url, retries + 1)
120
+ else:
121
+ print(
122
+ f"Failed to fetch directory contents after {MAX_RETRIES} retries: {e}"
123
+ )
124
+ return []
125
+
126
+
127
+ def download_file(file_url: str, file_path: str, retries: int = 0):
128
+ try:
129
+ check_rate_limit()
130
+ response = requests.get(file_url, headers=HEADERS)
131
+ response.raise_for_status()
132
+ with open(file_path, "wb") as file:
133
+ file.write(response.content)
134
+ except requests.exceptions.RequestException as e:
135
+ if retries < MAX_RETRIES:
136
+ wait_time = (2**retries) + random.random()
137
+ print(
138
+ f"Error downloading file: {e}. Retrying in {wait_time:.2f} seconds..."
139
+ )
140
+ time.sleep(wait_time)
141
+ download_file(file_url, file_path, retries + 1)
142
+ else:
143
+ print(f"Failed to download file after {MAX_RETRIES} retries: {e}")
144
+
145
+ # def convert_ipynb_to_md(ipynb_path: str, md_path: str):
146
+ # with open(ipynb_path, "r", encoding="utf-8") as f:
147
+ # notebook = nbformat.read(f, as_version=4)
148
+
149
+ # exporter = MarkdownExporter()
150
+ # markdown, _ = exporter.from_notebook_node(notebook)
151
+
152
+ # with open(md_path, "w", encoding="utf-8") as f:
153
+ # f.write(markdown)
154
+
155
+
156
+ def convert_ipynb_to_md(ipynb_path: str, md_path: str):
157
+ try:
158
+ with open(ipynb_path, "r", encoding="utf-8") as f:
159
+ notebook = nbformat.read(f, as_version=4)
160
+
161
+ exporter = MarkdownExporter()
162
+ markdown, _ = exporter.from_notebook_node(notebook)
163
+
164
+ with open(md_path, "w", encoding="utf-8") as f:
165
+ f.write(markdown)
166
+ except (json.JSONDecodeError, nbformat.reader.NotJSONError) as e:
167
+ print(f"Error converting notebook {ipynb_path}: {str(e)}")
168
+ print("Skipping this file and continuing with others...")
169
+ except Exception as e:
170
+ print(f"Unexpected error converting notebook {ipynb_path}: {str(e)}")
171
+ print("Skipping this file and continuing with others...")
172
+
173
+
174
+ def fetch_files(api_url: str, local_dir: str):
175
+ files = get_files_in_directory(api_url)
176
+ for file in files:
177
+ if file["type"] == "file" and file["name"].endswith((".md", ".mdx", ".ipynb")):
178
+ file_url = file["download_url"]
179
+ file_name = file["name"]
180
+ file_path = os.path.join(local_dir, file_name)
181
+ print(f"Downloading {file_name}...")
182
+ download_file(file_url, file_path)
183
+
184
+ if file_name.endswith(".ipynb"):
185
+ md_file_name = file_name.replace(".ipynb", ".md")
186
+ md_file_path = os.path.join(local_dir, md_file_name)
187
+ print(f"Converting {file_name} to markdown...")
188
+ convert_ipynb_to_md(file_path, md_file_path)
189
+ os.remove(file_path) # Remove the .ipynb file after conversion
190
+ elif file["type"] == "dir":
191
+ subdir = os.path.join(local_dir, file["name"])
192
+ os.makedirs(subdir, exist_ok=True)
193
+ fetch_files(file["url"], subdir)
194
+
195
+
196
+ def process_source(source: str):
197
+ if source not in SOURCE_CONFIGS:
198
+ print(
199
+ f"Error: Unknown source '{source}'. Available sources: {', '.join(SOURCE_CONFIGS.keys())}"
200
+ )
201
+ return
202
+
203
+ config = SOURCE_CONFIGS[source]
204
+ api_url = f"https://api.github.com/repos/{config['owner']}/{config['repo']}/contents/{config['path']}"
205
+ local_dir = f"data/{config['repo']}_md_files"
206
+ os.makedirs(local_dir, exist_ok=True)
207
+
208
+ print(f"Processing source: {source}")
209
+ fetch_files(api_url, local_dir)
210
+ print(f"Finished processing {source}")
211
+
212
+
213
+ def main(sources: List[str]):
214
+ for source in sources:
215
+ process_source(source)
216
+ print("All specified sources have been processed.")
217
+
218
+
219
+ if __name__ == "__main__":
220
+ parser = argparse.ArgumentParser(
221
+ description="Fetch Markdown files from specified GitHub repositories."
222
+ )
223
+ parser.add_argument(
224
+ "sources",
225
+ nargs="+",
226
+ choices=SOURCE_CONFIGS.keys(),
227
+ help="Specify one or more sources to process",
228
+ )
229
+ args = parser.parse_args()
230
+
231
+ main(args.sources)
data/scraping_scripts/process_md_files.py ADDED
@@ -0,0 +1,370 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Markdown Document Processor for Documentation Sources
3
+
4
+ This script processes Markdown (.md) and MDX (.mdx) files from various documentation sources
5
+ (such as Hugging Face Transformers, PEFT, TRL, LlamaIndex, and OpenAI Cookbook) and converts
6
+ them into a standardized JSONL format for further processing or indexing.
7
+
8
+ Key features:
9
+ 1. Configurable for multiple documentation sources
10
+ 2. Extracts titles, generates URLs, and counts tokens for each document
11
+ 3. Supports inclusion/exclusion of specific directories and root files
12
+ 4. Removes copyright headers from content
13
+ 5. Generates a unique ID for each document
14
+ 6. Determines if a whole document should be retrieved based on token count
15
+ 7. Handles special cases like openai-cookbook repo by adding .ipynb extensions
16
+ 8. Processes multiple sources in a single run
17
+
18
+ Usage:
19
+ python process_md_files.py <source1> <source2> ...
20
+
21
+ Where <source1>, <source2>, etc. are one or more of the predefined sources in SOURCE_CONFIGS
22
+ (e.g., 'transformers', 'llama_index', 'openai_cookbooks').
23
+
24
+ The script processes all Markdown files in the specified input directories (and their subdirectories),
25
+ applies the configured filters, and saves the results in JSONL files. Each line in the output
26
+ files represents a single document with metadata and content.
27
+
28
+ To add or modify sources, update the SOURCE_CONFIGS dictionary at the top of the script.
29
+ """
30
+
31
+ import argparse
32
+ import json
33
+ import logging
34
+ import os
35
+ import re
36
+ import uuid
37
+ from typing import Dict, List
38
+
39
+ import tiktoken
40
+
41
+ logging.basicConfig(level=logging.INFO)
42
+ logger = logging.getLogger(__name__)
43
+
44
+ # Configuration for different sources
45
+ SOURCE_CONFIGS = {
46
+ "transformers": {
47
+ "base_url": "https://huggingface.co/docs/transformers/",
48
+ "input_directory": "data/transformers_md_files",
49
+ "output_file": "data/transformers_data.jsonl",
50
+ "source_name": "transformers",
51
+ "use_include_list": False,
52
+ "included_dirs": [],
53
+ "excluded_dirs": ["internal", "main_classes"],
54
+ "excluded_root_files": [],
55
+ "included_root_files": [],
56
+ "url_extension": "",
57
+ },
58
+ "peft": {
59
+ "base_url": "https://huggingface.co/docs/peft/",
60
+ "input_directory": "data/peft_md_files",
61
+ "output_file": "data/peft_data.jsonl",
62
+ "source_name": "peft",
63
+ "use_include_list": False,
64
+ "included_dirs": [],
65
+ "excluded_dirs": [],
66
+ "excluded_root_files": [],
67
+ "included_root_files": [],
68
+ "url_extension": "",
69
+ },
70
+ "trl": {
71
+ "base_url": "https://huggingface.co/docs/trl/",
72
+ "input_directory": "data/trl_md_files",
73
+ "output_file": "data/trl_data.jsonl",
74
+ "source_name": "trl",
75
+ "use_include_list": False,
76
+ "included_dirs": [],
77
+ "excluded_dirs": [],
78
+ "excluded_root_files": [],
79
+ "included_root_files": [],
80
+ "url_extension": "",
81
+ },
82
+ "llama_index": {
83
+ "base_url": "https://docs.llamaindex.ai/en/stable/",
84
+ "input_directory": "data/llama_index_md_files",
85
+ "output_file": "data/llama_index_data.jsonl",
86
+ "source_name": "llama_index",
87
+ "use_include_list": True,
88
+ "included_dirs": [
89
+ "getting_started",
90
+ "understanding",
91
+ "use_cases",
92
+ "examples",
93
+ "module_guides",
94
+ "optimizing",
95
+ ],
96
+ "excluded_dirs": [],
97
+ "excluded_root_files": [],
98
+ "included_root_files": ["index.md"],
99
+ "url_extension": "",
100
+ },
101
+ "openai_cookbooks": {
102
+ "base_url": "https://github.com/openai/openai-cookbook/blob/main/examples/",
103
+ "input_directory": "data/openai-cookbook_md_files",
104
+ "output_file": "data/openai_cookbooks_data.jsonl",
105
+ "source_name": "openai_cookbooks",
106
+ "use_include_list": False,
107
+ "included_dirs": [],
108
+ "excluded_dirs": [],
109
+ "excluded_root_files": [],
110
+ "included_root_files": [],
111
+ "url_extension": ".ipynb",
112
+ },
113
+ "langchain": {
114
+ "base_url": "https://python.langchain.com/docs/",
115
+ "input_directory": "data/langchain_md_files",
116
+ "output_file": "data/langchain_data.jsonl",
117
+ "source_name": "langchain",
118
+ "use_include_list": True,
119
+ "included_dirs": ["how_to", "versions", "turorials", "integrations"],
120
+ "excluded_dirs": [],
121
+ "excluded_root_files": [],
122
+ "included_root_files": ["security.md", "concepts.mdx", "introduction.mdx"],
123
+ "url_extension": "",
124
+ },
125
+ "tai_blog": {
126
+ "base_url": "",
127
+ "input_directory": "",
128
+ "output_file": "data/tai_blog_data.jsonl",
129
+ "source_name": "tai_blog",
130
+ "use_include_list": False,
131
+ "included_dirs": [],
132
+ "excluded_dirs": [],
133
+ "excluded_root_files": [],
134
+ "included_root_files": [],
135
+ "url_extension": "",
136
+ },
137
+ "8-hour_primer": {
138
+ "base_url": "",
139
+ "input_directory": "data/8-hour_primer", # Path to the directory that contains the Markdown files
140
+ "output_file": "data/8-hour_primer_data.jsonl", # 8-hour Generative AI Primer
141
+ "source_name": "8-hour_primer",
142
+ "use_include_list": False,
143
+ "included_dirs": [],
144
+ "excluded_dirs": [],
145
+ "excluded_root_files": [],
146
+ "included_root_files": [],
147
+ "url_extension": "",
148
+ },
149
+ "llm_developer": {
150
+ "base_url": "",
151
+ "input_directory": "data/llm_developer", # Path to the directory that contains the Markdown files
152
+ "output_file": "data/llm_developer_data.jsonl", # From Beginner to Advanced LLM Developer
153
+ "source_name": "llm_developer",
154
+ "use_include_list": False,
155
+ "included_dirs": [],
156
+ "excluded_dirs": [],
157
+ "excluded_root_files": [],
158
+ "included_root_files": [],
159
+ "url_extension": "",
160
+ },
161
+ "python_primer": {
162
+ "base_url": "",
163
+ "input_directory": "data/python_primer", # Path to the directory that contains the Markdown files
164
+ "output_file": "data/python_primer_data.jsonl", # From Beginner to Advanced LLM Developer
165
+ "source_name": "python_primer",
166
+ "use_include_list": False,
167
+ "included_dirs": [],
168
+ "excluded_dirs": [],
169
+ "excluded_root_files": [],
170
+ "included_root_files": [],
171
+ "url_extension": "",
172
+ },
173
+ }
174
+
175
+
176
+ def extract_title(content: str):
177
+ title_match = re.search(r"^#\s+(.+)$", content, re.MULTILINE)
178
+ if title_match:
179
+ return title_match.group(1).strip()
180
+
181
+ lines = content.split("\n")
182
+ for line in lines:
183
+ if line.strip():
184
+ return line.strip()
185
+
186
+ return None
187
+
188
+
189
+ def generate_url(file_path: str, config: Dict) -> str:
190
+ """
191
+ Return an empty string if base_url is empty;
192
+ otherwise return the constructed URL as before.
193
+ """
194
+ if not config["base_url"]:
195
+ return ""
196
+
197
+ path_without_extension = os.path.splitext(file_path)[0]
198
+ path_with_forward_slashes = path_without_extension.replace("\\", "/")
199
+ return config["base_url"] + path_with_forward_slashes + config["url_extension"]
200
+
201
+
202
+ def should_include_file(file_path: str, config: Dict) -> bool:
203
+ if os.path.dirname(file_path) == "":
204
+ if config["use_include_list"]:
205
+ return os.path.basename(file_path) in config["included_root_files"]
206
+ else:
207
+ return os.path.basename(file_path) not in config["excluded_root_files"]
208
+
209
+ if config["use_include_list"]:
210
+ return any(file_path.startswith(dir) for dir in config["included_dirs"])
211
+ else:
212
+ return not any(file_path.startswith(dir) for dir in config["excluded_dirs"])
213
+
214
+
215
+ def num_tokens_from_string(string: str, encoding_name: str) -> int:
216
+ encoding = tiktoken.get_encoding(encoding_name)
217
+ num_tokens = len(encoding.encode(string, disallowed_special=()))
218
+ return num_tokens
219
+
220
+
221
+ def remove_copyright_header(content: str) -> str:
222
+ header_pattern = re.compile(r"<!--Copyright.*?-->\s*", re.DOTALL)
223
+ cleaned_content = header_pattern.sub("", content, count=1)
224
+ return cleaned_content.strip()
225
+
226
+
227
+ def process_md_files(directory: str, config: Dict) -> List[Dict]:
228
+ jsonl_data = []
229
+
230
+ for root, _, files in os.walk(directory):
231
+ for file in files:
232
+ if file.endswith(".md") or file.endswith(".mdx"):
233
+ file_path = os.path.join(root, file)
234
+ relative_path = os.path.relpath(file_path, directory)
235
+
236
+ if should_include_file(relative_path, config):
237
+ with open(file_path, "r", encoding="utf-8") as f:
238
+ content = f.read()
239
+
240
+ title = extract_title(content)
241
+ token_count = num_tokens_from_string(content, "cl100k_base")
242
+
243
+ # Skip very small or extremely large files
244
+ if token_count < 100 or token_count > 200_000:
245
+ logger.info(
246
+ f"Skipping {relative_path} due to token count {token_count}"
247
+ )
248
+ continue
249
+
250
+ cleaned_content = remove_copyright_header(content)
251
+
252
+ json_object = {
253
+ "tokens": token_count,
254
+ "doc_id": str(uuid.uuid4()),
255
+ "name": (title if title else file),
256
+ "url": generate_url(relative_path, config),
257
+ "retrieve_doc": (token_count <= 8000),
258
+ "source": config["source_name"],
259
+ "content": cleaned_content,
260
+ }
261
+
262
+ jsonl_data.append(json_object)
263
+
264
+ return jsonl_data
265
+
266
+
267
+ def save_jsonl(data: List[Dict], output_file: str) -> None:
268
+ with open(output_file, "w", encoding="utf-8") as f:
269
+ for item in data:
270
+ json.dump(item, f, ensure_ascii=False)
271
+ f.write("\n")
272
+
273
+
274
+ def combine_all_sources(sources: List[str]) -> None:
275
+ """
276
+ Combine JSONL files from multiple sources, preserving existing sources not being processed.
277
+
278
+ For example, if sources = ['transformers'], this will:
279
+ 1. Load data from transformers_data.jsonl
280
+ 2. Load data from all other source JSONL files that exist (course files, etc.)
281
+ 3. Combine them all into all_sources_data.jsonl
282
+ """
283
+ all_data = []
284
+ output_file = "data/all_sources_data.jsonl"
285
+
286
+ # Track which sources we're processing
287
+ processed_sources = set()
288
+
289
+ # First, add data from sources we're explicitly processing
290
+ for source in sources:
291
+ if source not in SOURCE_CONFIGS:
292
+ logger.error(f"Unknown source '{source}'. Skipping.")
293
+ continue
294
+
295
+ processed_sources.add(source)
296
+ input_file = SOURCE_CONFIGS[source]["output_file"]
297
+ logger.info(f"Processing updated source: {source} from {input_file}")
298
+
299
+ try:
300
+ source_data = []
301
+ with open(input_file, "r", encoding="utf-8") as f:
302
+ for line in f:
303
+ source_data.append(json.loads(line))
304
+
305
+ logger.info(f"Added {len(source_data)} documents from {source}")
306
+ all_data.extend(source_data)
307
+ except Exception as e:
308
+ logger.error(f"Error loading {input_file}: {e}")
309
+
310
+ # Now add data from all other sources not being processed
311
+ for source_name, config in SOURCE_CONFIGS.items():
312
+ # Skip sources we already processed
313
+ if source_name in processed_sources:
314
+ continue
315
+
316
+ # Try to load the individual source file
317
+ source_file = config["output_file"]
318
+ if os.path.exists(source_file):
319
+ logger.info(f"Preserving existing source: {source_name} from {source_file}")
320
+ try:
321
+ source_data = []
322
+ with open(source_file, "r", encoding="utf-8") as f:
323
+ for line in f:
324
+ source_data.append(json.loads(line))
325
+
326
+ logger.info(f"Preserved {len(source_data)} documents from {source_name}")
327
+ all_data.extend(source_data)
328
+ except Exception as e:
329
+ logger.error(f"Error loading {source_file}: {e}")
330
+
331
+ logger.info(f"Total documents combined: {len(all_data)}")
332
+ save_jsonl(all_data, output_file)
333
+ logger.info(f"Combined data saved to {output_file}")
334
+
335
+
336
+ def process_source(source: str) -> None:
337
+ if source not in SOURCE_CONFIGS:
338
+ logger.error(f"Unknown source '{source}'. Skipping.")
339
+ return
340
+
341
+ config = SOURCE_CONFIGS[source]
342
+ logger.info(f"\n\nProcessing source: {source}")
343
+ jsonl_data = process_md_files(config["input_directory"], config)
344
+ save_jsonl(jsonl_data, config["output_file"])
345
+ logger.info(
346
+ f"Processed {len(jsonl_data)} files and saved to {config['output_file']}"
347
+ )
348
+
349
+
350
+ def main(sources: List[str]) -> None:
351
+ for source in sources:
352
+ process_source(source)
353
+
354
+ if len(sources) > 1:
355
+ combine_all_sources(sources)
356
+
357
+
358
+ if __name__ == "__main__":
359
+ parser = argparse.ArgumentParser(
360
+ description="Process Markdown files from specified sources."
361
+ )
362
+ parser.add_argument(
363
+ "sources",
364
+ nargs="+",
365
+ choices=SOURCE_CONFIGS.keys(),
366
+ help="Specify one or more sources to process",
367
+ )
368
+ args = parser.parse_args()
369
+
370
+ main(args.sources)
data/scraping_scripts/update_docs_workflow.py ADDED
@@ -0,0 +1,409 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ """
3
+ AI Tutor App - Documentation Update Workflow
4
+
5
+ This script automates the process of updating documentation from GitHub repositories:
6
+ 1. Download documentation from GitHub using the API
7
+ 2. Process markdown files to create JSONL data
8
+ 3. Add contextual information to document nodes
9
+ 4. Create vector stores
10
+ 5. Upload databases to HuggingFace
11
+
12
+ This workflow is specific to updating library documentation (Transformers, PEFT, LlamaIndex, etc.).
13
+ For adding courses, use the add_course_workflow.py script instead.
14
+
15
+ Usage:
16
+ python update_docs_workflow.py --sources [SOURCE1] [SOURCE2] ...
17
+
18
+ Additional flags to run specific steps (if you want to restart from a specific point):
19
+ --skip-download Skip the GitHub download step
20
+ --skip-process Skip the markdown processing step
21
+ --new-context-only Only process new content when adding context
22
+ --skip-context Skip the context addition step entirely
23
+ --skip-vectors Skip vector store creation
24
+ --skip-upload Skip uploading to HuggingFace
25
+ """
26
+
27
+ import argparse
28
+ import json
29
+ import logging
30
+ import os
31
+ import pickle
32
+ import subprocess
33
+ import sys
34
+ from typing import Dict, List, Set
35
+
36
+ from dotenv import load_dotenv
37
+ from huggingface_hub import HfApi, hf_hub_download
38
+
39
+ # Load environment variables from .env file
40
+ load_dotenv()
41
+
42
+ # Configure logging
43
+ logging.basicConfig(
44
+ level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
45
+ )
46
+ logger = logging.getLogger(__name__)
47
+
48
+
49
+ def ensure_required_files_exist():
50
+ """Download required data files from HuggingFace if they don't exist locally."""
51
+ # List of files to check and download
52
+ required_files = {
53
+ # Critical files
54
+ "data/all_sources_data.jsonl": "all_sources_data.jsonl",
55
+ "data/all_sources_contextual_nodes.pkl": "all_sources_contextual_nodes.pkl",
56
+ # Documentation source files
57
+ "data/transformers_data.jsonl": "transformers_data.jsonl",
58
+ "data/peft_data.jsonl": "peft_data.jsonl",
59
+ "data/trl_data.jsonl": "trl_data.jsonl",
60
+ "data/llama_index_data.jsonl": "llama_index_data.jsonl",
61
+ "data/langchain_data.jsonl": "langchain_data.jsonl",
62
+ "data/openai_cookbooks_data.jsonl": "openai_cookbooks_data.jsonl",
63
+ # Course files
64
+ "data/tai_blog_data.jsonl": "tai_blog_data.jsonl",
65
+ "data/8-hour_primer_data.jsonl": "8-hour_primer_data.jsonl",
66
+ "data/llm_developer_data.jsonl": "llm_developer_data.jsonl",
67
+ "data/python_primer_data.jsonl": "python_primer_data.jsonl",
68
+ }
69
+
70
+ # Critical files that must be downloaded
71
+ critical_files = [
72
+ "data/all_sources_data.jsonl",
73
+ "data/all_sources_contextual_nodes.pkl",
74
+ ]
75
+
76
+ # Check and download each file
77
+ for local_path, remote_filename in required_files.items():
78
+ if not os.path.exists(local_path):
79
+ logger.info(
80
+ f"{remote_filename} not found. Attempting to download from HuggingFace..."
81
+ )
82
+ try:
83
+ hf_hub_download(
84
+ token=os.getenv("HF_TOKEN"),
85
+ repo_id="towardsai-tutors/ai-tutor-data",
86
+ filename=remote_filename,
87
+ repo_type="dataset",
88
+ local_dir="data",
89
+ )
90
+ logger.info(
91
+ f"Successfully downloaded {remote_filename} from HuggingFace"
92
+ )
93
+ except Exception as e:
94
+ logger.warning(f"Could not download {remote_filename}: {e}")
95
+
96
+ # Only create empty file for all_sources_data.jsonl if it's missing
97
+ if local_path == "data/all_sources_data.jsonl":
98
+ logger.warning(
99
+ "Creating a new all_sources_data.jsonl file. This will not include previously existing data."
100
+ )
101
+ with open(local_path, "w") as f:
102
+ pass
103
+
104
+ # If critical file is missing, print a more serious warning
105
+ if local_path in critical_files:
106
+ logger.warning(
107
+ f"Critical file {remote_filename} is missing. The workflow may not function correctly."
108
+ )
109
+
110
+ if local_path == "data/all_sources_contextual_nodes.pkl":
111
+ logger.warning(
112
+ "The context addition step will process all documents since no existing contexts were found."
113
+ )
114
+
115
+
116
+ # Documentation sources that can be updated via GitHub API
117
+ GITHUB_SOURCES = [
118
+ "transformers",
119
+ "peft",
120
+ "trl",
121
+ "llama_index",
122
+ "openai_cookbooks",
123
+ "langchain",
124
+ ]
125
+
126
+
127
+ def load_jsonl(file_path: str) -> List[Dict]:
128
+ """Load data from a JSONL file."""
129
+ data = []
130
+ with open(file_path, "r", encoding="utf-8") as f:
131
+ for line in f:
132
+ data.append(json.loads(line))
133
+ return data
134
+
135
+
136
+ def save_jsonl(data: List[Dict], file_path: str) -> None:
137
+ """Save data to a JSONL file."""
138
+ with open(file_path, "w", encoding="utf-8") as f:
139
+ for item in data:
140
+ json.dump(item, f, ensure_ascii=False)
141
+ f.write("\n")
142
+
143
+
144
+ def download_from_github(sources: List[str]) -> None:
145
+ """Download documentation from GitHub repositories."""
146
+ logger.info(f"Downloading documentation from GitHub for sources: {sources}")
147
+
148
+ for source in sources:
149
+ if source not in GITHUB_SOURCES:
150
+ logger.warning(f"Source {source} is not a GitHub source, skipping download")
151
+ continue
152
+
153
+ logger.info(f"Downloading {source} documentation")
154
+ cmd = ["python", "data/scraping_scripts/github_to_markdown_ai_docs.py", source]
155
+ result = subprocess.run(cmd)
156
+
157
+ if result.returncode != 0:
158
+ logger.error(
159
+ f"Error downloading {source} documentation - check output above"
160
+ )
161
+ # Continue with other sources instead of exiting
162
+ continue
163
+
164
+ logger.info(f"Successfully downloaded {source} documentation")
165
+
166
+
167
+ def process_markdown_files(sources: List[str]) -> None:
168
+ """Process markdown files for specific sources."""
169
+ logger.info(f"Processing markdown files for sources: {sources}")
170
+
171
+ cmd = ["python", "data/scraping_scripts/process_md_files.py"] + sources
172
+ result = subprocess.run(cmd)
173
+
174
+ if result.returncode != 0:
175
+ logger.error(f"Error processing markdown files - check output above")
176
+ sys.exit(1)
177
+
178
+ logger.info(f"Successfully processed markdown files")
179
+
180
+
181
+ def get_processed_doc_ids() -> Set[str]:
182
+ """Get set of doc_ids that have already been processed with context."""
183
+ if not os.path.exists("data/all_sources_contextual_nodes.pkl"):
184
+ return set()
185
+
186
+ try:
187
+ with open("data/all_sources_contextual_nodes.pkl", "rb") as f:
188
+ nodes = pickle.load(f)
189
+ return {node.source_node.node_id for node in nodes}
190
+ except Exception as e:
191
+ logger.error(f"Error loading processed doc_ids: {e}")
192
+ return set()
193
+
194
+
195
+ def add_context_to_nodes(new_only: bool = False) -> None:
196
+ """Add context to document nodes, optionally processing only new content."""
197
+ logger.info("Adding context to document nodes")
198
+
199
+ if new_only:
200
+ # Load all documents
201
+ all_docs = load_jsonl("data/all_sources_data.jsonl")
202
+ processed_ids = get_processed_doc_ids()
203
+
204
+ # Filter for unprocessed documents
205
+ new_docs = [doc for doc in all_docs if doc["doc_id"] not in processed_ids]
206
+
207
+ if not new_docs:
208
+ logger.info("No new documents to process")
209
+ return
210
+
211
+ # Save temporary JSONL with only new documents
212
+ temp_file = "data/new_docs_temp.jsonl"
213
+ save_jsonl(new_docs, temp_file)
214
+
215
+ # Temporarily modify the add_context_to_nodes.py script to use the temp file
216
+ cmd = [
217
+ "python",
218
+ "-c",
219
+ f"""
220
+ import asyncio
221
+ import os
222
+ import pickle
223
+ import json
224
+ from data.scraping_scripts.add_context_to_nodes import create_docs, process
225
+
226
+ async def main():
227
+ # First, get the list of sources being updated from the temp file
228
+ updated_sources = set()
229
+ with open("{temp_file}", "r") as f:
230
+ for line in f:
231
+ data = json.loads(line)
232
+ updated_sources.add(data["source"])
233
+
234
+ print(f"Updating nodes for sources: {{updated_sources}}")
235
+
236
+ # Process new documents
237
+ documents = create_docs("{temp_file}")
238
+ enhanced_nodes = await process(documents)
239
+ print(f"Generated context for {{len(enhanced_nodes)}} new nodes")
240
+
241
+ # Load existing nodes if they exist
242
+ existing_nodes = []
243
+ if os.path.exists("data/all_sources_contextual_nodes.pkl"):
244
+ with open("data/all_sources_contextual_nodes.pkl", "rb") as f:
245
+ existing_nodes = pickle.load(f)
246
+
247
+ # Filter out existing nodes for sources we're updating
248
+ filtered_nodes = []
249
+ removed_count = 0
250
+
251
+ for node in existing_nodes:
252
+ # Try to extract source from node metadata
253
+ try:
254
+ source = None
255
+ if hasattr(node, 'source_node') and hasattr(node.source_node, 'metadata'):
256
+ source = node.source_node.metadata.get("source")
257
+ elif hasattr(node, 'metadata'):
258
+ source = node.metadata.get("source")
259
+
260
+ if source not in updated_sources:
261
+ filtered_nodes.append(node)
262
+ else:
263
+ removed_count += 1
264
+ except Exception:
265
+ # Keep nodes where we can't determine the source
266
+ filtered_nodes.append(node)
267
+
268
+ print(f"Removed {{removed_count}} existing nodes for updated sources")
269
+ existing_nodes = filtered_nodes
270
+
271
+ # Combine filtered existing nodes with new nodes
272
+ all_nodes = existing_nodes + enhanced_nodes
273
+
274
+ # Save all nodes
275
+ with open("data/all_sources_contextual_nodes.pkl", "wb") as f:
276
+ pickle.dump(all_nodes, f)
277
+
278
+ print(f"Total nodes in updated file: {{len(all_nodes)}}")
279
+
280
+ asyncio.run(main())
281
+ """,
282
+ ]
283
+ else:
284
+ # Process all documents
285
+ logger.info("Adding context to all nodes")
286
+ cmd = ["python", "data/scraping_scripts/add_context_to_nodes.py"]
287
+
288
+ result = subprocess.run(cmd)
289
+
290
+ if result.returncode != 0:
291
+ logger.error(f"Error adding context to nodes - check output above")
292
+ sys.exit(1)
293
+
294
+ logger.info("Successfully added context to nodes")
295
+
296
+ # Clean up temp file if it exists
297
+ if new_only and os.path.exists("data/new_docs_temp.jsonl"):
298
+ os.remove("data/new_docs_temp.jsonl")
299
+
300
+
301
+ def create_vector_stores() -> None:
302
+ """Create vector stores from processed documents."""
303
+ logger.info("Creating vector stores")
304
+ cmd = ["python", "data/scraping_scripts/create_vector_stores.py", "all_sources"]
305
+ result = subprocess.run(cmd)
306
+
307
+ if result.returncode != 0:
308
+ logger.error(f"Error creating vector stores - check output above")
309
+ sys.exit(1)
310
+
311
+ logger.info("Successfully created vector stores")
312
+
313
+
314
+ def upload_to_huggingface(upload_jsonl: bool = False) -> None:
315
+ """Upload databases to HuggingFace."""
316
+ logger.info("Uploading databases to HuggingFace")
317
+ cmd = ["python", "data/scraping_scripts/upload_dbs_to_hf.py"]
318
+ result = subprocess.run(cmd)
319
+
320
+ if result.returncode != 0:
321
+ logger.error(f"Error uploading databases - check output above")
322
+ sys.exit(1)
323
+
324
+ logger.info("Successfully uploaded databases to HuggingFace")
325
+
326
+ if upload_jsonl:
327
+ logger.info("Uploading data files to HuggingFace")
328
+
329
+ try:
330
+ # Note: This uses a separate private repository
331
+ cmd = ["python", "data/scraping_scripts/upload_data_to_hf.py"]
332
+ result = subprocess.run(cmd)
333
+
334
+ if result.returncode != 0:
335
+ logger.error(f"Error uploading data files - check output above")
336
+ sys.exit(1)
337
+
338
+ logger.info("Successfully uploaded data files to HuggingFace")
339
+ except Exception as e:
340
+ logger.error(f"Error uploading JSONL file: {e}")
341
+ sys.exit(1)
342
+
343
+
344
+ def main():
345
+ parser = argparse.ArgumentParser(
346
+ description="AI Tutor App Documentation Update Workflow"
347
+ )
348
+ parser.add_argument(
349
+ "--sources",
350
+ nargs="+",
351
+ choices=GITHUB_SOURCES,
352
+ default=GITHUB_SOURCES,
353
+ help="GitHub documentation sources to update",
354
+ )
355
+ parser.add_argument(
356
+ "--skip-download", action="store_true", help="Skip downloading from GitHub"
357
+ )
358
+ parser.add_argument(
359
+ "--skip-process", action="store_true", help="Skip processing markdown files"
360
+ )
361
+ parser.add_argument(
362
+ "--process-all-context",
363
+ action="store_true",
364
+ help="Process all content when adding context (default: only process new content)",
365
+ )
366
+ parser.add_argument(
367
+ "--skip-context",
368
+ action="store_true",
369
+ help="Skip the context addition step entirely",
370
+ )
371
+ parser.add_argument(
372
+ "--skip-vectors", action="store_true", help="Skip vector store creation"
373
+ )
374
+ parser.add_argument(
375
+ "--skip-upload", action="store_true", help="Skip uploading to HuggingFace"
376
+ )
377
+ parser.add_argument(
378
+ "--skip-data-upload",
379
+ action="store_true",
380
+ help="Skip uploading data files (.jsonl and .pkl) to private HuggingFace repo (they are uploaded by default)",
381
+ )
382
+
383
+ args = parser.parse_args()
384
+
385
+ # Ensure required data files exist before proceeding
386
+ ensure_required_files_exist()
387
+
388
+ # Execute the workflow steps
389
+ if not args.skip_download:
390
+ download_from_github(args.sources)
391
+
392
+ if not args.skip_process:
393
+ process_markdown_files(args.sources)
394
+
395
+ if not args.skip_context:
396
+ add_context_to_nodes(not args.process_all_context)
397
+
398
+ if not args.skip_vectors:
399
+ create_vector_stores()
400
+
401
+ if not args.skip_upload:
402
+ # By default, also upload the data files (JSONL and PKL) unless explicitly skipped
403
+ upload_to_huggingface(not args.skip_data_upload)
404
+
405
+ logger.info("Documentation update workflow completed successfully")
406
+
407
+
408
+ if __name__ == "__main__":
409
+ main()
data/scraping_scripts/upload_data_to_hf.py ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ """
3
+ Upload Data Files to HuggingFace
4
+
5
+ This script uploads key data files to a private HuggingFace dataset repository:
6
+ 1. all_sources_data.jsonl - The raw document data
7
+ 2. all_sources_contextual_nodes.pkl - The processed nodes with added context
8
+
9
+ This is useful for new team members who need the latest version of the data.
10
+
11
+ Usage:
12
+ python upload_data_to_hf.py [--repo REPO_ID]
13
+
14
+ Arguments:
15
+ --repo REPO_ID HuggingFace dataset repository ID (default: towardsai-tutors/ai-tutor-data)
16
+ """
17
+
18
+ import argparse
19
+ import os
20
+
21
+ from dotenv import load_dotenv
22
+ from huggingface_hub import HfApi
23
+
24
+ load_dotenv()
25
+
26
+
27
+ def upload_files_to_huggingface(repo_id="towardsai-tutors/ai-tutor-data"):
28
+ """Upload data files to a private HuggingFace repository."""
29
+ # Main files to upload
30
+ files_to_upload = [
31
+ # Combined data and vector store
32
+ "data/all_sources_data.jsonl",
33
+ "data/all_sources_contextual_nodes.pkl",
34
+ # Individual source files
35
+ "data/transformers_data.jsonl",
36
+ "data/peft_data.jsonl",
37
+ "data/trl_data.jsonl",
38
+ "data/llama_index_data.jsonl",
39
+ "data/langchain_data.jsonl",
40
+ "data/openai_cookbooks_data.jsonl",
41
+ # Course files
42
+ "data/tai_blog_data.jsonl",
43
+ "data/8-hour_primer_data.jsonl",
44
+ "data/llm_developer_data.jsonl",
45
+ "data/python_primer_data.jsonl",
46
+ ]
47
+
48
+ # Filter to only include files that exist
49
+ existing_files = []
50
+ missing_files = []
51
+
52
+ for file_path in files_to_upload:
53
+ if os.path.exists(file_path):
54
+ existing_files.append(file_path)
55
+ else:
56
+ missing_files.append(file_path)
57
+
58
+ # Critical files must exist
59
+ critical_files = [
60
+ "data/all_sources_data.jsonl",
61
+ "data/all_sources_contextual_nodes.pkl",
62
+ ]
63
+ critical_missing = [f for f in critical_files if f in missing_files]
64
+
65
+ if critical_missing:
66
+ print(
67
+ f"Error: The following critical files were not found: {', '.join(critical_missing)}"
68
+ )
69
+ # return False
70
+
71
+ if missing_files:
72
+ print(
73
+ f"Warning: The following files were not found and will not be uploaded: {', '.join(missing_files)}"
74
+ )
75
+ print("This is normal if you're only updating certain sources.")
76
+
77
+ try:
78
+ api = HfApi(token=os.getenv("HF_TOKEN"))
79
+
80
+ # Check if repository exists, create if it doesn't
81
+ try:
82
+ api.repo_info(repo_id=repo_id, repo_type="dataset")
83
+ print(f"Repository {repo_id} exists")
84
+ except Exception:
85
+ print(
86
+ f"Repository {repo_id} doesn't exist. Please create it first on the HuggingFace platform."
87
+ )
88
+ print("Make sure to set it as private if needed.")
89
+ return False
90
+
91
+ # Upload all existing files
92
+ for file_path in existing_files:
93
+ try:
94
+ file_name = os.path.basename(file_path)
95
+ print(f"Uploading {file_name}...")
96
+
97
+ api.upload_file(
98
+ path_or_fileobj=file_path,
99
+ path_in_repo=file_name,
100
+ repo_id=repo_id,
101
+ repo_type="dataset",
102
+ )
103
+ print(
104
+ f"Successfully uploaded {file_name} to HuggingFace repository {repo_id}"
105
+ )
106
+ except Exception as e:
107
+ print(f"Error uploading {file_path}: {e}")
108
+ # Continue with other files even if one fails
109
+
110
+ return True
111
+ except Exception as e:
112
+ print(f"Error uploading files: {e}")
113
+ return False
114
+
115
+
116
+ def main():
117
+ parser = argparse.ArgumentParser(description="Upload Data Files to HuggingFace")
118
+ parser.add_argument(
119
+ "--repo",
120
+ default="towardsai-tutors/ai-tutor-data",
121
+ help="HuggingFace dataset repository ID",
122
+ )
123
+
124
+ args = parser.parse_args()
125
+ upload_files_to_huggingface(args.repo)
126
+
127
+
128
+ if __name__ == "__main__":
129
+ main()
data/scraping_scripts/upload_dbs_to_hf.py ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Hugging Face Data Upload Script
3
+
4
+ Purpose:
5
+ This script uploads a local folder to a Hugging Face dataset repository. It's designed to
6
+ update or create a dataset on the Hugging Face Hub by uploading the contents of a specified
7
+ local folder.
8
+
9
+ Usage:
10
+ - Run the script: python data/scraping_scripts/upload_dbs_to_hf.py
11
+
12
+ The script will:
13
+ - Upload the contents of the 'data' folder to the specified Hugging Face dataset repository.
14
+ - https://huggingface.co/datasets/towardsai-buster/ai-tutor-vector-db
15
+
16
+ Configuration:
17
+ - The script is set to upload to the "towardsai-buster/test-data" dataset repository.
18
+ - It deletes all existing files in the repository before uploading (due to delete_patterns=["*"]).
19
+ """
20
+
21
+ import os
22
+
23
+ from dotenv import load_dotenv
24
+ from huggingface_hub import HfApi
25
+
26
+ load_dotenv()
27
+
28
+ api = HfApi(token=os.getenv("HF_TOKEN"))
29
+
30
+ api.upload_folder(
31
+ folder_path="data",
32
+ repo_id="towardsai-tutors/ai-tutor-vector-db",
33
+ repo_type="dataset",
34
+ # multi_commits=True,
35
+ # multi_commits_verbose=True,
36
+ delete_patterns=["*"],
37
+ ignore_patterns=["*.jsonl", "*.py", "*.txt", "*.ipynb", "*.md", "*.pyc"],
38
+ )