Spaces:
Build error
Build error
Add files via upload
Browse files- data/scraping_scripts/README.md +104 -0
- data/scraping_scripts/add_context_to_nodes.py +196 -0
- data/scraping_scripts/add_course_workflow.py +541 -0
- data/scraping_scripts/create_vector_stores.py +218 -0
- data/scraping_scripts/csv_to_jsonl.py +61 -0
- data/scraping_scripts/github_to_markdown_ai_docs.py +231 -0
- data/scraping_scripts/process_md_files.py +370 -0
- data/scraping_scripts/update_docs_workflow.py +409 -0
- data/scraping_scripts/upload_data_to_hf.py +129 -0
- data/scraping_scripts/upload_dbs_to_hf.py +38 -0
data/scraping_scripts/README.md
ADDED
@@ -0,0 +1,104 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# AI Tutor App Data Workflows
|
2 |
+
|
3 |
+
This directory contains scripts for managing the AI Tutor App's data pipeline.
|
4 |
+
|
5 |
+
## Workflow Scripts
|
6 |
+
|
7 |
+
### 1. Adding a New Course
|
8 |
+
|
9 |
+
To add a new course to the AI Tutor:
|
10 |
+
|
11 |
+
```bash
|
12 |
+
python add_course_workflow.py --course [COURSE_NAME]
|
13 |
+
```
|
14 |
+
|
15 |
+
This will guide you through the complete process:
|
16 |
+
|
17 |
+
1. Process markdown files from Notion exports
|
18 |
+
2. Prompt you to manually add URLs to the course content
|
19 |
+
3. Merge the course data into the main dataset
|
20 |
+
4. Add contextual information to document nodes
|
21 |
+
5. Create vector stores
|
22 |
+
6. Upload databases to HuggingFace
|
23 |
+
7. Update UI configuration
|
24 |
+
|
25 |
+
**Requirements before running:**
|
26 |
+
|
27 |
+
- The course name must be properly configured in `process_md_files.py` under `SOURCE_CONFIGS`
|
28 |
+
- Course markdown files must be placed in the directory specified in the configuration
|
29 |
+
- You must have access to the live course platform to add URLs
|
30 |
+
|
31 |
+
### 2. Updating Documentation via GitHub API
|
32 |
+
|
33 |
+
To update library documentation from GitHub repositories:
|
34 |
+
|
35 |
+
```bash
|
36 |
+
python update_docs_workflow.py
|
37 |
+
```
|
38 |
+
|
39 |
+
This will update all supported documentation sources. You can also specify specific sources:
|
40 |
+
|
41 |
+
```bash
|
42 |
+
python update_docs_workflow.py --sources transformers peft
|
43 |
+
```
|
44 |
+
|
45 |
+
The workflow includes:
|
46 |
+
|
47 |
+
1. Downloading documentation from GitHub using the API
|
48 |
+
2. Processing markdown files to create JSONL data
|
49 |
+
3. Adding contextual information to document nodes
|
50 |
+
4. Creating vector stores
|
51 |
+
5. Uploading databases to HuggingFace
|
52 |
+
|
53 |
+
### 3. Uploading JSONL to HuggingFace
|
54 |
+
|
55 |
+
To upload the main JSONL file to a private HuggingFace repository:
|
56 |
+
|
57 |
+
```bash
|
58 |
+
python upload_jsonl_to_hf.py
|
59 |
+
```
|
60 |
+
|
61 |
+
This is useful for sharing the latest data with team members.
|
62 |
+
|
63 |
+
## Individual Components
|
64 |
+
|
65 |
+
If you need to run specific steps individually:
|
66 |
+
|
67 |
+
- **GitHub to Markdown**: `github_to_markdown_ai_docs.py`
|
68 |
+
- **Process Markdown**: `process_md_files.py`
|
69 |
+
- **Add Context**: `add_context_to_nodes.py`
|
70 |
+
- **Create Vector Stores**: `create_vector_stores.py`
|
71 |
+
- **Upload to HuggingFace**: `upload_dbs_to_hf.py`
|
72 |
+
|
73 |
+
## Tips for New Team Members
|
74 |
+
|
75 |
+
1. To update the AI Tutor with new content:
|
76 |
+
- For new courses, use `add_course_workflow.py`
|
77 |
+
- For updated documentation, use `update_docs_workflow.py`
|
78 |
+
|
79 |
+
2. When adding URLs to course content:
|
80 |
+
- Get the URLs from the live course platform
|
81 |
+
- Add them to the generated JSONL file in the `url` field
|
82 |
+
- Example URL format: `https://academy.towardsai.net/courses/take/python-for-genai/multimedia/62515980-course-structure`
|
83 |
+
- Make sure every document has a valid URL
|
84 |
+
|
85 |
+
3. By default, only new content will have context added to save time and resources. Use `--process-all-context` only if you need to regenerate context for all documents. Use `--skip-data-upload` if you don't want to upload data files to the private HuggingFace repo (they're uploaded by default).
|
86 |
+
|
87 |
+
4. When adding a new course, verify that it appears in the Gradio UI:
|
88 |
+
- The workflow automatically updates `main.py` and `setup.py` to include the new source
|
89 |
+
- Check that the new source appears in the dropdown menu in the UI
|
90 |
+
- Make sure it's properly included in the default selected sources
|
91 |
+
- Restart the Gradio app to see the changes
|
92 |
+
|
93 |
+
5. First time setup or missing files:
|
94 |
+
- Both workflows automatically check for and download required data files:
|
95 |
+
- `all_sources_data.jsonl` - Contains the raw document data
|
96 |
+
- `all_sources_contextual_nodes.pkl` - Contains the processed nodes with added context
|
97 |
+
- If the PKL file exists, the `--new-context-only` flag will only process new content
|
98 |
+
- You must have proper HuggingFace credentials with access to the private repository
|
99 |
+
|
100 |
+
6. Make sure you have the required environment variables set:
|
101 |
+
- `OPENAI_API_KEY` for LLM processing
|
102 |
+
- `COHERE_API_KEY` for embeddings
|
103 |
+
- `HF_TOKEN` for HuggingFace uploads
|
104 |
+
- `GITHUB_TOKEN` for accessing documentation via the GitHub API
|
data/scraping_scripts/add_context_to_nodes.py
ADDED
@@ -0,0 +1,196 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import asyncio
|
2 |
+
import json
|
3 |
+
import pdb
|
4 |
+
import pickle
|
5 |
+
from typing import Dict, List
|
6 |
+
|
7 |
+
import instructor
|
8 |
+
import logfire
|
9 |
+
import tiktoken
|
10 |
+
from anthropic import AsyncAnthropic
|
11 |
+
from dotenv import load_dotenv
|
12 |
+
from jinja2 import Template
|
13 |
+
from llama_index.core import Document
|
14 |
+
from llama_index.core.ingestion import IngestionPipeline
|
15 |
+
from llama_index.core.node_parser import SentenceSplitter
|
16 |
+
from llama_index.core.schema import TextNode
|
17 |
+
from openai import AsyncOpenAI
|
18 |
+
from pydantic import BaseModel, Field
|
19 |
+
from tenacity import retry, stop_after_attempt, wait_exponential
|
20 |
+
from tqdm.asyncio import tqdm
|
21 |
+
|
22 |
+
load_dotenv(".env")
|
23 |
+
|
24 |
+
# logfire.configure()
|
25 |
+
|
26 |
+
|
27 |
+
def create_docs(input_file: str) -> List[Document]:
|
28 |
+
with open(input_file, "r") as f:
|
29 |
+
documents: list[Document] = []
|
30 |
+
for line in f:
|
31 |
+
data = json.loads(line)
|
32 |
+
documents.append(
|
33 |
+
Document(
|
34 |
+
doc_id=data["doc_id"],
|
35 |
+
text=data["content"],
|
36 |
+
metadata={ # type: ignore
|
37 |
+
"url": data["url"],
|
38 |
+
"title": data["name"],
|
39 |
+
"tokens": data["tokens"],
|
40 |
+
"retrieve_doc": data["retrieve_doc"],
|
41 |
+
"source": data["source"],
|
42 |
+
},
|
43 |
+
excluded_llm_metadata_keys=[
|
44 |
+
"title",
|
45 |
+
"tokens",
|
46 |
+
"retrieve_doc",
|
47 |
+
"source",
|
48 |
+
],
|
49 |
+
excluded_embed_metadata_keys=[
|
50 |
+
"url",
|
51 |
+
"tokens",
|
52 |
+
"retrieve_doc",
|
53 |
+
"source",
|
54 |
+
],
|
55 |
+
)
|
56 |
+
)
|
57 |
+
return documents
|
58 |
+
|
59 |
+
|
60 |
+
class SituatedContext(BaseModel):
|
61 |
+
title: str = Field(..., description="The title of the document.")
|
62 |
+
context: str = Field(
|
63 |
+
..., description="The context to situate the chunk within the document."
|
64 |
+
)
|
65 |
+
|
66 |
+
|
67 |
+
# client = AsyncInstructor(
|
68 |
+
# client=AsyncAnthropic(),
|
69 |
+
# create=patch(
|
70 |
+
# create=AsyncAnthropic().beta.prompt_caching.messages.create,
|
71 |
+
# mode=Mode.ANTHROPIC_TOOLS,
|
72 |
+
# ),
|
73 |
+
# mode=Mode.ANTHROPIC_TOOLS,
|
74 |
+
# )
|
75 |
+
aclient = AsyncOpenAI()
|
76 |
+
# logfire.instrument_openai(aclient)
|
77 |
+
client: instructor.AsyncInstructor = instructor.from_openai(aclient)
|
78 |
+
|
79 |
+
|
80 |
+
@retry(stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=4, max=10))
|
81 |
+
async def situate_context(doc: str, chunk: str) -> str:
|
82 |
+
template = Template(
|
83 |
+
"""
|
84 |
+
<document>
|
85 |
+
{{ doc }}
|
86 |
+
</document>
|
87 |
+
|
88 |
+
Here is the chunk we want to situate within the whole document above:
|
89 |
+
|
90 |
+
<chunk>
|
91 |
+
{{ chunk }}
|
92 |
+
</chunk>
|
93 |
+
|
94 |
+
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk.
|
95 |
+
Answer only with the succinct context and nothing else.
|
96 |
+
"""
|
97 |
+
)
|
98 |
+
|
99 |
+
content = template.render(doc=doc, chunk=chunk)
|
100 |
+
|
101 |
+
response = await client.chat.completions.create(
|
102 |
+
model="gpt-4o-mini",
|
103 |
+
max_tokens=1000,
|
104 |
+
temperature=0,
|
105 |
+
messages=[
|
106 |
+
{
|
107 |
+
"role": "user",
|
108 |
+
"content": content,
|
109 |
+
}
|
110 |
+
],
|
111 |
+
response_model=SituatedContext,
|
112 |
+
)
|
113 |
+
return response.context
|
114 |
+
|
115 |
+
|
116 |
+
async def process_chunk(node: TextNode, document_dict: dict) -> TextNode:
|
117 |
+
doc_id: str = node.source_node.node_id # type: ignore
|
118 |
+
doc: Document = document_dict[doc_id]
|
119 |
+
|
120 |
+
if doc.metadata["tokens"] > 120_000:
|
121 |
+
# Tokenize the document text
|
122 |
+
encoding = tiktoken.encoding_for_model("gpt-4o-mini")
|
123 |
+
tokens = encoding.encode(doc.get_content())
|
124 |
+
|
125 |
+
# Trim to 120,000 tokens
|
126 |
+
trimmed_tokens = tokens[:120_000]
|
127 |
+
|
128 |
+
# Decode back to text
|
129 |
+
trimmed_text = encoding.decode(trimmed_tokens)
|
130 |
+
|
131 |
+
# Update the document with trimmed text
|
132 |
+
doc = Document(text=trimmed_text, metadata=doc.metadata)
|
133 |
+
doc.metadata["tokens"] = 120_000
|
134 |
+
|
135 |
+
context: str = await situate_context(doc.get_content(), node.text)
|
136 |
+
node.text = f"{node.text}\n\n{context}"
|
137 |
+
return node
|
138 |
+
|
139 |
+
|
140 |
+
async def process(
|
141 |
+
documents: List[Document], semaphore_limit: int = 50
|
142 |
+
) -> List[TextNode]:
|
143 |
+
|
144 |
+
# From the document, we create chunks
|
145 |
+
pipeline = IngestionPipeline(
|
146 |
+
transformations=[SentenceSplitter(chunk_size=800, chunk_overlap=0)]
|
147 |
+
)
|
148 |
+
all_nodes: list[TextNode] = pipeline.run(documents=documents, show_progress=True)
|
149 |
+
print(f"Number of nodes: {len(all_nodes)}")
|
150 |
+
|
151 |
+
document_dict: dict[str, Document] = {doc.doc_id: doc for doc in documents}
|
152 |
+
|
153 |
+
semaphore = asyncio.Semaphore(semaphore_limit)
|
154 |
+
|
155 |
+
async def process_with_semaphore(node):
|
156 |
+
async with semaphore:
|
157 |
+
result = await process_chunk(node, document_dict)
|
158 |
+
await asyncio.sleep(0.1)
|
159 |
+
return result
|
160 |
+
|
161 |
+
tasks = [process_with_semaphore(node) for node in all_nodes]
|
162 |
+
|
163 |
+
results: List[TextNode] = await tqdm.gather(*tasks, desc="Processing chunks")
|
164 |
+
|
165 |
+
# pdb.set_trace()
|
166 |
+
|
167 |
+
return results
|
168 |
+
|
169 |
+
|
170 |
+
async def main():
|
171 |
+
documents: List[Document] = create_docs("data/all_sources_data.jsonl")
|
172 |
+
enhanced_nodes: List[TextNode] = await process(documents)
|
173 |
+
|
174 |
+
with open("data/all_sources_contextual_nodes.pkl", "wb") as f:
|
175 |
+
pickle.dump(enhanced_nodes, f)
|
176 |
+
|
177 |
+
# pipeline = IngestionPipeline(
|
178 |
+
# transformations=[SentenceSplitter(chunk_size=800, chunk_overlap=0)]
|
179 |
+
# )
|
180 |
+
# all_nodes: list[TextNode] = pipeline.run(documents=documents, show_progress=True)
|
181 |
+
# print(all_nodes[7933])
|
182 |
+
# pdb.set_trace()
|
183 |
+
|
184 |
+
with open("data/all_sources_contextual_nodes.pkl", "rb") as f:
|
185 |
+
enhanced_nodes: list[TextNode] = pickle.load(f)
|
186 |
+
|
187 |
+
for i, node in enumerate(enhanced_nodes):
|
188 |
+
print(f"Chunk {i + 1}:")
|
189 |
+
print(f"Node: {node}")
|
190 |
+
print(f"Text: {node.text}")
|
191 |
+
# pdb.set_trace()
|
192 |
+
break
|
193 |
+
|
194 |
+
|
195 |
+
if __name__ == "__main__":
|
196 |
+
asyncio.run(main())
|
data/scraping_scripts/add_course_workflow.py
ADDED
@@ -0,0 +1,541 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/env python
|
2 |
+
"""
|
3 |
+
AI Tutor App - Course Addition Workflow
|
4 |
+
|
5 |
+
This script guides you through the complete process of adding a new course to the AI Tutor App:
|
6 |
+
|
7 |
+
1. Process course markdown files to create JSONL data
|
8 |
+
2. MANDATORY MANUAL STEP: Add URLs to course content in the generated JSONL
|
9 |
+
3. Merge course JSONL into all_sources_data.jsonl
|
10 |
+
4. Add contextual information to document nodes
|
11 |
+
5. Create vector stores
|
12 |
+
6. Upload databases to HuggingFace
|
13 |
+
7. Update UI configuration
|
14 |
+
|
15 |
+
Usage:
|
16 |
+
python add_course_workflow.py --course [COURSE_NAME]
|
17 |
+
|
18 |
+
Additional flags to run specific steps (if you want to restart from a specific point):
|
19 |
+
--skip-process-md Skip the markdown processing step
|
20 |
+
--skip-merge Skip merging into all_sources_data.jsonl
|
21 |
+
--new-context-only Only process new content when adding context
|
22 |
+
--skip-context Skip the context addition step entirely
|
23 |
+
--skip-vectors Skip vector store creation
|
24 |
+
--skip-upload Skip uploading to HuggingFace
|
25 |
+
--skip-ui-update Skip updating the UI configuration
|
26 |
+
"""
|
27 |
+
|
28 |
+
import argparse
|
29 |
+
import json
|
30 |
+
import logging
|
31 |
+
import os
|
32 |
+
import pickle
|
33 |
+
import subprocess
|
34 |
+
import sys
|
35 |
+
import time
|
36 |
+
from pathlib import Path
|
37 |
+
from typing import Dict, List, Set
|
38 |
+
|
39 |
+
from dotenv import load_dotenv
|
40 |
+
from huggingface_hub import HfApi, hf_hub_download
|
41 |
+
|
42 |
+
# Load environment variables from .env file
|
43 |
+
load_dotenv()
|
44 |
+
|
45 |
+
# Configure logging
|
46 |
+
logging.basicConfig(
|
47 |
+
level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
48 |
+
)
|
49 |
+
logger = logging.getLogger(__name__)
|
50 |
+
|
51 |
+
|
52 |
+
def ensure_required_files_exist():
|
53 |
+
"""Download required data files from HuggingFace if they don't exist locally."""
|
54 |
+
# List of files to check and download
|
55 |
+
required_files = {
|
56 |
+
# Critical files
|
57 |
+
"data/all_sources_data.jsonl": "all_sources_data.jsonl",
|
58 |
+
"data/all_sources_contextual_nodes.pkl": "all_sources_contextual_nodes.pkl",
|
59 |
+
|
60 |
+
# Documentation source files
|
61 |
+
"data/transformers_data.jsonl": "transformers_data.jsonl",
|
62 |
+
"data/peft_data.jsonl": "peft_data.jsonl",
|
63 |
+
"data/trl_data.jsonl": "trl_data.jsonl",
|
64 |
+
"data/llama_index_data.jsonl": "llama_index_data.jsonl",
|
65 |
+
"data/langchain_data.jsonl": "langchain_data.jsonl",
|
66 |
+
"data/openai_cookbooks_data.jsonl": "openai_cookbooks_data.jsonl",
|
67 |
+
|
68 |
+
# Course files
|
69 |
+
"data/tai_blog_data.jsonl": "tai_blog_data.jsonl",
|
70 |
+
"data/8-hour_primer_data.jsonl": "8-hour_primer_data.jsonl",
|
71 |
+
"data/llm_developer_data.jsonl": "llm_developer_data.jsonl",
|
72 |
+
"data/python_primer_data.jsonl": "python_primer_data.jsonl"
|
73 |
+
}
|
74 |
+
|
75 |
+
# Critical files that must be downloaded
|
76 |
+
critical_files = [
|
77 |
+
"data/all_sources_data.jsonl",
|
78 |
+
"data/all_sources_contextual_nodes.pkl"
|
79 |
+
]
|
80 |
+
|
81 |
+
# Check and download each file
|
82 |
+
for local_path, remote_filename in required_files.items():
|
83 |
+
if not os.path.exists(local_path):
|
84 |
+
logger.info(f"{remote_filename} not found. Attempting to download from HuggingFace...")
|
85 |
+
try:
|
86 |
+
hf_hub_download(
|
87 |
+
token=os.getenv("HF_TOKEN"),
|
88 |
+
repo_id="towardsai-tutors/ai-tutor-data",
|
89 |
+
filename=remote_filename,
|
90 |
+
repo_type="dataset",
|
91 |
+
local_dir="data",
|
92 |
+
)
|
93 |
+
logger.info(f"Successfully downloaded {remote_filename} from HuggingFace")
|
94 |
+
except Exception as e:
|
95 |
+
logger.warning(f"Could not download {remote_filename}: {e}")
|
96 |
+
|
97 |
+
# Only create empty file for all_sources_data.jsonl if it's missing
|
98 |
+
if local_path == "data/all_sources_data.jsonl":
|
99 |
+
logger.warning("Creating a new all_sources_data.jsonl file. This will not include previously existing data.")
|
100 |
+
with open(local_path, "w") as f:
|
101 |
+
pass
|
102 |
+
|
103 |
+
# If critical file is missing, print a more serious warning
|
104 |
+
if local_path in critical_files:
|
105 |
+
logger.warning(f"Critical file {remote_filename} is missing. The workflow may not function correctly.")
|
106 |
+
|
107 |
+
if local_path == "data/all_sources_contextual_nodes.pkl":
|
108 |
+
logger.warning("The context addition step will process all documents since no existing contexts were found.")
|
109 |
+
|
110 |
+
|
111 |
+
def load_jsonl(file_path: str) -> List[Dict]:
|
112 |
+
"""Load data from a JSONL file."""
|
113 |
+
data = []
|
114 |
+
with open(file_path, "r", encoding="utf-8") as f:
|
115 |
+
for line in f:
|
116 |
+
data.append(json.loads(line))
|
117 |
+
return data
|
118 |
+
|
119 |
+
|
120 |
+
def save_jsonl(data: List[Dict], file_path: str) -> None:
|
121 |
+
"""Save data to a JSONL file."""
|
122 |
+
with open(file_path, "w", encoding="utf-8") as f:
|
123 |
+
for item in data:
|
124 |
+
json.dump(item, f, ensure_ascii=False)
|
125 |
+
f.write("\n")
|
126 |
+
|
127 |
+
|
128 |
+
def process_markdown_files(course_name: str) -> str:
|
129 |
+
"""Process markdown files for a specific course. Returns path to output JSONL."""
|
130 |
+
logger.info(f"Processing markdown files for course: {course_name}")
|
131 |
+
cmd = ["python", "data/scraping_scripts/process_md_files.py", course_name]
|
132 |
+
result = subprocess.run(cmd)
|
133 |
+
|
134 |
+
if result.returncode != 0:
|
135 |
+
logger.error(f"Error processing markdown files - check output above")
|
136 |
+
sys.exit(1)
|
137 |
+
|
138 |
+
logger.info(f"Successfully processed markdown files for {course_name}")
|
139 |
+
|
140 |
+
# Determine the output file path from process_md_files.py
|
141 |
+
from data.scraping_scripts.process_md_files import SOURCE_CONFIGS
|
142 |
+
|
143 |
+
if course_name not in SOURCE_CONFIGS:
|
144 |
+
logger.error(f"Course {course_name} not found in SOURCE_CONFIGS")
|
145 |
+
sys.exit(1)
|
146 |
+
|
147 |
+
output_file = SOURCE_CONFIGS[course_name]["output_file"]
|
148 |
+
return output_file
|
149 |
+
|
150 |
+
|
151 |
+
def manual_url_addition(jsonl_path: str) -> None:
|
152 |
+
"""Guide the user through manually adding URLs to the course JSONL."""
|
153 |
+
logger.info(f"=== MANDATORY MANUAL STEP: URL ADDITION ===")
|
154 |
+
logger.info(f"Please add the URLs to the course content in: {jsonl_path}")
|
155 |
+
logger.info(f"For each document in the JSONL file:")
|
156 |
+
logger.info(f"1. Open the file in a text editor")
|
157 |
+
logger.info(f"2. Find the empty 'url' field for each document")
|
158 |
+
logger.info(f"3. Add the appropriate URL from the live course platform")
|
159 |
+
logger.info(f" Example URL format: https://academy.towardsai.net/courses/take/python-for-genai/multimedia/62515980-course-structure")
|
160 |
+
logger.info(f"4. Save the file when done")
|
161 |
+
|
162 |
+
# Check if URLs are present
|
163 |
+
data = load_jsonl(jsonl_path)
|
164 |
+
missing_urls = sum(1 for item in data if not item.get("url"))
|
165 |
+
|
166 |
+
if missing_urls > 0:
|
167 |
+
logger.warning(f"Found {missing_urls} documents without URLs in {jsonl_path}")
|
168 |
+
|
169 |
+
answer = input(
|
170 |
+
f"\n{missing_urls} documents are missing URLs. Have you added all the URLs? (yes/no): "
|
171 |
+
)
|
172 |
+
if answer.lower() not in ["yes", "y"]:
|
173 |
+
logger.info("Please add the URLs and run the script again.")
|
174 |
+
sys.exit(0)
|
175 |
+
else:
|
176 |
+
logger.info("All documents have URLs. Continuing with the workflow.")
|
177 |
+
|
178 |
+
|
179 |
+
def merge_into_all_sources(course_jsonl_path: str) -> None:
|
180 |
+
"""Merge the course JSONL into all_sources_data.jsonl."""
|
181 |
+
all_sources_path = "data/all_sources_data.jsonl"
|
182 |
+
logger.info(f"Merging {course_jsonl_path} into {all_sources_path}")
|
183 |
+
|
184 |
+
# Load course data
|
185 |
+
course_data = load_jsonl(course_jsonl_path)
|
186 |
+
|
187 |
+
# Load existing all_sources data if it exists
|
188 |
+
all_data = []
|
189 |
+
if os.path.exists(all_sources_path):
|
190 |
+
all_data = load_jsonl(all_sources_path)
|
191 |
+
|
192 |
+
# Get doc_ids from existing data
|
193 |
+
existing_ids = {item["doc_id"] for item in all_data}
|
194 |
+
|
195 |
+
# Add new course data (avoiding duplicates)
|
196 |
+
new_items = 0
|
197 |
+
for item in course_data:
|
198 |
+
if item["doc_id"] not in existing_ids:
|
199 |
+
all_data.append(item)
|
200 |
+
existing_ids.add(item["doc_id"])
|
201 |
+
new_items += 1
|
202 |
+
|
203 |
+
# Save the combined data
|
204 |
+
save_jsonl(all_data, all_sources_path)
|
205 |
+
logger.info(f"Added {new_items} new documents to {all_sources_path}")
|
206 |
+
|
207 |
+
|
208 |
+
def get_processed_doc_ids() -> Set[str]:
|
209 |
+
"""Get set of doc_ids that have already been processed with context."""
|
210 |
+
if not os.path.exists("data/all_sources_contextual_nodes.pkl"):
|
211 |
+
return set()
|
212 |
+
|
213 |
+
try:
|
214 |
+
with open("data/all_sources_contextual_nodes.pkl", "rb") as f:
|
215 |
+
nodes = pickle.load(f)
|
216 |
+
return {node.source_node.node_id for node in nodes}
|
217 |
+
except Exception as e:
|
218 |
+
logger.error(f"Error loading processed doc_ids: {e}")
|
219 |
+
return set()
|
220 |
+
|
221 |
+
|
222 |
+
def add_context_to_nodes(new_only: bool = False) -> None:
|
223 |
+
"""Add context to document nodes, optionally processing only new content."""
|
224 |
+
logger.info("Adding context to document nodes")
|
225 |
+
|
226 |
+
if new_only:
|
227 |
+
# Load all documents
|
228 |
+
all_docs = load_jsonl("data/all_sources_data.jsonl")
|
229 |
+
processed_ids = get_processed_doc_ids()
|
230 |
+
|
231 |
+
# Filter for unprocessed documents
|
232 |
+
new_docs = [doc for doc in all_docs if doc["doc_id"] not in processed_ids]
|
233 |
+
|
234 |
+
if not new_docs:
|
235 |
+
logger.info("No new documents to process")
|
236 |
+
return
|
237 |
+
|
238 |
+
# Save temporary JSONL with only new documents
|
239 |
+
temp_file = "data/new_docs_temp.jsonl"
|
240 |
+
save_jsonl(new_docs, temp_file)
|
241 |
+
|
242 |
+
# Temporarily modify the add_context_to_nodes.py script to use the temp file
|
243 |
+
cmd = [
|
244 |
+
"python",
|
245 |
+
"-c",
|
246 |
+
f"""
|
247 |
+
import asyncio
|
248 |
+
import os
|
249 |
+
import pickle
|
250 |
+
import json
|
251 |
+
from data.scraping_scripts.add_context_to_nodes import create_docs, process
|
252 |
+
|
253 |
+
async def main():
|
254 |
+
# First, get the list of sources being updated from the temp file
|
255 |
+
updated_sources = set()
|
256 |
+
with open("{temp_file}", "r") as f:
|
257 |
+
for line in f:
|
258 |
+
data = json.loads(line)
|
259 |
+
updated_sources.add(data["source"])
|
260 |
+
|
261 |
+
print(f"Updating nodes for sources: {{updated_sources}}")
|
262 |
+
|
263 |
+
# Process new documents
|
264 |
+
documents = create_docs("{temp_file}")
|
265 |
+
enhanced_nodes = await process(documents)
|
266 |
+
print(f"Generated context for {{len(enhanced_nodes)}} new nodes")
|
267 |
+
|
268 |
+
# Load existing nodes if they exist
|
269 |
+
existing_nodes = []
|
270 |
+
if os.path.exists("data/all_sources_contextual_nodes.pkl"):
|
271 |
+
with open("data/all_sources_contextual_nodes.pkl", "rb") as f:
|
272 |
+
existing_nodes = pickle.load(f)
|
273 |
+
|
274 |
+
# Filter out existing nodes for sources we're updating
|
275 |
+
filtered_nodes = []
|
276 |
+
removed_count = 0
|
277 |
+
|
278 |
+
for node in existing_nodes:
|
279 |
+
# Try to extract source from node metadata
|
280 |
+
try:
|
281 |
+
source = None
|
282 |
+
if hasattr(node, 'source_node') and hasattr(node.source_node, 'metadata'):
|
283 |
+
source = node.source_node.metadata.get("source")
|
284 |
+
elif hasattr(node, 'metadata'):
|
285 |
+
source = node.metadata.get("source")
|
286 |
+
|
287 |
+
if source not in updated_sources:
|
288 |
+
filtered_nodes.append(node)
|
289 |
+
else:
|
290 |
+
removed_count += 1
|
291 |
+
except Exception:
|
292 |
+
# Keep nodes where we can't determine the source
|
293 |
+
filtered_nodes.append(node)
|
294 |
+
|
295 |
+
print(f"Removed {{removed_count}} existing nodes for updated sources")
|
296 |
+
existing_nodes = filtered_nodes
|
297 |
+
|
298 |
+
# Combine filtered existing nodes with new nodes
|
299 |
+
all_nodes = existing_nodes + enhanced_nodes
|
300 |
+
|
301 |
+
# Save all nodes
|
302 |
+
with open("data/all_sources_contextual_nodes.pkl", "wb") as f:
|
303 |
+
pickle.dump(all_nodes, f)
|
304 |
+
|
305 |
+
print(f"Total nodes in updated file: {{len(all_nodes)}}")
|
306 |
+
|
307 |
+
asyncio.run(main())
|
308 |
+
""",
|
309 |
+
]
|
310 |
+
else:
|
311 |
+
# Process all documents
|
312 |
+
cmd = ["python", "data/scraping_scripts/add_context_to_nodes.py"]
|
313 |
+
|
314 |
+
result = subprocess.run(cmd)
|
315 |
+
|
316 |
+
if result.returncode != 0:
|
317 |
+
logger.error(f"Error adding context to nodes - check output above")
|
318 |
+
sys.exit(1)
|
319 |
+
|
320 |
+
logger.info("Successfully added context to nodes")
|
321 |
+
|
322 |
+
# Clean up temp file if it exists
|
323 |
+
if new_only and os.path.exists("data/new_docs_temp.jsonl"):
|
324 |
+
os.remove("data/new_docs_temp.jsonl")
|
325 |
+
|
326 |
+
|
327 |
+
def create_vector_stores() -> None:
|
328 |
+
"""Create vector stores from processed documents."""
|
329 |
+
logger.info("Creating vector stores")
|
330 |
+
cmd = ["python", "data/scraping_scripts/create_vector_stores.py", "all_sources"]
|
331 |
+
result = subprocess.run(cmd)
|
332 |
+
|
333 |
+
if result.returncode != 0:
|
334 |
+
logger.error(f"Error creating vector stores - check output above")
|
335 |
+
sys.exit(1)
|
336 |
+
|
337 |
+
logger.info("Successfully created vector stores")
|
338 |
+
|
339 |
+
|
340 |
+
def upload_to_huggingface(upload_jsonl: bool = False) -> None:
|
341 |
+
"""Upload databases to HuggingFace."""
|
342 |
+
logger.info("Uploading databases to HuggingFace")
|
343 |
+
cmd = ["python", "data/scraping_scripts/upload_dbs_to_hf.py"]
|
344 |
+
result = subprocess.run(cmd)
|
345 |
+
|
346 |
+
if result.returncode != 0:
|
347 |
+
logger.error(f"Error uploading databases - check output above")
|
348 |
+
sys.exit(1)
|
349 |
+
|
350 |
+
logger.info("Successfully uploaded databases to HuggingFace")
|
351 |
+
|
352 |
+
if upload_jsonl:
|
353 |
+
logger.info("Uploading data files to HuggingFace")
|
354 |
+
|
355 |
+
try:
|
356 |
+
# Note: This uses a separate private repository
|
357 |
+
cmd = ["python", "data/scraping_scripts/upload_data_to_hf.py"]
|
358 |
+
result = subprocess.run(cmd)
|
359 |
+
|
360 |
+
if result.returncode != 0:
|
361 |
+
logger.error(f"Error uploading data files - check output above")
|
362 |
+
sys.exit(1)
|
363 |
+
|
364 |
+
logger.info("Successfully uploaded data files to HuggingFace")
|
365 |
+
except Exception as e:
|
366 |
+
logger.error(f"Error uploading JSONL file: {e}")
|
367 |
+
sys.exit(1)
|
368 |
+
|
369 |
+
|
370 |
+
def update_ui_files(course_name: str) -> None:
|
371 |
+
"""Update main.py and setup.py with the new source."""
|
372 |
+
logger.info(f"Updating UI files with new course: {course_name}")
|
373 |
+
|
374 |
+
# Get the source configuration for display name
|
375 |
+
from data.scraping_scripts.process_md_files import SOURCE_CONFIGS
|
376 |
+
|
377 |
+
if course_name not in SOURCE_CONFIGS:
|
378 |
+
logger.error(f"Course {course_name} not found in SOURCE_CONFIGS")
|
379 |
+
return
|
380 |
+
|
381 |
+
# Get a readable display name for the UI
|
382 |
+
display_name = course_name.replace("_", " ").title()
|
383 |
+
|
384 |
+
# Update setup.py - add to AVAILABLE_SOURCES and AVAILABLE_SOURCES_UI
|
385 |
+
setup_path = Path("scripts/setup.py")
|
386 |
+
if setup_path.exists():
|
387 |
+
setup_content = setup_path.read_text()
|
388 |
+
|
389 |
+
# Check if already added
|
390 |
+
if f'"{course_name}"' in setup_content:
|
391 |
+
logger.info(f"Course {course_name} already in setup.py")
|
392 |
+
else:
|
393 |
+
# Add to AVAILABLE_SOURCES_UI
|
394 |
+
ui_list_start = setup_content.find("AVAILABLE_SOURCES_UI = [")
|
395 |
+
ui_list_end = setup_content.find("]", ui_list_start)
|
396 |
+
new_ui_content = (
|
397 |
+
setup_content[:ui_list_end]
|
398 |
+
+ f' "{display_name}",\n'
|
399 |
+
+ setup_content[ui_list_end:]
|
400 |
+
)
|
401 |
+
|
402 |
+
# Add to AVAILABLE_SOURCES
|
403 |
+
sources_list_start = new_ui_content.find("AVAILABLE_SOURCES = [")
|
404 |
+
sources_list_end = new_ui_content.find("]", sources_list_start)
|
405 |
+
new_content = (
|
406 |
+
new_ui_content[:sources_list_end]
|
407 |
+
+ f' "{course_name}",\n'
|
408 |
+
+ new_ui_content[sources_list_end:]
|
409 |
+
)
|
410 |
+
|
411 |
+
# Write updated content
|
412 |
+
setup_path.write_text(new_content)
|
413 |
+
logger.info(f"Updated setup.py with {course_name}")
|
414 |
+
else:
|
415 |
+
logger.warning(f"setup.py not found at {setup_path}")
|
416 |
+
|
417 |
+
# Update main.py - add to source_mapping
|
418 |
+
main_path = Path("scripts/main.py")
|
419 |
+
if main_path.exists():
|
420 |
+
main_content = main_path.read_text()
|
421 |
+
|
422 |
+
# Check if already added
|
423 |
+
if f'"{display_name}": "{course_name}"' in main_content:
|
424 |
+
logger.info(f"Course {course_name} already in main.py")
|
425 |
+
else:
|
426 |
+
# Add to source_mapping
|
427 |
+
mapping_start = main_content.find("source_mapping = {")
|
428 |
+
mapping_end = main_content.find("}", mapping_start)
|
429 |
+
new_main_content = (
|
430 |
+
main_content[:mapping_end]
|
431 |
+
+ f' "{display_name}": "{course_name}",\n'
|
432 |
+
+ main_content[mapping_end:]
|
433 |
+
)
|
434 |
+
|
435 |
+
# Add to default selected sources if not there
|
436 |
+
value_start = new_main_content.find("value=[")
|
437 |
+
value_end = new_main_content.find("]", value_start)
|
438 |
+
|
439 |
+
if f'"{display_name}"' not in new_main_content[value_start:value_end]:
|
440 |
+
new_main_content = (
|
441 |
+
new_main_content[: value_start + 7]
|
442 |
+
+ f' "{display_name}",\n'
|
443 |
+
+ new_main_content[value_start + 7 :]
|
444 |
+
)
|
445 |
+
|
446 |
+
# Write updated content
|
447 |
+
main_path.write_text(new_main_content)
|
448 |
+
logger.info(f"Updated main.py with {course_name}")
|
449 |
+
else:
|
450 |
+
logger.warning(f"main.py not found at {main_path}")
|
451 |
+
|
452 |
+
|
453 |
+
def main():
|
454 |
+
parser = argparse.ArgumentParser(
|
455 |
+
description="AI Tutor App Course Addition Workflow"
|
456 |
+
)
|
457 |
+
parser.add_argument(
|
458 |
+
"--course",
|
459 |
+
required=True,
|
460 |
+
help="Name of the course to process (must match SOURCE_CONFIGS)",
|
461 |
+
)
|
462 |
+
parser.add_argument(
|
463 |
+
"--skip-process-md",
|
464 |
+
action="store_true",
|
465 |
+
help="Skip the markdown processing step",
|
466 |
+
)
|
467 |
+
parser.add_argument(
|
468 |
+
"--skip-merge",
|
469 |
+
action="store_true",
|
470 |
+
help="Skip merging into all_sources_data.jsonl",
|
471 |
+
)
|
472 |
+
parser.add_argument(
|
473 |
+
"--process-all-context",
|
474 |
+
action="store_true",
|
475 |
+
help="Process all content when adding context (default: only process new content)",
|
476 |
+
)
|
477 |
+
parser.add_argument(
|
478 |
+
"--skip-context",
|
479 |
+
action="store_true",
|
480 |
+
help="Skip the context addition step entirely",
|
481 |
+
)
|
482 |
+
parser.add_argument(
|
483 |
+
"--skip-vectors", action="store_true", help="Skip vector store creation"
|
484 |
+
)
|
485 |
+
parser.add_argument(
|
486 |
+
"--skip-upload", action="store_true", help="Skip uploading to HuggingFace"
|
487 |
+
)
|
488 |
+
parser.add_argument(
|
489 |
+
"--skip-ui-update",
|
490 |
+
action="store_true",
|
491 |
+
help="Skip updating the UI configuration",
|
492 |
+
)
|
493 |
+
parser.add_argument(
|
494 |
+
"--skip-data-upload",
|
495 |
+
action="store_true",
|
496 |
+
help="Skip uploading data files to private HuggingFace repo (they are uploaded by default)",
|
497 |
+
)
|
498 |
+
|
499 |
+
args = parser.parse_args()
|
500 |
+
course_name = args.course
|
501 |
+
|
502 |
+
# Ensure required data files exist before proceeding
|
503 |
+
ensure_required_files_exist()
|
504 |
+
|
505 |
+
# Get the output file path
|
506 |
+
from data.scraping_scripts.process_md_files import SOURCE_CONFIGS
|
507 |
+
|
508 |
+
if course_name not in SOURCE_CONFIGS:
|
509 |
+
logger.error(f"Course {course_name} not found in SOURCE_CONFIGS")
|
510 |
+
sys.exit(1)
|
511 |
+
|
512 |
+
course_jsonl_path = SOURCE_CONFIGS[course_name]["output_file"]
|
513 |
+
|
514 |
+
# Execute the workflow steps
|
515 |
+
if not args.skip_process_md:
|
516 |
+
course_jsonl_path = process_markdown_files(course_name)
|
517 |
+
|
518 |
+
# Always do the manual URL addition step for courses
|
519 |
+
manual_url_addition(course_jsonl_path)
|
520 |
+
|
521 |
+
if not args.skip_merge:
|
522 |
+
merge_into_all_sources(course_jsonl_path)
|
523 |
+
|
524 |
+
if not args.skip_context:
|
525 |
+
add_context_to_nodes(not args.process_all_context)
|
526 |
+
|
527 |
+
if not args.skip_vectors:
|
528 |
+
create_vector_stores()
|
529 |
+
|
530 |
+
if not args.skip_upload:
|
531 |
+
# By default, also upload the data files (JSONL and PKL) unless explicitly skipped
|
532 |
+
upload_to_huggingface(not args.skip_data_upload)
|
533 |
+
|
534 |
+
if not args.skip_ui_update:
|
535 |
+
update_ui_files(course_name)
|
536 |
+
|
537 |
+
logger.info("Course addition workflow completed successfully")
|
538 |
+
|
539 |
+
|
540 |
+
if __name__ == "__main__":
|
541 |
+
main()
|
data/scraping_scripts/create_vector_stores.py
ADDED
@@ -0,0 +1,218 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Vector Store Creation Script
|
3 |
+
|
4 |
+
Purpose:
|
5 |
+
This script processes various data sources (e.g., transformers, peft, trl, llama_index, openai_cookbooks, langchain)
|
6 |
+
to create vector stores using Chroma and LlamaIndex. It reads data from JSONL files, creates document embeddings,
|
7 |
+
and stores them in persistent Chroma databases for efficient retrieval.
|
8 |
+
|
9 |
+
Usage:
|
10 |
+
python script_name.py <source1> <source2> ...
|
11 |
+
|
12 |
+
Example:
|
13 |
+
python script_name.py transformers peft llama_index
|
14 |
+
|
15 |
+
The script accepts one or more source names as command-line arguments. Valid source names are:
|
16 |
+
transformers, peft, trl, llama_index, openai_cookbooks, langchain
|
17 |
+
|
18 |
+
For each specified source, the script will:
|
19 |
+
1. Read data from the corresponding JSONL file
|
20 |
+
2. Create document embeddings
|
21 |
+
3. Store the embeddings in a Chroma vector database
|
22 |
+
4. Save a dictionary of documents for future reference
|
23 |
+
|
24 |
+
Note: Ensure that the input JSONL files are present in the 'data' directory.
|
25 |
+
"""
|
26 |
+
|
27 |
+
import argparse
|
28 |
+
import json
|
29 |
+
import os
|
30 |
+
import pdb
|
31 |
+
import pickle
|
32 |
+
import shutil
|
33 |
+
|
34 |
+
import chromadb
|
35 |
+
from dotenv import load_dotenv
|
36 |
+
from llama_index.core import Document, StorageContext, VectorStoreIndex
|
37 |
+
from llama_index.core.node_parser import SentenceSplitter
|
38 |
+
from llama_index.core.schema import MetadataMode, TextNode
|
39 |
+
from llama_index.embeddings.cohere import CohereEmbedding
|
40 |
+
from llama_index.embeddings.openai import OpenAIEmbedding
|
41 |
+
from llama_index.llms.openai import OpenAI
|
42 |
+
from llama_index.vector_stores.chroma import ChromaVectorStore
|
43 |
+
|
44 |
+
load_dotenv()
|
45 |
+
|
46 |
+
# Configuration for different sources
|
47 |
+
SOURCE_CONFIGS = {
|
48 |
+
"transformers": {
|
49 |
+
"input_file": "data/transformers_data.jsonl",
|
50 |
+
"db_name": "chroma-db-transformers",
|
51 |
+
},
|
52 |
+
"peft": {"input_file": "data/peft_data.jsonl", "db_name": "chroma-db-peft"},
|
53 |
+
"trl": {"input_file": "data/trl_data.jsonl", "db_name": "chroma-db-trl"},
|
54 |
+
"llama_index": {
|
55 |
+
"input_file": "data/llama_index_data.jsonl",
|
56 |
+
"db_name": "chroma-db-llama_index",
|
57 |
+
},
|
58 |
+
"openai_cookbooks": {
|
59 |
+
"input_file": "data/openai_cookbooks_data.jsonl",
|
60 |
+
"db_name": "chroma-db-openai_cookbooks",
|
61 |
+
},
|
62 |
+
"langchain": {
|
63 |
+
"input_file": "data/langchain_data.jsonl",
|
64 |
+
"db_name": "chroma-db-langchain",
|
65 |
+
},
|
66 |
+
"tai_blog": {
|
67 |
+
"input_file": "data/tai_blog_data.jsonl",
|
68 |
+
"db_name": "chroma-db-tai_blog",
|
69 |
+
},
|
70 |
+
"all_sources": {
|
71 |
+
"input_file": "data/all_sources_data.jsonl",
|
72 |
+
"db_name": "chroma-db-all_sources",
|
73 |
+
},
|
74 |
+
}
|
75 |
+
|
76 |
+
|
77 |
+
def create_docs(input_file: str) -> list[Document]:
|
78 |
+
with open(input_file, "r") as f:
|
79 |
+
documents = []
|
80 |
+
for line in f:
|
81 |
+
data = json.loads(line)
|
82 |
+
documents.append(
|
83 |
+
Document(
|
84 |
+
doc_id=data["doc_id"],
|
85 |
+
text=data["content"],
|
86 |
+
metadata={ # type: ignore
|
87 |
+
"url": data["url"],
|
88 |
+
"title": data["name"],
|
89 |
+
"tokens": data["tokens"],
|
90 |
+
"retrieve_doc": data["retrieve_doc"],
|
91 |
+
"source": data["source"],
|
92 |
+
},
|
93 |
+
excluded_llm_metadata_keys=[ # url is included in LLM context
|
94 |
+
"title",
|
95 |
+
"tokens",
|
96 |
+
"retrieve_doc",
|
97 |
+
"source",
|
98 |
+
],
|
99 |
+
excluded_embed_metadata_keys=[ # title is embedded along the content
|
100 |
+
"url",
|
101 |
+
"tokens",
|
102 |
+
"retrieve_doc",
|
103 |
+
"source",
|
104 |
+
],
|
105 |
+
)
|
106 |
+
)
|
107 |
+
return documents
|
108 |
+
|
109 |
+
|
110 |
+
def process_source(source: str):
|
111 |
+
config = SOURCE_CONFIGS[source]
|
112 |
+
|
113 |
+
input_file = config["input_file"]
|
114 |
+
db_name = config["db_name"]
|
115 |
+
db_path = f"data/{db_name}"
|
116 |
+
|
117 |
+
print(f"Processing source: {source}")
|
118 |
+
|
119 |
+
documents: list[Document] = create_docs(input_file)
|
120 |
+
print(f"Created {len(documents)} documents")
|
121 |
+
|
122 |
+
# Check if the folder exists and delete it
|
123 |
+
if os.path.exists(db_path):
|
124 |
+
print(f"Existing database found at {db_path}. Deleting...")
|
125 |
+
shutil.rmtree(db_path)
|
126 |
+
print(f"Deleted existing database at {db_path}")
|
127 |
+
|
128 |
+
# Create Chroma client and collection
|
129 |
+
chroma_client = chromadb.PersistentClient(path=f"data/{db_name}")
|
130 |
+
chroma_collection = chroma_client.create_collection(db_name)
|
131 |
+
|
132 |
+
# Create vector store and storage context
|
133 |
+
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
|
134 |
+
storage_context = StorageContext.from_defaults(vector_store=vector_store)
|
135 |
+
|
136 |
+
# Save document dictionary
|
137 |
+
document_dict: dict[str, Document] = {doc.doc_id: doc for doc in documents}
|
138 |
+
document_dict_file = f"data/{db_name}/document_dict_{source}.pkl"
|
139 |
+
with open(document_dict_file, "wb") as f:
|
140 |
+
pickle.dump(document_dict, f)
|
141 |
+
print(f"Saved document dictionary to {document_dict_file}")
|
142 |
+
|
143 |
+
# Load nodes with context
|
144 |
+
with open("data/all_sources_contextual_nodes.pkl", "rb") as f:
|
145 |
+
nodes_with_context: list[TextNode] = pickle.load(f)
|
146 |
+
|
147 |
+
print(f"Loaded {len(nodes_with_context)} nodes with context")
|
148 |
+
# pdb.set_trace()
|
149 |
+
# exit()
|
150 |
+
|
151 |
+
# Create vector store index
|
152 |
+
index = VectorStoreIndex(
|
153 |
+
nodes=nodes_with_context,
|
154 |
+
# embed_model=OpenAIEmbedding(model="text-embedding-3-large", mode="similarity"),
|
155 |
+
embed_model=CohereEmbedding(
|
156 |
+
api_key=os.environ["COHERE_API_KEY"],
|
157 |
+
model_name="embed-english-v3.0",
|
158 |
+
input_type="search_document",
|
159 |
+
),
|
160 |
+
show_progress=True,
|
161 |
+
use_async=True,
|
162 |
+
storage_context=storage_context,
|
163 |
+
)
|
164 |
+
llm = OpenAI(
|
165 |
+
temperature=1,
|
166 |
+
model="gpt-4o-mini",
|
167 |
+
# model="gpt-4o",
|
168 |
+
max_tokens=5000,
|
169 |
+
max_retries=3,
|
170 |
+
)
|
171 |
+
query_engine = index.as_query_engine(llm=llm)
|
172 |
+
response = query_engine.query("How to fine-tune an llm?")
|
173 |
+
print(response)
|
174 |
+
for src in response.source_nodes:
|
175 |
+
print("Node ID\t", src.node_id)
|
176 |
+
print("Title\t", src.metadata["title"])
|
177 |
+
print("Text\t", src.text)
|
178 |
+
print("Score\t", src.score)
|
179 |
+
print("-_" * 20)
|
180 |
+
|
181 |
+
# # Create vector store index
|
182 |
+
# index = VectorStoreIndex.from_documents(
|
183 |
+
# documents,
|
184 |
+
# # embed_model=OpenAIEmbedding(model="text-embedding-3-large", mode="similarity"),
|
185 |
+
# embed_model=CohereEmbedding(
|
186 |
+
# api_key=os.environ["COHERE_API_KEY"],
|
187 |
+
# model_name="embed-english-v3.0",
|
188 |
+
# input_type="search_document",
|
189 |
+
# ),
|
190 |
+
# transformations=[SentenceSplitter(chunk_size=800, chunk_overlap=0)],
|
191 |
+
# show_progress=True,
|
192 |
+
# use_async=True,
|
193 |
+
# storage_context=storage_context,
|
194 |
+
# )
|
195 |
+
print(f"Created vector store index for {source}")
|
196 |
+
|
197 |
+
|
198 |
+
def main(sources: list[str]):
|
199 |
+
for source in sources:
|
200 |
+
if source in SOURCE_CONFIGS:
|
201 |
+
process_source(source)
|
202 |
+
else:
|
203 |
+
print(f"Unknown source: {source}. Skipping.")
|
204 |
+
|
205 |
+
|
206 |
+
if __name__ == "__main__":
|
207 |
+
parser = argparse.ArgumentParser(
|
208 |
+
description="Process sources and create vector stores."
|
209 |
+
)
|
210 |
+
parser.add_argument(
|
211 |
+
"sources",
|
212 |
+
nargs="+",
|
213 |
+
choices=SOURCE_CONFIGS.keys(),
|
214 |
+
help="Specify one or more sources to process",
|
215 |
+
)
|
216 |
+
args = parser.parse_args()
|
217 |
+
|
218 |
+
main(args.sources)
|
data/scraping_scripts/csv_to_jsonl.py
ADDED
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import json
|
2 |
+
import uuid
|
3 |
+
|
4 |
+
import pandas as pd
|
5 |
+
import tiktoken
|
6 |
+
|
7 |
+
|
8 |
+
# Function to count tokens using tiktoken
|
9 |
+
def num_tokens_from_string(string: str, encoding_name: str) -> int:
|
10 |
+
encoding = tiktoken.get_encoding(encoding_name)
|
11 |
+
num_tokens = len(
|
12 |
+
encoding.encode(
|
13 |
+
string, disallowed_special=(encoding.special_tokens_set - {"<|endoftext|>"})
|
14 |
+
)
|
15 |
+
)
|
16 |
+
return num_tokens
|
17 |
+
|
18 |
+
|
19 |
+
# Function to clean or remove specific content, e.g., copyright headers
|
20 |
+
def remove_copyright_header(content: str) -> str:
|
21 |
+
# Implement any cleaning logic you need here
|
22 |
+
return content
|
23 |
+
|
24 |
+
|
25 |
+
# Function to convert DataFrame to JSONL format with token counting
|
26 |
+
def convert_to_jsonl_with_conditions(df, encoding_name="cl100k_base"):
|
27 |
+
jsonl_data = []
|
28 |
+
for _, row in df.iterrows():
|
29 |
+
token_count = num_tokens_from_string(row["text"], encoding_name)
|
30 |
+
|
31 |
+
# Skip entries based on token count conditions
|
32 |
+
if token_count < 100 or token_count > 200_000:
|
33 |
+
print(f"Skipping {row['title']} due to token count {token_count}")
|
34 |
+
continue
|
35 |
+
|
36 |
+
cleaned_content = remove_copyright_header(row["text"])
|
37 |
+
|
38 |
+
entry = {
|
39 |
+
"tokens": token_count, # Token count using tiktoken
|
40 |
+
"doc_id": str(uuid.uuid4()), # Generate a unique UUID
|
41 |
+
"name": row["title"],
|
42 |
+
"url": row["tai_url"],
|
43 |
+
"retrieve_doc": (token_count <= 8000), # retrieve_doc condition
|
44 |
+
"source": "tai_blog",
|
45 |
+
"content": cleaned_content,
|
46 |
+
}
|
47 |
+
jsonl_data.append(entry)
|
48 |
+
return jsonl_data
|
49 |
+
|
50 |
+
|
51 |
+
# Load the CSV file
|
52 |
+
data = pd.read_csv("data/tai.csv")
|
53 |
+
|
54 |
+
# Convert the dataframe to JSONL format with token counting and conditions
|
55 |
+
jsonl_data_with_conditions = convert_to_jsonl_with_conditions(data)
|
56 |
+
|
57 |
+
# Save the output to a new JSONL file using json.dumps to ensure proper escaping
|
58 |
+
output_path = "data/tai_blog_data_conditions.jsonl"
|
59 |
+
with open(output_path, "w") as f:
|
60 |
+
for entry in jsonl_data_with_conditions:
|
61 |
+
f.write(json.dumps(entry) + "\n")
|
data/scraping_scripts/github_to_markdown_ai_docs.py
ADDED
@@ -0,0 +1,231 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Fetch Markdown files from specified GitHub repositories.
|
3 |
+
|
4 |
+
This script fetches Markdown (.md), MDX (.mdx), and Jupyter Notebook (.ipynb) files
|
5 |
+
from specified GitHub repositories, particularly focusing on documentation sources
|
6 |
+
for various AI and machine learning libraries.
|
7 |
+
|
8 |
+
Key features:
|
9 |
+
1. Configurable for multiple documentation sources (e.g., Hugging Face Transformers, PEFT, TRL)
|
10 |
+
2. Command-line interface for specifying one or more sources to process
|
11 |
+
3. Automatic conversion of Jupyter Notebooks to Markdown
|
12 |
+
4. Rate limiting handling to comply with GitHub API restrictions
|
13 |
+
5. Retry mechanism for resilience against network issues
|
14 |
+
|
15 |
+
Usage:
|
16 |
+
python github_to_markdown_ai_docs.py <source1> [<source2> ...]
|
17 |
+
|
18 |
+
Where <sourceN> is one of the predefined sources in SOURCE_CONFIGS (e.g., 'transformers', 'peft', 'trl').
|
19 |
+
|
20 |
+
Example:
|
21 |
+
python github_to_markdown_ai_docs.py trl peft
|
22 |
+
|
23 |
+
This will download and process the documentation files for both TRL and PEFT libraries.
|
24 |
+
|
25 |
+
Note:
|
26 |
+
- Ensure you have set the GITHUB_TOKEN variable with your GitHub Personal Access Token.
|
27 |
+
- The script creates a 'data' directory in the current working directory to store the downloaded files.
|
28 |
+
- Each source's files are stored in a subdirectory named '<repo>_md_files'.
|
29 |
+
|
30 |
+
"""
|
31 |
+
|
32 |
+
import argparse
|
33 |
+
import json
|
34 |
+
import os
|
35 |
+
import random
|
36 |
+
import time
|
37 |
+
from typing import Dict, List
|
38 |
+
|
39 |
+
import nbformat
|
40 |
+
import requests
|
41 |
+
from dotenv import load_dotenv
|
42 |
+
from nbconvert import MarkdownExporter
|
43 |
+
|
44 |
+
load_dotenv()
|
45 |
+
|
46 |
+
# Configuration for different sources
|
47 |
+
SOURCE_CONFIGS = {
|
48 |
+
"transformers": {
|
49 |
+
"owner": "huggingface",
|
50 |
+
"repo": "transformers",
|
51 |
+
"path": "docs/source/en",
|
52 |
+
},
|
53 |
+
"peft": {
|
54 |
+
"owner": "huggingface",
|
55 |
+
"repo": "peft",
|
56 |
+
"path": "docs/source",
|
57 |
+
},
|
58 |
+
"trl": {
|
59 |
+
"owner": "huggingface",
|
60 |
+
"repo": "trl",
|
61 |
+
"path": "docs/source",
|
62 |
+
},
|
63 |
+
"llama_index": {
|
64 |
+
"owner": "run-llama",
|
65 |
+
"repo": "llama_index",
|
66 |
+
"path": "docs/docs",
|
67 |
+
},
|
68 |
+
"openai_cookbooks": {
|
69 |
+
"owner": "openai",
|
70 |
+
"repo": "openai-cookbook",
|
71 |
+
"path": "examples",
|
72 |
+
},
|
73 |
+
"langchain": {
|
74 |
+
"owner": "langchain-ai",
|
75 |
+
"repo": "langchain",
|
76 |
+
"path": "docs/docs",
|
77 |
+
},
|
78 |
+
}
|
79 |
+
|
80 |
+
# GitHub Personal Access Token (replace with your own token)
|
81 |
+
GITHUB_TOKEN = os.getenv("GITHUB_TOKEN")
|
82 |
+
|
83 |
+
# Headers for authenticated requests
|
84 |
+
HEADERS = {
|
85 |
+
"Authorization": f"token {GITHUB_TOKEN}",
|
86 |
+
"Accept": "application/vnd.github.v3+json",
|
87 |
+
}
|
88 |
+
|
89 |
+
# Maximum number of retries
|
90 |
+
MAX_RETRIES = 5
|
91 |
+
|
92 |
+
|
93 |
+
def check_rate_limit():
|
94 |
+
rate_limit_url = "https://api.github.com/rate_limit"
|
95 |
+
response = requests.get(rate_limit_url, headers=HEADERS)
|
96 |
+
data = response.json()
|
97 |
+
remaining = data["resources"]["core"]["remaining"]
|
98 |
+
reset_time = data["resources"]["core"]["reset"]
|
99 |
+
|
100 |
+
if remaining < 10: # Adjust this threshold as needed
|
101 |
+
wait_time = reset_time - time.time()
|
102 |
+
print(f"Rate limit nearly exceeded. Waiting for {wait_time:.2f} seconds.")
|
103 |
+
time.sleep(wait_time + 1) # Add 1 second buffer
|
104 |
+
|
105 |
+
|
106 |
+
def get_files_in_directory(api_url: str, retries: int = 0) -> List[Dict]:
|
107 |
+
try:
|
108 |
+
check_rate_limit()
|
109 |
+
response = requests.get(api_url, headers=HEADERS)
|
110 |
+
response.raise_for_status()
|
111 |
+
return response.json()
|
112 |
+
except requests.exceptions.RequestException as e:
|
113 |
+
if retries < MAX_RETRIES:
|
114 |
+
wait_time = (2**retries) + random.random()
|
115 |
+
print(
|
116 |
+
f"Error fetching directory contents: {e}. Retrying in {wait_time:.2f} seconds..."
|
117 |
+
)
|
118 |
+
time.sleep(wait_time)
|
119 |
+
return get_files_in_directory(api_url, retries + 1)
|
120 |
+
else:
|
121 |
+
print(
|
122 |
+
f"Failed to fetch directory contents after {MAX_RETRIES} retries: {e}"
|
123 |
+
)
|
124 |
+
return []
|
125 |
+
|
126 |
+
|
127 |
+
def download_file(file_url: str, file_path: str, retries: int = 0):
|
128 |
+
try:
|
129 |
+
check_rate_limit()
|
130 |
+
response = requests.get(file_url, headers=HEADERS)
|
131 |
+
response.raise_for_status()
|
132 |
+
with open(file_path, "wb") as file:
|
133 |
+
file.write(response.content)
|
134 |
+
except requests.exceptions.RequestException as e:
|
135 |
+
if retries < MAX_RETRIES:
|
136 |
+
wait_time = (2**retries) + random.random()
|
137 |
+
print(
|
138 |
+
f"Error downloading file: {e}. Retrying in {wait_time:.2f} seconds..."
|
139 |
+
)
|
140 |
+
time.sleep(wait_time)
|
141 |
+
download_file(file_url, file_path, retries + 1)
|
142 |
+
else:
|
143 |
+
print(f"Failed to download file after {MAX_RETRIES} retries: {e}")
|
144 |
+
|
145 |
+
# def convert_ipynb_to_md(ipynb_path: str, md_path: str):
|
146 |
+
# with open(ipynb_path, "r", encoding="utf-8") as f:
|
147 |
+
# notebook = nbformat.read(f, as_version=4)
|
148 |
+
|
149 |
+
# exporter = MarkdownExporter()
|
150 |
+
# markdown, _ = exporter.from_notebook_node(notebook)
|
151 |
+
|
152 |
+
# with open(md_path, "w", encoding="utf-8") as f:
|
153 |
+
# f.write(markdown)
|
154 |
+
|
155 |
+
|
156 |
+
def convert_ipynb_to_md(ipynb_path: str, md_path: str):
|
157 |
+
try:
|
158 |
+
with open(ipynb_path, "r", encoding="utf-8") as f:
|
159 |
+
notebook = nbformat.read(f, as_version=4)
|
160 |
+
|
161 |
+
exporter = MarkdownExporter()
|
162 |
+
markdown, _ = exporter.from_notebook_node(notebook)
|
163 |
+
|
164 |
+
with open(md_path, "w", encoding="utf-8") as f:
|
165 |
+
f.write(markdown)
|
166 |
+
except (json.JSONDecodeError, nbformat.reader.NotJSONError) as e:
|
167 |
+
print(f"Error converting notebook {ipynb_path}: {str(e)}")
|
168 |
+
print("Skipping this file and continuing with others...")
|
169 |
+
except Exception as e:
|
170 |
+
print(f"Unexpected error converting notebook {ipynb_path}: {str(e)}")
|
171 |
+
print("Skipping this file and continuing with others...")
|
172 |
+
|
173 |
+
|
174 |
+
def fetch_files(api_url: str, local_dir: str):
|
175 |
+
files = get_files_in_directory(api_url)
|
176 |
+
for file in files:
|
177 |
+
if file["type"] == "file" and file["name"].endswith((".md", ".mdx", ".ipynb")):
|
178 |
+
file_url = file["download_url"]
|
179 |
+
file_name = file["name"]
|
180 |
+
file_path = os.path.join(local_dir, file_name)
|
181 |
+
print(f"Downloading {file_name}...")
|
182 |
+
download_file(file_url, file_path)
|
183 |
+
|
184 |
+
if file_name.endswith(".ipynb"):
|
185 |
+
md_file_name = file_name.replace(".ipynb", ".md")
|
186 |
+
md_file_path = os.path.join(local_dir, md_file_name)
|
187 |
+
print(f"Converting {file_name} to markdown...")
|
188 |
+
convert_ipynb_to_md(file_path, md_file_path)
|
189 |
+
os.remove(file_path) # Remove the .ipynb file after conversion
|
190 |
+
elif file["type"] == "dir":
|
191 |
+
subdir = os.path.join(local_dir, file["name"])
|
192 |
+
os.makedirs(subdir, exist_ok=True)
|
193 |
+
fetch_files(file["url"], subdir)
|
194 |
+
|
195 |
+
|
196 |
+
def process_source(source: str):
|
197 |
+
if source not in SOURCE_CONFIGS:
|
198 |
+
print(
|
199 |
+
f"Error: Unknown source '{source}'. Available sources: {', '.join(SOURCE_CONFIGS.keys())}"
|
200 |
+
)
|
201 |
+
return
|
202 |
+
|
203 |
+
config = SOURCE_CONFIGS[source]
|
204 |
+
api_url = f"https://api.github.com/repos/{config['owner']}/{config['repo']}/contents/{config['path']}"
|
205 |
+
local_dir = f"data/{config['repo']}_md_files"
|
206 |
+
os.makedirs(local_dir, exist_ok=True)
|
207 |
+
|
208 |
+
print(f"Processing source: {source}")
|
209 |
+
fetch_files(api_url, local_dir)
|
210 |
+
print(f"Finished processing {source}")
|
211 |
+
|
212 |
+
|
213 |
+
def main(sources: List[str]):
|
214 |
+
for source in sources:
|
215 |
+
process_source(source)
|
216 |
+
print("All specified sources have been processed.")
|
217 |
+
|
218 |
+
|
219 |
+
if __name__ == "__main__":
|
220 |
+
parser = argparse.ArgumentParser(
|
221 |
+
description="Fetch Markdown files from specified GitHub repositories."
|
222 |
+
)
|
223 |
+
parser.add_argument(
|
224 |
+
"sources",
|
225 |
+
nargs="+",
|
226 |
+
choices=SOURCE_CONFIGS.keys(),
|
227 |
+
help="Specify one or more sources to process",
|
228 |
+
)
|
229 |
+
args = parser.parse_args()
|
230 |
+
|
231 |
+
main(args.sources)
|
data/scraping_scripts/process_md_files.py
ADDED
@@ -0,0 +1,370 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Markdown Document Processor for Documentation Sources
|
3 |
+
|
4 |
+
This script processes Markdown (.md) and MDX (.mdx) files from various documentation sources
|
5 |
+
(such as Hugging Face Transformers, PEFT, TRL, LlamaIndex, and OpenAI Cookbook) and converts
|
6 |
+
them into a standardized JSONL format for further processing or indexing.
|
7 |
+
|
8 |
+
Key features:
|
9 |
+
1. Configurable for multiple documentation sources
|
10 |
+
2. Extracts titles, generates URLs, and counts tokens for each document
|
11 |
+
3. Supports inclusion/exclusion of specific directories and root files
|
12 |
+
4. Removes copyright headers from content
|
13 |
+
5. Generates a unique ID for each document
|
14 |
+
6. Determines if a whole document should be retrieved based on token count
|
15 |
+
7. Handles special cases like openai-cookbook repo by adding .ipynb extensions
|
16 |
+
8. Processes multiple sources in a single run
|
17 |
+
|
18 |
+
Usage:
|
19 |
+
python process_md_files.py <source1> <source2> ...
|
20 |
+
|
21 |
+
Where <source1>, <source2>, etc. are one or more of the predefined sources in SOURCE_CONFIGS
|
22 |
+
(e.g., 'transformers', 'llama_index', 'openai_cookbooks').
|
23 |
+
|
24 |
+
The script processes all Markdown files in the specified input directories (and their subdirectories),
|
25 |
+
applies the configured filters, and saves the results in JSONL files. Each line in the output
|
26 |
+
files represents a single document with metadata and content.
|
27 |
+
|
28 |
+
To add or modify sources, update the SOURCE_CONFIGS dictionary at the top of the script.
|
29 |
+
"""
|
30 |
+
|
31 |
+
import argparse
|
32 |
+
import json
|
33 |
+
import logging
|
34 |
+
import os
|
35 |
+
import re
|
36 |
+
import uuid
|
37 |
+
from typing import Dict, List
|
38 |
+
|
39 |
+
import tiktoken
|
40 |
+
|
41 |
+
logging.basicConfig(level=logging.INFO)
|
42 |
+
logger = logging.getLogger(__name__)
|
43 |
+
|
44 |
+
# Configuration for different sources
|
45 |
+
SOURCE_CONFIGS = {
|
46 |
+
"transformers": {
|
47 |
+
"base_url": "https://huggingface.co/docs/transformers/",
|
48 |
+
"input_directory": "data/transformers_md_files",
|
49 |
+
"output_file": "data/transformers_data.jsonl",
|
50 |
+
"source_name": "transformers",
|
51 |
+
"use_include_list": False,
|
52 |
+
"included_dirs": [],
|
53 |
+
"excluded_dirs": ["internal", "main_classes"],
|
54 |
+
"excluded_root_files": [],
|
55 |
+
"included_root_files": [],
|
56 |
+
"url_extension": "",
|
57 |
+
},
|
58 |
+
"peft": {
|
59 |
+
"base_url": "https://huggingface.co/docs/peft/",
|
60 |
+
"input_directory": "data/peft_md_files",
|
61 |
+
"output_file": "data/peft_data.jsonl",
|
62 |
+
"source_name": "peft",
|
63 |
+
"use_include_list": False,
|
64 |
+
"included_dirs": [],
|
65 |
+
"excluded_dirs": [],
|
66 |
+
"excluded_root_files": [],
|
67 |
+
"included_root_files": [],
|
68 |
+
"url_extension": "",
|
69 |
+
},
|
70 |
+
"trl": {
|
71 |
+
"base_url": "https://huggingface.co/docs/trl/",
|
72 |
+
"input_directory": "data/trl_md_files",
|
73 |
+
"output_file": "data/trl_data.jsonl",
|
74 |
+
"source_name": "trl",
|
75 |
+
"use_include_list": False,
|
76 |
+
"included_dirs": [],
|
77 |
+
"excluded_dirs": [],
|
78 |
+
"excluded_root_files": [],
|
79 |
+
"included_root_files": [],
|
80 |
+
"url_extension": "",
|
81 |
+
},
|
82 |
+
"llama_index": {
|
83 |
+
"base_url": "https://docs.llamaindex.ai/en/stable/",
|
84 |
+
"input_directory": "data/llama_index_md_files",
|
85 |
+
"output_file": "data/llama_index_data.jsonl",
|
86 |
+
"source_name": "llama_index",
|
87 |
+
"use_include_list": True,
|
88 |
+
"included_dirs": [
|
89 |
+
"getting_started",
|
90 |
+
"understanding",
|
91 |
+
"use_cases",
|
92 |
+
"examples",
|
93 |
+
"module_guides",
|
94 |
+
"optimizing",
|
95 |
+
],
|
96 |
+
"excluded_dirs": [],
|
97 |
+
"excluded_root_files": [],
|
98 |
+
"included_root_files": ["index.md"],
|
99 |
+
"url_extension": "",
|
100 |
+
},
|
101 |
+
"openai_cookbooks": {
|
102 |
+
"base_url": "https://github.com/openai/openai-cookbook/blob/main/examples/",
|
103 |
+
"input_directory": "data/openai-cookbook_md_files",
|
104 |
+
"output_file": "data/openai_cookbooks_data.jsonl",
|
105 |
+
"source_name": "openai_cookbooks",
|
106 |
+
"use_include_list": False,
|
107 |
+
"included_dirs": [],
|
108 |
+
"excluded_dirs": [],
|
109 |
+
"excluded_root_files": [],
|
110 |
+
"included_root_files": [],
|
111 |
+
"url_extension": ".ipynb",
|
112 |
+
},
|
113 |
+
"langchain": {
|
114 |
+
"base_url": "https://python.langchain.com/docs/",
|
115 |
+
"input_directory": "data/langchain_md_files",
|
116 |
+
"output_file": "data/langchain_data.jsonl",
|
117 |
+
"source_name": "langchain",
|
118 |
+
"use_include_list": True,
|
119 |
+
"included_dirs": ["how_to", "versions", "turorials", "integrations"],
|
120 |
+
"excluded_dirs": [],
|
121 |
+
"excluded_root_files": [],
|
122 |
+
"included_root_files": ["security.md", "concepts.mdx", "introduction.mdx"],
|
123 |
+
"url_extension": "",
|
124 |
+
},
|
125 |
+
"tai_blog": {
|
126 |
+
"base_url": "",
|
127 |
+
"input_directory": "",
|
128 |
+
"output_file": "data/tai_blog_data.jsonl",
|
129 |
+
"source_name": "tai_blog",
|
130 |
+
"use_include_list": False,
|
131 |
+
"included_dirs": [],
|
132 |
+
"excluded_dirs": [],
|
133 |
+
"excluded_root_files": [],
|
134 |
+
"included_root_files": [],
|
135 |
+
"url_extension": "",
|
136 |
+
},
|
137 |
+
"8-hour_primer": {
|
138 |
+
"base_url": "",
|
139 |
+
"input_directory": "data/8-hour_primer", # Path to the directory that contains the Markdown files
|
140 |
+
"output_file": "data/8-hour_primer_data.jsonl", # 8-hour Generative AI Primer
|
141 |
+
"source_name": "8-hour_primer",
|
142 |
+
"use_include_list": False,
|
143 |
+
"included_dirs": [],
|
144 |
+
"excluded_dirs": [],
|
145 |
+
"excluded_root_files": [],
|
146 |
+
"included_root_files": [],
|
147 |
+
"url_extension": "",
|
148 |
+
},
|
149 |
+
"llm_developer": {
|
150 |
+
"base_url": "",
|
151 |
+
"input_directory": "data/llm_developer", # Path to the directory that contains the Markdown files
|
152 |
+
"output_file": "data/llm_developer_data.jsonl", # From Beginner to Advanced LLM Developer
|
153 |
+
"source_name": "llm_developer",
|
154 |
+
"use_include_list": False,
|
155 |
+
"included_dirs": [],
|
156 |
+
"excluded_dirs": [],
|
157 |
+
"excluded_root_files": [],
|
158 |
+
"included_root_files": [],
|
159 |
+
"url_extension": "",
|
160 |
+
},
|
161 |
+
"python_primer": {
|
162 |
+
"base_url": "",
|
163 |
+
"input_directory": "data/python_primer", # Path to the directory that contains the Markdown files
|
164 |
+
"output_file": "data/python_primer_data.jsonl", # From Beginner to Advanced LLM Developer
|
165 |
+
"source_name": "python_primer",
|
166 |
+
"use_include_list": False,
|
167 |
+
"included_dirs": [],
|
168 |
+
"excluded_dirs": [],
|
169 |
+
"excluded_root_files": [],
|
170 |
+
"included_root_files": [],
|
171 |
+
"url_extension": "",
|
172 |
+
},
|
173 |
+
}
|
174 |
+
|
175 |
+
|
176 |
+
def extract_title(content: str):
|
177 |
+
title_match = re.search(r"^#\s+(.+)$", content, re.MULTILINE)
|
178 |
+
if title_match:
|
179 |
+
return title_match.group(1).strip()
|
180 |
+
|
181 |
+
lines = content.split("\n")
|
182 |
+
for line in lines:
|
183 |
+
if line.strip():
|
184 |
+
return line.strip()
|
185 |
+
|
186 |
+
return None
|
187 |
+
|
188 |
+
|
189 |
+
def generate_url(file_path: str, config: Dict) -> str:
|
190 |
+
"""
|
191 |
+
Return an empty string if base_url is empty;
|
192 |
+
otherwise return the constructed URL as before.
|
193 |
+
"""
|
194 |
+
if not config["base_url"]:
|
195 |
+
return ""
|
196 |
+
|
197 |
+
path_without_extension = os.path.splitext(file_path)[0]
|
198 |
+
path_with_forward_slashes = path_without_extension.replace("\\", "/")
|
199 |
+
return config["base_url"] + path_with_forward_slashes + config["url_extension"]
|
200 |
+
|
201 |
+
|
202 |
+
def should_include_file(file_path: str, config: Dict) -> bool:
|
203 |
+
if os.path.dirname(file_path) == "":
|
204 |
+
if config["use_include_list"]:
|
205 |
+
return os.path.basename(file_path) in config["included_root_files"]
|
206 |
+
else:
|
207 |
+
return os.path.basename(file_path) not in config["excluded_root_files"]
|
208 |
+
|
209 |
+
if config["use_include_list"]:
|
210 |
+
return any(file_path.startswith(dir) for dir in config["included_dirs"])
|
211 |
+
else:
|
212 |
+
return not any(file_path.startswith(dir) for dir in config["excluded_dirs"])
|
213 |
+
|
214 |
+
|
215 |
+
def num_tokens_from_string(string: str, encoding_name: str) -> int:
|
216 |
+
encoding = tiktoken.get_encoding(encoding_name)
|
217 |
+
num_tokens = len(encoding.encode(string, disallowed_special=()))
|
218 |
+
return num_tokens
|
219 |
+
|
220 |
+
|
221 |
+
def remove_copyright_header(content: str) -> str:
|
222 |
+
header_pattern = re.compile(r"<!--Copyright.*?-->\s*", re.DOTALL)
|
223 |
+
cleaned_content = header_pattern.sub("", content, count=1)
|
224 |
+
return cleaned_content.strip()
|
225 |
+
|
226 |
+
|
227 |
+
def process_md_files(directory: str, config: Dict) -> List[Dict]:
|
228 |
+
jsonl_data = []
|
229 |
+
|
230 |
+
for root, _, files in os.walk(directory):
|
231 |
+
for file in files:
|
232 |
+
if file.endswith(".md") or file.endswith(".mdx"):
|
233 |
+
file_path = os.path.join(root, file)
|
234 |
+
relative_path = os.path.relpath(file_path, directory)
|
235 |
+
|
236 |
+
if should_include_file(relative_path, config):
|
237 |
+
with open(file_path, "r", encoding="utf-8") as f:
|
238 |
+
content = f.read()
|
239 |
+
|
240 |
+
title = extract_title(content)
|
241 |
+
token_count = num_tokens_from_string(content, "cl100k_base")
|
242 |
+
|
243 |
+
# Skip very small or extremely large files
|
244 |
+
if token_count < 100 or token_count > 200_000:
|
245 |
+
logger.info(
|
246 |
+
f"Skipping {relative_path} due to token count {token_count}"
|
247 |
+
)
|
248 |
+
continue
|
249 |
+
|
250 |
+
cleaned_content = remove_copyright_header(content)
|
251 |
+
|
252 |
+
json_object = {
|
253 |
+
"tokens": token_count,
|
254 |
+
"doc_id": str(uuid.uuid4()),
|
255 |
+
"name": (title if title else file),
|
256 |
+
"url": generate_url(relative_path, config),
|
257 |
+
"retrieve_doc": (token_count <= 8000),
|
258 |
+
"source": config["source_name"],
|
259 |
+
"content": cleaned_content,
|
260 |
+
}
|
261 |
+
|
262 |
+
jsonl_data.append(json_object)
|
263 |
+
|
264 |
+
return jsonl_data
|
265 |
+
|
266 |
+
|
267 |
+
def save_jsonl(data: List[Dict], output_file: str) -> None:
|
268 |
+
with open(output_file, "w", encoding="utf-8") as f:
|
269 |
+
for item in data:
|
270 |
+
json.dump(item, f, ensure_ascii=False)
|
271 |
+
f.write("\n")
|
272 |
+
|
273 |
+
|
274 |
+
def combine_all_sources(sources: List[str]) -> None:
|
275 |
+
"""
|
276 |
+
Combine JSONL files from multiple sources, preserving existing sources not being processed.
|
277 |
+
|
278 |
+
For example, if sources = ['transformers'], this will:
|
279 |
+
1. Load data from transformers_data.jsonl
|
280 |
+
2. Load data from all other source JSONL files that exist (course files, etc.)
|
281 |
+
3. Combine them all into all_sources_data.jsonl
|
282 |
+
"""
|
283 |
+
all_data = []
|
284 |
+
output_file = "data/all_sources_data.jsonl"
|
285 |
+
|
286 |
+
# Track which sources we're processing
|
287 |
+
processed_sources = set()
|
288 |
+
|
289 |
+
# First, add data from sources we're explicitly processing
|
290 |
+
for source in sources:
|
291 |
+
if source not in SOURCE_CONFIGS:
|
292 |
+
logger.error(f"Unknown source '{source}'. Skipping.")
|
293 |
+
continue
|
294 |
+
|
295 |
+
processed_sources.add(source)
|
296 |
+
input_file = SOURCE_CONFIGS[source]["output_file"]
|
297 |
+
logger.info(f"Processing updated source: {source} from {input_file}")
|
298 |
+
|
299 |
+
try:
|
300 |
+
source_data = []
|
301 |
+
with open(input_file, "r", encoding="utf-8") as f:
|
302 |
+
for line in f:
|
303 |
+
source_data.append(json.loads(line))
|
304 |
+
|
305 |
+
logger.info(f"Added {len(source_data)} documents from {source}")
|
306 |
+
all_data.extend(source_data)
|
307 |
+
except Exception as e:
|
308 |
+
logger.error(f"Error loading {input_file}: {e}")
|
309 |
+
|
310 |
+
# Now add data from all other sources not being processed
|
311 |
+
for source_name, config in SOURCE_CONFIGS.items():
|
312 |
+
# Skip sources we already processed
|
313 |
+
if source_name in processed_sources:
|
314 |
+
continue
|
315 |
+
|
316 |
+
# Try to load the individual source file
|
317 |
+
source_file = config["output_file"]
|
318 |
+
if os.path.exists(source_file):
|
319 |
+
logger.info(f"Preserving existing source: {source_name} from {source_file}")
|
320 |
+
try:
|
321 |
+
source_data = []
|
322 |
+
with open(source_file, "r", encoding="utf-8") as f:
|
323 |
+
for line in f:
|
324 |
+
source_data.append(json.loads(line))
|
325 |
+
|
326 |
+
logger.info(f"Preserved {len(source_data)} documents from {source_name}")
|
327 |
+
all_data.extend(source_data)
|
328 |
+
except Exception as e:
|
329 |
+
logger.error(f"Error loading {source_file}: {e}")
|
330 |
+
|
331 |
+
logger.info(f"Total documents combined: {len(all_data)}")
|
332 |
+
save_jsonl(all_data, output_file)
|
333 |
+
logger.info(f"Combined data saved to {output_file}")
|
334 |
+
|
335 |
+
|
336 |
+
def process_source(source: str) -> None:
|
337 |
+
if source not in SOURCE_CONFIGS:
|
338 |
+
logger.error(f"Unknown source '{source}'. Skipping.")
|
339 |
+
return
|
340 |
+
|
341 |
+
config = SOURCE_CONFIGS[source]
|
342 |
+
logger.info(f"\n\nProcessing source: {source}")
|
343 |
+
jsonl_data = process_md_files(config["input_directory"], config)
|
344 |
+
save_jsonl(jsonl_data, config["output_file"])
|
345 |
+
logger.info(
|
346 |
+
f"Processed {len(jsonl_data)} files and saved to {config['output_file']}"
|
347 |
+
)
|
348 |
+
|
349 |
+
|
350 |
+
def main(sources: List[str]) -> None:
|
351 |
+
for source in sources:
|
352 |
+
process_source(source)
|
353 |
+
|
354 |
+
if len(sources) > 1:
|
355 |
+
combine_all_sources(sources)
|
356 |
+
|
357 |
+
|
358 |
+
if __name__ == "__main__":
|
359 |
+
parser = argparse.ArgumentParser(
|
360 |
+
description="Process Markdown files from specified sources."
|
361 |
+
)
|
362 |
+
parser.add_argument(
|
363 |
+
"sources",
|
364 |
+
nargs="+",
|
365 |
+
choices=SOURCE_CONFIGS.keys(),
|
366 |
+
help="Specify one or more sources to process",
|
367 |
+
)
|
368 |
+
args = parser.parse_args()
|
369 |
+
|
370 |
+
main(args.sources)
|
data/scraping_scripts/update_docs_workflow.py
ADDED
@@ -0,0 +1,409 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/env python
|
2 |
+
"""
|
3 |
+
AI Tutor App - Documentation Update Workflow
|
4 |
+
|
5 |
+
This script automates the process of updating documentation from GitHub repositories:
|
6 |
+
1. Download documentation from GitHub using the API
|
7 |
+
2. Process markdown files to create JSONL data
|
8 |
+
3. Add contextual information to document nodes
|
9 |
+
4. Create vector stores
|
10 |
+
5. Upload databases to HuggingFace
|
11 |
+
|
12 |
+
This workflow is specific to updating library documentation (Transformers, PEFT, LlamaIndex, etc.).
|
13 |
+
For adding courses, use the add_course_workflow.py script instead.
|
14 |
+
|
15 |
+
Usage:
|
16 |
+
python update_docs_workflow.py --sources [SOURCE1] [SOURCE2] ...
|
17 |
+
|
18 |
+
Additional flags to run specific steps (if you want to restart from a specific point):
|
19 |
+
--skip-download Skip the GitHub download step
|
20 |
+
--skip-process Skip the markdown processing step
|
21 |
+
--new-context-only Only process new content when adding context
|
22 |
+
--skip-context Skip the context addition step entirely
|
23 |
+
--skip-vectors Skip vector store creation
|
24 |
+
--skip-upload Skip uploading to HuggingFace
|
25 |
+
"""
|
26 |
+
|
27 |
+
import argparse
|
28 |
+
import json
|
29 |
+
import logging
|
30 |
+
import os
|
31 |
+
import pickle
|
32 |
+
import subprocess
|
33 |
+
import sys
|
34 |
+
from typing import Dict, List, Set
|
35 |
+
|
36 |
+
from dotenv import load_dotenv
|
37 |
+
from huggingface_hub import HfApi, hf_hub_download
|
38 |
+
|
39 |
+
# Load environment variables from .env file
|
40 |
+
load_dotenv()
|
41 |
+
|
42 |
+
# Configure logging
|
43 |
+
logging.basicConfig(
|
44 |
+
level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
45 |
+
)
|
46 |
+
logger = logging.getLogger(__name__)
|
47 |
+
|
48 |
+
|
49 |
+
def ensure_required_files_exist():
|
50 |
+
"""Download required data files from HuggingFace if they don't exist locally."""
|
51 |
+
# List of files to check and download
|
52 |
+
required_files = {
|
53 |
+
# Critical files
|
54 |
+
"data/all_sources_data.jsonl": "all_sources_data.jsonl",
|
55 |
+
"data/all_sources_contextual_nodes.pkl": "all_sources_contextual_nodes.pkl",
|
56 |
+
# Documentation source files
|
57 |
+
"data/transformers_data.jsonl": "transformers_data.jsonl",
|
58 |
+
"data/peft_data.jsonl": "peft_data.jsonl",
|
59 |
+
"data/trl_data.jsonl": "trl_data.jsonl",
|
60 |
+
"data/llama_index_data.jsonl": "llama_index_data.jsonl",
|
61 |
+
"data/langchain_data.jsonl": "langchain_data.jsonl",
|
62 |
+
"data/openai_cookbooks_data.jsonl": "openai_cookbooks_data.jsonl",
|
63 |
+
# Course files
|
64 |
+
"data/tai_blog_data.jsonl": "tai_blog_data.jsonl",
|
65 |
+
"data/8-hour_primer_data.jsonl": "8-hour_primer_data.jsonl",
|
66 |
+
"data/llm_developer_data.jsonl": "llm_developer_data.jsonl",
|
67 |
+
"data/python_primer_data.jsonl": "python_primer_data.jsonl",
|
68 |
+
}
|
69 |
+
|
70 |
+
# Critical files that must be downloaded
|
71 |
+
critical_files = [
|
72 |
+
"data/all_sources_data.jsonl",
|
73 |
+
"data/all_sources_contextual_nodes.pkl",
|
74 |
+
]
|
75 |
+
|
76 |
+
# Check and download each file
|
77 |
+
for local_path, remote_filename in required_files.items():
|
78 |
+
if not os.path.exists(local_path):
|
79 |
+
logger.info(
|
80 |
+
f"{remote_filename} not found. Attempting to download from HuggingFace..."
|
81 |
+
)
|
82 |
+
try:
|
83 |
+
hf_hub_download(
|
84 |
+
token=os.getenv("HF_TOKEN"),
|
85 |
+
repo_id="towardsai-tutors/ai-tutor-data",
|
86 |
+
filename=remote_filename,
|
87 |
+
repo_type="dataset",
|
88 |
+
local_dir="data",
|
89 |
+
)
|
90 |
+
logger.info(
|
91 |
+
f"Successfully downloaded {remote_filename} from HuggingFace"
|
92 |
+
)
|
93 |
+
except Exception as e:
|
94 |
+
logger.warning(f"Could not download {remote_filename}: {e}")
|
95 |
+
|
96 |
+
# Only create empty file for all_sources_data.jsonl if it's missing
|
97 |
+
if local_path == "data/all_sources_data.jsonl":
|
98 |
+
logger.warning(
|
99 |
+
"Creating a new all_sources_data.jsonl file. This will not include previously existing data."
|
100 |
+
)
|
101 |
+
with open(local_path, "w") as f:
|
102 |
+
pass
|
103 |
+
|
104 |
+
# If critical file is missing, print a more serious warning
|
105 |
+
if local_path in critical_files:
|
106 |
+
logger.warning(
|
107 |
+
f"Critical file {remote_filename} is missing. The workflow may not function correctly."
|
108 |
+
)
|
109 |
+
|
110 |
+
if local_path == "data/all_sources_contextual_nodes.pkl":
|
111 |
+
logger.warning(
|
112 |
+
"The context addition step will process all documents since no existing contexts were found."
|
113 |
+
)
|
114 |
+
|
115 |
+
|
116 |
+
# Documentation sources that can be updated via GitHub API
|
117 |
+
GITHUB_SOURCES = [
|
118 |
+
"transformers",
|
119 |
+
"peft",
|
120 |
+
"trl",
|
121 |
+
"llama_index",
|
122 |
+
"openai_cookbooks",
|
123 |
+
"langchain",
|
124 |
+
]
|
125 |
+
|
126 |
+
|
127 |
+
def load_jsonl(file_path: str) -> List[Dict]:
|
128 |
+
"""Load data from a JSONL file."""
|
129 |
+
data = []
|
130 |
+
with open(file_path, "r", encoding="utf-8") as f:
|
131 |
+
for line in f:
|
132 |
+
data.append(json.loads(line))
|
133 |
+
return data
|
134 |
+
|
135 |
+
|
136 |
+
def save_jsonl(data: List[Dict], file_path: str) -> None:
|
137 |
+
"""Save data to a JSONL file."""
|
138 |
+
with open(file_path, "w", encoding="utf-8") as f:
|
139 |
+
for item in data:
|
140 |
+
json.dump(item, f, ensure_ascii=False)
|
141 |
+
f.write("\n")
|
142 |
+
|
143 |
+
|
144 |
+
def download_from_github(sources: List[str]) -> None:
|
145 |
+
"""Download documentation from GitHub repositories."""
|
146 |
+
logger.info(f"Downloading documentation from GitHub for sources: {sources}")
|
147 |
+
|
148 |
+
for source in sources:
|
149 |
+
if source not in GITHUB_SOURCES:
|
150 |
+
logger.warning(f"Source {source} is not a GitHub source, skipping download")
|
151 |
+
continue
|
152 |
+
|
153 |
+
logger.info(f"Downloading {source} documentation")
|
154 |
+
cmd = ["python", "data/scraping_scripts/github_to_markdown_ai_docs.py", source]
|
155 |
+
result = subprocess.run(cmd)
|
156 |
+
|
157 |
+
if result.returncode != 0:
|
158 |
+
logger.error(
|
159 |
+
f"Error downloading {source} documentation - check output above"
|
160 |
+
)
|
161 |
+
# Continue with other sources instead of exiting
|
162 |
+
continue
|
163 |
+
|
164 |
+
logger.info(f"Successfully downloaded {source} documentation")
|
165 |
+
|
166 |
+
|
167 |
+
def process_markdown_files(sources: List[str]) -> None:
|
168 |
+
"""Process markdown files for specific sources."""
|
169 |
+
logger.info(f"Processing markdown files for sources: {sources}")
|
170 |
+
|
171 |
+
cmd = ["python", "data/scraping_scripts/process_md_files.py"] + sources
|
172 |
+
result = subprocess.run(cmd)
|
173 |
+
|
174 |
+
if result.returncode != 0:
|
175 |
+
logger.error(f"Error processing markdown files - check output above")
|
176 |
+
sys.exit(1)
|
177 |
+
|
178 |
+
logger.info(f"Successfully processed markdown files")
|
179 |
+
|
180 |
+
|
181 |
+
def get_processed_doc_ids() -> Set[str]:
|
182 |
+
"""Get set of doc_ids that have already been processed with context."""
|
183 |
+
if not os.path.exists("data/all_sources_contextual_nodes.pkl"):
|
184 |
+
return set()
|
185 |
+
|
186 |
+
try:
|
187 |
+
with open("data/all_sources_contextual_nodes.pkl", "rb") as f:
|
188 |
+
nodes = pickle.load(f)
|
189 |
+
return {node.source_node.node_id for node in nodes}
|
190 |
+
except Exception as e:
|
191 |
+
logger.error(f"Error loading processed doc_ids: {e}")
|
192 |
+
return set()
|
193 |
+
|
194 |
+
|
195 |
+
def add_context_to_nodes(new_only: bool = False) -> None:
|
196 |
+
"""Add context to document nodes, optionally processing only new content."""
|
197 |
+
logger.info("Adding context to document nodes")
|
198 |
+
|
199 |
+
if new_only:
|
200 |
+
# Load all documents
|
201 |
+
all_docs = load_jsonl("data/all_sources_data.jsonl")
|
202 |
+
processed_ids = get_processed_doc_ids()
|
203 |
+
|
204 |
+
# Filter for unprocessed documents
|
205 |
+
new_docs = [doc for doc in all_docs if doc["doc_id"] not in processed_ids]
|
206 |
+
|
207 |
+
if not new_docs:
|
208 |
+
logger.info("No new documents to process")
|
209 |
+
return
|
210 |
+
|
211 |
+
# Save temporary JSONL with only new documents
|
212 |
+
temp_file = "data/new_docs_temp.jsonl"
|
213 |
+
save_jsonl(new_docs, temp_file)
|
214 |
+
|
215 |
+
# Temporarily modify the add_context_to_nodes.py script to use the temp file
|
216 |
+
cmd = [
|
217 |
+
"python",
|
218 |
+
"-c",
|
219 |
+
f"""
|
220 |
+
import asyncio
|
221 |
+
import os
|
222 |
+
import pickle
|
223 |
+
import json
|
224 |
+
from data.scraping_scripts.add_context_to_nodes import create_docs, process
|
225 |
+
|
226 |
+
async def main():
|
227 |
+
# First, get the list of sources being updated from the temp file
|
228 |
+
updated_sources = set()
|
229 |
+
with open("{temp_file}", "r") as f:
|
230 |
+
for line in f:
|
231 |
+
data = json.loads(line)
|
232 |
+
updated_sources.add(data["source"])
|
233 |
+
|
234 |
+
print(f"Updating nodes for sources: {{updated_sources}}")
|
235 |
+
|
236 |
+
# Process new documents
|
237 |
+
documents = create_docs("{temp_file}")
|
238 |
+
enhanced_nodes = await process(documents)
|
239 |
+
print(f"Generated context for {{len(enhanced_nodes)}} new nodes")
|
240 |
+
|
241 |
+
# Load existing nodes if they exist
|
242 |
+
existing_nodes = []
|
243 |
+
if os.path.exists("data/all_sources_contextual_nodes.pkl"):
|
244 |
+
with open("data/all_sources_contextual_nodes.pkl", "rb") as f:
|
245 |
+
existing_nodes = pickle.load(f)
|
246 |
+
|
247 |
+
# Filter out existing nodes for sources we're updating
|
248 |
+
filtered_nodes = []
|
249 |
+
removed_count = 0
|
250 |
+
|
251 |
+
for node in existing_nodes:
|
252 |
+
# Try to extract source from node metadata
|
253 |
+
try:
|
254 |
+
source = None
|
255 |
+
if hasattr(node, 'source_node') and hasattr(node.source_node, 'metadata'):
|
256 |
+
source = node.source_node.metadata.get("source")
|
257 |
+
elif hasattr(node, 'metadata'):
|
258 |
+
source = node.metadata.get("source")
|
259 |
+
|
260 |
+
if source not in updated_sources:
|
261 |
+
filtered_nodes.append(node)
|
262 |
+
else:
|
263 |
+
removed_count += 1
|
264 |
+
except Exception:
|
265 |
+
# Keep nodes where we can't determine the source
|
266 |
+
filtered_nodes.append(node)
|
267 |
+
|
268 |
+
print(f"Removed {{removed_count}} existing nodes for updated sources")
|
269 |
+
existing_nodes = filtered_nodes
|
270 |
+
|
271 |
+
# Combine filtered existing nodes with new nodes
|
272 |
+
all_nodes = existing_nodes + enhanced_nodes
|
273 |
+
|
274 |
+
# Save all nodes
|
275 |
+
with open("data/all_sources_contextual_nodes.pkl", "wb") as f:
|
276 |
+
pickle.dump(all_nodes, f)
|
277 |
+
|
278 |
+
print(f"Total nodes in updated file: {{len(all_nodes)}}")
|
279 |
+
|
280 |
+
asyncio.run(main())
|
281 |
+
""",
|
282 |
+
]
|
283 |
+
else:
|
284 |
+
# Process all documents
|
285 |
+
logger.info("Adding context to all nodes")
|
286 |
+
cmd = ["python", "data/scraping_scripts/add_context_to_nodes.py"]
|
287 |
+
|
288 |
+
result = subprocess.run(cmd)
|
289 |
+
|
290 |
+
if result.returncode != 0:
|
291 |
+
logger.error(f"Error adding context to nodes - check output above")
|
292 |
+
sys.exit(1)
|
293 |
+
|
294 |
+
logger.info("Successfully added context to nodes")
|
295 |
+
|
296 |
+
# Clean up temp file if it exists
|
297 |
+
if new_only and os.path.exists("data/new_docs_temp.jsonl"):
|
298 |
+
os.remove("data/new_docs_temp.jsonl")
|
299 |
+
|
300 |
+
|
301 |
+
def create_vector_stores() -> None:
|
302 |
+
"""Create vector stores from processed documents."""
|
303 |
+
logger.info("Creating vector stores")
|
304 |
+
cmd = ["python", "data/scraping_scripts/create_vector_stores.py", "all_sources"]
|
305 |
+
result = subprocess.run(cmd)
|
306 |
+
|
307 |
+
if result.returncode != 0:
|
308 |
+
logger.error(f"Error creating vector stores - check output above")
|
309 |
+
sys.exit(1)
|
310 |
+
|
311 |
+
logger.info("Successfully created vector stores")
|
312 |
+
|
313 |
+
|
314 |
+
def upload_to_huggingface(upload_jsonl: bool = False) -> None:
|
315 |
+
"""Upload databases to HuggingFace."""
|
316 |
+
logger.info("Uploading databases to HuggingFace")
|
317 |
+
cmd = ["python", "data/scraping_scripts/upload_dbs_to_hf.py"]
|
318 |
+
result = subprocess.run(cmd)
|
319 |
+
|
320 |
+
if result.returncode != 0:
|
321 |
+
logger.error(f"Error uploading databases - check output above")
|
322 |
+
sys.exit(1)
|
323 |
+
|
324 |
+
logger.info("Successfully uploaded databases to HuggingFace")
|
325 |
+
|
326 |
+
if upload_jsonl:
|
327 |
+
logger.info("Uploading data files to HuggingFace")
|
328 |
+
|
329 |
+
try:
|
330 |
+
# Note: This uses a separate private repository
|
331 |
+
cmd = ["python", "data/scraping_scripts/upload_data_to_hf.py"]
|
332 |
+
result = subprocess.run(cmd)
|
333 |
+
|
334 |
+
if result.returncode != 0:
|
335 |
+
logger.error(f"Error uploading data files - check output above")
|
336 |
+
sys.exit(1)
|
337 |
+
|
338 |
+
logger.info("Successfully uploaded data files to HuggingFace")
|
339 |
+
except Exception as e:
|
340 |
+
logger.error(f"Error uploading JSONL file: {e}")
|
341 |
+
sys.exit(1)
|
342 |
+
|
343 |
+
|
344 |
+
def main():
|
345 |
+
parser = argparse.ArgumentParser(
|
346 |
+
description="AI Tutor App Documentation Update Workflow"
|
347 |
+
)
|
348 |
+
parser.add_argument(
|
349 |
+
"--sources",
|
350 |
+
nargs="+",
|
351 |
+
choices=GITHUB_SOURCES,
|
352 |
+
default=GITHUB_SOURCES,
|
353 |
+
help="GitHub documentation sources to update",
|
354 |
+
)
|
355 |
+
parser.add_argument(
|
356 |
+
"--skip-download", action="store_true", help="Skip downloading from GitHub"
|
357 |
+
)
|
358 |
+
parser.add_argument(
|
359 |
+
"--skip-process", action="store_true", help="Skip processing markdown files"
|
360 |
+
)
|
361 |
+
parser.add_argument(
|
362 |
+
"--process-all-context",
|
363 |
+
action="store_true",
|
364 |
+
help="Process all content when adding context (default: only process new content)",
|
365 |
+
)
|
366 |
+
parser.add_argument(
|
367 |
+
"--skip-context",
|
368 |
+
action="store_true",
|
369 |
+
help="Skip the context addition step entirely",
|
370 |
+
)
|
371 |
+
parser.add_argument(
|
372 |
+
"--skip-vectors", action="store_true", help="Skip vector store creation"
|
373 |
+
)
|
374 |
+
parser.add_argument(
|
375 |
+
"--skip-upload", action="store_true", help="Skip uploading to HuggingFace"
|
376 |
+
)
|
377 |
+
parser.add_argument(
|
378 |
+
"--skip-data-upload",
|
379 |
+
action="store_true",
|
380 |
+
help="Skip uploading data files (.jsonl and .pkl) to private HuggingFace repo (they are uploaded by default)",
|
381 |
+
)
|
382 |
+
|
383 |
+
args = parser.parse_args()
|
384 |
+
|
385 |
+
# Ensure required data files exist before proceeding
|
386 |
+
ensure_required_files_exist()
|
387 |
+
|
388 |
+
# Execute the workflow steps
|
389 |
+
if not args.skip_download:
|
390 |
+
download_from_github(args.sources)
|
391 |
+
|
392 |
+
if not args.skip_process:
|
393 |
+
process_markdown_files(args.sources)
|
394 |
+
|
395 |
+
if not args.skip_context:
|
396 |
+
add_context_to_nodes(not args.process_all_context)
|
397 |
+
|
398 |
+
if not args.skip_vectors:
|
399 |
+
create_vector_stores()
|
400 |
+
|
401 |
+
if not args.skip_upload:
|
402 |
+
# By default, also upload the data files (JSONL and PKL) unless explicitly skipped
|
403 |
+
upload_to_huggingface(not args.skip_data_upload)
|
404 |
+
|
405 |
+
logger.info("Documentation update workflow completed successfully")
|
406 |
+
|
407 |
+
|
408 |
+
if __name__ == "__main__":
|
409 |
+
main()
|
data/scraping_scripts/upload_data_to_hf.py
ADDED
@@ -0,0 +1,129 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/env python
|
2 |
+
"""
|
3 |
+
Upload Data Files to HuggingFace
|
4 |
+
|
5 |
+
This script uploads key data files to a private HuggingFace dataset repository:
|
6 |
+
1. all_sources_data.jsonl - The raw document data
|
7 |
+
2. all_sources_contextual_nodes.pkl - The processed nodes with added context
|
8 |
+
|
9 |
+
This is useful for new team members who need the latest version of the data.
|
10 |
+
|
11 |
+
Usage:
|
12 |
+
python upload_data_to_hf.py [--repo REPO_ID]
|
13 |
+
|
14 |
+
Arguments:
|
15 |
+
--repo REPO_ID HuggingFace dataset repository ID (default: towardsai-tutors/ai-tutor-data)
|
16 |
+
"""
|
17 |
+
|
18 |
+
import argparse
|
19 |
+
import os
|
20 |
+
|
21 |
+
from dotenv import load_dotenv
|
22 |
+
from huggingface_hub import HfApi
|
23 |
+
|
24 |
+
load_dotenv()
|
25 |
+
|
26 |
+
|
27 |
+
def upload_files_to_huggingface(repo_id="towardsai-tutors/ai-tutor-data"):
|
28 |
+
"""Upload data files to a private HuggingFace repository."""
|
29 |
+
# Main files to upload
|
30 |
+
files_to_upload = [
|
31 |
+
# Combined data and vector store
|
32 |
+
"data/all_sources_data.jsonl",
|
33 |
+
"data/all_sources_contextual_nodes.pkl",
|
34 |
+
# Individual source files
|
35 |
+
"data/transformers_data.jsonl",
|
36 |
+
"data/peft_data.jsonl",
|
37 |
+
"data/trl_data.jsonl",
|
38 |
+
"data/llama_index_data.jsonl",
|
39 |
+
"data/langchain_data.jsonl",
|
40 |
+
"data/openai_cookbooks_data.jsonl",
|
41 |
+
# Course files
|
42 |
+
"data/tai_blog_data.jsonl",
|
43 |
+
"data/8-hour_primer_data.jsonl",
|
44 |
+
"data/llm_developer_data.jsonl",
|
45 |
+
"data/python_primer_data.jsonl",
|
46 |
+
]
|
47 |
+
|
48 |
+
# Filter to only include files that exist
|
49 |
+
existing_files = []
|
50 |
+
missing_files = []
|
51 |
+
|
52 |
+
for file_path in files_to_upload:
|
53 |
+
if os.path.exists(file_path):
|
54 |
+
existing_files.append(file_path)
|
55 |
+
else:
|
56 |
+
missing_files.append(file_path)
|
57 |
+
|
58 |
+
# Critical files must exist
|
59 |
+
critical_files = [
|
60 |
+
"data/all_sources_data.jsonl",
|
61 |
+
"data/all_sources_contextual_nodes.pkl",
|
62 |
+
]
|
63 |
+
critical_missing = [f for f in critical_files if f in missing_files]
|
64 |
+
|
65 |
+
if critical_missing:
|
66 |
+
print(
|
67 |
+
f"Error: The following critical files were not found: {', '.join(critical_missing)}"
|
68 |
+
)
|
69 |
+
# return False
|
70 |
+
|
71 |
+
if missing_files:
|
72 |
+
print(
|
73 |
+
f"Warning: The following files were not found and will not be uploaded: {', '.join(missing_files)}"
|
74 |
+
)
|
75 |
+
print("This is normal if you're only updating certain sources.")
|
76 |
+
|
77 |
+
try:
|
78 |
+
api = HfApi(token=os.getenv("HF_TOKEN"))
|
79 |
+
|
80 |
+
# Check if repository exists, create if it doesn't
|
81 |
+
try:
|
82 |
+
api.repo_info(repo_id=repo_id, repo_type="dataset")
|
83 |
+
print(f"Repository {repo_id} exists")
|
84 |
+
except Exception:
|
85 |
+
print(
|
86 |
+
f"Repository {repo_id} doesn't exist. Please create it first on the HuggingFace platform."
|
87 |
+
)
|
88 |
+
print("Make sure to set it as private if needed.")
|
89 |
+
return False
|
90 |
+
|
91 |
+
# Upload all existing files
|
92 |
+
for file_path in existing_files:
|
93 |
+
try:
|
94 |
+
file_name = os.path.basename(file_path)
|
95 |
+
print(f"Uploading {file_name}...")
|
96 |
+
|
97 |
+
api.upload_file(
|
98 |
+
path_or_fileobj=file_path,
|
99 |
+
path_in_repo=file_name,
|
100 |
+
repo_id=repo_id,
|
101 |
+
repo_type="dataset",
|
102 |
+
)
|
103 |
+
print(
|
104 |
+
f"Successfully uploaded {file_name} to HuggingFace repository {repo_id}"
|
105 |
+
)
|
106 |
+
except Exception as e:
|
107 |
+
print(f"Error uploading {file_path}: {e}")
|
108 |
+
# Continue with other files even if one fails
|
109 |
+
|
110 |
+
return True
|
111 |
+
except Exception as e:
|
112 |
+
print(f"Error uploading files: {e}")
|
113 |
+
return False
|
114 |
+
|
115 |
+
|
116 |
+
def main():
|
117 |
+
parser = argparse.ArgumentParser(description="Upload Data Files to HuggingFace")
|
118 |
+
parser.add_argument(
|
119 |
+
"--repo",
|
120 |
+
default="towardsai-tutors/ai-tutor-data",
|
121 |
+
help="HuggingFace dataset repository ID",
|
122 |
+
)
|
123 |
+
|
124 |
+
args = parser.parse_args()
|
125 |
+
upload_files_to_huggingface(args.repo)
|
126 |
+
|
127 |
+
|
128 |
+
if __name__ == "__main__":
|
129 |
+
main()
|
data/scraping_scripts/upload_dbs_to_hf.py
ADDED
@@ -0,0 +1,38 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Hugging Face Data Upload Script
|
3 |
+
|
4 |
+
Purpose:
|
5 |
+
This script uploads a local folder to a Hugging Face dataset repository. It's designed to
|
6 |
+
update or create a dataset on the Hugging Face Hub by uploading the contents of a specified
|
7 |
+
local folder.
|
8 |
+
|
9 |
+
Usage:
|
10 |
+
- Run the script: python data/scraping_scripts/upload_dbs_to_hf.py
|
11 |
+
|
12 |
+
The script will:
|
13 |
+
- Upload the contents of the 'data' folder to the specified Hugging Face dataset repository.
|
14 |
+
- https://huggingface.co/datasets/towardsai-buster/ai-tutor-vector-db
|
15 |
+
|
16 |
+
Configuration:
|
17 |
+
- The script is set to upload to the "towardsai-buster/test-data" dataset repository.
|
18 |
+
- It deletes all existing files in the repository before uploading (due to delete_patterns=["*"]).
|
19 |
+
"""
|
20 |
+
|
21 |
+
import os
|
22 |
+
|
23 |
+
from dotenv import load_dotenv
|
24 |
+
from huggingface_hub import HfApi
|
25 |
+
|
26 |
+
load_dotenv()
|
27 |
+
|
28 |
+
api = HfApi(token=os.getenv("HF_TOKEN"))
|
29 |
+
|
30 |
+
api.upload_folder(
|
31 |
+
folder_path="data",
|
32 |
+
repo_id="towardsai-tutors/ai-tutor-vector-db",
|
33 |
+
repo_type="dataset",
|
34 |
+
# multi_commits=True,
|
35 |
+
# multi_commits_verbose=True,
|
36 |
+
delete_patterns=["*"],
|
37 |
+
ignore_patterns=["*.jsonl", "*.py", "*.txt", "*.ipynb", "*.md", "*.pyc"],
|
38 |
+
)
|