Update app.py
Browse files
app.py
CHANGED
@@ -227,10 +227,120 @@ iface = gr.Interface(
|
|
227 |
flagging_mode="never"
|
228 |
)
|
229 |
|
|
|
|
|
|
|
230 |
tutorial_md = """
|
231 |
# Embedding Comparison Tool Tutorial
|
232 |
|
233 |
-
This tool allows you to compare different embedding models and retrieval strategies for document search.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
234 |
|
235 |
1. **File Upload**: Optionally upload a file (PDF, DOCX, or TXT) or leave it empty to use files in the `./files` directory.
|
236 |
|
@@ -264,6 +374,7 @@ You can download the results as CSV files for further analysis.
|
|
264 |
Experiment with different settings to find the best combination for your specific use case!
|
265 |
"""
|
266 |
|
|
|
267 |
iface = gr.TabbedInterface(
|
268 |
[iface, gr.Markdown(tutorial_md)],
|
269 |
["Embedding Comparison", "Tutorial"]
|
|
|
227 |
flagging_mode="never"
|
228 |
)
|
229 |
|
230 |
+
# The code remains the same as in the previous artifact, so I'll omit it here for brevity.
|
231 |
+
# The changes will be in the tutorial_md variable.
|
232 |
+
|
233 |
tutorial_md = """
|
234 |
# Embedding Comparison Tool Tutorial
|
235 |
|
236 |
+
This tool allows you to compare different embedding models and retrieval strategies for document search. Before we dive into how to use the tool, let's cover some important concepts.
|
237 |
+
|
238 |
+
## What is RAG?
|
239 |
+
|
240 |
+
RAG stands for Retrieval-Augmented Generation. It's a technique that combines the strength of large language models with the ability to access and use external knowledge. RAG is particularly useful for:
|
241 |
+
|
242 |
+
- Providing up-to-date information
|
243 |
+
- Answering questions based on specific documents or data sources
|
244 |
+
- Reducing hallucinations in AI responses
|
245 |
+
- Customizing AI outputs for specific domains or use cases
|
246 |
+
|
247 |
+
RAG is good for applications where you need accurate, context-specific information retrieval combined with natural language generation. This includes chatbots, question-answering systems, and document analysis tools.
|
248 |
+
|
249 |
+
## Key Components of RAG
|
250 |
+
|
251 |
+
### 1. Document Loading
|
252 |
+
|
253 |
+
This is the process of ingesting documents from various sources (PDFs, web pages, databases, etc.) into a format that can be processed by the RAG system. Efficient document loading is crucial for handling large volumes of data.
|
254 |
+
|
255 |
+
### 2. Document Splitting
|
256 |
+
|
257 |
+
Large documents are often split into smaller chunks for more efficient processing and retrieval. The choice of splitting method can significantly impact the quality of retrieval results.
|
258 |
+
|
259 |
+
### 3. Vector Store and Embeddings
|
260 |
+
|
261 |
+
Embeddings are dense vector representations of text that capture semantic meaning. A vector store is a database optimized for storing and querying these high-dimensional vectors. Together, they allow for efficient semantic search.
|
262 |
+
|
263 |
+
### 4. Retrieval
|
264 |
+
|
265 |
+
This is the process of finding the most relevant documents or chunks based on a query. The quality of retrieval directly impacts the final output of the RAG system.
|
266 |
+
|
267 |
+
## Why is this important?
|
268 |
+
|
269 |
+
Understanding and optimizing each component of the RAG pipeline is crucial because:
|
270 |
+
|
271 |
+
1. It affects the accuracy and relevance of the information retrieved.
|
272 |
+
2. It impacts the speed and efficiency of the system.
|
273 |
+
3. It determines the scalability of your solution.
|
274 |
+
4. It influences the overall quality of the generated responses.
|
275 |
+
|
276 |
+
## Impact of Parameter Changes
|
277 |
+
|
278 |
+
Changes in various parameters can have significant effects:
|
279 |
+
|
280 |
+
- **Chunk Size**: Larger chunks provide more context but may reduce precision. Smaller chunks increase precision but may lose context.
|
281 |
+
- **Overlap**: More overlap can help maintain context between chunks but increases computational load.
|
282 |
+
- **Embedding Model**: Different models have varying performance across languages and domains.
|
283 |
+
- **Vector Store**: Affects query speed and the types of searches you can perform.
|
284 |
+
- **Retrieval Method**: Impacts the diversity and relevance of retrieved documents.
|
285 |
+
|
286 |
+
## Detailed Parameter Explanations
|
287 |
+
|
288 |
+
### Embedding Model
|
289 |
+
|
290 |
+
The embedding model translates text into numerical vectors. The choice of model affects:
|
291 |
+
|
292 |
+
- **Language Coverage**: Some models are monolingual, others are multilingual.
|
293 |
+
- **Domain Specificity**: Models can be general or trained on specific domains (e.g., legal, medical).
|
294 |
+
- **Vector Dimensions**: Higher dimensions can capture more information but require more storage and computation.
|
295 |
+
|
296 |
+
#### Vocabulary Size
|
297 |
+
|
298 |
+
The vocab size refers to the number of unique tokens the model recognizes. It's important because:
|
299 |
+
|
300 |
+
- It affects the model's ability to handle rare words or specialized terminology.
|
301 |
+
- Larger vocabs can lead to better performance but require more memory.
|
302 |
+
- It impacts the model's performance across different languages (larger vocabs are often better for multilingual models).
|
303 |
+
|
304 |
+
### Split Strategy
|
305 |
+
|
306 |
+
- **Token**: Splits based on a fixed number of tokens. Good for maintaining consistent chunk sizes.
|
307 |
+
- **Recursive**: Splits based on content, trying to maintain semantic coherence. Better for preserving context.
|
308 |
+
|
309 |
+
### Vector Store Type
|
310 |
+
|
311 |
+
- **FAISS**: Fast, memory-efficient. Good for large-scale similarity search.
|
312 |
+
- **Chroma**: Offers additional features like metadata filtering. Good for more complex querying needs.
|
313 |
+
|
314 |
+
### Search Type
|
315 |
+
|
316 |
+
- **Similarity**: Returns the most similar documents. Fast and straightforward.
|
317 |
+
- **MMR (Maximum Marginal Relevance)**: Balances relevance with diversity in results. Useful for getting a broader perspective.
|
318 |
+
|
319 |
+
## MTEB (Massive Text Embedding Benchmark)
|
320 |
+
|
321 |
+
MTEB is a comprehensive benchmark for evaluating text embedding models across a wide range of tasks and languages. It's useful for:
|
322 |
+
|
323 |
+
- Comparing the performance of different embedding models.
|
324 |
+
- Understanding how models perform on specific tasks (e.g., classification, clustering, retrieval).
|
325 |
+
- Selecting the best model for your specific use case.
|
326 |
+
|
327 |
+
### Finding Embeddings on MTEB Leaderboard
|
328 |
+
|
329 |
+
To find suitable embeddings using the MTEB leaderboard (https://huggingface.co/spaces/mteb/leaderboard):
|
330 |
+
|
331 |
+
1. Look at the "Avg" column for overall performance across all tasks.
|
332 |
+
2. Check performance on specific task types relevant to your use case (e.g., Retrieval, Classification).
|
333 |
+
3. Consider the model size and inference speed for your deployment constraints.
|
334 |
+
4. Look at language-specific scores if you're working with non-English text.
|
335 |
+
5. Click on model names to get more details and links to the model pages on Hugging Face.
|
336 |
+
|
337 |
+
When selecting a model, balance performance with practical considerations like model size, inference speed, and specific task performance relevant to your application.
|
338 |
+
|
339 |
+
By understanding these concepts and parameters, you can make informed decisions when using the Embedding Comparison Tool and optimize your RAG system for your specific needs.
|
340 |
+
|
341 |
+
## Using the Embedding Comparison Tool
|
342 |
+
|
343 |
+
Now that you understand the underlying concepts, here's how to use the tool:
|
344 |
|
345 |
1. **File Upload**: Optionally upload a file (PDF, DOCX, or TXT) or leave it empty to use files in the `./files` directory.
|
346 |
|
|
|
374 |
Experiment with different settings to find the best combination for your specific use case!
|
375 |
"""
|
376 |
|
377 |
+
# The rest of the code remains the same
|
378 |
iface = gr.TabbedInterface(
|
379 |
[iface, gr.Markdown(tutorial_md)],
|
380 |
["Embedding Comparison", "Tutorial"]
|