Chris4K commited on
Commit
78e1a2e
·
verified ·
1 Parent(s): e01a00a

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +112 -1
app.py CHANGED
@@ -227,10 +227,120 @@ iface = gr.Interface(
227
  flagging_mode="never"
228
  )
229
 
 
 
 
230
  tutorial_md = """
231
  # Embedding Comparison Tool Tutorial
232
 
233
- This tool allows you to compare different embedding models and retrieval strategies for document search. Here's how to use it:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
234
 
235
  1. **File Upload**: Optionally upload a file (PDF, DOCX, or TXT) or leave it empty to use files in the `./files` directory.
236
 
@@ -264,6 +374,7 @@ You can download the results as CSV files for further analysis.
264
  Experiment with different settings to find the best combination for your specific use case!
265
  """
266
 
 
267
  iface = gr.TabbedInterface(
268
  [iface, gr.Markdown(tutorial_md)],
269
  ["Embedding Comparison", "Tutorial"]
 
227
  flagging_mode="never"
228
  )
229
 
230
+ # The code remains the same as in the previous artifact, so I'll omit it here for brevity.
231
+ # The changes will be in the tutorial_md variable.
232
+
233
  tutorial_md = """
234
  # Embedding Comparison Tool Tutorial
235
 
236
+ This tool allows you to compare different embedding models and retrieval strategies for document search. Before we dive into how to use the tool, let's cover some important concepts.
237
+
238
+ ## What is RAG?
239
+
240
+ RAG stands for Retrieval-Augmented Generation. It's a technique that combines the strength of large language models with the ability to access and use external knowledge. RAG is particularly useful for:
241
+
242
+ - Providing up-to-date information
243
+ - Answering questions based on specific documents or data sources
244
+ - Reducing hallucinations in AI responses
245
+ - Customizing AI outputs for specific domains or use cases
246
+
247
+ RAG is good for applications where you need accurate, context-specific information retrieval combined with natural language generation. This includes chatbots, question-answering systems, and document analysis tools.
248
+
249
+ ## Key Components of RAG
250
+
251
+ ### 1. Document Loading
252
+
253
+ This is the process of ingesting documents from various sources (PDFs, web pages, databases, etc.) into a format that can be processed by the RAG system. Efficient document loading is crucial for handling large volumes of data.
254
+
255
+ ### 2. Document Splitting
256
+
257
+ Large documents are often split into smaller chunks for more efficient processing and retrieval. The choice of splitting method can significantly impact the quality of retrieval results.
258
+
259
+ ### 3. Vector Store and Embeddings
260
+
261
+ Embeddings are dense vector representations of text that capture semantic meaning. A vector store is a database optimized for storing and querying these high-dimensional vectors. Together, they allow for efficient semantic search.
262
+
263
+ ### 4. Retrieval
264
+
265
+ This is the process of finding the most relevant documents or chunks based on a query. The quality of retrieval directly impacts the final output of the RAG system.
266
+
267
+ ## Why is this important?
268
+
269
+ Understanding and optimizing each component of the RAG pipeline is crucial because:
270
+
271
+ 1. It affects the accuracy and relevance of the information retrieved.
272
+ 2. It impacts the speed and efficiency of the system.
273
+ 3. It determines the scalability of your solution.
274
+ 4. It influences the overall quality of the generated responses.
275
+
276
+ ## Impact of Parameter Changes
277
+
278
+ Changes in various parameters can have significant effects:
279
+
280
+ - **Chunk Size**: Larger chunks provide more context but may reduce precision. Smaller chunks increase precision but may lose context.
281
+ - **Overlap**: More overlap can help maintain context between chunks but increases computational load.
282
+ - **Embedding Model**: Different models have varying performance across languages and domains.
283
+ - **Vector Store**: Affects query speed and the types of searches you can perform.
284
+ - **Retrieval Method**: Impacts the diversity and relevance of retrieved documents.
285
+
286
+ ## Detailed Parameter Explanations
287
+
288
+ ### Embedding Model
289
+
290
+ The embedding model translates text into numerical vectors. The choice of model affects:
291
+
292
+ - **Language Coverage**: Some models are monolingual, others are multilingual.
293
+ - **Domain Specificity**: Models can be general or trained on specific domains (e.g., legal, medical).
294
+ - **Vector Dimensions**: Higher dimensions can capture more information but require more storage and computation.
295
+
296
+ #### Vocabulary Size
297
+
298
+ The vocab size refers to the number of unique tokens the model recognizes. It's important because:
299
+
300
+ - It affects the model's ability to handle rare words or specialized terminology.
301
+ - Larger vocabs can lead to better performance but require more memory.
302
+ - It impacts the model's performance across different languages (larger vocabs are often better for multilingual models).
303
+
304
+ ### Split Strategy
305
+
306
+ - **Token**: Splits based on a fixed number of tokens. Good for maintaining consistent chunk sizes.
307
+ - **Recursive**: Splits based on content, trying to maintain semantic coherence. Better for preserving context.
308
+
309
+ ### Vector Store Type
310
+
311
+ - **FAISS**: Fast, memory-efficient. Good for large-scale similarity search.
312
+ - **Chroma**: Offers additional features like metadata filtering. Good for more complex querying needs.
313
+
314
+ ### Search Type
315
+
316
+ - **Similarity**: Returns the most similar documents. Fast and straightforward.
317
+ - **MMR (Maximum Marginal Relevance)**: Balances relevance with diversity in results. Useful for getting a broader perspective.
318
+
319
+ ## MTEB (Massive Text Embedding Benchmark)
320
+
321
+ MTEB is a comprehensive benchmark for evaluating text embedding models across a wide range of tasks and languages. It's useful for:
322
+
323
+ - Comparing the performance of different embedding models.
324
+ - Understanding how models perform on specific tasks (e.g., classification, clustering, retrieval).
325
+ - Selecting the best model for your specific use case.
326
+
327
+ ### Finding Embeddings on MTEB Leaderboard
328
+
329
+ To find suitable embeddings using the MTEB leaderboard (https://huggingface.co/spaces/mteb/leaderboard):
330
+
331
+ 1. Look at the "Avg" column for overall performance across all tasks.
332
+ 2. Check performance on specific task types relevant to your use case (e.g., Retrieval, Classification).
333
+ 3. Consider the model size and inference speed for your deployment constraints.
334
+ 4. Look at language-specific scores if you're working with non-English text.
335
+ 5. Click on model names to get more details and links to the model pages on Hugging Face.
336
+
337
+ When selecting a model, balance performance with practical considerations like model size, inference speed, and specific task performance relevant to your application.
338
+
339
+ By understanding these concepts and parameters, you can make informed decisions when using the Embedding Comparison Tool and optimize your RAG system for your specific needs.
340
+
341
+ ## Using the Embedding Comparison Tool
342
+
343
+ Now that you understand the underlying concepts, here's how to use the tool:
344
 
345
  1. **File Upload**: Optionally upload a file (PDF, DOCX, or TXT) or leave it empty to use files in the `./files` directory.
346
 
 
374
  Experiment with different settings to find the best combination for your specific use case!
375
  """
376
 
377
+ # The rest of the code remains the same
378
  iface = gr.TabbedInterface(
379
  [iface, gr.Markdown(tutorial_md)],
380
  ["Embedding Comparison", "Tutorial"]