Spaces:

Navid-AI
/

The-Arabic-Rag-Leaderboard

Running on CPU Upgrade

App Files Files Community

MohamedRashad commited on Feb 8

Commit

fe19b0b

1 Parent(s): da222cf

Enhance evaluation sections and streamline JSON result loading for retrieval and reranking

Browse files

Files changed (2) hide show

app.py +56 -12
utils.py +32 -30

app.py CHANGED Viewed

@@ -8,22 +8,66 @@ HEADER = """<div style="text-align: center; margin-bottom: 20px;">
 </div>
 """
-ABOUT_SECTION = """
-## About
-The Arabic RAG Leaderboard is designed to evaluate and compare the performance of Retrieval-Augmented Generation (RAG) models on a set of retrieval and generative tasks. By leveraging a comprehensive evaluation framework, the leaderboard provides a detailed assessment of a model's ability to retrieve relevant information and generate accurate, coherent, and contextually appropriate responses.
-### Why Focus on RAG Models?
-The Arabic RAG Leaderboard is specifically designed to assess **RAG models**, which combine retrieval mechanisms with generative capabilities to enhance the quality and relevance of generated content. These models are particularly useful in scenarios where access to up-to-date and contextually relevant information is crucial. While foundational models can be evaluated, the primary focus is on RAG models that excel in both retrieval and generation tasks.
-### How to Submit Your Model?
-Navigate to the submission section below to submit your RAG model from the HuggingFace Hub for evaluation. Ensure that your model is public and the submitted metadata (precision, revision, #params) is accurate.
-### Contact
-For any inquiries or assistance, feel free to reach out through the community tab at [Navid-AI Community](https://huggingface.co/spaces/Navid-AI/The-Arabic-Rag-Leaderboard/discussions) or via [email](mailto:[email protected]).
 """
 CITATION_BUTTON_LABEL = """
@@ -31,7 +75,7 @@ Copy the following snippet to cite these results
 """
 CITATION_BUTTON_TEXT = """
-@misc{AraGen,
   author = {Mohaned A. Rashad, Hamza Shahid},
   title = {The Arabic RAG Leaderboard},
   year = {2025},
@@ -119,7 +163,7 @@ def main():
                         submit_gradio_module("Retriever")
                     with gr.Tab("ℹ️ About"):
-                        gr.Markdown(ABOUT_SECTION)
             with gr.Tab("📊 Reranking"):
                 with gr.Tabs():
@@ -161,7 +205,7 @@ def main():
                         submit_gradio_module("Reranker")
                     with gr.Tab("ℹ️ About"):
-                        gr.Markdown(ABOUT_SECTION)
             # with gr.Tab("🧠 LLM Context Answering"):
             #     with gr.Tabs():

 </div>
 """
+RETRIEVAL_ABOUT_SECTION = """
+## About Retrieval Evaluation
+The retrieval evaluation assesses a model's ability to find and retrieve relevant information from a large corpus of Arabic text. Models are evaluated on:
+### Web Search Dataset Metrics
+- **MRR (Mean Reciprocal Rank)**: Measures the ranking quality by focusing on the position of the first relevant result
+- **nDCG (Normalized Discounted Cumulative Gain)**: Evaluates the ranking quality considering all relevant results
+- **Recall@5**: Measures the proportion of relevant documents found in the top 5 results
+- **Overall Score**: Combined score calculated as the average of MRR, nDCG, and Recall@5
+### Model Requirements
+- Must support Arabic text embeddings
+- Should handle queries of at least 512 tokens
+- Must work with `sentence-transformers` library
+### Evaluation Process
+1. Models process Arabic web search queries
+2. Retrieved documents are evaluated using:
+   - MRR for first relevant result positioning
+   - nDCG for overall ranking quality
+   - Recall@5 for top results accuracy
+3. Metrics are averaged to calculate the overall score
+4. Models are ranked based on their overall performance
+### How to Prepare Your Model
+- Ensure your model is publicly available on HuggingFace Hub (We don't support private model evaluations yet)
+- Model should output fixed-dimension embeddings for text
+- Support batch processing for efficient evaluation (this is default if you use `sentence-transformers`)
+"""
+RERANKER_ABOUT_SECTION = """
+## About Reranking Evaluation
+The reranking evaluation assesses a model's ability to improve search quality by reordering initially retrieved results. Models are evaluated across multiple unseen Arabic datasets to ensure robust performance.
+### Evaluation Metrics
+- **MRR@10 (Mean Reciprocal Rank at 10)**: Measures the ranking quality focusing on the first relevant result in top-10
+- **NDCG@10 (Normalized DCG at 10)**: Evaluates the ranking quality of all relevant results in top-10
+- **MAP (Mean Average Precision)**: Measures the overall precision across all relevant documents
+All metrics are averaged across multiple evaluation datasets to provide a comprehensive assessment of model performance.
+### Model Requirements
+- Must accept query-document pairs as input
+- Should output relevance scores for reranking (has cross-attention or similar mechanism for query-document matching)
+- Support for Arabic text processing
+### Evaluation Process
+1. Models are tested on multiple unseen Arabic datasets
+2. For each dataset:
+   - Initial candidate documents are provided
+   - Model reranks the candidates
+   - MRR@10, NDCG@10, and MAP are calculated
+3. Final scores are averaged across all datasets
+4. Models are ranked based on overall performance
+### How to Prepare Your Model
+- Model should be public on HuggingFace Hub (private models are not supported yet)
+- Make sure it works coherently with `sentence-transformers` library
 """
 CITATION_BUTTON_LABEL = """
 """
 CITATION_BUTTON_TEXT = """
+@misc{TARL,
   author = {Mohaned A. Rashad, Hamza Shahid},
   title = {The Arabic RAG Leaderboard},
   year = {2025},
                         submit_gradio_module("Retriever")
                     with gr.Tab("ℹ️ About"):
+                        gr.Markdown(RETRIEVAL_ABOUT_SECTION)
             with gr.Tab("📊 Reranking"):
                 with gr.Tabs():
                         submit_gradio_module("Reranker")
                     with gr.Tab("ℹ️ About"):
+                        gr.Markdown(RERANKER_ABOUT_SECTION)
             # with gr.Tab("🧠 LLM Context Answering"):
             #     with gr.Tabs():

utils.py CHANGED Viewed

@@ -12,35 +12,40 @@ DATASET_REPO_ID = f"{OWNER}/requests-dataset"
 results_dir = Path(__file__).parent / "results"
-def load_retrieval_results(prepare_for_display=False):
-    # Load the retrieval results
-    dataframe_path = results_dir / "retrieval_results.json"
-    if dataframe_path.exists():
-        df = pd.read_json(dataframe_path)
-    else:
-        raise FileNotFoundError(f"File '{dataframe_path}' not found.")
     if prepare_for_display:
-        df[["Model"]] = df[["Model"]].map(lambda x: f'<a href="https://huggingface.co/{x}" target="_blank">{x}</a>')
-        df.drop(columns=["Revision", "Precision", "Task"], inplace=True)
-        df.sort_values("Web Search Dataset (Overall Score)", ascending=False, inplace=True)
     return df
 def load_reranking_results(prepare_for_display=False):
-    # Load the reranking results
     dataframe_path = results_dir / "reranking_results.json"
-    if dataframe_path.exists():
-        df = pd.read_json(dataframe_path)
-    else:
-        raise FileNotFoundError(f"File '{dataframe_path}' not found.")
-    if prepare_for_display:
-        df[["Model"]] = df[["Model"]].map(lambda x: f'<a href="https://huggingface.co/{x}" target="_blank">{x}</a>')
-        df.sort_values("Overall Score", ascending=False, inplace=True)
-    return df
 def get_model_info(model_id, verbose=False):
     model_info = api.model_info(model_id)
@@ -148,13 +153,12 @@ def submit_model(model_name, revision, precision, params, license, task):
     # Upload the submission to the dataset repository
     try:
-        hf_api_token = os.environ.get('HF_TOKEN', None)
         api.upload_file(
             path_or_fileobj=submission_json.encode('utf-8'),
             path_in_repo=file_path_in_repo,
             repo_id=DATASET_REPO_ID,
             repo_type="dataset",
-            token=hf_api_token
         )
     except Exception as e:
         print(f"Error uploading file: {e}")
@@ -167,14 +171,12 @@ def load_requests(status_folder):
     requests_data = []
     folder_path_in_repo = status_folder  # 'pending', 'finished', or 'failed'
-    hf_api_token = os.environ.get('HF_TOKEN', None)
     try:
-        # List files in the dataset repository
         files_info = api.list_repo_files(
             repo_id=DATASET_REPO_ID,
             repo_type="dataset",
-            token=hf_api_token
         )
     except Exception as e:
         print(f"Error accessing dataset repository: {e}")
@@ -190,7 +192,7 @@ def load_requests(status_folder):
                 repo_id=DATASET_REPO_ID,
                 filename=file_path,
                 repo_type="dataset",
-                token=hf_api_token
             )
             # Load JSON data
             with open(local_file_path, 'r') as f:

 results_dir = Path(__file__).parent / "results"
+# Cache the HF token to avoid multiple os.environ lookups.
+HF_TOKEN = os.environ.get('HF_TOKEN', None)
+# Add a helper to load JSON results with optional formatting.
+def load_json_results(file_path: Path, prepare_for_display=False, sort_col=None, drop_cols=None):
+    if file_path.exists():
+        df = pd.read_json(file_path)
+    else:
+        raise FileNotFoundError(f"File '{file_path}' not found.")
     if prepare_for_display:
+        # Apply common mapping for model link formatting.
+        df[["Model"]] = df[["Model"]].applymap(lambda x: f'<a href="https://huggingface.co/{x}" target="_blank">{x}</a>')
+        if drop_cols is not None:
+            df.drop(columns=drop_cols, inplace=True)
+        if sort_col is not None:
+            df.sort_values(sort_col, ascending=False, inplace=True)
     return df
+def load_retrieval_results(prepare_for_display=False):
+    dataframe_path = results_dir / "retrieval_results.json"
+    return load_json_results(
+        dataframe_path,
+        prepare_for_display=prepare_for_display,
+        sort_col="Web Search Dataset (Overall Score)",
+        drop_cols=["Revision", "Precision", "Task"]
+    )
 def load_reranking_results(prepare_for_display=False):
     dataframe_path = results_dir / "reranking_results.json"
+    return load_json_results(
+        dataframe_path,
+        prepare_for_display=prepare_for_display,
+        sort_col="Overall Score"
+    )
 def get_model_info(model_id, verbose=False):
     model_info = api.model_info(model_id)
     # Upload the submission to the dataset repository
     try:
         api.upload_file(
             path_or_fileobj=submission_json.encode('utf-8'),
             path_in_repo=file_path_in_repo,
             repo_id=DATASET_REPO_ID,
             repo_type="dataset",
+            token=HF_TOKEN
         )
     except Exception as e:
         print(f"Error uploading file: {e}")
     requests_data = []
     folder_path_in_repo = status_folder  # 'pending', 'finished', or 'failed'
     try:
+        # Use the cached token
         files_info = api.list_repo_files(
             repo_id=DATASET_REPO_ID,
             repo_type="dataset",
+            token=HF_TOKEN
         )
     except Exception as e:
         print(f"Error accessing dataset repository: {e}")
                 repo_id=DATASET_REPO_ID,
                 filename=file_path,
                 repo_type="dataset",
+                token=HF_TOKEN
             )
             # Load JSON data
             with open(local_file_path, 'r') as f: