Spaces:

jbnayahu
/

bluebench

Running

App Files Files Community

jbnayahu commited on 27 days ago

Commit

3ee3ca7

unverified ·

1 Parent(s): 86d72cb

.

Browse files

Signed-off-by: Jonathan Bnayahu <[email protected]>

Files changed (2) hide show

app.py +0 -12
src/about.py +10 -7

app.py CHANGED Viewed

@@ -4,8 +4,6 @@ from gradio_leaderboard import Leaderboard
 from apscheduler.schedulers.background import BackgroundScheduler
 from src.about import (
-    CITATION_BUTTON_LABEL,
-    CITATION_BUTTON_TEXT,
     INTRODUCTION_TEXT,
     LLM_BENCHMARKS_TEXT,
     TITLE,
@@ -70,16 +68,6 @@ with gui:
             download_button.click(fn=generate_csv_file, outputs=csv_output)
-    with gr.Row():
-        with gr.Accordion("📙 Citation", open=False):
-            citation_button = gr.Textbox(
-                value=CITATION_BUTTON_TEXT,
-                label=CITATION_BUTTON_LABEL,
-                lines=20,
-                elem_id="citation-button",
-                show_copy_button=True,
-            )
 scheduler = BackgroundScheduler()
 scheduler.add_job(restart_space, "interval", seconds=1800)
 scheduler.start()

 from apscheduler.schedulers.background import BackgroundScheduler
 from src.about import (
     INTRODUCTION_TEXT,
     LLM_BENCHMARKS_TEXT,
     TITLE,
             download_button.click(fn=generate_csv_file, outputs=csv_output)
 scheduler = BackgroundScheduler()
 scheduler.add_job(restart_space, "interval", seconds=1800)
 scheduler.start()

src/about.py CHANGED Viewed

@@ -85,15 +85,18 @@ table th:nth-of-type(3) {
 | QA Finance              | <pre><p><b>FinQA</b></p>[Dataset](https://huggingface.co/datasets/ibm/finqa), [Paper](https://arxiv.org/abs/2109.00122), [Unitxt Card](https://www.unitxt.ai/en/latest/catalog/catalog.cards.fin_qa.html)</pre>                                                                   | <p>A large-scale dataset with 2.8k financial reports for 8k Q&A pairs to study numerical reasoning with structured and unstructured evidence.</p>The FinQA dataset is designed to facilitate research and development in the area of question answering (QA) using financial texts. It consists of a subset of QA pairs from a larger dataset, originally created through a collaboration between researchers from the University of Pennsylvania, J.P. Morgan, and Amazon.The original dataset includes 8,281 QA pairs built against publicly available earnings reports of S&P 500 companies from 1999 to 2019 (FinQA: A Dataset of Numerical Reasoning over Financial Data.). This subset, specifically curated by Aiera, consists of 91 QA pairs. Each entry in the dataset includes a context, a question, and an answer, with each component manually verified for accuracy and formatting consistency. |
 ## Reproducibility
-To reproduce our results, here is the commands you can run:
 ```
 pip install unitxt[bluebench]
-unitxt-evaluate --tasks "benchmarks.bluebench" --model cross_provider --model_args "model_name=MODEL_TO_EVALUATE_IN_LITELLM_FORMAT,max_tokens=1024" --output_path ./results/bluebench --log_samples --trust_remote_code --batch_size 8
 unitxt-summarize ./results/bluebench
 ```
 """
-CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
-CITATION_BUTTON_TEXT = r"""
-"""

 | QA Finance              | <pre><p><b>FinQA</b></p>[Dataset](https://huggingface.co/datasets/ibm/finqa), [Paper](https://arxiv.org/abs/2109.00122), [Unitxt Card](https://www.unitxt.ai/en/latest/catalog/catalog.cards.fin_qa.html)</pre>                                                                   | <p>A large-scale dataset with 2.8k financial reports for 8k Q&A pairs to study numerical reasoning with structured and unstructured evidence.</p>The FinQA dataset is designed to facilitate research and development in the area of question answering (QA) using financial texts. It consists of a subset of QA pairs from a larger dataset, originally created through a collaboration between researchers from the University of Pennsylvania, J.P. Morgan, and Amazon.The original dataset includes 8,281 QA pairs built against publicly available earnings reports of S&P 500 companies from 1999 to 2019 (FinQA: A Dataset of Numerical Reasoning over Financial Data.). This subset, specifically curated by Aiera, consists of 91 QA pairs. Each entry in the dataset includes a context, a question, and an answer, with each component manually verified for accuracy and formatting consistency. |
 ## Reproducibility
+BlueBench is powered by the <a href="https://www.unitxt.ai">unitxt</a> library. To reproduce our results, start by installing Unitxt in a clean Python 3.10 virtual environment, along with the required dependencies:
 ```
+conda create -n bluebench python=3.10
+conda activate bluebench
 pip install unitxt[bluebench]
+```
+To perform the evaluation, run the following, replacing MODEL_FULL_NAME with the name of the provider and model you wish to evaluate, in LiteLLM format. Consult the LiteLLM <a href="https://docs.litellm.ai/docs/providers">providers catalog</a> for details. Make sure you set the required environment variables (e.g., API keys and credentials).
+```
+unitxt-evaluate --tasks "benchmarks.bluebench" --model cross_provider --model_args "model_name=MODEL_FULL_NAME,max_tokens=1024" --output_path ./results/bluebench --log_samples --trust_remote_code --batch_size 8
+```
+A successful run will result in two json files in the ./results/bluebench folder. To view a summary of the results, run the following:
+```
 unitxt-summarize ./results/bluebench
 ```
 """