Spaces:
Runtime error
Runtime error
update tasks
Browse files- src/display/about.py +20 -12
src/display/about.py
CHANGED
@@ -16,10 +16,10 @@ class Tasks(Enum):
|
|
16 |
|
17 |
task0 = Task("finance_bench", "accuracy", "FinanceBench")
|
18 |
task1 = Task("legal_confidentiality", "accuracy", "Legal Confidentiality")
|
19 |
-
task2 = Task("
|
20 |
-
task3 = Task("
|
21 |
-
task4 = Task("
|
22 |
-
task5 = Task("
|
23 |
|
24 |
|
25 |
# Your leaderboard name
|
@@ -35,20 +35,28 @@ LLM_BENCHMARKS_TEXT = f"""
|
|
35 |
## How it works
|
36 |
|
37 |
## Tasks
|
38 |
-
1. FinanceBench
|
|
|
|
|
|
|
39 |
|
40 |
-
2. Legal Confidentiality:
|
41 |
-
|
|
|
|
|
42 |
|
43 |
-
3. Writing Prompts: This task evaluates the story-writing and creative abilities of the LLM.
|
|
|
44 |
|
45 |
4. Customer Support Dialogue: This task evaluates the ability of the LLM to answer a customer support question
|
46 |
-
given some product information and conversational history.
|
|
|
47 |
|
48 |
-
5. Toxic Prompts: This task evaluates the safety of the model by using prompts that can elicit harmful information
|
49 |
-
|
50 |
|
51 |
-
6. Enterprise PII: This task evaluates the business safety of the model by using prompts to elicit
|
|
|
52 |
|
53 |
## Reproducibility
|
54 |
All of our datasets are closed-source. We provide a validation set with 5 examples for each of the tasks.
|
|
|
16 |
|
17 |
task0 = Task("finance_bench", "accuracy", "FinanceBench")
|
18 |
task1 = Task("legal_confidentiality", "accuracy", "Legal Confidentiality")
|
19 |
+
task2 = Task("writing_prompts", "coherence", "Writing Prompts")
|
20 |
+
task3 = Task("customer_support", "engagement", "Customer Support Dialogue")
|
21 |
+
task4 = Task("toxic_prompts", "toxicity", "Toxic Prompts")
|
22 |
+
task5 = Task("enterprise_pii", "accuracy", "Enterprise PII")
|
23 |
|
24 |
|
25 |
# Your leaderboard name
|
|
|
35 |
## How it works
|
36 |
|
37 |
## Tasks
|
38 |
+
1. FinanceBench (Islam, Pranab, et al. "FinanceBench: A New Benchmark for Financial Question Answering."): The task
|
39 |
+
measures the ability to answer financial questions given the retrieved context from a document and a question. We do
|
40 |
+
not evaluate the retrieval capabilities for this task. We evaluate the accuracy of the answers. The dataset can be
|
41 |
+
found at https://huggingface.co/datasets/PatronusAI/financebench.
|
42 |
|
43 |
+
2. Legal Confidentiality: We use a subset of 100 labelled prompts from LegalBench (Guha, et al. LegalBench: A
|
44 |
+
Collaboratively Built Benchmark for Measuring Legal Reasoning in \
|
45 |
+
Large Language Models) to measure the ability of LLMs to reason over legal causes. The model is prompted to return \
|
46 |
+
yes/no as an answer to the question.
|
47 |
|
48 |
+
3. Writing Prompts: This task evaluates the story-writing and creative abilities of the LLM. We measure the
|
49 |
+
engagingness of the text generated by the LLM.
|
50 |
|
51 |
4. Customer Support Dialogue: This task evaluates the ability of the LLM to answer a customer support question
|
52 |
+
given some product information and conversational history. We measure the relevance of the generation given the
|
53 |
+
conversational history, product information and question by the customer.
|
54 |
|
55 |
+
5. Toxic Prompts: This task evaluates the safety of the model by using prompts that can elicit harmful information from
|
56 |
+
LLMs. We measure if the model generates toxic content.
|
57 |
|
58 |
+
6. Enterprise PII: This task evaluates the business safety of the model by using prompts to elicit
|
59 |
+
business-sensitive information from LLMs. We measure if the model generates business sensitive information.
|
60 |
|
61 |
## Reproducibility
|
62 |
All of our datasets are closed-source. We provide a validation set with 5 examples for each of the tasks.
|