Spaces:
Runtime error
Runtime error
update about
Browse files- src/display/about.py +7 -7
src/display/about.py
CHANGED
@@ -17,7 +17,7 @@ class Tasks(Enum):
|
|
17 |
task0 = Task("finance_bench", "accuracy", "FinanceBench")
|
18 |
task1 = Task("legal_confidentiality", "exact_match", "Legal Confidentiality")
|
19 |
task2 = Task("writing_prompts", "engagingness", "Writing Prompts")
|
20 |
-
|
21 |
task4 = Task("toxic_prompts", "toxicity", "Toxic Prompts")
|
22 |
task5 = Task("enterprise_pii", "enterprise_pii", "Enterprise PII")
|
23 |
|
@@ -35,19 +35,19 @@ LLM_BENCHMARKS_TEXT = f"""
|
|
35 |
## How it works
|
36 |
|
37 |
## Tasks
|
38 |
-
1.
|
39 |
found at https://huggingface.co/datasets/PatronusAI/financebench.
|
40 |
|
41 |
-
2.
|
42 |
Large Language Models) to measure the ability of LLMs to reason over legal causes. The model is prompted to return yes/no as an answer to the question.
|
43 |
|
44 |
-
3.
|
45 |
|
46 |
-
4.
|
47 |
|
48 |
-
5.
|
49 |
|
50 |
-
6.
|
51 |
|
52 |
## What is Patronus AI?
|
53 |
|
|
|
17 |
task0 = Task("finance_bench", "accuracy", "FinanceBench")
|
18 |
task1 = Task("legal_confidentiality", "exact_match", "Legal Confidentiality")
|
19 |
task2 = Task("writing_prompts", "engagingness", "Writing Prompts")
|
20 |
+
task3 = Task("customer_support_dialogue", "relevance", "Customer Support Dialogue")
|
21 |
task4 = Task("toxic_prompts", "toxicity", "Toxic Prompts")
|
22 |
task5 = Task("enterprise_pii", "enterprise_pii", "Enterprise PII")
|
23 |
|
|
|
35 |
## How it works
|
36 |
|
37 |
## Tasks
|
38 |
+
1.FinanceBench (Islam, Pranab, et al. "FinanceBench: A New Benchmark for Financial Question Answering."): The task measures the ability to answer financial questions given the retrieved context from a document and a question. We do not evaluate the retrieval capabilities for this task. We only evaluate the accuracy of the answers.The dataset can be
|
39 |
found at https://huggingface.co/datasets/PatronusAI/financebench.
|
40 |
|
41 |
+
2.Legal Confidentiality: We use a subset of 100 labeled prompts from LegalBench (Guha, et al. LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in \
|
42 |
Large Language Models) to measure the ability of LLMs to reason over legal causes. The model is prompted to return yes/no as an answer to the question.
|
43 |
|
44 |
+
3.Writing Prompts: This task evaluates the story-writing and creative abilities of the LLM. We measure the engagingness of the text generated by the LLM. The dataset is a mix of human annotated samples from r/WritingPrompts and redteaming generations.
|
45 |
|
46 |
+
4.Customer Support Dialogue: This task evaluates the ability of the LLM to answer a customer support question given some product information and conversational history. We measure the relevance of the generation given the conversational history, product information and question by the customer.
|
47 |
|
48 |
+
5.Toxic Prompts: This task evaluates the safety of the model by using prompts that can elicit harmful information from LLMs. We measure if the model generates toxic content.
|
49 |
|
50 |
+
6.Enterprise PII: This task evaluates the business safety of the model by using prompts to elicit business-sensitive information from LLMs. We measure if the model generates business sensitive information.
|
51 |
|
52 |
## What is Patronus AI?
|
53 |
|