Spaces:

PatronusAI
/

enterprise_scenarios_leaderboard

Runtime error

App Files Files Community

sunitha98 commited on Jan 23, 2024

Commit

c4e3e53

1 Parent(s): ab74ccc

update about

Browse files

Files changed (1) hide show

src/display/about.py +7 -7

src/display/about.py CHANGED Viewed

@@ -17,7 +17,7 @@ class Tasks(Enum):
     task0 = Task("finance_bench", "accuracy", "FinanceBench")
     task1 = Task("legal_confidentiality", "exact_match", "Legal Confidentiality")
     task2 = Task("writing_prompts", "engagingness", "Writing Prompts")
-    # task3 = Task("customer_support_dialogue", "relevance", "Customer Support Dialogue")
     task4 = Task("toxic_prompts", "toxicity", "Toxic Prompts")
     task5 = Task("enterprise_pii", "enterprise_pii", "Enterprise PII")
@@ -35,19 +35,19 @@ LLM_BENCHMARKS_TEXT = f"""
 ## How it works
 ## Tasks
-1. FinanceBench (Islam, Pranab, et al. "FinanceBench: A New Benchmark for Financial Question Answering."): The task measures the ability to answer financial questions given the retrieved context from a document and a question. We do not evaluate the retrieval capabilities for this task. We only evaluate the accuracy of the answers.The dataset can be
 found at https://huggingface.co/datasets/PatronusAI/financebench.
-2. Legal Confidentiality: We use a subset of 100 labeled prompts from LegalBench (Guha, et al. LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in \
 Large Language Models) to measure the ability of LLMs to reason over legal causes. The model is prompted to return yes/no as an answer to the question.
-3. Writing Prompts: This task evaluates the story-writing and creative abilities of the LLM. We measure the engagingness of the text generated by the LLM.  The dataset is a mix of human annotated samples from r/WritingPrompts and redteaming generations.
-4. Customer Support Dialogue: This task evaluates the ability of the LLM to answer a customer support question given some product information and conversational history. We measure the relevance of the generation given the conversational history, product information and question by the customer.
-5. Toxic Prompts: This task evaluates the safety of the model by using prompts that can elicit harmful information from LLMs. We measure if the model generates toxic content.
-6. Enterprise PII: This task evaluates the business safety of the model by using prompts to elicit business-sensitive information from LLMs. We measure if the model generates business sensitive information.
 ## What is Patronus AI?

     task0 = Task("finance_bench", "accuracy", "FinanceBench")
     task1 = Task("legal_confidentiality", "exact_match", "Legal Confidentiality")
     task2 = Task("writing_prompts", "engagingness", "Writing Prompts")
+    task3 = Task("customer_support_dialogue", "relevance", "Customer Support Dialogue")
     task4 = Task("toxic_prompts", "toxicity", "Toxic Prompts")
     task5 = Task("enterprise_pii", "enterprise_pii", "Enterprise PII")
 ## How it works
 ## Tasks
+1.FinanceBench (Islam, Pranab, et al. "FinanceBench: A New Benchmark for Financial Question Answering."): The task measures the ability to answer financial questions given the retrieved context from a document and a question. We do not evaluate the retrieval capabilities for this task. We only evaluate the accuracy of the answers.The dataset can be
 found at https://huggingface.co/datasets/PatronusAI/financebench.
+2.Legal Confidentiality: We use a subset of 100 labeled prompts from LegalBench (Guha, et al. LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in \
 Large Language Models) to measure the ability of LLMs to reason over legal causes. The model is prompted to return yes/no as an answer to the question.
+3.Writing Prompts: This task evaluates the story-writing and creative abilities of the LLM. We measure the engagingness of the text generated by the LLM.  The dataset is a mix of human annotated samples from r/WritingPrompts and redteaming generations.
+4.Customer Support Dialogue: This task evaluates the ability of the LLM to answer a customer support question given some product information and conversational history. We measure the relevance of the generation given the conversational history, product information and question by the customer.
+5.Toxic Prompts: This task evaluates the safety of the model by using prompts that can elicit harmful information from LLMs. We measure if the model generates toxic content.
+6.Enterprise PII: This task evaluates the business safety of the model by using prompts to elicit business-sensitive information from LLMs. We measure if the model generates business sensitive information.
 ## What is Patronus AI?