Commit
·
7f8ca6a
1
Parent(s):
fd4f913
Adjust description for TruthfulQA
Browse files- content.py +6 -2
content.py
CHANGED
|
@@ -1,4 +1,7 @@
|
|
| 1 |
CHANGELOG_TEXT = f"""
|
|
|
|
|
|
|
|
|
|
| 2 |
## [2023-06-12]
|
| 3 |
- Add Human & GPT-4 Evaluations
|
| 4 |
|
|
@@ -34,7 +37,8 @@ CHANGELOG_TEXT = f"""
|
|
| 34 |
- Display different queues for jobs that are RUNNING, PENDING, FINISHED status
|
| 35 |
|
| 36 |
## [2023-05-15]
|
| 37 |
-
- Fix a typo: from "TruthQA" to "
|
|
|
|
| 38 |
|
| 39 |
## [2023-05-10]
|
| 40 |
- Fix a bug that prevented auto-refresh
|
|
@@ -58,7 +62,7 @@ Evaluation is performed against 4 popular benchmarks:
|
|
| 58 |
- <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
|
| 59 |
- <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
|
| 60 |
- <a href="https://arxiv.org/abs/2009.03300" target="_blank"> MMLU </a> (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
|
| 61 |
-
- <a href="https://arxiv.org/abs/2109.07958" target="_blank"> TruthfulQA </a> (0-shot) - a
|
| 62 |
|
| 63 |
We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
|
| 64 |
"""
|
|
|
|
| 1 |
CHANGELOG_TEXT = f"""
|
| 2 |
+
## [2023-06-13]
|
| 3 |
+
- Adjust description for TruthfulQA
|
| 4 |
+
|
| 5 |
## [2023-06-12]
|
| 6 |
- Add Human & GPT-4 Evaluations
|
| 7 |
|
|
|
|
| 37 |
- Display different queues for jobs that are RUNNING, PENDING, FINISHED status
|
| 38 |
|
| 39 |
## [2023-05-15]
|
| 40 |
+
- Fix a typo: from "TruthQA" to "
|
| 41 |
+
QA"
|
| 42 |
|
| 43 |
## [2023-05-10]
|
| 44 |
- Fix a bug that prevented auto-refresh
|
|
|
|
| 62 |
- <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
|
| 63 |
- <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
|
| 64 |
- <a href="https://arxiv.org/abs/2009.03300" target="_blank"> MMLU </a> (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
|
| 65 |
+
- <a href="https://arxiv.org/abs/2109.07958" target="_blank"> TruthfulQA </a> (0-shot) - a test to measure a model’s propensity to reproduce falsehoods commonly found online.
|
| 66 |
|
| 67 |
We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
|
| 68 |
"""
|