Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
more info
Browse files- src/about.py +13 -2
src/about.py
CHANGED
|
@@ -41,9 +41,11 @@ TITLE = """<h1 align="center" id="space-title">Open PL LLM Leaderboard (0-shot a
|
|
| 41 |
|
| 42 |
# What does your leaderboard evaluate?
|
| 43 |
INTRODUCTION_TEXT = """
|
| 44 |
-
|
| 45 |
|
| 46 |
-
|
|
|
|
|
|
|
| 47 |
"""
|
| 48 |
|
| 49 |
# Which evaluations are you running? how can people reproduce what you have?
|
|
@@ -54,6 +56,15 @@ Contact with me: [LinkedIn](https://www.linkedin.com/in/wrobelkrzysztof/)
|
|
| 54 |
|
| 55 |
or join our [Discord SpeakLeash](https://discord.gg/3G9DVM39)
|
| 56 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
## Evaluation metrics
|
| 58 |
|
| 59 |
- **belebele_pol_Latn**: accuracy
|
|
|
|
| 41 |
|
| 42 |
# What does your leaderboard evaluate?
|
| 43 |
INTRODUCTION_TEXT = """
|
| 44 |
+
The leaderboard evaluates language models on a set of Polish tasks. The tasks are designed to test the models' ability to understand and generate Polish text. The leaderboard is designed to be a benchmark for the Polish language model community, and to help researchers and practitioners understand the capabilities of different models.
|
| 45 |
|
| 46 |
+
Almost every task has two versions: regex and multiple choice. The regex version is scored based on exact match, while the multiple choice version is scored based on accuracy.
|
| 47 |
+
* _g suffix means that a model needs to generate an answer (only suitable for instructions-based models)
|
| 48 |
+
* _mc suffix means that a model is scored against every possible class (suitable also for base models)
|
| 49 |
"""
|
| 50 |
|
| 51 |
# Which evaluations are you running? how can people reproduce what you have?
|
|
|
|
| 56 |
|
| 57 |
or join our [Discord SpeakLeash](https://discord.gg/3G9DVM39)
|
| 58 |
|
| 59 |
+
## TODO
|
| 60 |
+
|
| 61 |
+
* change metrics for DYK, PSC, CBD(?)
|
| 62 |
+
* fix names of our models
|
| 63 |
+
* add inference time
|
| 64 |
+
* add metadata for models (e.g. #Params)
|
| 65 |
+
* add more tasks
|
| 66 |
+
* add baselines
|
| 67 |
+
|
| 68 |
## Evaluation metrics
|
| 69 |
|
| 70 |
- **belebele_pol_Latn**: accuracy
|