mfajcik commited on
Commit
5f940b0
·
verified ·
1 Parent(s): f09348d

Update content.py

Browse files
Files changed (1) hide show
  1. content.py +2 -1
content.py CHANGED
@@ -19,7 +19,8 @@ Here, you can compare models on tasks in the Czech language or submit your own m
19
  - On the submission page, __you can view your model's results on the leaderboard without publishing them__.
20
  - The first step is "pre-submission." After this is complete (significance tests may take up to 2 hours), you can choose to submit the results if you wish.
21
  - NEWS:
22
- - 16.04.2025: We changed the way statistical significance checking works in BCM! From now on, our statistical evaluation guarantees at most 5% [False Discovery Rate](https://en.wikipedia.org/wiki/False_discovery_rate) when counting number of discoveries (wins) in Duel Win Score.
 
23
  - 19.02.2025: We added a performance-size plot under the Table for better overview! Scroll down to find out, which model works the best for it's size!
24
  - 23.12.2024: We released [a preprint](http://arxiv.org/abs/2412.17933) detailing our work.
25
  - 7.11.2024: We acknowledge that one of the Qwen2.5 models correctly predicted our (& Bigbench's) canary string. This confirms the contamination, it was trained on benchmark data. Other [studies](https://arxiv.org/pdf/2409.01790) also suggest the contamination issues of the Qwen family.
 
19
  - On the submission page, __you can view your model's results on the leaderboard without publishing them__.
20
  - The first step is "pre-submission." After this is complete (significance tests may take up to 2 hours), you can choose to submit the results if you wish.
21
  - NEWS:
22
+ - 02.05.2025: Our work was accepted into [TACL](https://transacl.org/index.php/tacl) journal! Arxiv now contains TACL camera-ready version.
23
+ - 16.04.2025: We added the way statistical significance checking works in BCM! From now on, our statistical evaluation provides guarantees at most 5% [False Discovery Rate](https://en.wikipedia.org/wiki/False_discovery_rate) when counting number of discoveries (wins) in Duel Win Score (check FDR Guarantees button).
24
  - 19.02.2025: We added a performance-size plot under the Table for better overview! Scroll down to find out, which model works the best for it's size!
25
  - 23.12.2024: We released [a preprint](http://arxiv.org/abs/2412.17933) detailing our work.
26
  - 7.11.2024: We acknowledge that one of the Qwen2.5 models correctly predicted our (& Bigbench's) canary string. This confirms the contamination, it was trained on benchmark data. Other [studies](https://arxiv.org/pdf/2409.01790) also suggest the contamination issues of the Qwen family.