Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
[FLAG] contamination MATH `bond005/meno-tiny-0.1`
#1094
by
fblgit
- opened
@bond005
@clefourrier
I believe that OpenMathInstruct-2 is part of the training for this model, which unfortunately seems to be contaminated.
Indistinctly of which dataset is part of the training, the model weights contamination is a fact. But TBH, the numbers & samples involved are the same as the previous case
According contamination benchmarks:
- 200~ MATH tests were EXTRA contaminated
- 35~ MATH_HARD tests were EXTRA contaminated
Contamination tests for base model:
MATH_rewritten-test-1 5_gram_accuracy: 0.25320000000000004
MATH_rewritten-test-2 5_gram_accuracy: 0.2690666666666667
MATH_rewritten-test-3 5_gram_accuracy: 0.2692
orgn-MATH-test 5_gram_accuracy: 0.27053333333333335
ngram acc of Qwen2.5-1.5B-Instruct
MATH_rewritten-test-1: 0.25320000000000004
MATH_rewritten-test-2: 0.2690666666666667
MATH_rewritten-test-3: 0.2692
orgn-MATH-test: 0.27053333333333335
...
GSM8K_rewritten-test-1 5_gram_accuracy: 0.21971190295678544
GSM8K_rewritten-test-2 5_gram_accuracy: 0.2227445034116755
GSM8K_rewritten-test-3 5_gram_accuracy: 0.2172858225928734
orgn-GSM8K-test 5_gram_accuracy: 0.23290371493555728
GSM8K_rewritten-test-1: 0.21971190295678544
GSM8K_rewritten-test-2: 0.2227445034116755
GSM8K_rewritten-test-3: 0.2172858225928734
orgn-GSM8K-test: 0.23290371493555728
Contamination tests for this model:
MATH_rewritten-test-1 5_gram_accuracy: 0.3384666666666667
MATH_rewritten-test-2 5_gram_accuracy: 0.3502666666666667
MATH_rewritten-test-3 5_gram_accuracy: 0.3504666666666667
orgn-MATH-test 5_gram_accuracy: 0.3519333333333334
ngram acc of meno-tiny-0.1
MATH_rewritten-test-1: 0.3384666666666667
MATH_rewritten-test-2: 0.3502666666666667
MATH_rewritten-test-3: 0.3504666666666667
orgn-MATH-test: 0.3519333333333334
...
GSM8K_rewritten-test-1 5_gram_accuracy: 0.23320697498104626
GSM8K_rewritten-test-2 5_gram_accuracy: 0.2400303260045489
GSM8K_rewritten-test-3 5_gram_accuracy: 0.23290371493555728
orgn-GSM8K-test 5_gram_accuracy: 0.26277482941622443
GSM8K_rewritten-test-1: 0.23320697498104626
GSM8K_rewritten-test-2: 0.2400303260045489
GSM8K_rewritten-test-3: 0.23290371493555728
orgn-GSM8K-test: 0.26277482941622443
The reproduction is simple:
- https://github.com/GAIR-NLP/benbench
- modify the src/script to use the model, and the test to be
math
orgsm8k
- run and get the results