open-llm-leaderboard/open_llm_leaderboard · [FLAG] contamination MATH `bond005/meno-tiny-0.1`

@bond005 @clefourrier
I believe that OpenMathInstruct-2 is part of the training for this model, which unfortunately seems to be contaminated.

Indistinctly of which dataset is part of the training, the model weights contamination is a fact. But TBH, the numbers & samples involved are the same as the previous case

According contamination benchmarks:

200~ MATH tests were EXTRA contaminated
35~ MATH_HARD tests were EXTRA contaminated

Contamination tests for base model:

MATH_rewritten-test-1 5_gram_accuracy:  0.25320000000000004
MATH_rewritten-test-2 5_gram_accuracy:  0.2690666666666667
MATH_rewritten-test-3 5_gram_accuracy:  0.2692
orgn-MATH-test 5_gram_accuracy:  0.27053333333333335
ngram acc of Qwen2.5-1.5B-Instruct
MATH_rewritten-test-1: 0.25320000000000004
MATH_rewritten-test-2: 0.2690666666666667
MATH_rewritten-test-3: 0.2692
orgn-MATH-test: 0.27053333333333335
...
GSM8K_rewritten-test-1 5_gram_accuracy:  0.21971190295678544
GSM8K_rewritten-test-2 5_gram_accuracy:  0.2227445034116755
GSM8K_rewritten-test-3 5_gram_accuracy:  0.2172858225928734
orgn-GSM8K-test 5_gram_accuracy:  0.23290371493555728
GSM8K_rewritten-test-1: 0.21971190295678544
GSM8K_rewritten-test-2: 0.2227445034116755
GSM8K_rewritten-test-3: 0.2172858225928734
orgn-GSM8K-test: 0.23290371493555728

Contamination tests for this model:

MATH_rewritten-test-1 5_gram_accuracy:  0.3384666666666667
MATH_rewritten-test-2 5_gram_accuracy:  0.3502666666666667
MATH_rewritten-test-3 5_gram_accuracy:  0.3504666666666667
orgn-MATH-test 5_gram_accuracy:  0.3519333333333334
ngram acc of meno-tiny-0.1
MATH_rewritten-test-1: 0.3384666666666667
MATH_rewritten-test-2: 0.3502666666666667
MATH_rewritten-test-3: 0.3504666666666667
orgn-MATH-test: 0.3519333333333334
...
GSM8K_rewritten-test-1 5_gram_accuracy:  0.23320697498104626
GSM8K_rewritten-test-2 5_gram_accuracy:  0.2400303260045489
GSM8K_rewritten-test-3 5_gram_accuracy:  0.23290371493555728
orgn-GSM8K-test 5_gram_accuracy:  0.26277482941622443
GSM8K_rewritten-test-1: 0.23320697498104626
GSM8K_rewritten-test-2: 0.2400303260045489
GSM8K_rewritten-test-3: 0.23290371493555728
orgn-GSM8K-test: 0.26277482941622443

The reproduction is simple:

https://github.com/GAIR-NLP/benbench
modify the src/script to use the model, and the test to be math or gsm8k
run and get the results