ControlLLM
/

Control-LLM-Llama3.1-8B-Math16-Instruct

Text Generation

Model card Files Files and versions Community

Control-LLM-Llama3.1-8B-Math16-Instruct / README.md

hawei_LinkedIn

update explanation of benchmark result table

63e078d 3 months ago

|

4 kB

	---
	license: llama3.1
	datasets:
	- nvidia/OpenMathInstruct-2
	language:
	- en
	base_model:
	- meta-llama/Llama-3.1-8B-Instruct
	model-index:
	- name: Control-LLM-Llama3.1-8B-Math16
	results:
	- task:
	type: math-evaluation
	dataset:
	type: parquet
	name: Math, Math Hard, GSM8K
	dataset_kwargs:
	data_files: "https://github.com/linkedin/ControlLLM/blob/main/src/controlllm/inference/llm_eval_harness/additional_tasks/math/joined_math.parquet"
	metrics:
	- name: exact_match,none
	type: exact_match
	value: 0.6205678398534606
	stderr: 0.005249520342473376
	verified: false
	- name: exact_match,none (gsm8k_0shot_instruct)
	type: exact_match
	value: 0.8968915845337376
	stderr: 0.008376436987507811
	verified: false
	- name: exact_match,none (meta_math_0shot_instruct)
	type: exact_match
	value: 0.6166
	stderr: 0.006876797660918556
	verified: false
	- name: exact_match,none (meta_math_hard_0shot_instruct)
	type: exact_match
	value: 0.36027190332326287
	stderr: 0.013198755610252931
	verified: false
	- task:
	type: original-capability
	dataset:
	type: meta/Llama-3.1-8B-Instruct-evals
	name: Llama-3.1-8B-Instruct-evals Dataset
	dataset_path: "meta-llama/llama-3.1-8_b-instruct-evals"
	dataset_name: "Llama-3.1-8B-Instruct-evals__arc_challenge__details"
	metrics:
	- name: exact_match,strict-match
	type: exact_match
	value: 0.6001372485281902
	stderr: 0.002821514831773572
	verified: false
	- name: exact_match,strict-match (meta_arc_0shot_instruct)
	type: exact_match
	value: 0.8248927038626609
	stderr: 0.011139722235859526
	verified: false
	- name: exact_match,strict-match (meta_gpqa_0shot_cot_instruct)
	type: exact_match
	value: 0.3080357142857143
	stderr: 0.021836780796366417
	verified: false
	- name: exact_match,strict-match (meta_mmlu_0shot_instruct)
	type: exact_match
	value: 0.7159948725252813
	stderr: 0.00380556397209409
	verified: false
	- name: exact_match,strict-match (meta_mmlu_pro_5shot_instruct)
	type: exact_match
	value: 0.45403922872340424
	stderr: 0.004539171007529716
	verified: false
	---
	# Control-LLM-Llama3.1-8B-Math16
	This is a fine-tuned model of Llama-3.1-8B-Instruct for mathematical tasks on OpenMath2 dataset.

	## Evaluation Results
	Here is an overview of the evaluation results and findings:

	### Benchmark Results Table
	The table below summarizes evaluation results across mathematical tasks and original capabilities.

	\| Model \| MH \| M \| G8K \| M-Avg \| ARC \| GPQA \| MLU \| MLUP \| O-Avg \| Overall \|
	\|-------------------\|--------\|--------\|---------\|-----------\|---------\|----------\|---------\|----------\|-----------\|-------------\|
	\| Llama3.1-8B-Inst \| 23.7 \| 50.9 \| 85.6 \| 52.1 \| 83.4 \| 29.9 \| 72.4 \| 46.7 \| 60.5 \| 56.3 \|
	\| Control LLM* \| 36.0 \| 61.7 \| 89.7\| 62.5 \| 82.5 \| 30.8 \| 71.6\| 45.4 \| 57.6 \| 60.0 \|

	---
	### Explanation:
	- MH: MathHard
	- M: Math
	- G8K: GSM8K
	- M-Avg: Math - Average across MathHard, Math, and GSM8K
	- ARC: ARC benchmark
	- GPQA: General knowledge QA
	- MLU: MMLU (Massive Multitask Language Understanding)
	- MLUP: MMLU Pro
	- O-Avg: Original Capability - Average across ARC, GPQA, MMLU, and MLUP
	- Overall: Combined average across all tasks

	### Catastrophic Forgetting on OpenMath
	The following plot illustrates and compares catastrophic forgetting mitigation during training

	![Catastrophic Forgetting](plots/ControlLLM_CF_Plot_Math.png)

	### Alignment Result
	The plot below highlights the alignment result of the model trained with Control LLM.

	![Alignment](plots/alignment_best.png)