Update README.md

9ac734a verified 6 months ago

4.72 kB

	---
	license: cc-by-4.0
	datasets:
	- allenai/c4
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- deepseek-ai/DeepSeek-R1-Distill-Llama-70B
	pipeline_tag: text-generation
	---



	# Overview
	This document presents the evaluation results of `DeepSeek-R1-Distill-Llama-70B`, a 4-bit quantized model using GPTQ, evaluated with the Language Model Evaluation Harness on the ARC-Challenge benchmark.

	---

	## 📊 Evaluation Summary

	\| Metric \| Value \| Description \| 8bit \|
	\|----------------------\|-----------\|-----------------\|-----------\|
	\| Accuracy (acc,none) \| `21.2%` \| Raw accuracy - percentage of correct answers. \| `21.2%` \|
	\| Standard Error (acc_stderr,none) \| `1.19%` \| Uncertainty in the accuracy estimate. \| `1.2%` \|
	\| Normalized Accuracy (acc_norm,none) \| `25.4%` \| Accuracy after dataset-specific normalization. \| `25.2%` \|
	\| Standard Error (acc_norm_stderr,none) \| `1.27%` \| Uncertainty for normalized accuracy. \| `1.3%` \|

	📌 Interpretation:
	- The model correctly answered 21.2% of the questions.
	- After normalization, the accuracy slightly improves to 25.4%.
	- The standard error (~1.27%) indicates a small margin of uncertainty.

	---

	## ⚙️ Model Configuration

	- Model: `DeepSeek-R1-Distill-Llama-70B`
	- Parameters: `70 billion`
	- Quantization: `4-bit GPTQ`
	- Source: Hugging Face (`hf`)
	- Precision: `torch.float16`
	- Hardware: `NVIDIA A100 80GB PCIe`
	- CUDA Version: `12.4`
	- PyTorch Version: `2.6.0+cu124`
	- Batch Size: `1`
	- Evaluation Time: `365.89 seconds (~6 minutes)`

	📌 Interpretation:
	- The evaluation was performed on a high-performance GPU (A100 80GB).
	- The model is significantly larger than the previous 8B version, with GPTQ 4-bit quantization reducing memory footprint.
	- A single-sample batch size was used, which might slow evaluation speed.

	---

	## 📂 Dataset Information

	- Dataset: `AI2 ARC-Challenge`
	- Task Type: `Multiple Choice`
	- Number of Samples Evaluated: `1,172`
	- Few-shot Examples Used: `0` (Zero-shot setting)

	📌 Interpretation:
	- This benchmark assesses grade-school-level scientific reasoning.
	- Since no few-shot examples were provided, the model was evaluated in a pure zero-shot setting.

	---

	## 📈 Performance Insights

	- The `"higher_is_better"` flag confirms that higher accuracy is preferred.
	- The model's raw accuracy (21.2%) is significantly lower compared to state-of-the-art models (60–80% on ARC-Challenge).
	- Quantization Impact: The 4-bit GPTQ quantization reduces memory usage but may also impact accuracy slightly.
	- Zero-shot Limitation: Performance could improve with few-shot prompting (providing examples before testing).

	---


	## 📊 Detailed Evaluation on MMLU Challenges


	\| Metric \| Value \| Description \|
	\|----------------------\|-----------\|-----------------\|
	\| MMLU \| `37.88%` \| Averaged over MMLU-Stem, MMLU-Social-Sciences, MMLU-Humanities, MMLU-ther \|
	\| MMLU-Humanities \| `31.83%` \| Averaged over MMLU-Formal-Logic, MMLU-Prehistory, MMLU-World-Religions, MMLU-Philosophy, MMLU-High-School-World-History, MMLU-Professional-Law, MMLU-High-School-US-History, MMLU-Logical-Fallacies, MMLU-International-Law, MMLU-High-School-European-History, MMLU-Moral-Disputes, MMLU-Moral-Scenarios, MMLU-Jurisprudence \|
	\| MMLU-Social-Sciences \| `45.43%` \| Averaged over MMLU-Public-Relations, MMLU-Sociology, MMLU-Security-Studies, MMLU-High-School-Government-and-Politics, MMLU-High-School-Psychology, MMLU-Human-Sexuality, MMLU-US-Foreign-Policy, MMLU-High-School-Microeconomics, MMLU-Econometrics, MMLU-High-School-Macroeconomics, MMLU-High-School-Geography, MMLU-Professional-Psychology \|
	\| MMLU-Stem \| `33.01%` \| Averaged over MMLU-Conceptual-Physics, MMLU-High-School-Chemistry, MMLU-College-Biology, MMLU-College-Chemistry, MMLU-Machine-Learning, MMLU-Elementary-Mathematics, MMLU-Abstract-Algebra, MMLU-Astronomy, MMLU-High-School-Statistics, MMLU-Anatomy, MMLU-College-Mathematics, MMLU-Computer-Security, MMLU-College-Computer-Science, MMLU-Electrical-Engineering, MMLU-College-Physics, MMLU-High-School-Computer-Science, MMLU-High-School-Physics, MMLU-High-School-Biology, MMLU-High-School-Mathematics \|
	\| MMLU-Other \| `44.48%` \| Averaged over MMLU-Medical-Genetics, MMLU-Global-Facts, MMLU-Marketing, MMLU-College-Medicine, MMLU-Human-Aging, MMLU-Virology, MMLU-Business-Ethics, MMLU-Clinical-Knowledge, MMLU-Professional-Medicine, MMLU-Nutrition, MMLU-Miscellaneous, MMLU-Professional-Accounting, MMLU-Management \|



	📌 Let us know if you need further analysis or model tuning! 🚀

	---
	license: cc-by-4.0
	datasets:
	- allenai/c4
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- deepseek-ai/DeepSeek-R1-Distill-Llama-70B
	pipeline_tag: text-generation
	---



	# Overview
	This document presents the evaluation results of `DeepSeek-R1-Distill-Llama-70B`, a 4-bit quantized model using GPTQ, evaluated with the Language Model Evaluation Harness on the ARC-Challenge benchmark.

	---

	## 📊 Evaluation Summary

	\| Metric \| Value \| Description \| 8bit \|
	\|----------------------\|-----------\|-----------------\|-----------\|
	\| Accuracy (acc,none) \| `21.2%` \| Raw accuracy - percentage of correct answers. \| `21.2%` \|
	\| Standard Error (acc_stderr,none) \| `1.19%` \| Uncertainty in the accuracy estimate. \| `1.2%` \|
	\| Normalized Accuracy (acc_norm,none) \| `25.4%` \| Accuracy after dataset-specific normalization. \| `25.2%` \|
	\| Standard Error (acc_norm_stderr,none) \| `1.27%` \| Uncertainty for normalized accuracy. \| `1.3%` \|

	📌 Interpretation:
	- The model correctly answered 21.2% of the questions.
	- After normalization, the accuracy slightly improves to 25.4%.
	- The standard error (~1.27%) indicates a small margin of uncertainty.

	---

	## ⚙️ Model Configuration

	- Model: `DeepSeek-R1-Distill-Llama-70B`
	- Parameters: `70 billion`
	- Quantization: `4-bit GPTQ`
	- Source: Hugging Face (`hf`)
	- Precision: `torch.float16`
	- Hardware: `NVIDIA A100 80GB PCIe`
	- CUDA Version: `12.4`
	- PyTorch Version: `2.6.0+cu124`
	- Batch Size: `1`
	- Evaluation Time: `365.89 seconds (~6 minutes)`

	📌 Interpretation:
	- The evaluation was performed on a high-performance GPU (A100 80GB).
	- The model is significantly larger than the previous 8B version, with GPTQ 4-bit quantization reducing memory footprint.
	- A single-sample batch size was used, which might slow evaluation speed.

	---

	## 📂 Dataset Information

	- Dataset: `AI2 ARC-Challenge`
	- Task Type: `Multiple Choice`
	- Number of Samples Evaluated: `1,172`
	- Few-shot Examples Used: `0` (Zero-shot setting)

	📌 Interpretation:
	- This benchmark assesses grade-school-level scientific reasoning.
	- Since no few-shot examples were provided, the model was evaluated in a pure zero-shot setting.

	---

	## 📈 Performance Insights

	- The `"higher_is_better"` flag confirms that higher accuracy is preferred.
	- The model's raw accuracy (21.2%) is significantly lower compared to state-of-the-art models (60–80% on ARC-Challenge).
	- Quantization Impact: The 4-bit GPTQ quantization reduces memory usage but may also impact accuracy slightly.
	- Zero-shot Limitation: Performance could improve with few-shot prompting (providing examples before testing).

	---


	## 📊 Detailed Evaluation on MMLU Challenges


	\| Metric \| Value \| Description \|
	\|----------------------\|-----------\|-----------------\|
	\| MMLU \| `37.88%` \| Averaged over MMLU-Stem, MMLU-Social-Sciences, MMLU-Humanities, MMLU-ther \|
	\| MMLU-Humanities \| `31.83%` \| Averaged over MMLU-Formal-Logic, MMLU-Prehistory, MMLU-World-Religions, MMLU-Philosophy, MMLU-High-School-World-History, MMLU-Professional-Law, MMLU-High-School-US-History, MMLU-Logical-Fallacies, MMLU-International-Law, MMLU-High-School-European-History, MMLU-Moral-Disputes, MMLU-Moral-Scenarios, MMLU-Jurisprudence \|
	\| MMLU-Social-Sciences \| `45.43%` \| Averaged over MMLU-Public-Relations, MMLU-Sociology, MMLU-Security-Studies, MMLU-High-School-Government-and-Politics, MMLU-High-School-Psychology, MMLU-Human-Sexuality, MMLU-US-Foreign-Policy, MMLU-High-School-Microeconomics, MMLU-Econometrics, MMLU-High-School-Macroeconomics, MMLU-High-School-Geography, MMLU-Professional-Psychology \|
	\| MMLU-Stem \| `33.01%` \| Averaged over MMLU-Conceptual-Physics, MMLU-High-School-Chemistry, MMLU-College-Biology, MMLU-College-Chemistry, MMLU-Machine-Learning, MMLU-Elementary-Mathematics, MMLU-Abstract-Algebra, MMLU-Astronomy, MMLU-High-School-Statistics, MMLU-Anatomy, MMLU-College-Mathematics, MMLU-Computer-Security, MMLU-College-Computer-Science, MMLU-Electrical-Engineering, MMLU-College-Physics, MMLU-High-School-Computer-Science, MMLU-High-School-Physics, MMLU-High-School-Biology, MMLU-High-School-Mathematics \|
	\| MMLU-Other \| `44.48%` \| Averaged over MMLU-Medical-Genetics, MMLU-Global-Facts, MMLU-Marketing, MMLU-College-Medicine, MMLU-Human-Aging, MMLU-Virology, MMLU-Business-Ethics, MMLU-Clinical-Knowledge, MMLU-Professional-Medicine, MMLU-Nutrition, MMLU-Miscellaneous, MMLU-Professional-Accounting, MMLU-Management \|



	📌 Let us know if you need further analysis or model tuning! 🚀