Overview

This document presents the evaluation results of DeepSeek-R1-Distill-Qwen-32B, a 4-bit quantized model using GPTQ, evaluated with the Language Model Evaluation Harness on the ARC and MMLU-Challenge benchmark.


📊 Evaluation Summary

Metric Value Description
ARCH 41.04% Raw
MMLU 29.74% Averaged over MMLU-Stem, MMLU-Social-Sciences, MMLU-Humanities, MMLU-ther
MMLU-Humanities 32.05% Averaged over MMLU-Formal-Logic, MMLU-Prehistory, MMLU-World-Religions, MMLU-Philosophy, MMLU-High-School-World-History, MMLU-Professional-Law, MMLU-High-School-US-History, MMLU-Logical-Fallacies, MMLU-International-Law, MMLU-High-School-European-History, MMLU-Moral-Disputes, MMLU-Moral-Scenarios, MMLU-Jurisprudence
MMLU-Social-Sciences 30.32% Averaged over MMLU-Public-Relations, MMLU-Sociology, MMLU-Security-Studies, MMLU-High-School-Government-and-Politics, MMLU-High-School-Psychology, MMLU-Human-Sexuality, MMLU-US-Foreign-Policy, MMLU-High-School-Microeconomics, MMLU-Econometrics, MMLU-High-School-Macroeconomics, MMLU-High-School-Geography, MMLU-Professional-Psychology
MMLU-Stem 27.5% Averaged over MMLU-Conceptual-Physics, MMLU-High-School-Chemistry, MMLU-College-Biology, MMLU-College-Chemistry, MMLU-Machine-Learning, MMLU-Elementary-Mathematics, MMLU-Abstract-Algebra, MMLU-Astronomy, MMLU-High-School-Statistics, MMLU-Anatomy, MMLU-College-Mathematics, MMLU-Computer-Security, MMLU-College-Computer-Science, MMLU-Electrical-Engineering, MMLU-College-Physics, MMLU-High-School-Computer-Science, MMLU-High-School-Physics, MMLU-High-School-Biology, MMLU-High-School-Mathematics
MMLU-Other 27.94% Averaged over MMLU-Medical-Genetics, MMLU-Global-Facts, MMLU-Marketing, MMLU-College-Medicine, MMLU-Human-Aging, MMLU-Virology, MMLU-Business-Ethics, MMLU-Clinical-Knowledge, MMLU-Professional-Medicine, MMLU-Nutrition, MMLU-Miscellaneous, MMLU-Professional-Accounting, MMLU-Management

⚙️ Model Configuration

  • Model: DeepSeek-R1-Distill-Qwen-32B
  • Parameters: 70 billion
  • Quantization: 4-bit GPTQ
  • Source: Hugging Face (hf)
  • Precision: torch.float16
  • Hardware: NVIDIA A100 80GB PCIe
  • CUDA Version: 12.4
  • PyTorch Version: 2.6.0+cu124
  • Batch Size: 1
  • Evaluation Time: 1780.502 seconds (~29 minutes)

📌 Interpretation:

  • The evaluation was performed on a high-performance GPU (A100 80GB).
  • The model is significantly larger than the previous 8B version, with GPTQ 4-bit quantization reducing memory footprint.
  • A single-sample batch size was used, which might slow evaluation speed.

📈 Performance Insights

  • The "higher_is_better" flag confirms that higher accuracy is preferred.
  • Quantization Impact: The 4-bit GPTQ quantization reduces memory usage but may also impact accuracy slightly.
  • Zero-shot Limitation: Performance could improve with few-shot prompting (providing examples before testing).

📌 Let us know if you need further analysis or model tuning! 🚀

Downloads last month
94
Safetensors
Model size
5.74B params
Tensor type
I32
·
BF16
·
FP16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for empirischtech/DeepSeek-R1-Distill-Qwen-32B-gptq-4bit

Quantized
(116)
this model

Dataset used to train empirischtech/DeepSeek-R1-Distill-Qwen-32B-gptq-4bit