- 
	
	
	  RedHatAI/DeepSeek-R1-Distill-Llama-8B-FP8-dynamicText Generation • 8B • Updated • 269 • 4
- 
	
	
	  RedHatAI/DeepSeek-R1-Distill-Llama-70B-FP8-dynamicText Generation • 71B • Updated • 499 • 9
- 
	
	
	  RedHatAI/DeepSeek-R1-Distill-Qwen-32B-FP8-dynamicText Generation • 33B • Updated • 1.27k • 8
- 
	
	
	  RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8Text Generation • 71B • Updated • 3.64k • 2
Neural Magic
company
						
	Verified
						
						
						AI & ML interests
LLMs, optimization, compression, sparsification, quantization, pruning, distillation, NLP, CV
2:4 sparse versions of Llama-3.1, including transfer learning
			
	
	- 
	
	
	  RedHatAI/Sparse-Llama-3.1-8B-ultrachat_200k-2of4-FP8-dynamicText Generation • 8B • Updated • 4 • 1
- 
	
	
	  RedHatAI/Sparse-Llama-3.1-8B-gsm8k-2of4Text Generation • 8B • Updated • 4 • 1
- 
	
	
	  RedHatAI/Sparse-Llama-3.1-8B-2of4Text Generation • 8B • Updated • 68 • 62
- 
	
	
	  RedHatAI/Sparse-Llama-3.1-8B-ultrachat_200k-2of4Text Generation • 8B • Updated • 5 • 1
Accurate FP8 quantized models by Neural Magic, ready for use with vLLM!
			
	
	- 
	
	
	  RedHatAI/Meta-Llama-3.1-405B-Instruct-FP8Text Generation • 406B • Updated • 710 • 31
- 
	
	
	  RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8Text Generation • 8B • Updated • 147k • 42
- 
	
	
	  RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8Text Generation • 71B • Updated • 4.7k • 50
- 
	
	
	  RedHatAI/Phi-3-medium-128k-instruct-FP8Text Generation • 14B • Updated • 2 • 5
Neural Magic quantized Llama-3.1 models
			
	
	- 
	
	
	  RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8Text Generation • 71B • Updated • 4.7k • 50
- 
	
	
	  RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8Text Generation • 8B • Updated • 147k • 42
- 
	
	
	  RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w4a16Text Generation • 11B • Updated • 204k • 32
- 
	
	
	  RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16Text Generation • 2B • Updated • 41.6k • 30
Accurate INT4 quantized models by Neural Magic, ready for use with vLLM!
			
	
	- 
	
	
	  RedHatAI/Meta-Llama-3.1-405B-Instruct-quantized.w4a16Text Generation • 58B • Updated • 96 • 12
- 
	
	
	  RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w4a16Text Generation • 11B • Updated • 204k • 32
- 
	
	
	  RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16Text Generation • 2B • Updated • 41.6k • 30
- 
	
	
	  RedHatAI/Mistral-Nemo-Instruct-2407-quantized.w4a16Text Generation • 3B • Updated • 247 • 4
Papers that we're proud to integrate into our libraries
			
	
	- 
	
	
	Sparse Finetuning for Inference Acceleration of Large Language ModelsPaper • 2310.06927 • Published • 15
- 
	
	
	SparseGPT: Massive Language Models Can Be Accurately Pruned in One-ShotPaper • 2301.00774 • Published • 3
- 
	
	
	The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language ModelsPaper • 2203.07259 • Published • 4
- 
	
	
	How Well Do Sparse Imagenet Models Transfer?Paper • 2111.13445 • Published • 1
Explore our breakthrough in sparse fine-tuning LLMs! Our novel method maintains downstream accuracy even with >70% sparsity.
			
	
	- 
	
	
	Sparse Finetuning for Inference Acceleration of Large Language ModelsPaper • 2310.06927 • Published • 15
- 
	
	
	16Sparse Llama Gsm8k📚Solve math problems with chat-based guidance 
- 
	
	
	  RedHatAI/mpt-7b-gsm8k-pruned40-quant-dsText Generation • Updated • 2
- 
	
	
	  RedHatAI/mpt-7b-gsm8k-pruned50-quant-dsText Generation • Updated • 10
- 
	
	
	  RedHatAI/granite-3.1-2b-instruct-quantized.w4a16Text Generation • 0.5B • Updated • 128
- 
	
	
	  RedHatAI/granite-3.1-2b-instruct-quantized.w8a8Text Generation • 3B • Updated • 20
- 
	
	
	  RedHatAI/granite-3.1-8b-instruct-quantized.w4a16Text Generation • 1B • Updated • 731 • 1
- 
	
	
	  RedHatAI/granite-3.1-8b-instruct-quantized.w8a8Text Generation • 8B • Updated • 94 • 2
Vision Language Models (VLMs) quantized by Neural Magic
			
	
	- 
	
	
	  RedHatAI/Llama-3.2-11B-Vision-Instruct-FP8-dynamicText Generation • 11B • Updated • 5.79k • 24
- 
	
	
	  RedHatAI/Llama-3.2-90B-Vision-Instruct-FP8-dynamicText Generation • 89B • Updated • 261k • 10
- 
	
	
	  RedHatAI/pixtral-12b-FP8-dynamicText Generation • 13B • Updated • 52 • 10
- 
	
	
	  RedHatAI/Phi-3-vision-128k-instruct-W4A16-G128Text Generation • 1B • Updated • 29 • 1
Llama 3.2 models quantized by Neural Magic
			
	
	- 
	
	
	  RedHatAI/Llama-3.2-11B-Vision-Instruct-FP8-dynamicText Generation • 11B • Updated • 5.79k • 24
- 
	
	
	  RedHatAI/Llama-3.2-90B-Vision-Instruct-FP8-dynamicText Generation • 89B • Updated • 261k • 10
- 
	
	
	  RedHatAI/Llama-3.2-1B-Instruct-FP8-dynamicText Generation • 1B • Updated • 102k • 3
- 
	
	
	  RedHatAI/Llama-3.2-3B-Instruct-FP8-dynamicText Generation • 4B • Updated • 266 • 3
Accurate INT8 quantized models by Neural Magic, ready for use with vLLM!
			
	
	- 
	
	
	  RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w8a8Text Generation • 71B • Updated • 9.23k • 21
- 
	
	
	  RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8Text Generation • 8B • Updated • 33.7k • 17
- 
	
	
	  RedHatAI/Meta-Llama-3.1-405B-Instruct-quantized.w8a8Text Generation • 406B • Updated • 19 • 1
- 
	
	
	  RedHatAI/Phi-3-medium-128k-instruct-quantized.w8a8Text Generation • 14B • Updated • 3 • 2
Sparse pre-trained and fine-tuned Llama models made by Neural Magic + Cerebras
			
	
	- 
	
	
	Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and DeploymentPaper • 2405.03594 • Published • 7
- 
	
	
	  RedHatAI/Llama-2-7b-pruned50-retrainedText Generation • 7B • Updated • 2
- 
	
	
	  RedHatAI/Llama-2-7b-pruned70-retrainedText Generation • 7B • Updated • 222
- 
	
	
	  RedHatAI/Llama-2-7b-ultrachat200k-pruned_50Text Generation • 7B • Updated • 7
Useful LLMs for DeepSparse where we've pruned at least 50% of the weights!
			
	
	- 
	
	
	  RedHatAI/OpenHermes-2.5-Mistral-7B-pruned50-quant-dsText Generation • Updated • 1 • 2
- 
	
	
	  RedHatAI/Nous-Hermes-2-SOLAR-10.7B-pruned50-quant-dsText Generation • Updated • 1 • 7
- 
	
	
	  RedHatAI/SOLAR-10.7B-Instruct-v1.0-pruned50-quant-dsText Generation • Updated • 5 • 5
- 
	
	
	  RedHatAI/Llama2-7b-chat-pruned50-quant-dsText Generation • Updated • 1 • 9
LLMs optimized by the community using Neural Magic's LLM Compressor for efficient deployment in vLLM. Contribute and help advance efficient AI!
			
	
	- 
	
	
	  RedHatAI/DeepSeek-R1-Distill-Llama-8B-FP8-dynamicText Generation • 8B • Updated • 269 • 4
- 
	
	
	  RedHatAI/DeepSeek-R1-Distill-Llama-70B-FP8-dynamicText Generation • 71B • Updated • 499 • 9
- 
	
	
	  RedHatAI/DeepSeek-R1-Distill-Qwen-32B-FP8-dynamicText Generation • 33B • Updated • 1.27k • 8
- 
	
	
	  RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8Text Generation • 71B • Updated • 3.64k • 2
- 
	
	
	  RedHatAI/granite-3.1-2b-instruct-quantized.w4a16Text Generation • 0.5B • Updated • 128
- 
	
	
	  RedHatAI/granite-3.1-2b-instruct-quantized.w8a8Text Generation • 3B • Updated • 20
- 
	
	
	  RedHatAI/granite-3.1-8b-instruct-quantized.w4a16Text Generation • 1B • Updated • 731 • 1
- 
	
	
	  RedHatAI/granite-3.1-8b-instruct-quantized.w8a8Text Generation • 8B • Updated • 94 • 2
2:4 sparse versions of Llama-3.1, including transfer learning
			
	
	- 
	
	
	  RedHatAI/Sparse-Llama-3.1-8B-ultrachat_200k-2of4-FP8-dynamicText Generation • 8B • Updated • 4 • 1
- 
	
	
	  RedHatAI/Sparse-Llama-3.1-8B-gsm8k-2of4Text Generation • 8B • Updated • 4 • 1
- 
	
	
	  RedHatAI/Sparse-Llama-3.1-8B-2of4Text Generation • 8B • Updated • 68 • 62
- 
	
	
	  RedHatAI/Sparse-Llama-3.1-8B-ultrachat_200k-2of4Text Generation • 8B • Updated • 5 • 1
Vision Language Models (VLMs) quantized by Neural Magic
			
	
	- 
	
	
	  RedHatAI/Llama-3.2-11B-Vision-Instruct-FP8-dynamicText Generation • 11B • Updated • 5.79k • 24
- 
	
	
	  RedHatAI/Llama-3.2-90B-Vision-Instruct-FP8-dynamicText Generation • 89B • Updated • 261k • 10
- 
	
	
	  RedHatAI/pixtral-12b-FP8-dynamicText Generation • 13B • Updated • 52 • 10
- 
	
	
	  RedHatAI/Phi-3-vision-128k-instruct-W4A16-G128Text Generation • 1B • Updated • 29 • 1
Accurate FP8 quantized models by Neural Magic, ready for use with vLLM!
			
	
	- 
	
	
	  RedHatAI/Meta-Llama-3.1-405B-Instruct-FP8Text Generation • 406B • Updated • 710 • 31
- 
	
	
	  RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8Text Generation • 8B • Updated • 147k • 42
- 
	
	
	  RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8Text Generation • 71B • Updated • 4.7k • 50
- 
	
	
	  RedHatAI/Phi-3-medium-128k-instruct-FP8Text Generation • 14B • Updated • 2 • 5
Llama 3.2 models quantized by Neural Magic
			
	
	- 
	
	
	  RedHatAI/Llama-3.2-11B-Vision-Instruct-FP8-dynamicText Generation • 11B • Updated • 5.79k • 24
- 
	
	
	  RedHatAI/Llama-3.2-90B-Vision-Instruct-FP8-dynamicText Generation • 89B • Updated • 261k • 10
- 
	
	
	  RedHatAI/Llama-3.2-1B-Instruct-FP8-dynamicText Generation • 1B • Updated • 102k • 3
- 
	
	
	  RedHatAI/Llama-3.2-3B-Instruct-FP8-dynamicText Generation • 4B • Updated • 266 • 3
Neural Magic quantized Llama-3.1 models
			
	
	- 
	
	
	  RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8Text Generation • 71B • Updated • 4.7k • 50
- 
	
	
	  RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8Text Generation • 8B • Updated • 147k • 42
- 
	
	
	  RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w4a16Text Generation • 11B • Updated • 204k • 32
- 
	
	
	  RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16Text Generation • 2B • Updated • 41.6k • 30
Accurate INT8 quantized models by Neural Magic, ready for use with vLLM!
			
	
	- 
	
	
	  RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w8a8Text Generation • 71B • Updated • 9.23k • 21
- 
	
	
	  RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8Text Generation • 8B • Updated • 33.7k • 17
- 
	
	
	  RedHatAI/Meta-Llama-3.1-405B-Instruct-quantized.w8a8Text Generation • 406B • Updated • 19 • 1
- 
	
	
	  RedHatAI/Phi-3-medium-128k-instruct-quantized.w8a8Text Generation • 14B • Updated • 3 • 2
Accurate INT4 quantized models by Neural Magic, ready for use with vLLM!
			
	
	- 
	
	
	  RedHatAI/Meta-Llama-3.1-405B-Instruct-quantized.w4a16Text Generation • 58B • Updated • 96 • 12
- 
	
	
	  RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w4a16Text Generation • 11B • Updated • 204k • 32
- 
	
	
	  RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16Text Generation • 2B • Updated • 41.6k • 30
- 
	
	
	  RedHatAI/Mistral-Nemo-Instruct-2407-quantized.w4a16Text Generation • 3B • Updated • 247 • 4
Sparse pre-trained and fine-tuned Llama models made by Neural Magic + Cerebras
			
	
	- 
	
	
	Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and DeploymentPaper • 2405.03594 • Published • 7
- 
	
	
	  RedHatAI/Llama-2-7b-pruned50-retrainedText Generation • 7B • Updated • 2
- 
	
	
	  RedHatAI/Llama-2-7b-pruned70-retrainedText Generation • 7B • Updated • 222
- 
	
	
	  RedHatAI/Llama-2-7b-ultrachat200k-pruned_50Text Generation • 7B • Updated • 7
Papers that we're proud to integrate into our libraries
			
	
	- 
	
	
	Sparse Finetuning for Inference Acceleration of Large Language ModelsPaper • 2310.06927 • Published • 15
- 
	
	
	SparseGPT: Massive Language Models Can Be Accurately Pruned in One-ShotPaper • 2301.00774 • Published • 3
- 
	
	
	The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language ModelsPaper • 2203.07259 • Published • 4
- 
	
	
	How Well Do Sparse Imagenet Models Transfer?Paper • 2111.13445 • Published • 1
Useful LLMs for DeepSparse where we've pruned at least 50% of the weights!
			
	
	- 
	
	
	  RedHatAI/OpenHermes-2.5-Mistral-7B-pruned50-quant-dsText Generation • Updated • 1 • 2
- 
	
	
	  RedHatAI/Nous-Hermes-2-SOLAR-10.7B-pruned50-quant-dsText Generation • Updated • 1 • 7
- 
	
	
	  RedHatAI/SOLAR-10.7B-Instruct-v1.0-pruned50-quant-dsText Generation • Updated • 5 • 5
- 
	
	
	  RedHatAI/Llama2-7b-chat-pruned50-quant-dsText Generation • Updated • 1 • 9
Explore our breakthrough in sparse fine-tuning LLMs! Our novel method maintains downstream accuracy even with >70% sparsity.
			
	
	- 
	
	
	Sparse Finetuning for Inference Acceleration of Large Language ModelsPaper • 2310.06927 • Published • 15
- 
	
	
	16Sparse Llama Gsm8k📚Solve math problems with chat-based guidance 
- 
	
	
	  RedHatAI/mpt-7b-gsm8k-pruned40-quant-dsText Generation • Updated • 2
- 
	
	
	  RedHatAI/mpt-7b-gsm8k-pruned50-quant-dsText Generation • Updated • 10
LLMs optimized by the community using Neural Magic's LLM Compressor for efficient deployment in vLLM. Contribute and help advance efficient AI!
			
	
	
 
				