--- license: apache-2.0 datasets: - HuggingFaceH4/ultrachat_200k - BAAI/Infinity-Instruct - HuggingFaceH4/ultrafeedback_binarized - Intel/orca_dpo_pairs - argilla/OpenHermesPreferences base_model: - Zyphra/Zamba2-1.2B library_name: transformers --- # Model Card for Zamba2-1.2B Zamba2-1.2B-instruct is obtained from Zamba2-1.2B by fine-tuning on instruction-following and chat datasets. Specifically: 1. SFT of the base [Zamba2-1.2B](https://huggingface.co/Zyphra/Zamba2-1.2B) model on [ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) and [Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct) 2. DPO of the SFT checkpoint on [ultrafeedback_binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized), [orca_dpo_pairs](https://huggingface.co/datasets/Intel/orca_dpo_pairs), and [OpenHermesPreferences](https://huggingface.co/datasets/argilla/OpenHermesPreferences) Zamba2-1.2B-Instruct is a hybrid model composed of state-space ([Mamba2](https://github.com/state-spaces/mamba)) and transformer blocks. ## Quick start ### Prerequisites To download Zamba2-1.2B-instruct, install `transformers` from source: 1. `git clone https://github.com/huggingface/transformers.git` 2. `cd transformers && pip install .` To install dependencies necessary to run Mamba2 kernels, install `mamba-ssm` from source (due to compatibility issues with PyTorch) as well as `causal-conv1d`: 1. `git clone https://github.com/state-spaces/mamba.git` 2. `cd mamba && git checkout v2.1.0 && pip install .` 3. `pip install causal-conv1d` You can run the model without using the optimized Mamba2 kernels, but it is **not** recommended as it will result in significantly higher latency and memory usage. ### Inference ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch # Instantiate model and tokenizer tokenizer = AutoTokenizer.from_pretrained("Zyphra/Zamba2-1.2B-instruct") model = AutoModelForCausalLM.from_pretrained("Zyphra/Zamba2-1.2B-instruct", device_map="cuda", torch_dtype=torch.bfloat16) # Format the input as a chat template prompt = "What factors contributed to the fall of the Roman Empire?" sample = [{'role': 'user', 'content': prompt}] chat_sample = tokenizer.apply_chat_template(sample, tokenize=False) # Tokenize input and generate output input_ids = tokenizer(chat_sample, return_tensors='pt', add_special_tokens=False).to("cuda") outputs = model.generate(**input_ids, max_new_tokens=150, return_dict_in_generate=False, output_scores=False, use_cache=True, num_beams=1, do_sample=False) print((tokenizer.decode(outputs[0]))) ``` ## Performance Zamba2-1.2B-Instruct achieves leading instruction-following and multi-turn chat performance for a model of its size and matches strong models significantly larger. For instance, Zamba2-1.2B-Instruct outperforms Gemma2-2B-Instruct, a very strong model over 2x its size.