Hiber-Multi-10B-Instruct
Architecture Overview
A state-of-the-art multilingual language model built on advanced transformer architecture:
MODEL_SPECS = {
"architecture": "Decoder-only Transformer",
"params": "10B",
"context_length": 4096,
"hidden_size": 4096,
"attention_heads": 32,
"kv_heads": 8,
"intermediate_size": 14336,
"num_layers": 48,
"vocab_size": 32000,
"position_encoding": "Rotary",
"activation": "SwiGLU",
"norm_type": "RMSNorm"
}
Key Components
Advanced Attention Mechanism
- Multi-query attention with 32 heads
- Grouped-query attention (8 KV heads)
- Flash Attention 2.0 optimization
- Sliding window attention for long sequences
Architectural Innovations
- SwiGLU activation function
- RMSNorm layer normalization
- Rotary position embeddings (RoPE)
- Adaptive KV caching
- Mixture of Experts routing
Implementation Example
from dataclasses import dataclass
from typing import Optional, List, Dict, Union
import torch
import torch.nn.functional as F
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
@dataclass
class GenerationConfig:
temperature: float = 0.7
top_p: float = 0.9
top_k: int = 50
repetition_penalty: float = 1.1
max_new_tokens: int = 512
do_sample: bool = True
num_beams: int = 1
class HiberMultiPipeline:
def __init__(
self,
model_name: str = "Hiber-Multi-10B-Instruct",
device_map: str = "auto",
torch_dtype: Optional[torch.dtype] = torch.bfloat16,
load_in_8bit: bool = False,
load_in_4bit: bool = False,
):
self.config = AutoConfig.from_pretrained(model_name)
self.tokenizer = AutoTokenizer.from_pretrained(
model_name,
padding_side="left",
truncation_side="left",
)
quantization_config = None
if load_in_8bit or load_in_4bit:
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_8bit=load_in_8bit,
load_in_4bit=load_in_4bit,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
)
self.model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map=device_map,
torch_dtype=torch_dtype,
quantization_config=quantization_config,
trust_remote_code=True,
)
def generate(
self,
messages: List[Dict[str, str]],
generation_config: Optional[GenerationConfig] = None,
) -> str:
if generation_config is None:
generation_config = GenerationConfig()
prompt = self.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = self.tokenizer(
prompt,
return_tensors="pt",
padding=True,
truncation=True,
max_length=self.config.max_position_embeddings,
).to(self.model.device)
with torch.inference_mode():
outputs = self.model.generate(
**inputs,
pad_token_id=self.tokenizer.pad_token_id,
bos_token_id=self.tokenizer.bos_token_id,
eos_token_id=self.tokenizer.eos_token_id,
**asdict(generation_config),
)
response = self.tokenizer.decode(
outputs[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True,
)
return response.strip()
@torch.inference_mode()
def batch_generate(
self,
batch_messages: List[List[Dict[str, str]]],
generation_config: Optional[GenerationConfig] = None,
batch_size: int = 8,
) -> List[str]:
responses = []
for i in range(0, len(batch_messages), batch_size):
batch = batch_messages[i:i + batch_size]
responses.extend([
self.generate(msgs, generation_config)
for msgs in batch
])
return responses
Performance Characteristics
Memory Usage
- FP16: 20GB VRAM
- INT8: 12GB VRAM
- INT4: 8GB VRAM
Throughput (A100 GPU)
- Batch Size 1: 32 tokens/sec
- Batch Size 8: 180 tokens/sec
- Batch Size 32: 420 tokens/sec
Latency (ms)
LATENCY_PROFILE = {
"first_token": 42,
"token_throughput": {
"batch_1": 31.25,
"batch_8": 5.56,
"batch_32": 2.38
},
"context_scaling": {
"1024_tokens": 1.0,
"2048_tokens": 1.2,
"4096_tokens": 1.8
}
}
System Requirements
Minimum Configuration
- CUDA 11.8+
- PyTorch 2.0+
- 16GB VRAM (INT8)
- 64GB RAM
- AVX2 support
Recommended Configuration
- CUDA 12.0+
- PyTorch 2.1+
- 24GB+ VRAM
- 128GB RAM
- NVIDIA Ampere GPU
- NVMe SSD
Citation
@software{hiber_multi_2024,
title = {Hiber-Multi-10B-Instruct: Advanced Multilingual Language Model},
author = {{Hibernates + UCLA Research Team}},
year = {2024},
publisher = {HuggingFace},
version = {1.0.0},
architecture = {Transformer},
parameters = {10B},
license = {LLaMA 3.1}
}
- Downloads last month
- 8
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.