Hiber-Multi-10B-Instruct

Architecture Overview

A state-of-the-art multilingual language model built on advanced transformer architecture:

MODEL_SPECS = {
    "architecture": "Decoder-only Transformer",
    "params": "10B",
    "context_length": 4096,
    "hidden_size": 4096,
    "attention_heads": 32,
    "kv_heads": 8,
    "intermediate_size": 14336,
    "num_layers": 48,
    "vocab_size": 32000,
    "position_encoding": "Rotary",
    "activation": "SwiGLU",
    "norm_type": "RMSNorm"
}

Key Components

  • Advanced Attention Mechanism

    • Multi-query attention with 32 heads
    • Grouped-query attention (8 KV heads)
    • Flash Attention 2.0 optimization
    • Sliding window attention for long sequences
  • Architectural Innovations

    • SwiGLU activation function
    • RMSNorm layer normalization
    • Rotary position embeddings (RoPE)
    • Adaptive KV caching
    • Mixture of Experts routing

Implementation Example

from dataclasses import dataclass
from typing import Optional, List, Dict, Union
import torch
import torch.nn.functional as F
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer

@dataclass
class GenerationConfig:
    temperature: float = 0.7
    top_p: float = 0.9
    top_k: int = 50
    repetition_penalty: float = 1.1
    max_new_tokens: int = 512
    do_sample: bool = True
    num_beams: int = 1

class HiberMultiPipeline:
    def __init__(
        self,
        model_name: str = "Hiber-Multi-10B-Instruct",
        device_map: str = "auto",
        torch_dtype: Optional[torch.dtype] = torch.bfloat16,
        load_in_8bit: bool = False,
        load_in_4bit: bool = False,
    ):
        self.config = AutoConfig.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(
            model_name,
            padding_side="left",
            truncation_side="left",
        )
        
        quantization_config = None
        if load_in_8bit or load_in_4bit:
            from transformers import BitsAndBytesConfig
            quantization_config = BitsAndBytesConfig(
                load_in_8bit=load_in_8bit,
                load_in_4bit=load_in_4bit,
                bnb_4bit_compute_dtype=torch.bfloat16,
                bnb_4bit_quant_type="nf4",
            )

        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            device_map=device_map,
            torch_dtype=torch_dtype,
            quantization_config=quantization_config,
            trust_remote_code=True,
        )
        
    def generate(
        self,
        messages: List[Dict[str, str]],
        generation_config: Optional[GenerationConfig] = None,
    ) -> str:
        if generation_config is None:
            generation_config = GenerationConfig()
            
        prompt = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )
        
        inputs = self.tokenizer(
            prompt,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=self.config.max_position_embeddings,
        ).to(self.model.device)
        
        with torch.inference_mode():
            outputs = self.model.generate(
                **inputs,
                pad_token_id=self.tokenizer.pad_token_id,
                bos_token_id=self.tokenizer.bos_token_id,
                eos_token_id=self.tokenizer.eos_token_id,
                **asdict(generation_config),
            )
            
        response = self.tokenizer.decode(
            outputs[0][inputs["input_ids"].shape[1]:],
            skip_special_tokens=True,
        )
        return response.strip()

    @torch.inference_mode()
    def batch_generate(
        self,
        batch_messages: List[List[Dict[str, str]]],
        generation_config: Optional[GenerationConfig] = None,
        batch_size: int = 8,
    ) -> List[str]:
        responses = []
        for i in range(0, len(batch_messages), batch_size):
            batch = batch_messages[i:i + batch_size]
            responses.extend([
                self.generate(msgs, generation_config)
                for msgs in batch
            ])
        return responses

Performance Characteristics

Memory Usage

  • FP16: 20GB VRAM
  • INT8: 12GB VRAM
  • INT4: 8GB VRAM

Throughput (A100 GPU)

  • Batch Size 1: 32 tokens/sec
  • Batch Size 8: 180 tokens/sec
  • Batch Size 32: 420 tokens/sec

Latency (ms)

LATENCY_PROFILE = {
    "first_token": 42,
    "token_throughput": {
        "batch_1": 31.25,
        "batch_8": 5.56,
        "batch_32": 2.38
    },
    "context_scaling": {
        "1024_tokens": 1.0,
        "2048_tokens": 1.2,
        "4096_tokens": 1.8
    }
}

System Requirements

Minimum Configuration

  • CUDA 11.8+
  • PyTorch 2.0+
  • 16GB VRAM (INT8)
  • 64GB RAM
  • AVX2 support

Recommended Configuration

  • CUDA 12.0+
  • PyTorch 2.1+
  • 24GB+ VRAM
  • 128GB RAM
  • NVIDIA Ampere GPU
  • NVMe SSD

Citation

@software{hiber_multi_2024,
    title = {Hiber-Multi-10B-Instruct: Advanced Multilingual Language Model},
    author = {{Hibernates + UCLA Research Team}},
    year = {2024},
    publisher = {HuggingFace},
    version = {1.0.0},
    architecture = {Transformer},
    parameters = {10B},
    license = {LLaMA 3.1}
}
Downloads last month
8
Safetensors
Model size
10.7B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for Hibernates/Hiber-Multi-10B-Instruct

Finetuned
(863)
this model
Quantizations
2 models