Porting MobileLLM-R1-950M to MLX and mlx-lm: Architectural Challenges and Solutions

I spent a some time pairing with Gemini 2.5 Pro and later OpenAI Codex to drag the brand-new facebook/MobileLLM-R1-950M weights onto Apple Silicon. This write-up is the “why it wasn’t copy-paste” story, plus the gotchas that bit us until the model finally spoke clean English and quantized without drama.

Goal

Enable facebook/MobileLLM-R1-950M to run natively on Apple Silicon using MLX, then create quantized versions compatible with the mlx-lm ecosystem.

1. Why a Direct "Llama-4 Drop-In" Failed

Although the Hugging Face repo presents MobileLLM-R1-950M as a Llama-4-style dense model, its config and weights don't align cleanly with a stock Llama block. The deviations aren't quirks of MLX—they reflect this model's specific architecture:

MLP ambiguity
Config advertises both intermediate_size and intermediate_size_mlp, suggesting a dual-branch feed-forward.
Actual weights contain only a SwiGLU branch (gate_proj, up_proj, down_proj).
→ Solution: auto-detect MLP variant from weight names at load time.
Grouped-Query Attention (GQA)
num_attention_heads=24, num_key_value_heads=6.
K/V tensors must be repeated to full head count for attention shapes to align correctly.
QK-norm and scaling
Config includes use_qk_norm=True and attn_scale=0.1.
We add the RMSNorm on Q/K as specified, but drop the extra 0.1 multiplier—applying it in MLX's scaled_dot_product_attention collapses logits into gibberish.
RoPE gating
Config lists all layers under no_rope_layers.
Disabling RoPE everywhere would eliminate positional encoding entirely.
→ Treat "all layers disabled" as a config artifact and apply RoPE everywhere.

2. Prompt-Level Deviations

Even after weights loaded correctly, default inference was disrupted by tokenizer settings:

Chat template
Default system prompt: "Please reason step-by-step and put your final answer within \boxed{}."
Without overrides, the model produces verbose "reasoning" outputs.
→ Added CLI controls: --system, --disable-chat-template, --final-only.
Double BOS
Both tokenizer and template inserted BOS tokens.
→ Fixed with add_special_tokens=False.
Premature EOS
Template headers (<|eot_id|>) were treated as stop tokens.
→ Limited stopping criteria to true EOS token only.

3. Sampling Stability

Sampling issues stemmed from API mismatches rather than model problems:

Top-p on probabilities then feeding mx.random.categorical produced repetition loops.
Solution: Apply penalties → scale logits → top-p mask (with float('-inf')) → categorical(logits).
Added controls for temperature, repetition penalty, frequency penalty.

4. Quantization in mlx-lm: Why Custom Metadata Was Required

mlx-lm provides quantization hooks, but MobileLLM's architecture exposed several challenges:

Frozen gradients during sensitivity analysis → empty sensitivity lists.
→ Avoid freezing weights during gradient computation.
Re-quantizing quantized layers → type errors on second pass.
→ Skip QuantizedLinear layers if already quantized.
Embedding/norm dtype crashes
Standard quantization re-quantized everything, but embeddings must remain float.
→ Introduced metadata-driven approach: config.json records per-layer bit-widths. Only specified layers are instantiated as QuantizedLinear.

This metadata contract allows 4-bit mixed-precision MobileLLM to be loaded cleanly by our metadata-aware custom_loader.py, making it compatible with the mlx-lm ecosystem.

5. End State

MLX path:
Structural fixes (GQA, MLP detection), numerical fixes (QK-norm, RoPE, attn_scale), and prompt controls together yield fluent, stable inference.
mlx-lm path:
Custom quantization pipeline produces FP16 and 4-bit models. These can be loaded with our metadata-aware custom_loader.py and used for inference with our provided scripts.
Performance: measurable speedup and reduced VRAM usage on Apple Silicon, with minimal quality degradation.

Takeaway

The MobileLLM-R1-950M port required systematically addressing architectural mismatches (MLP variant detection, GQA handling, QK-norm implementation, RoPE configuration) and developing a metadata-driven quantization approach. Once these were resolved, the model became fully functional in MLX with both float and quantized inference paths.