Porting MobileLLM-R1-950M to MLX and mlx-lm: Architectural Challenges and Solutions
I spent a some time pairing with Gemini 2.5 Pro and later OpenAI Codex to drag the brand-new facebook/MobileLLM-R1-950M weights onto Apple Silicon. This write-up is the “why it wasn’t copy-paste” story, plus the gotchas that bit us until the model finally spoke clean English and quantized without drama.
Goal
Enable facebook/MobileLLM-R1-950M to run natively on Apple Silicon using MLX, then create quantized versions compatible with the mlx-lm ecosystem.
1. Why a Direct "Llama-4 Drop-In" Failed
Although the Hugging Face repo presents MobileLLM-R1-950M as a Llama-4-style dense model, its config and weights don't align cleanly with a stock Llama block. The deviations aren't quirks of MLX—they reflect this model's specific architecture:
MLP ambiguity
Config advertises bothintermediate_sizeandintermediate_size_mlp, suggesting a dual-branch feed-forward.
Actual weights contain only a SwiGLU branch (gate_proj,up_proj,down_proj).
→ Solution: auto-detect MLP variant from weight names at load time.Grouped-Query Attention (GQA)
num_attention_heads=24,num_key_value_heads=6.
K/V tensors must be repeated to full head count for attention shapes to align correctly.QK-norm and scaling
Config includesuse_qk_norm=Trueandattn_scale=0.1.
We add the RMSNorm on Q/K as specified, but drop the extra0.1multiplier—applying it in MLX'sscaled_dot_product_attentioncollapses logits into gibberish.RoPE gating
Config lists all layers underno_rope_layers.
Disabling RoPE everywhere would eliminate positional encoding entirely.
→ Treat "all layers disabled" as a config artifact and apply RoPE everywhere.
2. Prompt-Level Deviations
Even after weights loaded correctly, default inference was disrupted by tokenizer settings:
Chat template
Default system prompt: "Please reason step-by-step and put your final answer within \boxed{}."
Without overrides, the model produces verbose "reasoning" outputs.
→ Added CLI controls:--system,--disable-chat-template,--final-only.Double BOS
Both tokenizer and template inserted BOS tokens.
→ Fixed withadd_special_tokens=False.Premature EOS
Template headers (<|eot_id|>) were treated as stop tokens.
→ Limited stopping criteria to true EOS token only.
3. Sampling Stability
Sampling issues stemmed from API mismatches rather than model problems:
- Top-p on probabilities then feeding
mx.random.categoricalproduced repetition loops. - Solution: Apply penalties → scale logits → top-p mask (with
float('-inf')) →categorical(logits). - Added controls for temperature, repetition penalty, frequency penalty.
4. Quantization in mlx-lm: Why Custom Metadata Was Required
mlx-lm provides quantization hooks, but MobileLLM's architecture exposed several challenges:
Frozen gradients during sensitivity analysis → empty sensitivity lists.
→ Avoid freezing weights during gradient computation.Re-quantizing quantized layers → type errors on second pass.
→ SkipQuantizedLinearlayers if already quantized.Embedding/norm dtype crashes
Standard quantization re-quantized everything, but embeddings must remain float.
→ Introduced metadata-driven approach: config.json records per-layer bit-widths. Only specified layers are instantiated asQuantizedLinear.
This metadata contract allows 4-bit mixed-precision MobileLLM to be loaded cleanly by our metadata-aware custom_loader.py, making it compatible with the mlx-lm ecosystem.
5. End State
MLX path:
Structural fixes (GQA, MLP detection), numerical fixes (QK-norm, RoPE, attn_scale), and prompt controls together yield fluent, stable inference.mlx-lm path:
Custom quantization pipeline produces FP16 and 4-bit models. These can be loaded with our metadata-awarecustom_loader.pyand used for inference with our provided scripts.
Performance: measurable speedup and reduced VRAM usage on Apple Silicon, with minimal quality degradation.
Takeaway
The MobileLLM-R1-950M port required systematically addressing architectural mismatches (MLP variant detection, GQA handling, QK-norm implementation, RoPE configuration) and developing a metadata-driven quantization approach. Once these were resolved, the model became fully functional in MLX with both float and quantized inference paths.