google/gemma-3-4b-pt · Wrong configs

There are a few issues loading the Gemma3 model with AutoModelForCausalLM. The core problem is that the current config.json is set up for multi-modal usage (with "text_config" and "vision_config") but is missing key text fields at the top level (like "vocab_size" and "hidden_size") that the text-only classes look for. Specifically:
• There is no "vocab_size" field, yet the checkpoint’s embedding matrix is sized [262208, hidden_size] (because it has extra tokens for images).
• The text fields are nested under "text_config", but Gemma3ForCausalLM expects them at the top level (like config.hidden_size, config.num_hidden_layers, etc.).
• The uploaded config references "Gemma3ForConditionalGeneration", implying multi-modal usage. But for text-only usage, we must patch the config ourselves to match the real embedding dimension and top-level text fields.

Potential fixes:
1. Add text fields at the top level (e.g. "hidden_size": 2560, "vocab_size": 262208, etc.) so that AutoModelForCausalLM can read them directly without error.
2. Use a multi-modal class such as Gemma3ForConditionalGeneration that explicitly handles both text_config and vision_config if that’s the intended usage.

Fixing this manually shows that the model should load fine if this is addressed:

import torch
from transformers import (
    AutoConfig,
    AutoTokenizer,
    pipeline
)
from transformers.models.gemma3.configuration_gemma3 import Gemma3TextConfig
from transformers.models.gemma3.modeling_gemma3 import Gemma3ForCausalLM

# Name or local path of the Gemma3 model checkpoint
model_name = "google/gemma-3-4b-pt"

# Load the multi-modal config
multi_config = AutoConfig.from_pretrained(model_name)

# Extract the text-specific config to a dict
text_cfg_dict = multi_config.text_config.to_dict()

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Ensure the vocab size matches the checkpoint's embedding shape
#    (the checkpoint has embed_tokens.weight of size [262208, 2560], so we set 262208).
text_cfg_dict["vocab_size"] = 262208

# Add any special token IDs from the tokenizer
if tokenizer.pad_token_id is not None:
    text_cfg_dict["pad_token_id"] = tokenizer.pad_token_id
text_cfg_dict["bos_token_id"] = tokenizer.bos_token_id
text_cfg_dict["eos_token_id"] = tokenizer.eos_token_id

# Build a text-only config
text_config = Gemma3TextConfig(**text_cfg_dict)

# Load the model using that text config
model = Gemma3ForCausalLM.from_pretrained(
    model_name,
    config=text_config,
    torch_dtype=torch.bfloat16,
    device_map=None,
    low_cpu_mem_usage=False,
)

# Create a text-generation pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

prompt = "Eiffel tower is located in"
output = pipe(prompt, max_new_tokens=50)
print("Generated text:", output[0]["generated_text"])
```