Fill-Mask
Transformers
Safetensors
Russian
English
modernbert
Inference Endpoints

RuModernBERT-small

The Russian version of the modernized bidirectional encoder-only Transformer model, ModernBERT. RuModernBERT was pre-trained on approximately 2 trillion tokens of Russian, English, and code data with a context length of up to 8,192 tokens, using data from the internet, books, scientific sources, and social media.

Model Size Hidden Dim Num Layers Vocab Size Context Length Task
deepvk/RuModernBERT-small [this] 35M 384 12 50368 8192 Masked LM
deepvk/RuModernBERT-base 150M 768 22 50368 8192 Masked LM

Usage

Don't forget to update transformers and install flash-attn if your GPU supports it.

from transformers import AutoTokenizer, AutoModelForMaskedLM

# Prepare model
model_id = "deepvk/RuModernBERT-small"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id, attn_implementation="flash_attention_2")
model = model.eval()

# Prepare input
text = "Мама мыла [MASK]."
inputs = tokenizer(text, return_tensors="pt")
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)

# Make prediction
outputs = model(**inputs)

# Show prediction
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
# Predicted token:  посуду

Training Details

This is the small version with 35 million parameters.

Tokenizer

We trained a new tokenizer following the original configuration. We maintained the size of the vocabulary and added the same special tokens. The tokenizer was trained on a mixture of Russian and English from FineWeb.

Dataset

Pre-training includes three main stages: massive pre-training, context extension, and cooldown. Unlike the original model, we did not use the same data for all stages. For the second and third stages, we used cleaner data sources.

Data Source Stage 1 Stage 2 Stage 3
FineWeb (En+Ru)
CulturaX-Ru-Edu (Ru)
Wiki (En+Ru)
ArXiv (En)
Book (En+Ru)
Code
StackExchange (En+Ru)
Social (Ru)
Total Tokens 1.3T 250B 50B

Context length

In the first stage, the model was trained with a context length of 1,024. In the second and third stages, it was extended to 8,192.

Evaluation

To evaluate the model, we measure quality on the encodechka and Russian Super Glue (RSG) benchmarks. For RSG, we perform a grid search for optimal hyperparameters and report metrics from the dev split.

For a fair comparison, we compare the RuModernBERT model only with raw encoders that were not trained on retrieval or sentence embedding tasks.

Russian Super Glue

Model RCB PARus MuSeRC TERRa RUSSE RWSD DaNetQA Score
deepvk/deberta-v1-distill 0.433 0.56 0.625 0.590 0.943 0.569 0.726 0.635
deepvk/deberta-v1-base 0.450 0.61 0.722 0.704 0.948 0.578 0.760 0.682
ai-forever/ruBert-base 0.491 0.61 0.663 0.769 0.962 0.574 0.678 0.678
deepvk/RuModernBERT-small [this] 0.555 0.64 0.746 0.593 0.930 0.574 0.743 0.683
deepvk/RuModernBERT-base 0.556 0.61 0.857 0.818 0.977 0.583 0.758 0.737

Encodechka

Model Size STS-B Paraphraser XNLI Sentiment Toxicity Inappropriateness Intents IntentsX FactRu RuDReC Avg. S Avg. S+W
cointegrated/rubert-tiny 11.9M 0.66 0.53 0.40 0.71 0.89 0.68 0.70 0.58 0.24 0.34 0.645 0.575
deepvk/deberta-v1-distill 81.5M 0.70 0.57 0.38 0.77 0.98 0.79 0.77 0.36 0.36 0.44 0.665 0.612
deepvk/deberta-v1-base 124M 0.68 0.54 0.38 0.76 0.98 0.80 0.78 0.29 0.29 0.40 0.653 0.591
answerdotai/ModernBERT-base 150M 0.50 0.29 0.36 0.64 0.79 0.62 0.59 0.10 0.22 0.20 0.486 0.431
ai-forever/ruBert-base 178M 0.67 0.53 0.39 0.77 0.98 0.78 0.77 0.38 🥴 🥴 0.659 🥴
DeepPavlov/rubert-base-cased 180M 0.63 0.50 0.38 0.73 0.94 0.74 0.74 0.31 🥴 🥴 0.621 🥴
deepvk/RuModernBERT-small [this] 35M 0.64 0.50 0.36 0.72 0.95 0.73 0.72 0.47 0.28 0.26 0.636 0.563
deepvk/RuModernBERT-base 150M 0.67 0.54 0.35 0.75 0.97 0.76 0.76 0.58 0.37 0.36 0.673 0.611

Citation

@misc{deepvk2025rumodernbert,
    title={RuModernBERT: Modernized BERT for Russian},
    author={Spirin, Egor and Malashenko, Boris and Sokolov Andrey},
    url={https://huggingface.co/deepvk/rumodernbert-base},
    publisher={Hugging Face}
    year={2025},
}
Downloads last month
5,637
Safetensors
Model size
34.6M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Datasets used to train deepvk/RuModernBERT-small

Collection including deepvk/RuModernBERT-small