OpenSeek-Small-v1 / README.md
ldwang's picture
Update README.md
2920192 verified
# OpenSeek-Small v1 Model Documentation
## Overview
OpenSeek-Small-v1 is the initial production model of the OpenSeek project.
- Utilizes DeepSeek-V3-like MoE architecture.
- Comprises 1.4 billion total parameters, with 0.4 billion activated parameters.
- Trained on 720 billion tokens.
- Demonstrates superior efficiency compared to 1-billion-parameter models.
## Training Data
- 0.72TB tokens of high-quality pretraining data and the ratio for each domain is as follows:
| Name | Ratio |
|-------------------------------------------|---------|
| Nemotron-CC-high-actual-actual-high | 1.26 |
| Nemotron-CC-high-actual-actual-low | 0.67 |
| Nemotron-CC-high-actual-actual-mid | 2.05 |
| Nemotron-CC-high-synthetic-distill-high | 1.59 |
| Nemotron-CC-high-synthetic-distill-low | 0.64 |
| Nemotron-CC-high-synthetic-distill-mid | 2.32 |
| Nemotron-CC-high-synthetic-diverse_qa_pairs-high | 4.67 |
| Nemotron-CC-high-synthetic-diverse_qa_pairs-low | 2.16 |
| Nemotron-CC-high-synthetic-diverse_qa_pairs-mid | 7.58 |
| Nemotron-CC-high-synthetic-extract_knowledge-high | 6.43 |
| Nemotron-CC-high-synthetic-extract_knowledge-low | 0.07 |
| Nemotron-CC-high-synthetic-extract_knowledge-mid | 2.22 |
| Nemotron-CC-high-synthetic-knowledge_list-high | 1.88 |
| Nemotron-CC-high-synthetic-knowledge_list-low | 0.74 |
| Nemotron-CC-high-synthetic-knowledge_list-mid | 3.20 |
| Nemotron-CC-high-synthetic-wrap_medium-high | 3.89 |
| Nemotron-CC-high-synthetic-wrap_medium-low | 0.65 |
| Nemotron-CC-high-synthetic-wrap_medium-mid | 6.18 |
| Nemotron-CC-low-synthetic-wrap_medium-high | 0.17 |
| Nemotron-CC-low-synthetic-wrap_medium-low | 0.30 |
| Nemotron-CC-low-synthetic-wrap_medium-mid | 1.08 |
| Nemotron-CC-medium-actual-actual-high | 2.20 |
| Nemotron-CC-medium-actual-actual-low | 4.48 |
| Nemotron-CC-medium-actual-actual-mid | 7.76 |
| arxiv | 0.32 |
| books | 1.98 |
| code | 3.43 |
| cot_synthesis_CC | 9.82 |
| cot_synthesis_OpenSource | 0.46 |
| cot_synthesis_arxiv | 4.15 |
| cot_synthesis_code | 1.32 |
| cot_synthesis_math | 2.19 |
| cot_synthesis_wiki | 0.83 |
| math | 0.83 |
| pes2o | 0.31 |
| stack | 0.19 |
| wiki | 0.29 |
| zh_cc | 9.65 |
## Wandb
Our training curves have been recorded in Weights & Biases [wandb](https://wandb.ai/openseek-team/OpenSeek-Small-v1).
## Evaluation
| Category | Metrics (shots) | Llama-3.2-1B | Qwen2.5-1.5B | Qwen2.5-0.5B | OLMo-1B-0724 | OpenSeek-Small-v1 |
|------------------------------|-------------------|--------------|--------------|--------------|---------------|-------------------|
| **English-Commonsense Reasoning** | HellaSwag (5-shot) | 0.4830 | 0.5007 | 0.4007 | 0.4909 | 0.3893 |
| | TruthfulQA (0-shot) | 0.3773 | 0.4663 | 0.3986 | 0.4029 | 0.3990 |
| | Winogrande (5-shot) | 0.6212 | 0.6448 | 0.5683 | 0.6290 | 0.5541 |
| | CommonsenseQA (5-shot) | 0.3120 | 0.7445 | 0.5487 | 0.1949 | 0.2048 |
| | PIQA (5-shot) | 0.7514 | 0.7612 | 0.7111 | 0.7459 | 0.7203 |
| | OpenBookQA (5-shot) | 0.2960 | 0.3340 | 0.2720 | 0.3080 | 0.2560 |
| | BoolQ (5-shot) | 0.6590 | 0.7774 | 0.6572 | 0.6508 | 0.6165 |
| **English-Problem-Solving** | ARC Easy (5-shot) | 0.6940 | 0.8043 | 0.6780 | 0.6111 | 0.6237 |
| | ARC Challenge (5-shot) | 0.3532 | 0.4846 | 0.3370 | 0.3063 | 0.3157 |
| | MMLU (5-shot) | 0.3124 | 0.6165 | 0.4818 | 0.2869 | 0.2654 |
| **English-Mathematics** | GSM8K (5-shot) | 0.0637 | 0.6194 | 0.3495 | 0.0159 | 0.0182 |
| | Minerva Math (4-shot) | 0.0180 | 0.2876 | 0.1160 | 0.0182 | 0.0010 |
| **Chinese** | CEval (5-shot) | 0.2779 | 0.6954 | 0.5423 | 0.2340 | 0.2422 |
| | CMMLU (5-shot) | 0.2687 | 0.6882 | 0.5300 | 0.2570 | 0.2468 |
| **Average Metrics** | **Average-English(w/o Math)** | 0.4859 | 0.6134 | 0.5053 | 0.4627 | 0.4345 |
| | **Average-English** | 0.4118 | 0.5868 | 0.4599 | 0.3884 | 0.3637 |
| | **Average-Chinese** | 0.2733 | 0.6918 | 0.5362 | 0.2455 | 0.2445 |
| | **Average** | 0.3920 | 0.6018 | 0.4708 | 0.3680 | 0.3466 |
| | **Average(w/o Math)** | 0.4505 | 0.6265 | 0.5105 | 0.4265 | 0.4028 |
OpenSeek-Small-v1 demonstrates superior efficiency compared to 1-billion-parameter models.
- <img src="logC_vs_Metric_Average_scatter_plot.png" alt="logC_vs_Metric_Average" width="400"/>
## Usage Instructions
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("BAAI/OpenSeek-Small-v1",trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("BAAI/OpenSeek-Small-v1",trust_remote_code=True)
inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))
```