OpenSeek-Small-v1 / README.md

Update README.md

2920192 verified 2 months ago

6.41 kB

	# OpenSeek-Small v1 Model Documentation

	## Overview
	OpenSeek-Small-v1 is the initial production model of the OpenSeek project.
	- Utilizes DeepSeek-V3-like MoE architecture.
	- Comprises 1.4 billion total parameters, with 0.4 billion activated parameters.
	- Trained on 720 billion tokens.
	- Demonstrates superior efficiency compared to 1-billion-parameter models.

	## Training Data
	- 0.72TB tokens of high-quality pretraining data and the ratio for each domain is as follows:
	\| Name \| Ratio \|
	\|-------------------------------------------\|---------\|
	\| Nemotron-CC-high-actual-actual-high \| 1.26 \|
	\| Nemotron-CC-high-actual-actual-low \| 0.67 \|
	\| Nemotron-CC-high-actual-actual-mid \| 2.05 \|
	\| Nemotron-CC-high-synthetic-distill-high \| 1.59 \|
	\| Nemotron-CC-high-synthetic-distill-low \| 0.64 \|
	\| Nemotron-CC-high-synthetic-distill-mid \| 2.32 \|
	\| Nemotron-CC-high-synthetic-diverse_qa_pairs-high \| 4.67 \|
	\| Nemotron-CC-high-synthetic-diverse_qa_pairs-low \| 2.16 \|
	\| Nemotron-CC-high-synthetic-diverse_qa_pairs-mid \| 7.58 \|
	\| Nemotron-CC-high-synthetic-extract_knowledge-high \| 6.43 \|
	\| Nemotron-CC-high-synthetic-extract_knowledge-low \| 0.07 \|
	\| Nemotron-CC-high-synthetic-extract_knowledge-mid \| 2.22 \|
	\| Nemotron-CC-high-synthetic-knowledge_list-high \| 1.88 \|
	\| Nemotron-CC-high-synthetic-knowledge_list-low \| 0.74 \|
	\| Nemotron-CC-high-synthetic-knowledge_list-mid \| 3.20 \|
	\| Nemotron-CC-high-synthetic-wrap_medium-high \| 3.89 \|
	\| Nemotron-CC-high-synthetic-wrap_medium-low \| 0.65 \|
	\| Nemotron-CC-high-synthetic-wrap_medium-mid \| 6.18 \|
	\| Nemotron-CC-low-synthetic-wrap_medium-high \| 0.17 \|
	\| Nemotron-CC-low-synthetic-wrap_medium-low \| 0.30 \|
	\| Nemotron-CC-low-synthetic-wrap_medium-mid \| 1.08 \|
	\| Nemotron-CC-medium-actual-actual-high \| 2.20 \|
	\| Nemotron-CC-medium-actual-actual-low \| 4.48 \|
	\| Nemotron-CC-medium-actual-actual-mid \| 7.76 \|
	\| arxiv \| 0.32 \|
	\| books \| 1.98 \|
	\| code \| 3.43 \|
	\| cot_synthesis_CC \| 9.82 \|
	\| cot_synthesis_OpenSource \| 0.46 \|
	\| cot_synthesis_arxiv \| 4.15 \|
	\| cot_synthesis_code \| 1.32 \|
	\| cot_synthesis_math \| 2.19 \|
	\| cot_synthesis_wiki \| 0.83 \|
	\| math \| 0.83 \|
	\| pes2o \| 0.31 \|
	\| stack \| 0.19 \|
	\| wiki \| 0.29 \|
	\| zh_cc \| 9.65 \|

	## Wandb
	Our training curves have been recorded in Weights & Biases [wandb](https://wandb.ai/openseek-team/OpenSeek-Small-v1).

	## Evaluation
	\| Category \| Metrics (shots) \| Llama-3.2-1B \| Qwen2.5-1.5B \| Qwen2.5-0.5B \| OLMo-1B-0724 \| OpenSeek-Small-v1 \|
	\|------------------------------\|-------------------\|--------------\|--------------\|--------------\|---------------\|-------------------\|
	\| English-Commonsense Reasoning \| HellaSwag (5-shot) \| 0.4830 \| 0.5007 \| 0.4007 \| 0.4909 \| 0.3893 \|
	\| \| TruthfulQA (0-shot) \| 0.3773 \| 0.4663 \| 0.3986 \| 0.4029 \| 0.3990 \|
	\| \| Winogrande (5-shot) \| 0.6212 \| 0.6448 \| 0.5683 \| 0.6290 \| 0.5541 \|
	\| \| CommonsenseQA (5-shot) \| 0.3120 \| 0.7445 \| 0.5487 \| 0.1949 \| 0.2048 \|
	\| \| PIQA (5-shot) \| 0.7514 \| 0.7612 \| 0.7111 \| 0.7459 \| 0.7203 \|
	\| \| OpenBookQA (5-shot) \| 0.2960 \| 0.3340 \| 0.2720 \| 0.3080 \| 0.2560 \|
	\| \| BoolQ (5-shot) \| 0.6590 \| 0.7774 \| 0.6572 \| 0.6508 \| 0.6165 \|
	\| English-Problem-Solving \| ARC Easy (5-shot) \| 0.6940 \| 0.8043 \| 0.6780 \| 0.6111 \| 0.6237 \|
	\| \| ARC Challenge (5-shot) \| 0.3532 \| 0.4846 \| 0.3370 \| 0.3063 \| 0.3157 \|
	\| \| MMLU (5-shot) \| 0.3124 \| 0.6165 \| 0.4818 \| 0.2869 \| 0.2654 \|
	\| English-Mathematics \| GSM8K (5-shot) \| 0.0637 \| 0.6194 \| 0.3495 \| 0.0159 \| 0.0182 \|
	\| \| Minerva Math (4-shot) \| 0.0180 \| 0.2876 \| 0.1160 \| 0.0182 \| 0.0010 \|
	\| Chinese \| CEval (5-shot) \| 0.2779 \| 0.6954 \| 0.5423 \| 0.2340 \| 0.2422 \|
	\| \| CMMLU (5-shot) \| 0.2687 \| 0.6882 \| 0.5300 \| 0.2570 \| 0.2468 \|
	\| Average Metrics \| Average-English(w/o Math) \| 0.4859 \| 0.6134 \| 0.5053 \| 0.4627 \| 0.4345 \|
	\| \| Average-English \| 0.4118 \| 0.5868 \| 0.4599 \| 0.3884 \| 0.3637 \|
	\| \| Average-Chinese \| 0.2733 \| 0.6918 \| 0.5362 \| 0.2455 \| 0.2445 \|
	\| \| Average \| 0.3920 \| 0.6018 \| 0.4708 \| 0.3680 \| 0.3466 \|
	\| \| Average(w/o Math) \| 0.4505 \| 0.6265 \| 0.5105 \| 0.4265 \| 0.4028 \|

	OpenSeek-Small-v1 demonstrates superior efficiency compared to 1-billion-parameter models.

	- <img src="logC_vs_Metric_Average_scatter_plot.png" alt="logC_vs_Metric_Average" width="400"/>

	## Usage Instructions
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("BAAI/OpenSeek-Small-v1",trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained("BAAI/OpenSeek-Small-v1",trust_remote_code=True)

	inputs = tokenizer("The future of AI is", return_tensors="pt")
	outputs = model.generate(**inputs, max_length=50)
	print(tokenizer.decode(outputs[0]))
	```

	# OpenSeek-Small v1 Model Documentation

	## Overview
	OpenSeek-Small-v1 is the initial production model of the OpenSeek project.
	- Utilizes DeepSeek-V3-like MoE architecture.
	- Comprises 1.4 billion total parameters, with 0.4 billion activated parameters.
	- Trained on 720 billion tokens.
	- Demonstrates superior efficiency compared to 1-billion-parameter models.

	## Training Data
	- 0.72TB tokens of high-quality pretraining data and the ratio for each domain is as follows:
	\| Name \| Ratio \|
	\|-------------------------------------------\|---------\|
	\| Nemotron-CC-high-actual-actual-high \| 1.26 \|
	\| Nemotron-CC-high-actual-actual-low \| 0.67 \|
	\| Nemotron-CC-high-actual-actual-mid \| 2.05 \|
	\| Nemotron-CC-high-synthetic-distill-high \| 1.59 \|
	\| Nemotron-CC-high-synthetic-distill-low \| 0.64 \|
	\| Nemotron-CC-high-synthetic-distill-mid \| 2.32 \|
	\| Nemotron-CC-high-synthetic-diverse_qa_pairs-high \| 4.67 \|
	\| Nemotron-CC-high-synthetic-diverse_qa_pairs-low \| 2.16 \|
	\| Nemotron-CC-high-synthetic-diverse_qa_pairs-mid \| 7.58 \|
	\| Nemotron-CC-high-synthetic-extract_knowledge-high \| 6.43 \|
	\| Nemotron-CC-high-synthetic-extract_knowledge-low \| 0.07 \|
	\| Nemotron-CC-high-synthetic-extract_knowledge-mid \| 2.22 \|
	\| Nemotron-CC-high-synthetic-knowledge_list-high \| 1.88 \|
	\| Nemotron-CC-high-synthetic-knowledge_list-low \| 0.74 \|
	\| Nemotron-CC-high-synthetic-knowledge_list-mid \| 3.20 \|
	\| Nemotron-CC-high-synthetic-wrap_medium-high \| 3.89 \|
	\| Nemotron-CC-high-synthetic-wrap_medium-low \| 0.65 \|
	\| Nemotron-CC-high-synthetic-wrap_medium-mid \| 6.18 \|
	\| Nemotron-CC-low-synthetic-wrap_medium-high \| 0.17 \|
	\| Nemotron-CC-low-synthetic-wrap_medium-low \| 0.30 \|
	\| Nemotron-CC-low-synthetic-wrap_medium-mid \| 1.08 \|
	\| Nemotron-CC-medium-actual-actual-high \| 2.20 \|
	\| Nemotron-CC-medium-actual-actual-low \| 4.48 \|
	\| Nemotron-CC-medium-actual-actual-mid \| 7.76 \|
	\| arxiv \| 0.32 \|
	\| books \| 1.98 \|
	\| code \| 3.43 \|
	\| cot_synthesis_CC \| 9.82 \|
	\| cot_synthesis_OpenSource \| 0.46 \|
	\| cot_synthesis_arxiv \| 4.15 \|
	\| cot_synthesis_code \| 1.32 \|
	\| cot_synthesis_math \| 2.19 \|
	\| cot_synthesis_wiki \| 0.83 \|
	\| math \| 0.83 \|
	\| pes2o \| 0.31 \|
	\| stack \| 0.19 \|
	\| wiki \| 0.29 \|
	\| zh_cc \| 9.65 \|

	## Wandb
	Our training curves have been recorded in Weights & Biases [wandb](https://wandb.ai/openseek-team/OpenSeek-Small-v1).

	## Evaluation
	\| Category \| Metrics (shots) \| Llama-3.2-1B \| Qwen2.5-1.5B \| Qwen2.5-0.5B \| OLMo-1B-0724 \| OpenSeek-Small-v1 \|
	\|------------------------------\|-------------------\|--------------\|--------------\|--------------\|---------------\|-------------------\|
	\| English-Commonsense Reasoning \| HellaSwag (5-shot) \| 0.4830 \| 0.5007 \| 0.4007 \| 0.4909 \| 0.3893 \|
	\| \| TruthfulQA (0-shot) \| 0.3773 \| 0.4663 \| 0.3986 \| 0.4029 \| 0.3990 \|
	\| \| Winogrande (5-shot) \| 0.6212 \| 0.6448 \| 0.5683 \| 0.6290 \| 0.5541 \|
	\| \| CommonsenseQA (5-shot) \| 0.3120 \| 0.7445 \| 0.5487 \| 0.1949 \| 0.2048 \|
	\| \| PIQA (5-shot) \| 0.7514 \| 0.7612 \| 0.7111 \| 0.7459 \| 0.7203 \|
	\| \| OpenBookQA (5-shot) \| 0.2960 \| 0.3340 \| 0.2720 \| 0.3080 \| 0.2560 \|
	\| \| BoolQ (5-shot) \| 0.6590 \| 0.7774 \| 0.6572 \| 0.6508 \| 0.6165 \|
	\| English-Problem-Solving \| ARC Easy (5-shot) \| 0.6940 \| 0.8043 \| 0.6780 \| 0.6111 \| 0.6237 \|
	\| \| ARC Challenge (5-shot) \| 0.3532 \| 0.4846 \| 0.3370 \| 0.3063 \| 0.3157 \|
	\| \| MMLU (5-shot) \| 0.3124 \| 0.6165 \| 0.4818 \| 0.2869 \| 0.2654 \|
	\| English-Mathematics \| GSM8K (5-shot) \| 0.0637 \| 0.6194 \| 0.3495 \| 0.0159 \| 0.0182 \|
	\| \| Minerva Math (4-shot) \| 0.0180 \| 0.2876 \| 0.1160 \| 0.0182 \| 0.0010 \|
	\| Chinese \| CEval (5-shot) \| 0.2779 \| 0.6954 \| 0.5423 \| 0.2340 \| 0.2422 \|
	\| \| CMMLU (5-shot) \| 0.2687 \| 0.6882 \| 0.5300 \| 0.2570 \| 0.2468 \|
	\| Average Metrics \| Average-English(w/o Math) \| 0.4859 \| 0.6134 \| 0.5053 \| 0.4627 \| 0.4345 \|
	\| \| Average-English \| 0.4118 \| 0.5868 \| 0.4599 \| 0.3884 \| 0.3637 \|
	\| \| Average-Chinese \| 0.2733 \| 0.6918 \| 0.5362 \| 0.2455 \| 0.2445 \|
	\| \| Average \| 0.3920 \| 0.6018 \| 0.4708 \| 0.3680 \| 0.3466 \|
	\| \| Average(w/o Math) \| 0.4505 \| 0.6265 \| 0.5105 \| 0.4265 \| 0.4028 \|

	OpenSeek-Small-v1 demonstrates superior efficiency compared to 1-billion-parameter models.

	- <img src="logC_vs_Metric_Average_scatter_plot.png" alt="logC_vs_Metric_Average" width="400"/>

	## Usage Instructions
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("BAAI/OpenSeek-Small-v1",trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained("BAAI/OpenSeek-Small-v1",trust_remote_code=True)

	inputs = tokenizer("The future of AI is", return_tensors="pt")
	outputs = model.generate(**inputs, max_length=50)
	print(tokenizer.decode(outputs[0]))
	```