File size: 3,728 Bytes
437c78f
 
 
 
 
 
 
6cfea74
437c78f
 
 
c9da263
 
 
 
437c78f
c9da263
 
 
 
 
 
 
 
437c78f
c9da263
 
437c78f
c9da263
 
 
 
 
437c78f
 
c9da263
437c78f
 
c9da263
437c78f
c9da263
 
437c78f
 
6cfea74
 
c9da263
6cfea74
437c78f
 
6c867b4
437c78f
 
 
e99e560
437c78f
 
 
 
 
 
 
 
6cfea74
437c78f
c9da263
437c78f
 
 
 
c9da263
 
437c78f
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
language:
- zh
- en
- ja
- ko
pipeline_tag: fill-mask
license: apache-2.0
---

### Overview
- ModernBertMultilingual is a multilingual model trained from scratch.
- Uses the [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) architecture.
- Supports four languages and their variants, including `Chinese (Simplified, Traditional)`, `English`, `Japanese`, and `Korean`.
- Performs well on mixed East Asian language text tasks.

### Technical Specifications
- Uses a slightly adjusted vocabulary from the `Qwen2.5` series to support multilingualism.
- Trained for approximately `100` hours on `L40*7` devices, with a training volume of about `60B` tokens.
- Key training parameters:
  - Batch Size : 1792
  - Learing Rate : 5e-04
  - Maximum Sequence Length : 512
  - Optimizer : adamw_torch
  - LR Scheduler: warmup_stable_decay
  - Train Precision : bf16 mix
- For other technical specifications, please refer to the original release information and paper of [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base).

### Released Versions
- Provides 3 different weight versions:
  - base - Fully trained with general corpus, suitable for various text domains.
  - nodecay - Checkpoint before the annealing stage, you can fine-tune it with domain-specific data to better adapt to target domains.
  - keyword_gacha_multilingual - Fine-tuned version using ACGN (e.g., `light novels`, `game text`, `manga text`, etc.) type text.

| Model | Version | Description |
| :--: | :--: | :--:|
| [modern_bert_multilingual](https://huggingface.co/neavo/modern_bert_multilingual) | 20250128 | base |
| [modern_bert_multilingual_nodecay](https://huggingface.co/neavo/modern_bert_multilingual_nodecay) | 20250128 | nodecay |
| [keyword_gacha_multilingual_base](https://huggingface.co/neavo/keyword_gacha_multilingual_base) | 20250128 | keyword_gacha_multilingual |

### Other
- Training script: [Github](https://github.com/neavo/KeywordGachaModel)

### 综述
- ModernBertMultilingual 是一个从零开始训练的多语言模型
- 使用 [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) 架构
- 支持 `中文(简体、繁体)`、`英文`、`日文`、`韩文` 等四种语言及其变种
- 可以很好处理东亚语言混合文本任务

### 技术指标
- 使用略微调整后的 `Qwen2.5` 系列的词表以支持多语言
- 在 `L40*7` 的设备上训练了大约 `100` 个小时,训练量大约 `60B` Token
- 主要训练参数
  - Batch Size : 1792
  - Learing Rate : 5e-04
  - Maximum Sequence Length : 512
  - Optimizer : adamw_torch
  - LR Scheduler: warmup_stable_decay
  - Train Precision : bf16 mix
- 其余技术指标可以参考 [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) 原始发布信息与论文

### 发布版本
- 提供 3 个不同的权重版本
  - base - 使用通用预料完整训练,可以较好的适用于各种不同领域文本
  - nodecay - 退火阶段开始前的检查点,你可以在这个权重的基础上添加领域语料进行退火以使其更适应目标领域
  - keyword_gacha_multilingual - 使用 ACGN(例如 `轻小说``游戏文本``漫画文本` 等)类型文本进行退火的版本

| 模型 | 版本 | 说明 |
| :--: | :--: | :--:|
| [modern_bert_multilingual](https://huggingface.co/neavo/modern_bert_multilingual) | 20250128 | base |
| [modern_bert_multilingual_nodecay](https://huggingface.co/neavo/modern_bert_multilingual_nodecay) | 20250128 | nodecay |
| [keyword_gacha_multilingual_base](https://huggingface.co/neavo/keyword_gacha_multilingual_base) | 20250128 | keyword_gacha_multilingual |

### 其他
- 训练脚本 [Github](https://github.com/neavo/KeywordGachaModel)