File size: 3,759 Bytes
4a139a6
 
 
 
 
 
 
 
 
9b9cfe6
 
b42030a
b1aff97
1e6a18d
 
79f6a9c
b1aff97
b42030a
 
 
 
 
 
 
 
 
 
934751c
b42030a
 
 
 
 
 
 
 
 
 
 
 
 
8864bdb
b42030a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
license: mit
datasets:
- iryneko571/CCMatrix-v1-Ja_Zh-fused
language:
- ja
- zh
library_name: transformers
pipeline_tag: translation
widget:
- text: <-ja2zh-> フェルディナント・ラッサール \n は、プロイセンの政治学者、哲学者、法学者、社会主義者、労働運動指導者。ドイツ社会民主党の母体となる全ドイツ労働者同盟の創設者である。社会主義共和政の統一ドイツを目指しつつも、……
---
# colab notebook, no environment needed
简单的gpu colab笔记本,可以测试,不需要线下的环境配置和安装 <br>
总之换成gpu模式可以用啦,翻译一本小说什么的可读性不高(尴尬)<br>
https://colab.research.google.com/drive/19rQG4ryrue-0g8KH4ATT0_o2-8tHLcIT?usp=sharing

# Release Notes
* this model is finetuned from mt5-base, training methods and datasets refers to larryvrh/mt5-translation-ja_zh
* used a trimmed and fused dataset CCMatrix-v1-Ja_Zh 1e-4 for 1 epoch no weight decay,arraived at about 1.5 val loss, pretty decent for this behemoth tokenizer
* spent about 26h on a modified 2080ti 22g graphic card, but size-wise this is safe to train on much smaller cards

* reason for making this model<br>
  There are some issues in the original model by larryvrh, which includes:
  * long sentence repetition, doesn't recongize breaks
  * dirty mix of numbers and periods
  * translates to or from english "sometimes"
  * a bit too big on smaller cards <br>
  They are generally all-parameter problems that i can only partially change with all-parameter finetune
  But I generally perfer to make a base model that doesn't have these issues to begin with. so here...

# 模型公开声明
* 这个模型由 mt5-translation-ja_zh 启发(其实就是在它上面改的),使用mt5-base,比原模型要小一些
* 使用了CCMatrix-v1-Ja_Zh, 1e-4学习率, 1 个epoch
* 大概在自己的2080ti 22g卡上跑了26小时,用高级的小卡会更快

* 制造这个模型的原因
  larryvrh的原模型很不错了,但是有一些小问题
  * 长句子会卷起来重复,而且不认识换行符
  * 数字和标点会乱写
  * 有时候会翻译或翻成英文,有时候会不翻
  * 对于小的机器来说有点大了 <br>
  当然还有别的问题,但是以上这些问题涉及到所有的param形状,我加lora上去它还是歪的,并不解决问题,像是之前那样整个模型finetune太精细不好把握
  所以还是重新炼个丹把上面的都解决掉

# 简单的后端应用
还没稳定调试,慎用
* https://github.com/IryNeko/RabbitCafe
  
# A more precise example using it
# 使用指南
```python
from transformers import pipeline
model_name="iryneko571/mt5-base-translation-ja_zh"
#pipe = pipeline("translation",model=model_name,tokenizer=model_name,repetition_penalty=1.4,batch_size=1,max_length=256)
pipe = pipeline("translation",
  model=model_name,
  repetition_penalty=1.4,
  batch_size=1,
  max_length=256
  )

def translate_batch(batch, language='<-ja2zh->'): # batch is an array of string
    i=0 # quickly format the list
    while i<len(batch):
        batch[i]=f'{language} {batch[i]}'
        i+=1
    translated=pipe(batch)
    result=[]
    i=0
    while i<len(translated):
        result.append(translated[i]['translation_text'])
        i+=1
    return result

inputs=[]

print(translate_batch(inputs))
```
# Roadmap
* want some loras?
* build the platform better

# how to find me
# 找到作者
Discord Server:<br>
https://discord.gg/JmjPmJjA<br>
If you need any help, a test server or just want to chat<br>
如果需要帮助,需要试试最新的版本,或者只是为了看下我是啥,可以进channel看看(这边允许发布这个吗?)<br>