|
--- |
|
license: llama2 |
|
datasets: |
|
- snow_simplified_japanese_corpus |
|
- khalidalt/tydiqa-goldp |
|
- csebuetnlp/xlsum |
|
language: |
|
- ja |
|
--- |
|
# About |
|
This model is Lightblue's QLoRA finetune of OpenOrca's [Open-Orca/OpenOrcaxOpenChat-Preview2-13B](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B) model on Japanese fine-tuning datasets. |
|
|
|
We trained on equal samples of the following three datasets: |
|
* [SNOW](https://huggingface.co/datasets/snow_simplified_japanese_corpus) |
|
* [TyDiQA (Ja)](https://huggingface.co/datasets/khalidalt/tydiqa-goldp) |
|
* [XLSUM (Ja)](https://huggingface.co/datasets/csebuetnlp/xlsum) |
|
|
|
which resulted in a dataset of 13167 samples total. |
|
|
|
These three datasets were chosen as they represent three distinct fine-tuning tasks (Text simplification, question answering, and text summarization, respectively) which we hypothesize can help to improve the language models suitability for dealing with Japanese data. |
|
These three datasets make up the model name: STX. |
|
|
|
# How to use |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_dir) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_dir, torch_dtype=torch.bfloat16, device_map='auto', |
|
) |
|
|
|
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) |
|
|
|
def do_closed_qa(context, question): |
|
return context + "\n\n" + question |
|
|
|
test_article = """ใใขใใใใฎใฌใใผใใชใผใซใใชใผใใปใใคใฑใซ้ธๆใใใใใฌใคใถใผใฉใขใณRGใใใๆฌไบบๅ
ฌ่ชใฎใขใใใใงใใใใฉใฐใใผใใกใณใฎๅๅฟใซๅฐใ้ฉใใใใใงใใ |
|
ใใชใผใใปใใคใฑใซ้ธๆใฎใขใใใใฏใไฝใใใฃใใใงใใใ |
|
ใ2015ๅนดใฎใฏใผใซใใซใใ๏ผWๆฏ๏ผใคใณใฐใฉใณใๅคงไผใงๆฅๆฌใๅใขใใชใซใๅใใๆฌกใฎๆฅใใไบฌ้ฝใงใฎ็ช็ตใญใฑใงใใใๅฝๆใฏใใขใใใซใฎๅ
ฑๅๅตๆฅญ่
ในใใฃใผใใปใธใงใใบใฎใขใใใใฐใใใงใใใใไธ็ทใซใญใฑใใใฆใใใธใฃใณใฐใซใใฑใใใใใใชใผใใปใใคใฑใซใซไผผใฆใพใใใใธใงใใบใฎใพใพใใใใใใใใชใใงใใ๏ผใใจ่จใใใใฎใๅงใพใใงใใ |
|
ใใใ ใใฟใใช็ฅ่ญใใชใใใฉใฐใใผใทใงใใใๆขใใๆฅๆฌไปฃ่กจใฎใฆใใใผใ ใๅฃฒใๅใใ ใฃใใฎใงใ่ตคใฃใฝใใฆใใใผใ ใจใใใใใฎ็ญใใณใใฏใใฆใใจใใใใSNSใงใใชใผใใปใใคใฑใซใงใใใฃใฆใใฃใฑใๅ็ใ่ผใใพใใใ |
|
ใใใใจใใใใ่ฆใใชใผใใใๆฌไบบใใDM๏ผใใคใฌใฏใใกใใปใผใธ๏ผใๅฑใใพใใใใใขใใใใใใใจใใใใใพใใใใใขใใใใใใใชใใๅใฎใฆใใใผใ ใ้ใใพใใฎใง็ใฆใใ ใใใใจใWๆฏๅพใซใฆใใใผใ 2็ใจใใณใใใฝใใฏในใชใฉใใปใใพใซ้ใฃใฆใใฆใใใพใใใไป็ใฆใใใฎใใใใงใใ |
|
ใใใพใงใๆฐใ
ใฎ่ๅไบบใใขใใใใใฆใใใใพใใใใชใผใ้ธๆใฎใใฟใฎๅ้ฟใฏใใใใงใใใใ |
|
ใใๅใฏใฉใฐใใผ็ต้จใใชใใงใใใใฉใฐใใผใๅ
จ็ถ็ฅใใชใใฃใใใฉใใใฃใฑใๆฌไบบใใใฆใใใผใ ใ้ ใใฆใใฃใฆใใโๅฐ็ฑ ๏ผใใใใ๏ผโใฟใใใชใฎใใใฃใฆใใใใใคใฏใชใผใใใๆฌไบบใซ่ชใใใใฆใใใจใไธ็ฎ็ฝฎใใใฆใใใฎใใชใจๆใใพใใ |
|
ใใใใฃใฆใใใใจใฏใ่ฆใ็ฎใๆฌไบบใซๅฏใใฆใฏใณใใผใ ใฃใฆ่จใใ ใใชใใงใใใฉใญใใใใงใใใใใใชใผใใใใ ใใจ่จใฃใฆใใใใพใใ |
|
ใใใชใผใใใใจๅฎ้ใซไผใใใจใชใใฆใ็ฐกๅใซใฏใงใใชใใใใชใใงใใใใงใใใชใผใใใใฎใพใญใใใฆใใRGใซใฏไผใใใใใฟใใใช๏ผ็ฌ๏ผใไฝใ ใใใชใๆๅใช็ฅ็คพใฎๆฏ็คพใฎใใใชๅญๅจใงใใใญใใใใใใใใใใจใใๆๅณใงใฏไปใฎใขใใใใจใฏใใใ้ใใพใใญใ |
|
""" |
|
|
|
test_question = "ใใชใผใใปใใคใฑใซใฏไฝใ้ใฃใฆใใพใใใ๏ผ" |
|
|
|
pipe(do_closed_qa(test_article, question), max_new_tokens=128, temperature=0)[0]["generated_text"] |
|
# "ใฆใใใผใ 2็ใจใใณใใใฝใใฏในใชใฉ" |
|
``` |
|
|
|
|
|
# Training details |
|
|
|
This model was trained for 1000 steps (1.2 epochs) with the model being evaluated every 50 steps. We then chose the best model from these evaluations based on validation loss. |
|
We used the [qlora](https://github.com/artidoro/qlora) package from artidoro. |
|
We trained with the following hyperparameters: |
|
|
|
``` |
|
Per device evaluation batch size: 16 |
|
Per device train batch size: 8 |
|
LoRA (lora_r): 64 |
|
LoRA alpha (lora_alpha): 16 |
|
LoRA modules: all |
|
Double quantization: Enabled |
|
Quantization type: nf4 |
|
BF16: Enabled |
|
Bits: 4 |
|
Warmup ratio: 0.03 |
|
Learning rate scheduler type: Constant |
|
Gradient checkpointing: Enabled |
|
Gradient accumulation steps: 2 |
|
Learning rate: 0.0002 |
|
Adam beta2: 0.999 |
|
Maximum gradient norm: 0.3 |
|
LoRA dropout: 0.05 |
|
Weight decay: 0.0 |
|
``` |
|
|
|
 |
|
|
|
 |