File size: 8,187 Bytes
8da1560 f503b95 f72c6f1 f503b95 608ede5 c2fc322 f503b95 51ca017 f503b95 f72c6f1 5ab010b db29ec0 5ab010b f72c6f1 f503b95 7b833f2 f503b95 26a395b f503b95 d33dcfe f503b95 8c1ce9a f6d7433 8c1ce9a f6d7433 8c1ce9a f6d7433 8c1ce9a 26abdeb 8c1ce9a f503b95 3cefbdb f503b95 3cefbdb f503b95 8da1560 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
---
license: llama2
datasets:
- snow_simplified_japanese_corpus
- khalidalt/tydiqa-goldp
- csebuetnlp/xlsum
language:
- ja
---
# About
This model is Lightblue's QLoRA finetune of OpenOrca's [Open-Orca/OpenOrcaxOpenChat-Preview2-13B](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B) model on Japanese fine-tuning datasets.
This model specialises on answering **Closed Question Answering** in Japanese. Input a piece of reference text, ask a question, and see the model answer based on the reference text.
We trained on equal samples of the following three datasets:
* [SNOW](https://huggingface.co/datasets/snow_simplified_japanese_corpus)
* [TyDiQA (Ja)](https://huggingface.co/datasets/khalidalt/tydiqa-goldp)
* [XLSUM (Ja)](https://huggingface.co/datasets/csebuetnlp/xlsum)
which resulted in a dataset of 13,167 samples total.
These three datasets were chosen as they represent three distinct fine-tuning tasks (Text simplification, question answering, and text summarization, respectively) which we hypothesize can help to improve the language models suitability for dealing with Japanese data.
These three datasets make up the model name: STX.
With these datasets, we achieve the following scores on the JGLUE benchmark:
| Model Name | Open-Orca/OpenOrcaxOpenChat-Preview2-13B | lightblue/openorca_stx |
|------------------------|------------------------------------------|------------------------|
| jsquad-1.1-0.3 | 0.692 | 0.836 |
| jcommonsenseqa-1.1-0.3 | 0.831 | 0.782 |
| jnli-1.1-0.3 | 0.504 | 0.48 |
| marc_ja-1.1-0.3 | 0.936 | 0.959 |
We achieved these scores by using the [lm-evaluation-harness](https://github.com/Stability-AI/lm-evaluation-harness) from Stability AI using the below commands:
```bash
MODEL_ARGS=pretrained=lightblue/openorca_stx,use_accelerate=True
TASK="jsquad-1.1-0.3,jcommonsenseqa-1.1-0.3,jnli-1.1-0.3,marc_ja-1.1-0.3"
export JGLUE_OUTPUT_DIR=../jglue_results/$MODEL_NAME/$DATSET_NAME/$DATASET_SIZE
mkdir -p $JGLUE_OUTPUT_DIR
python main.py --model hf-causal-experimental --model_args $MODEL_ARGS --tasks $TASK --num_fewshot "2,3,3,3" --device "cuda" --output_path $JGLUE_OUTPUT_DIR/result.json --batch_size 4 > $JGLUE_OUTPUT_DIR/harness.out 2> $JGLUE_OUTPUT_DIR/harness.err
```
Our model achieves much better results on the question answering benchmark (JSQuAD) than the base checkpoint without monstrous degradation of performance on multi-choice question benchmarks (JCommonSense, JNLI, MARC-Ja) purely through QLoRA training.
This shows the potential for applying strong language models such as [Open-Orca/OpenOrcaxOpenChat-Preview2-13B](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B) to minimal QLoRA fine-tuning using Japanese fine-tuning datasets to achieve better results at narrow NLP tasks.
# How to use
```python
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
model_dir = "lightblue/openorca_stx"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForCausalLM.from_pretrained(
model_dir, torch_dtype=torch.bfloat16, device_map='auto',
)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
def do_closed_qa(context, question):
return context + "\n\n" + question
test_article = """ใใขใใใใฎใฌใใผใใชใผใซใใชใผใใปใใคใฑใซ้ธๆใใใใใฌใคใถใผใฉใขใณRGใใใๆฌไบบๅ
ฌ่ชใฎใขใใใใงใใใใฉใฐใใผใใกใณใฎๅๅฟใซๅฐใ้ฉใใใใใงใใ
ใใชใผใใปใใคใฑใซ้ธๆใฎใขใใใใฏใไฝใใใฃใใใงใใใ
ใ2015ๅนดใฎใฏใผใซใใซใใ๏ผWๆฏ๏ผใคใณใฐใฉใณใๅคงไผใงๆฅๆฌใๅใขใใชใซใๅใใๆฌกใฎๆฅใใไบฌ้ฝใงใฎ็ช็ตใญใฑใงใใใๅฝๆใฏใใขใใใซใฎๅ
ฑๅๅตๆฅญ่
ในใใฃใผใใปใธใงใใบใฎใขใใใใฐใใใงใใใใไธ็ทใซใญใฑใใใฆใใใธใฃใณใฐใซใใฑใใใใใใชใผใใปใใคใฑใซใซไผผใฆใพใใใใธใงใใบใฎใพใพใใใใใใใใชใใงใใ๏ผใใจ่จใใใใฎใๅงใพใใงใใ
ใใใ ใใฟใใช็ฅ่ญใใชใใใฉใฐใใผใทใงใใใๆขใใๆฅๆฌไปฃ่กจใฎใฆใใใผใ ใๅฃฒใๅใใ ใฃใใฎใงใ่ตคใฃใฝใใฆใใใผใ ใจใใใใใฎ็ญใใณใใฏใใฆใใจใใใใSNSใงใใชใผใใปใใคใฑใซใงใใใฃใฆใใฃใฑใๅ็ใ่ผใใพใใใ
ใใใใจใใใใ่ฆใใชใผใใใๆฌไบบใใDM๏ผใใคใฌใฏใใกใใปใผใธ๏ผใๅฑใใพใใใใใขใใใใใใใจใใใใใพใใใใใขใใใใใใใชใใๅใฎใฆใใใผใ ใ้ใใพใใฎใง็ใฆใใ ใใใใจใWๆฏๅพใซใฆใใใผใ 2็ใจใใณใใใฝใใฏในใชใฉใใปใใพใซ้ใฃใฆใใฆใใใพใใใไป็ใฆใใใฎใใใใงใใ
ใใใพใงใๆฐใ
ใฎ่ๅไบบใใขใใใใใฆใใใใพใใใใชใผใ้ธๆใฎใใฟใฎๅ้ฟใฏใใใใงใใใใ
ใใๅใฏใฉใฐใใผ็ต้จใใชใใงใใใใฉใฐใใผใๅ
จ็ถ็ฅใใชใใฃใใใฉใใใฃใฑใๆฌไบบใใใฆใใใผใ ใ้ ใใฆใใฃใฆใใโๅฐ็ฑ ๏ผใใใใ๏ผโใฟใใใชใฎใใใฃใฆใใใใใคใฏใชใผใใใๆฌไบบใซ่ชใใใใฆใใใจใไธ็ฎ็ฝฎใใใฆใใใฎใใชใจๆใใพใใ
ใใใใฃใฆใใใใจใฏใ่ฆใ็ฎใๆฌไบบใซๅฏใใฆใฏใณใใผใ ใฃใฆ่จใใ ใใชใใงใใใฉใญใใใใงใใใใใใชใผใใใใ ใใจ่จใฃใฆใใใใพใใ
ใใใชใผใใใใจๅฎ้ใซไผใใใจใชใใฆใ็ฐกๅใซใฏใงใใชใใใใชใใงใใใใงใใใชใผใใใใฎใพใญใใใฆใใRGใซใฏไผใใใใใฟใใใช๏ผ็ฌ๏ผใไฝใ ใใใชใๆๅใช็ฅ็คพใฎๆฏ็คพใฎใใใชๅญๅจใงใใใญใใใใใใใใใใจใใๆๅณใงใฏไปใฎใขใใใใจใฏใใใ้ใใพใใญใ
"""
test_question = "ใใชใผใใปใใคใฑใซใฏไฝใ้ใฃใฆใใพใใใ๏ผ"
pipe(do_closed_qa(test_article, question), max_new_tokens=128, temperature=0)[0]["generated_text"]
# "ใฆใใใผใ 2็ใจใใณใใใฝใใฏในใชใฉ"
```
### Prompting
We have found that this model is able to work well using a variety of prompts, including the Alpaca style templated prompts:
```python
f"""
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input}
### Response:
"""
```
We have found that having a newline at the end of the prompt can be important for signalling that the model must respond and not continue the inputs.
# Training details
We trained using the following three minimalistic prompt templates for the three tasks in STX:
* SNOW
```python
f"""ๅ
ใฎๆฅๆฌ่ช๏ผ
{original_ja}
ใทใณใใซใชๆฅๆฌ่ช๏ผ"""
```
* TyDiQA
```python
f"""{passage_text}
{question_text}"""
```
* XLSum
```python
f"""่จไบ๏ผ
{article_text}
่ฆ็ด๏ผ"""
```
This model was trained for 1000 steps (1.2 epochs) with the model being evaluated every 50 steps. We then chose the best model from these evaluations based on validation loss.
We used the [qlora](https://github.com/artidoro/qlora) package from artidoro.
We trained with the following hyperparameters:
```
Per device evaluation batch size: 16
Per device train batch size: 8
LoRA (lora_r): 64
LoRA alpha (lora_alpha): 16
LoRA modules: all
Double quantization: Enabled
Quantization type: nf4
BF16: Enabled
Bits: 4
Warmup ratio: 0.03
Learning rate scheduler type: Constant
Gradient checkpointing: Enabled
Gradient accumulation steps: 2
Learning rate: 0.0002
Adam beta2: 0.999
Maximum gradient norm: 0.3
LoRA dropout: 0.05
Weight decay: 0.0
```

 |