|
--- |
|
language: |
|
- en |
|
- ja |
|
library_name: transformers |
|
pipeline_tag: text-generation |
|
model_type: mistral |
|
license: apache-2.0 |
|
--- |
|
|
|
# Swallow-MS-7b-v0.1 |
|
|
|
Our Swallow-MS-7b-v0.1 model has undergone continual pre-training from the Mistral-7B-v0.1, primarily with the addition of Japanese language data. |
|
|
|
# Model Release Updates |
|
|
|
We are excited to share the release schedule for our latest models: |
|
- **April 26, 2024**: Released the [Swallow-MS-7b-instruct-v0.1](https://huggingface.co/tokyotech-llm/Swallow-MS-7b-instruct-v0.1) |
|
- **March 11, 2024**: Released the [Swallow-MS-7b-v0.1](https://huggingface.co/tokyotech-llm/Swallow-MS-7b-v0.1) |
|
 |
|
|
|
This repository provides large language models developed by [TokyoTech-LLM](https://tokyotech-llm.github.io/). |
|
|
|
## Model Details |
|
|
|
* **Model type**: Please refer to Mistral technical report for details on the model architecture. |
|
* **Language(s)**: Japanese English |
|
* **Tokenizer**: This model employs a tokenizer that features a broadened vocabulary based on Japanese data. This allows for a more efficient representation of text using fewer tokens, leading to a notably faster inference process. |
|
* **Contact**: swallow[at]nlp.c.titech.ac.jp |
|
|
|
## Instruct Model Performance |
|
|
|
### MT-Bench JA |
|
|
|
#### Turn-Wise Performance |
|
|
|
We report overall (i.e., average over scores of the first and second turns), first, and second turn scores. |
|
|
|
##### Overall |
|
|
|
|Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities| |
|
|---|---|---|---|---|---|---|---|---|---| |
|
| Swallow-MS-7b-instruct-v0.1 |0.3411|0.3770|0.4290|0.3454|0.1040|0.2400|0.3677|0.3907|0.4750| |
|
|
|
##### First Turn |
|
|
|
|Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities| |
|
|---|---|---|---|---|---|---|---|---|---| |
|
| Swallow-MS-7b-instruct-v0.1 |0.3699|0.4880|0.4260|0.3900|0.1080|0.2364|0.3780|0.4500|0.4800| |
|
|
|
##### Second Turn |
|
|
|
|Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities| |
|
|---|---|---|---|---|---|---|---|---|---| |
|
| Swallow-MS-7b-instruct-v0.1 |0.3130|0.2624|0.4320|0.2996|0.1000|0.2430|0.3564|0.3291|0.4700| |
|
|
|
#### Comparison to the past model |
|
|
|
We only provide the overall score in this section. |
|
|
|
|Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities| |
|
|---|---|---|---|---|---|---|---|---|---| |
|
| Swallow-MS-7b-instruct-v0.1 |0.3411|0.3770|0.4290|0.3454|0.1040|0.2400|0.3677|0.3907|0.4750| |
|
| ELYZA-japanese-Llama-2-7b-fast-instruct |0.2827|0.3289|0.3907|0.2424|0.1480|0.1584|0.3511|0.3053|0.3365| |
|
| calm2-7b-chat |0.3204|0.4657|0.4898|0.1837|0.1005|0.1414|0.3927|0.3601|0.4293| |
|
| calm2-7b-chat-dpo-experimental |0.3493|0.5312|0.5237|0.1857|0.1000|0.1813|0.3355|0.4320|0.5051| |
|
| RakutenAI-7B-instruct |0.2994|0.3623|0.3711|0.3333|0.1763|0.1581|0.4215|0.2824|0.2901| |
|
| RakutenAI-7B-chat |0.3667|0.4229|0.4644|0.3990|0.2161|0.2390|0.3416|0.3904|0.4601| |
|
|
|
|
|
## Evaluation Benchmarks |
|
|
|
### MT-Bench JA |
|
|
|
We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question) to assess the instruction-following capabilities of models. |
|
We utilized the following settings: |
|
|
|
- Implemantation: FastChat [Zheng+, 2023] (commit #e86e70d0) |
|
- Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3) |
|
- Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1) |
|
- Prompt for Judge: [Nejumi LLM-Lederboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1) |
|
- Judge: `gpt-4-1106-preview` |
|
- Scoring: Absolute scale normalized to a 0-1 range, averaged over five runs. |
|
|
|
|
|
## Usage |
|
|
|
First install additional dependencies in [requirements.txt](./requirements.txt): |
|
|
|
```sh |
|
pip install -r requirements.txt |
|
``` |
|
|
|
### Instruction format Ver0.1 |
|
This format must be adhered to strictly, as deviations may result in less optimal outputs from the model. |
|
|
|
The template used to construct a prompt for the Instruct model is specified as follows: |
|
|
|
``` |
|
<s>[INST] <<SYS>>\n{SYSTEM_PROMPT}\n<</SYS>>\n\n{USER_MESSAGE_1} [/INST] {BOT_MESSAGE_1}</s>[INST] {USER_MESSAGE_2} [/INST] |
|
``` |
|
|
|
|
|
Please be aware that ``<s>`` and ``</s>`` are special tokens used for the beginning of string (BOS) and end of string (EOS), respectively, while [INST] and [/INST] are considered regular strings. |
|
|
|
For the "{SYSTEM_PROMPT}" part, We recommend using "あなたは誠実で優秀な日本人のアシスタントです。" |
|
|
|
For the "{USER_MESSAGE_1}" part, We recommend using {instruction}\n{input} |
|
|
|
In other words, We recommend the following: |
|
|
|
``` |
|
<s>[INST] <<SYS>>\nあなたは誠実で優秀な日本人のアシスタントです。\n<</SYS>>\n\n{instruction1}\n{input1} [/INST] {BOT_MESSAGE_1}</s>[INST] {instruction2}\n{input2} [/INST] |
|
``` |
|
|
|
### Use the instruct model Ver0.1 |
|
|
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
model_name = "tokyotech-llm/Swallow-MS-7b-instruct-v0.1" |
|
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto") |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
|
device = "cuda" |
|
|
|
messages = [ |
|
{"role": "system", "content": "あなたは誠実で優秀な日本人のアシスタントです。"}, |
|
{"role": "user", "content": "東京工業大学の主なキャンパスについて教えてください"} |
|
] |
|
|
|
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt") |
|
|
|
model_inputs = encodeds.to(device) |
|
model.to(device) |
|
|
|
generated_ids = model.generate(model_inputs, max_new_tokens=128, do_sample=True) |
|
decoded = tokenizer.batch_decode(generated_ids) |
|
print(decoded[0]) |
|
``` |
|
|
|
## Training Datasets |
|
|
|
### Instruction Tuning Ver0.1 |
|
|
|
The following datasets were used for the instruction tuning. |
|
|
|
- [OpenAssistant Conversations Dataset](https://huggingface.co/datasets/llm-jp/oasst1-21k-ja) was used, where human utterances are included but the responses are not used. Instead, the responses were generated using the [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) model. |
|
- [OpenAssistant Conversations Dataset 21k Ja](https://huggingface.co/datasets/llm-jp/oasst1-21k-ja) |
|
- [OpenAssistant Conversations Dataset 21k En](https://huggingface.co/datasets/llm-jp/oasst1-21k-en) |
|
- [Databricks Dolly 15k Ja](https://huggingface.co/datasets/llm-jp/databricks-dolly-15k-ja) |
|
- [Databricks Dolly 15k En](https://huggingface.co/datasets/databricks/databricks-dolly-15k) |
|
|
|
Please note that some of the data had issues with quality or format, so not all of it was used. |
|
|
|
## Risks and Limitations |
|
|
|
The models released here are still in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations. |
|
|
|
## Acknowledgements |
|
|
|
We thank Mistral AI for releasing Mistral 7B v0.1 under an open license for others to build on. |
|
|
|
Our project is supported by the [ABCI Large-scale Language Model Building Support Program](https://abci.ai/en/link/llm_support_program.html) of the National Institute of Advanced Industrial Science and Technology. |
|
|
|
## License |
|
|
|
apache-2.0 |
|
|
|
## Authors |
|
|
|
Here are the team members: |
|
- From [Okazaki Laboratory](https://www.nlp.c.titech.ac.jp/index.en.html), the following members: |
|
- [Naoaki Okazaki](https://www.chokkan.org/index.ja.html) |
|
- [Sakae Mizuki](https://s-mizuki-nlp.github.io/) |
|
- [Hiroki Iida](https://meshidenn.github.io/) |
|
- [Mengsay Loem](https://loem-ms.github.io/) |
|
- [Shota Hirai](https://huggingface.co/Kotemo428) |
|
- [Kakeru Hattori](https://aya-se.vercel.app/) |
|
- [Masanari Ohi](https://twitter.com/stjohn2007) |
|
- From [YOKOTA Laboratory](https://www.rio.gsic.titech.ac.jp/en/index.html), the following members: |
|
- [Rio Yokota](https://twitter.com/rioyokota) |
|
- [Kazuki Fujii](https://twitter.com/okoge_kaz) |
|
- [Taishi Nakamura](https://twitter.com/Setuna7777_2) |
|
- [Takumi Okamoto](https://www.linkedin.com/in/takumi-okamoto) |
|
- [Ishida Shigeki](https://www.wantedly.com/id/reborn27) |
|
|
|
## How to cite |
|
|
|
If you find our work helpful, please feel free to cite us. |
|
|
|
``` |
|
@inproceedings{Fujii:COLM2024, |
|
title={Continual Pre-Training for Cross-Lingual LLM Adaptation: |
|
Enhancing Japanese Language Capabilities}, |
|
author={Kazuki Fujii and Taishi Nakamura and Mengsay Loem and Hiroki |
|
Iida and Masanari Ohi and Kakeru Hattori and Hirai Shota and Sakae |
|
Mizuki and Rio Yokota and Naoaki Okazaki}, |
|
booktitle="Proceedings of the First Conference on Language Modeling", |
|
series={COLM}, |
|
pages="(to appear)", |
|
year="2024", |
|
month=oct, |
|
address={University of Pennsylvania, USA}, |
|
} |
|
|
|
@inproceedings{Okazaki:COLM2024, |
|
title={Building a Large Japanese Web Corpus for Large Language Models}, |
|
author={Naoaki Okazaki and Kakeru Hattori and Hirai Shota and Hiroki |
|
Iida and Masanari Ohi and Kazuki Fujii and Taishi Nakamura and Mengsay |
|
Loem and Rio Yokota and Sakae Mizuki}, |
|
booktitle="Proceedings of the First Conference on Language Modeling", |
|
series={COLM}, |
|
pages="(to appear)", |
|
year="2024", |
|
month=oct, |
|
address={University of Pennsylvania, USA}, |
|
} |
|
``` |