Update README.md
Browse files
README.md
CHANGED
@@ -32,27 +32,41 @@ This repository provides large language models developed by [TokyoTech-LLM](http
|
|
32 |
|
33 |
### MT-Bench JA
|
34 |
|
35 |
-
|
36 |
-
* We will add the scores of existing models soon.
|
37 |
|
38 |
-
|
|
|
|
|
39 |
|
40 |
|Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities|
|
41 |
|---|---|---|---|---|---|---|---|---|---|
|
42 |
| Swallow-MS-7b-instruct-v0.1 |0.3411|0.3770|0.4290|0.3454|0.1040|0.2400|0.3677|0.3907|0.4750|
|
43 |
|
44 |
-
|
45 |
|
46 |
|Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities|
|
47 |
|---|---|---|---|---|---|---|---|---|---|
|
48 |
| Swallow-MS-7b-instruct-v0.1 |0.3699|0.4880|0.4260|0.3900|0.1080|0.2364|0.3780|0.4500|0.4800|
|
49 |
|
50 |
-
|
51 |
|
52 |
|Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities|
|
53 |
|---|---|---|---|---|---|---|---|---|---|
|
54 |
| Swallow-MS-7b-instruct-v0.1 |0.3130|0.2624|0.4320|0.2996|0.1000|0.2430|0.3564|0.3291|0.4700|
|
55 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
56 |
|
57 |
## Evaluation Benchmarks
|
58 |
|
|
|
32 |
|
33 |
### MT-Bench JA
|
34 |
|
35 |
+
#### Turn-Wise Performance
|
|
|
36 |
|
37 |
+
We report overall (i.e., average over scores of the first and second turns), first, and second turn scores.
|
38 |
+
|
39 |
+
##### Overall
|
40 |
|
41 |
|Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities|
|
42 |
|---|---|---|---|---|---|---|---|---|---|
|
43 |
| Swallow-MS-7b-instruct-v0.1 |0.3411|0.3770|0.4290|0.3454|0.1040|0.2400|0.3677|0.3907|0.4750|
|
44 |
|
45 |
+
##### First Turn
|
46 |
|
47 |
|Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities|
|
48 |
|---|---|---|---|---|---|---|---|---|---|
|
49 |
| Swallow-MS-7b-instruct-v0.1 |0.3699|0.4880|0.4260|0.3900|0.1080|0.2364|0.3780|0.4500|0.4800|
|
50 |
|
51 |
+
##### Second Turn
|
52 |
|
53 |
|Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities|
|
54 |
|---|---|---|---|---|---|---|---|---|---|
|
55 |
| Swallow-MS-7b-instruct-v0.1 |0.3130|0.2624|0.4320|0.2996|0.1000|0.2430|0.3564|0.3291|0.4700|
|
56 |
|
57 |
+
#### Comparison to the past model
|
58 |
+
|
59 |
+
We only provide the overall score in this section.
|
60 |
+
|
61 |
+
|Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities|
|
62 |
+
|---|---|---|---|---|---|---|---|---|---|
|
63 |
+
| Swallow-MS-7b-instruct-v0.1 |0.3411|0.3770|0.4290|0.3454|0.1040|0.2400|0.3677|0.3907|0.4750|
|
64 |
+
| ELYZA-japanese-Llama-2-7b-fast-instruct |0.2827|0.3289|0.3907|0.2424|0.1480|0.1584|0.3511|0.3053|0.3365|
|
65 |
+
| calm2-7b-chat |0.3204|0.4657|0.4898|0.1837|0.1005|0.1414|0.3927|0.3601|0.4293|
|
66 |
+
| calm2-7b-chat-dpo-experimental |0.3493|0.5312|0.5237|0.1857|0.1000|0.1813|0.3355|0.4320|0.5051|
|
67 |
+
| RakutenAI-7B-instruct |0.2994|0.3623|0.3711|0.3333|0.1763|0.1581|0.4215|0.2824|0.2901|
|
68 |
+
| RakutenAI-7B-chat |0.3667|0.4229|0.4644|0.3990|0.2161|0.2390|0.3416|0.3904|0.4601|
|
69 |
+
|
70 |
|
71 |
## Evaluation Benchmarks
|
72 |
|