Update README.md
Browse files
README.md
CHANGED
@@ -113,7 +113,7 @@ The following datasets were used for continual pre-training.
|
|
113 |
- [English Wikipedia](https://dumps.wikimedia.org/other/cirrussearch)
|
114 |
- [Japanese Wikipedia](https://dumps.wikimedia.org/other/cirrussearch)
|
115 |
- [Laboro ParaCorpus](https://github.com/laboroai/Laboro-ParaCorpus)
|
116 |
-
- [Swallow Corpus Version 2](https://arxiv.org/abs/2404.17733)
|
117 |
- [The-stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids)
|
118 |
|
119 |
## Risks and Limitations
|
|
|
113 |
- [English Wikipedia](https://dumps.wikimedia.org/other/cirrussearch)
|
114 |
- [Japanese Wikipedia](https://dumps.wikimedia.org/other/cirrussearch)
|
115 |
- [Laboro ParaCorpus](https://github.com/laboroai/Laboro-ParaCorpus)
|
116 |
+
- [Swallow Corpus Version 2](https://arxiv.org/abs/2404.17733) (filtered using [Swallow Education Classifier(Wiki-based)](https://huggingface.co/tokyotech-llm/edu-classifier))
|
117 |
- [The-stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids)
|
118 |
|
119 |
## Risks and Limitations
|