Upload yanomami_dataset/README.md with huggingface_hub
Browse files- yanomami_dataset/README.md +52 -0
yanomami_dataset/README.md
ADDED
@@ -0,0 +1,52 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Yanomami Language Dataset
|
2 |
+
|
3 |
+
This dataset was created for training the Yanomami-English translation model. It contains various types of data to help the model learn different aspects of the Yanomami language and its translation to English.
|
4 |
+
|
5 |
+
## Dataset Contents
|
6 |
+
|
7 |
+
The dataset consists of the following files:
|
8 |
+
|
9 |
+
| File | Description | Examples |
|
10 |
+
|------|-------------|----------|
|
11 |
+
| translations.jsonl | General translations between Yanomami and English | 17,009 |
|
12 |
+
| yanomami-to-english.jsonl | Specific Yanomami to English translations | 1,822 |
|
13 |
+
| phrases.jsonl | Common phrases and expressions | 2,322 |
|
14 |
+
| grammar.jsonl | Grammatical structures and rules | 200 |
|
15 |
+
| comparison.jsonl | Comparative phrases and structures | 2,072 |
|
16 |
+
| how-to.jsonl | Instructional content and procedural language | 5,586 |
|
17 |
+
|
18 |
+
## Data Format
|
19 |
+
|
20 |
+
Each file is in JSONL format (JSON Lines), where each line is a valid JSON object containing the source text and its translation.
|
21 |
+
|
22 |
+
Example format:
|
23 |
+
```json
|
24 |
+
{"input": "English: What does 'aheprariyo' mean in Yanomami? => Yanomami:", "output": "aheprariyo means 'to be happy' in Yanomami."}
|
25 |
+
{"input": "Yanomami: ahetoimi => English:", "output": "ahetoimi means 'to be sick' in English."}
|
26 |
+
```
|
27 |
+
|
28 |
+
## Dataset Generation
|
29 |
+
|
30 |
+
This dataset was generated using the [AI Dataset Generator](https://www.npmjs.com/package/ai-dataset-generator) npm package, which helps create synthetic training data for language models.
|
31 |
+
|
32 |
+
## Usage
|
33 |
+
|
34 |
+
This dataset is used to fine-tune the GPT-2 model for Yanomami-English translation. The diverse nature of the dataset helps the model learn various aspects of the language, including vocabulary, grammar, and common expressions.
|
35 |
+
|
36 |
+
## License
|
37 |
+
|
38 |
+
This dataset is provided for research and educational purposes. Please respect the Yanomami culture and language when using this data.
|
39 |
+
|
40 |
+
## Citation
|
41 |
+
|
42 |
+
If you use this dataset in your research or applications, please cite:
|
43 |
+
|
44 |
+
```
|
45 |
+
@misc{yanomami-english-dataset,
|
46 |
+
author = {Renan Serrano},
|
47 |
+
title = {Yanomami Language Dataset},
|
48 |
+
year = {2025},
|
49 |
+
publisher = {GitHub},
|
50 |
+
howpublished = {\url{https://github.com/renantrendt/yanomami-finetuning}}
|
51 |
+
}
|
52 |
+
```
|