renanserrano commited on
Commit
d9579ea
·
verified ·
1 Parent(s): 7b85f6d

Upload yanomami_dataset/README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. yanomami_dataset/README.md +52 -0
yanomami_dataset/README.md ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Yanomami Language Dataset
2
+
3
+ This dataset was created for training the Yanomami-English translation model. It contains various types of data to help the model learn different aspects of the Yanomami language and its translation to English.
4
+
5
+ ## Dataset Contents
6
+
7
+ The dataset consists of the following files:
8
+
9
+ | File | Description | Examples |
10
+ |------|-------------|----------|
11
+ | translations.jsonl | General translations between Yanomami and English | 17,009 |
12
+ | yanomami-to-english.jsonl | Specific Yanomami to English translations | 1,822 |
13
+ | phrases.jsonl | Common phrases and expressions | 2,322 |
14
+ | grammar.jsonl | Grammatical structures and rules | 200 |
15
+ | comparison.jsonl | Comparative phrases and structures | 2,072 |
16
+ | how-to.jsonl | Instructional content and procedural language | 5,586 |
17
+
18
+ ## Data Format
19
+
20
+ Each file is in JSONL format (JSON Lines), where each line is a valid JSON object containing the source text and its translation.
21
+
22
+ Example format:
23
+ ```json
24
+ {"input": "English: What does 'aheprariyo' mean in Yanomami? => Yanomami:", "output": "aheprariyo means 'to be happy' in Yanomami."}
25
+ {"input": "Yanomami: ahetoimi => English:", "output": "ahetoimi means 'to be sick' in English."}
26
+ ```
27
+
28
+ ## Dataset Generation
29
+
30
+ This dataset was generated using the [AI Dataset Generator](https://www.npmjs.com/package/ai-dataset-generator) npm package, which helps create synthetic training data for language models.
31
+
32
+ ## Usage
33
+
34
+ This dataset is used to fine-tune the GPT-2 model for Yanomami-English translation. The diverse nature of the dataset helps the model learn various aspects of the language, including vocabulary, grammar, and common expressions.
35
+
36
+ ## License
37
+
38
+ This dataset is provided for research and educational purposes. Please respect the Yanomami culture and language when using this data.
39
+
40
+ ## Citation
41
+
42
+ If you use this dataset in your research or applications, please cite:
43
+
44
+ ```
45
+ @misc{yanomami-english-dataset,
46
+ author = {Renan Serrano},
47
+ title = {Yanomami Language Dataset},
48
+ year = {2025},
49
+ publisher = {GitHub},
50
+ howpublished = {\url{https://github.com/renantrendt/yanomami-finetuning}}
51
+ }
52
+ ```