Could you elaborate a bit more on the details of the corpora you used to develop it?

#1
by abxda - opened

Could you please share more details about the corpora you used to develop it, and let us know if it would be possible to gain access to them in order to replicate the results and learn more about your process?

Hi! You can find the dataset here Axolotl-Spanish-Nahuatl. The code is a standard SFT job on a quantized version of Mistral-7B. The point was to make a demo for this use case: an instruct-type model fine-tuned on an indigenous language dataset. I can also share the code if you want but I'd recommend using a more recent SFT codebase. My script dates back to ancient times (the summer of 2023 when Llama2 came out).

Sign up or log in to comment