HiTZ
/

mt-hitz-ca-eu

Text2Text Generation

Inference Endpoints

Model card Files Files and versions Community

anegda commited on Jun 17, 2024

Commit

e6b165c

·

verified ·

1 Parent(s): a0b9ee5

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -71,7 +71,7 @@ The 9,692,996 sentence pairs of synthetic parallel data were created by translat
 #### Preprocessing
-After concatenation, all datasets are cleaned and deduplicated using [bifixer](https://github.com/bitextor/bifixer) and [biclener](https://github.com/bitextor/bicleaner) tools [(Ramírez-Sánchez et al., 2020)](https://aclanthology.org/2020.eamt-1.31/). Any sentence pairs with a classification score of less than 0.5 is removed. The filtered corpus is composed of 10,582,279 parallel sentences.
 #### Tokenization
 All data is tokenized using sentencepiece, with a 32,000 token sentencepiece model learned from the combination of all filtered training data. This model is included.

 #### Preprocessing
+After concatenation, all datasets are cleaned and deduplicated using [bifixer](https://github.com/bitextor/bifixer) [(Ramírez-Sánchez et al., 2020)](https://aclanthology.org/2020.eamt-1.31/) for identifying repetions and cleaning encoding problems and LaBSE embeddings to filter missaligned sentences. Any sentence pairs with a LaBSE similarity score of less than 0.5 is removed. The filtered corpus is composed of 10,582,279 parallel sentences.
 #### Tokenization
 All data is tokenized using sentencepiece, with a 32,000 token sentencepiece model learned from the combination of all filtered training data. This model is included.