lfsm commited on
Commit
6f882d7
·
1 Parent(s): 7387f96

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -0
README.md ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## CC_FILTER
2
+ this is ja cc filter fo reference from ja wiki vs random ja mc4, and build with following procedure.
3
+ 1. get ja wiki dump file, and extract the all url inside, get about 4M urls
4
+ 2. crawl 300K of 4M webpages from the urls
5
+ 3. get pure text and remove content len less than 1k,
6
+ 4. use langdetect to tell the lang of the pages,
7
+ we finally get total **16K**pages : **10K** ja pages, **5K** en pages, and **1K** other lang pages
8
+ 5. random sample 16K from ja mc4, concat with all 16k pages to get lang_all.txt data
9
+ 6. random sample 10K from ja mc4, concat with ja 10k pages to get lang_ja.txt data
10
+ 7. tokenize all text with "cl-tohoku/bert-base-japanese"
11
+ 8. feed lang_all.txt to fasttext to get model_all.bin
12
+ 9. feed lang_ja.txt to fasttext to get model_ja.bin