Spaces:

atiwari751
/

Hindi-tokenizer

Sleeping

atiwari751 commited on Jan 10

Commit

b0f8dcf

1 Parent(s): dea5ea1

english trial long final

Files changed (5) hide show

BPE.py CHANGED Viewed

@@ -3,7 +3,7 @@ import regex as re
 from tqdm import tqdm
 # Read text from a file
-with open('text_file_eng.txt', 'r', encoding='utf-8') as file:
     text = file.read()
 # Define the GPT-2 regex pattern
@@ -50,7 +50,7 @@ def merge(token_list, pair, idx):
     return newids
 def perform_bpe():
-    vocab_size = 1500  # the desired final vocabulary size
     num_merges = vocab_size - 256
     token_list = list(tokens)  # copy so we don't destroy the original list

 from tqdm import tqdm
 # Read text from a file
+with open('text_file_eng_long.txt', 'r', encoding='utf-8') as file:
     text = file.read()
 # Define the GPT-2 regex pattern
     return newids
 def perform_bpe():
+    vocab_size = 3500  # the desired final vocabulary size
     num_merges = vocab_size - 256
     token_list = list(tokens)  # copy so we don't destroy the original list

decoded_output.txt CHANGED Viewed

	@@ -1 +1 @@
1	- There 's a ~~chan ce~~ this is not ~~wor~~ ~~king~~ , is n 't it ? ~~The~~ re ' re many p a p ers , why will this work ? I ' ve ~~g ot~~ to make su re . I ' m now ~~think~~ ~~ing~~ s ~~om eth ing~~ 's w r ong . I t 'll be s ad if there 's ~~s om eth ing~~ w r ong and I ~~mis s~~ it , I 'll be s or ry . I t 'd ~~bet ter~~ be re v ie w ed well , I 'd want to be c ~~er tain~~ .


1	+ There 's a chance this is not working , isn 't it ? There ' re many p ap ers , why will this work ? I ' ve got to make su re . I ' m now th in king something 's wr ong . It 'll be s ad if there 's something wr ong and I miss it , I 'll be s or ry . It 'd better be re vi ew ed well , I 'd want to be certain .

encode_decode.py CHANGED Viewed

@@ -27,7 +27,7 @@ def decode(ids):
     return text
 # Example: Decode a list of IDs
-set_of_ids = [1072, 342, 259, 1406, 338, 409, 332, 330, 545, 733, 44, 332, 110, 642, 336, 63, 1044, 263, 39, 263, 1496, 285, 97, 112, 410, 44, 1045, 421, 409, 915, 63, 301, 39, 299, 288, 305, 290, 496, 578, 263, 46, 301, 39, 109, 753, 619, 312, 261, 292, 879, 312, 342, 262, 114, 546, 46, 301, 116, 415, 308, 261, 329, 538, 1129, 342, 261, 292, 879, 312, 262, 114, 546, 304, 301, 1238, 115, 336, 44, 301, 415, 308, 261, 274, 567, 46, 301, 116, 373, 1179, 450, 308, 348, 118, 362, 119, 310, 500, 44, 301, 373, 1282, 290, 308, 282, 271, 1167, 46]
 decoded_text = decode(set_of_ids)  # Pass the list of IDs
 print(decoded_text)

     return text
 # Example: Decode a list of IDs
+set_of_ids = [2532, 522, 258, 3103, 425, 332, 374, 2797, 44, 2391, 1508, 369, 63, 1375, 39, 261, 972, 277, 641, 385, 44, 2208, 553, 425, 1592, 63, 330, 39, 318, 1088, 285, 843, 405, 261, 46, 330, 39, 109, 1070, 325, 259, 888, 2913, 522, 1796, 524, 46, 966, 824, 306, 262, 354, 820, 726, 522, 2913, 1796, 524, 294, 330, 2827, 369, 44, 330, 824, 306, 262, 279, 551, 46, 966, 672, 2988, 306, 301, 3188, 451, 270, 814, 44, 330, 672, 1726, 285, 306, 1475, 46]
 decoded_text = decode(set_of_ids)  # Pass the list of IDs
 print(decoded_text)

text_file_eng_long.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

text_file_eng_short.txt DELETED Viewed

	@@ -1 +0,0 @@
1	- There's a chance this is not working, isn't it? There're many papers, why will this work? I've got to make sure. I'm now thinking something's wrong. It'll be sad if there's something wrong and I miss it, I'll be sorry. It'd better be reviewed well, I'd want to be certain.