atiwari751 commited on
Commit
b0f8dcf
·
1 Parent(s): dea5ea1

english trial long final

Browse files
BPE.py CHANGED
@@ -3,7 +3,7 @@ import regex as re
3
  from tqdm import tqdm
4
 
5
  # Read text from a file
6
- with open('text_file_eng.txt', 'r', encoding='utf-8') as file:
7
  text = file.read()
8
 
9
  # Define the GPT-2 regex pattern
@@ -50,7 +50,7 @@ def merge(token_list, pair, idx):
50
  return newids
51
 
52
  def perform_bpe():
53
- vocab_size = 1500 # the desired final vocabulary size
54
  num_merges = vocab_size - 256
55
  token_list = list(tokens) # copy so we don't destroy the original list
56
 
 
3
  from tqdm import tqdm
4
 
5
  # Read text from a file
6
+ with open('text_file_eng_long.txt', 'r', encoding='utf-8') as file:
7
  text = file.read()
8
 
9
  # Define the GPT-2 regex pattern
 
50
  return newids
51
 
52
  def perform_bpe():
53
+ vocab_size = 3500 # the desired final vocabulary size
54
  num_merges = vocab_size - 256
55
  token_list = list(tokens) # copy so we don't destroy the original list
56
 
decoded_output.txt CHANGED
@@ -1 +1 @@
1
- There 's a chan ce this is not wor king , is n 't it ? The re ' re many p a p ers , why will this work ? I ' ve g ot to make su re . I ' m now think ing s om eth ing 's w r ong . I t 'll be s ad if there 's s om eth ing w r ong and I mis s it , I 'll be s or ry . I t 'd bet ter be re v ie w ed well , I 'd want to be c er tain .
 
1
+ There 's a chance this is not working , isn 't it ? There ' re many p ap ers , why will this work ? I ' ve got to make su re . I ' m now th in king something 's wr ong . It 'll be s ad if there 's something wr ong and I miss it , I 'll be s or ry . It 'd better be re vi ew ed well , I 'd want to be certain .
encode_decode.py CHANGED
@@ -27,7 +27,7 @@ def decode(ids):
27
  return text
28
 
29
  # Example: Decode a list of IDs
30
- set_of_ids = [1072, 342, 259, 1406, 338, 409, 332, 330, 545, 733, 44, 332, 110, 642, 336, 63, 1044, 263, 39, 263, 1496, 285, 97, 112, 410, 44, 1045, 421, 409, 915, 63, 301, 39, 299, 288, 305, 290, 496, 578, 263, 46, 301, 39, 109, 753, 619, 312, 261, 292, 879, 312, 342, 262, 114, 546, 46, 301, 116, 415, 308, 261, 329, 538, 1129, 342, 261, 292, 879, 312, 262, 114, 546, 304, 301, 1238, 115, 336, 44, 301, 415, 308, 261, 274, 567, 46, 301, 116, 373, 1179, 450, 308, 348, 118, 362, 119, 310, 500, 44, 301, 373, 1282, 290, 308, 282, 271, 1167, 46]
31
  decoded_text = decode(set_of_ids) # Pass the list of IDs
32
  print(decoded_text)
33
 
 
27
  return text
28
 
29
  # Example: Decode a list of IDs
30
+ set_of_ids = [2532, 522, 258, 3103, 425, 332, 374, 2797, 44, 2391, 1508, 369, 63, 1375, 39, 261, 972, 277, 641, 385, 44, 2208, 553, 425, 1592, 63, 330, 39, 318, 1088, 285, 843, 405, 261, 46, 330, 39, 109, 1070, 325, 259, 888, 2913, 522, 1796, 524, 46, 966, 824, 306, 262, 354, 820, 726, 522, 2913, 1796, 524, 294, 330, 2827, 369, 44, 330, 824, 306, 262, 279, 551, 46, 966, 672, 2988, 306, 301, 3188, 451, 270, 814, 44, 330, 672, 1726, 285, 306, 1475, 46]
31
  decoded_text = decode(set_of_ids) # Pass the list of IDs
32
  print(decoded_text)
33
 
text_file_eng_long.txt ADDED
The diff for this file is too large to render. See raw diff
 
text_file_eng_short.txt DELETED
@@ -1 +0,0 @@
1
- There's a chance this is not working, isn't it? There're many papers, why will this work? I've got to make sure. I'm now thinking something's wrong. It'll be sad if there's something wrong and I miss it, I'll be sorry. It'd better be reviewed well, I'd want to be certain.