File size: 17,990 Bytes
9df4cc0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "sys.path.append(\"../../FinNLP\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import requests\n",
    "import shutil\n",
    "import pandas as pd\n",
    "from finnlp.data_engineering.data_cleaning import * "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Downloading sample data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def download_parquet_files(url_list, local_dir):\n",
    "    for url in url_list:\n",
    "        file_name = url.split('/')[-1]\n",
    "        local_file = os.path.join(local_dir, file_name)\n",
    "        if not os.path.exists(local_dir):\n",
    "            os.makedirs(local_dir)\n",
    "\n",
    "        r = requests.get(url, stream=True)\n",
    "        if r.status_code == 200:\n",
    "            with open(local_file, 'wb+') as f:\n",
    "                r.raw.decode_content = True\n",
    "                shutil.copyfileobj(r.raw, f)\n",
    "        else:\n",
    "            print('download failed: ', url)\n",
    "\n",
    "def web_data_prepare(name):\n",
    "    r = requests.get(\"https://datasets-server.huggingface.co/parquet?dataset=\"+name)\n",
    "    j = r.json()\n",
    "    urls = [f['url'] for f in j['parquet_files'] if f['split'] == 'train']\n",
    "    train_urls = [f['url'] for f in j['parquet_files'] if f['config'] == name and f['split'] == 'train']\n",
    "    test_urls = [f['url'] for f in j['parquet_files'] if f['config'] == name and f['split'] == 'validation']\n",
    "    download_parquet_files(train_urls, 'train_dataset')\n",
    "    download_parquet_files(test_urls, 'test_dataset')\n",
    "\n",
    "    train_dataset = pd.read_parquet('./train_dataset', engine='pyarrow')\n",
    "    test_dataset = pd.read_parquet('./test_dataset', engine='pyarrow')\n",
    "\n",
    "    # train_dataset.rebalance()\n",
    "    # test_dataset.rebalance()\n",
    "    return train_dataset, test_dataset\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>date</th>\n",
       "      <th>gender</th>\n",
       "      <th>age</th>\n",
       "      <th>horoscope</th>\n",
       "      <th>job</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>395587</th>\n",
       "      <td>it has just hit me that in a matter of a few s...</td>\n",
       "      <td>04,August,2004</td>\n",
       "      <td>female</td>\n",
       "      <td>27</td>\n",
       "      <td>Virgo</td>\n",
       "      <td>indUnk</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>275435</th>\n",
       "      <td>Ok, Dear Dopugie/ Ben/ Matt/ anyone who can he...</td>\n",
       "      <td>27,June,2004</td>\n",
       "      <td>male</td>\n",
       "      <td>14</td>\n",
       "      <td>Sagittarius</td>\n",
       "      <td>Student</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>634637</th>\n",
       "      <td>ooyeah/Happy Roctober = Miriam?  Anyway, just ...</td>\n",
       "      <td>03,August,2004</td>\n",
       "      <td>female</td>\n",
       "      <td>24</td>\n",
       "      <td>Libra</td>\n",
       "      <td>Arts</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>264675</th>\n",
       "      <td>Election season is in the air! And I know that...</td>\n",
       "      <td>19,February,2004</td>\n",
       "      <td>male</td>\n",
       "      <td>24</td>\n",
       "      <td>Gemini</td>\n",
       "      <td>indUnk</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>628907</th>\n",
       "      <td>His face bore the the signs of the many battle...</td>\n",
       "      <td>03,February,2004</td>\n",
       "      <td>male</td>\n",
       "      <td>24</td>\n",
       "      <td>Libra</td>\n",
       "      <td>indUnk</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41392</th>\n",
       "      <td>Everything You Wanted to Know About Oscillatin...</td>\n",
       "      <td>04,October,2003</td>\n",
       "      <td>female</td>\n",
       "      <td>40</td>\n",
       "      <td>Gemini</td>\n",
       "      <td>Law</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>71956</th>\n",
       "      <td>Sick Chickens   March has not been a good mont...</td>\n",
       "      <td>08,August,2004</td>\n",
       "      <td>male</td>\n",
       "      <td>26</td>\n",
       "      <td>Capricorn</td>\n",
       "      <td>Communications-Media</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21415</th>\n",
       "      <td>Gunshot as you are more likely to die quickly ...</td>\n",
       "      <td>26,March,2003</td>\n",
       "      <td>male</td>\n",
       "      <td>17</td>\n",
       "      <td>Taurus</td>\n",
       "      <td>Technology</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>255686</th>\n",
       "      <td>Well I am sad to report that my Uncle has pass...</td>\n",
       "      <td>07,March,2004</td>\n",
       "      <td>female</td>\n",
       "      <td>27</td>\n",
       "      <td>Capricorn</td>\n",
       "      <td>indUnk</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>401130</th>\n",
       "      <td>urlLink    A picture from the 'Precious Moment...</td>\n",
       "      <td>10,June,2004</td>\n",
       "      <td>male</td>\n",
       "      <td>33</td>\n",
       "      <td>Gemini</td>\n",
       "      <td>Technology</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>6898 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                     text              date  \\\n",
       "395587  it has just hit me that in a matter of a few s...    04,August,2004   \n",
       "275435  Ok, Dear Dopugie/ Ben/ Matt/ anyone who can he...      27,June,2004   \n",
       "634637  ooyeah/Happy Roctober = Miriam?  Anyway, just ...    03,August,2004   \n",
       "264675  Election season is in the air! And I know that...  19,February,2004   \n",
       "628907  His face bore the the signs of the many battle...  03,February,2004   \n",
       "...                                                   ...               ...   \n",
       "41392   Everything You Wanted to Know About Oscillatin...   04,October,2003   \n",
       "71956   Sick Chickens   March has not been a good mont...    08,August,2004   \n",
       "21415   Gunshot as you are more likely to die quickly ...     26,March,2003   \n",
       "255686  Well I am sad to report that my Uncle has pass...     07,March,2004   \n",
       "401130  urlLink    A picture from the 'Precious Moment...      10,June,2004   \n",
       "\n",
       "        gender  age    horoscope                   job  \n",
       "395587  female   27        Virgo                indUnk  \n",
       "275435    male   14  Sagittarius               Student  \n",
       "634637  female   24        Libra                  Arts  \n",
       "264675    male   24       Gemini                indUnk  \n",
       "628907    male   24        Libra                indUnk  \n",
       "...        ...  ...          ...                   ...  \n",
       "41392   female   40       Gemini                   Law  \n",
       "71956     male   26    Capricorn  Communications-Media  \n",
       "21415     male   17       Taurus            Technology  \n",
       "255686  female   27    Capricorn                indUnk  \n",
       "401130    male   33       Gemini            Technology  \n",
       "\n",
       "[6898 rows x 6 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "name = 'blog_authorship_corpus'\n",
    "sample_ratio = 0.01\n",
    "# train_dataset, test_dataset=web_data_prepare(name = name)\n",
    "train_dataset = pd.read_parquet('./train_dataset', engine='pyarrow')\n",
    "test_dataset = pd.read_parquet('./test_dataset', engine='pyarrow')\n",
    "train_dataset=train_dataset.sample(frac=sample_ratio)\n",
    "test_dataset=test_dataset.sample(frac=sample_ratio)\n",
    "train_dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Junk Data\n",
    "Large-scale datasets often contain an uneven distribution of text representation, which includes a significant amount of nonsensical and boilerplate text - such as HTML tags.\n",
    "\n",
    "The presence of such \"noise\" or irrelevant content in the dataset is detrimental to the training of predictive models, specifically those that operate by predicting the next token based on all previous ones. Therefore, it's crucial to clean the dataset and remove these undesired elements prior to the training phase.\n",
    "\n",
    "This piece of Python code calculated a measure of \"impurity\" in text documents, and then computing the proportion of documents that exceed a certain impurity threshold. It defines a compiled regular expression that matches any of the following suspicious characters: &, #, <, >, {, }, [, ]."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.049724557842853"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re_e = r'[&#<>{}\\[\\]\\\\]'\n",
    "threshold = 0.01\n",
    "df, impurity_ratio = junk_eliminate(train_dataset, re_e, threshold)\n",
    "impurity_ratio"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Biased Content\n",
    "It is crucial in the training of language models to be vigilant and potentially apply tools to exclude toxic content from the pre-training datasets. This practice helps to prevent the models from demonstrating bias or generating detrimental content in subsequent applications.\n",
    "\n",
    "One approach to address this issue is by scanning the text for offensive words. For instance, the creators of the C4 dataset have implemented such a filtering mechanism. The follow code references this word list that they open source.\n",
    "\n",
    "The following code utilizes the word list to quantify the \"biased content ratio\" in the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.1284430269643375"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "non_toxic_df, biased_content_ratio = toxic_eliminate(df = train_dataset, l_kind = 'en')\n",
    "biased_content_ratio"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Too Short Document\n",
    "The aim of language modeling is to master the generation of text based on preceding tokens. In this scenario, eliminating extremely brief documents (text consisting of fewer than approximately 100 tokens) from the corpus could aid in the reduction of noise, by producing contiguous text to model dependencies within the text.\n",
    "\n",
    "Use the Hugging Face Transformers library to tokenize text and then calculate the proportion of documents that are \"too short\" in a dataset. This example converts text into tokens that the BERT model can understand. Choose a tokenizer for your model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.40895911858509715"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n",
    "df = train_dataset\n",
    "not_short_df, too_short_doc_ratio = short_eliminate(df, tokenizer, min_len=100)\n",
    "too_short_doc_ratio"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Contamination\n",
    "Typically, ensuring the segregation of training and testing data is rather straightforward in machine learning. However, things become complicated in the context of large language models where both the training and benchmarking datasets are collected from the internet.\n",
    "\n",
    "For instance, the performance evaluation of a large language model using benchmark data (like question-answer pairs) can be significantly affected if the benchmark data also features in the model's training set. The procedure of eliminating instances from the training datasets that intersect with the existing benchmarking datasets is called \"decontamination\".\n",
    "\n",
    "This Python code below is being used to quantify the contamination problem lying in the datasets, i.e., the proportion of documents in the test set that also appear in the training set using N-grams.\n",
    "\n",
    "The approach here is from GPT-3 paper. OpenAI defined a test document as contaminated if any N-gram overlap existed with any training document. (They used a range of N values between 8 and 13 depending on dataset.) When constructing the WebText dataset, OpenAI researchers decontaminated the data by eliminating all Wikipedia content from the training set. This was necessary as Wikipedia data was heavily used in their benchmark datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.021108179419525065"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "contamination_ratio = contamination_eliminate(train_dataset, test_dataset)\n",
    "contamination_ratio"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Duplication\n",
    "When datasets are created by scraping raw text from the Internet, this will often result in the same sequences being repeated multiple times. This paper mentions a single 50 word sequence that is repeated in the C4 dataset 60,000 times.\n",
    "\n",
    "Deduplication helps prevent models from outputting verbatim training data when there are many duplicates, and makes models less vulnerable to privacy attacks. Deduplication can also improve model training efficiency and prevent benchmark contamination.\n",
    "\n",
    "Tools & Tutorials\n",
    "The GPT-3 paper mentions they fuzzily deduplicated documents within each dataset using Spark’s MinHashLSH implementation with 10 hashes.\n",
    "\n",
    "deduplicate-text-datasets is an ExactSubstr deduplication implementation (written in Rust) along with the scripts to perform ExactSubstr deduplication and inspect the results (written in Python).\n",
    "\n",
    "datasketch gives you probabilistic data structures that can process and search very large amount of data super fast, with little loss of accuracy.\n",
    "\n",
    "This article provides a MinHash walkthrough to demonstrate how to implement a parallelel deduplication.\n",
    "\n",
    "The following code uses the datasketch library and LSH (Locality Sensitive Hashing) to deduplicate the dataset. For each text in the DataFrame, it creates a query MinHash object and performs a query on the LSH index to find similar documents.\n",
    "\n",
    "It worths to mention that the de-duplication process usually requires a lot of computational resources (CPU and RAM) due to the size of web crawl datasets and it's therefore recommended to run such computations in distributed settings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unique_documents, duplication_ratio = duplication_eliminate(train_dataset.sample(frac=sample_ratio))\n",
    "duplication_ratio"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Thanks to [HuggingFace-Datasets-Text-Quality-Analysis](https://huggingface.co/spaces/Dreamsome/HuggingFace-Datasets-Text-Quality-Analysis)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "langchain",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}