{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import sys\n", "sys.path.append(\"../../FinNLP\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import os\n", "import requests\n", "import shutil\n", "import pandas as pd\n", "from finnlp.data_engineering.data_cleaning import * " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Downloading sample data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def download_parquet_files(url_list, local_dir):\n", " for url in url_list:\n", " file_name = url.split('/')[-1]\n", " local_file = os.path.join(local_dir, file_name)\n", " if not os.path.exists(local_dir):\n", " os.makedirs(local_dir)\n", "\n", " r = requests.get(url, stream=True)\n", " if r.status_code == 200:\n", " with open(local_file, 'wb+') as f:\n", " r.raw.decode_content = True\n", " shutil.copyfileobj(r.raw, f)\n", " else:\n", " print('download failed: ', url)\n", "\n", "def web_data_prepare(name):\n", " r = requests.get(\"https://datasets-server.huggingface.co/parquet?dataset=\"+name)\n", " j = r.json()\n", " urls = [f['url'] for f in j['parquet_files'] if f['split'] == 'train']\n", " train_urls = [f['url'] for f in j['parquet_files'] if f['config'] == name and f['split'] == 'train']\n", " test_urls = [f['url'] for f in j['parquet_files'] if f['config'] == name and f['split'] == 'validation']\n", " download_parquet_files(train_urls, 'train_dataset')\n", " download_parquet_files(test_urls, 'test_dataset')\n", "\n", " train_dataset = pd.read_parquet('./train_dataset', engine='pyarrow')\n", " test_dataset = pd.read_parquet('./test_dataset', engine='pyarrow')\n", "\n", " # train_dataset.rebalance()\n", " # test_dataset.rebalance()\n", " return train_dataset, test_dataset\n", "\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textdategenderagehoroscopejob
395587it has just hit me that in a matter of a few s...04,August,2004female27VirgoindUnk
275435Ok, Dear Dopugie/ Ben/ Matt/ anyone who can he...27,June,2004male14SagittariusStudent
634637ooyeah/Happy Roctober = Miriam? Anyway, just ...03,August,2004female24LibraArts
264675Election season is in the air! And I know that...19,February,2004male24GeminiindUnk
628907His face bore the the signs of the many battle...03,February,2004male24LibraindUnk
.....................
41392Everything You Wanted to Know About Oscillatin...04,October,2003female40GeminiLaw
71956Sick Chickens March has not been a good mont...08,August,2004male26CapricornCommunications-Media
21415Gunshot as you are more likely to die quickly ...26,March,2003male17TaurusTechnology
255686Well I am sad to report that my Uncle has pass...07,March,2004female27CapricornindUnk
401130urlLink A picture from the 'Precious Moment...10,June,2004male33GeminiTechnology
\n", "

6898 rows × 6 columns

\n", "
" ], "text/plain": [ " text date \\\n", "395587 it has just hit me that in a matter of a few s... 04,August,2004 \n", "275435 Ok, Dear Dopugie/ Ben/ Matt/ anyone who can he... 27,June,2004 \n", "634637 ooyeah/Happy Roctober = Miriam? Anyway, just ... 03,August,2004 \n", "264675 Election season is in the air! And I know that... 19,February,2004 \n", "628907 His face bore the the signs of the many battle... 03,February,2004 \n", "... ... ... \n", "41392 Everything You Wanted to Know About Oscillatin... 04,October,2003 \n", "71956 Sick Chickens March has not been a good mont... 08,August,2004 \n", "21415 Gunshot as you are more likely to die quickly ... 26,March,2003 \n", "255686 Well I am sad to report that my Uncle has pass... 07,March,2004 \n", "401130 urlLink A picture from the 'Precious Moment... 10,June,2004 \n", "\n", " gender age horoscope job \n", "395587 female 27 Virgo indUnk \n", "275435 male 14 Sagittarius Student \n", "634637 female 24 Libra Arts \n", "264675 male 24 Gemini indUnk \n", "628907 male 24 Libra indUnk \n", "... ... ... ... ... \n", "41392 female 40 Gemini Law \n", "71956 male 26 Capricorn Communications-Media \n", "21415 male 17 Taurus Technology \n", "255686 female 27 Capricorn indUnk \n", "401130 male 33 Gemini Technology \n", "\n", "[6898 rows x 6 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "name = 'blog_authorship_corpus'\n", "sample_ratio = 0.01\n", "# train_dataset, test_dataset=web_data_prepare(name = name)\n", "train_dataset = pd.read_parquet('./train_dataset', engine='pyarrow')\n", "test_dataset = pd.read_parquet('./test_dataset', engine='pyarrow')\n", "train_dataset=train_dataset.sample(frac=sample_ratio)\n", "test_dataset=test_dataset.sample(frac=sample_ratio)\n", "train_dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Junk Data\n", "Large-scale datasets often contain an uneven distribution of text representation, which includes a significant amount of nonsensical and boilerplate text - such as HTML tags.\n", "\n", "The presence of such \"noise\" or irrelevant content in the dataset is detrimental to the training of predictive models, specifically those that operate by predicting the next token based on all previous ones. Therefore, it's crucial to clean the dataset and remove these undesired elements prior to the training phase.\n", "\n", "This piece of Python code calculated a measure of \"impurity\" in text documents, and then computing the proportion of documents that exceed a certain impurity threshold. It defines a compiled regular expression that matches any of the following suspicious characters: &, #, <, >, {, }, [, ]." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.049724557842853" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re_e = r'[&#<>{}\\[\\]\\\\]'\n", "threshold = 0.01\n", "df, impurity_ratio = junk_eliminate(train_dataset, re_e, threshold)\n", "impurity_ratio" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Biased Content\n", "It is crucial in the training of language models to be vigilant and potentially apply tools to exclude toxic content from the pre-training datasets. This practice helps to prevent the models from demonstrating bias or generating detrimental content in subsequent applications.\n", "\n", "One approach to address this issue is by scanning the text for offensive words. For instance, the creators of the C4 dataset have implemented such a filtering mechanism. The follow code references this word list that they open source.\n", "\n", "The following code utilizes the word list to quantify the \"biased content ratio\" in the dataset." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.1284430269643375" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "non_toxic_df, biased_content_ratio = toxic_eliminate(df = train_dataset, l_kind = 'en')\n", "biased_content_ratio" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Too Short Document\n", "The aim of language modeling is to master the generation of text based on preceding tokens. In this scenario, eliminating extremely brief documents (text consisting of fewer than approximately 100 tokens) from the corpus could aid in the reduction of noise, by producing contiguous text to model dependencies within the text.\n", "\n", "Use the Hugging Face Transformers library to tokenize text and then calculate the proportion of documents that are \"too short\" in a dataset. This example converts text into tokens that the BERT model can understand. Choose a tokenizer for your model." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.40895911858509715" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n", "df = train_dataset\n", "not_short_df, too_short_doc_ratio = short_eliminate(df, tokenizer, min_len=100)\n", "too_short_doc_ratio" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Contamination\n", "Typically, ensuring the segregation of training and testing data is rather straightforward in machine learning. However, things become complicated in the context of large language models where both the training and benchmarking datasets are collected from the internet.\n", "\n", "For instance, the performance evaluation of a large language model using benchmark data (like question-answer pairs) can be significantly affected if the benchmark data also features in the model's training set. The procedure of eliminating instances from the training datasets that intersect with the existing benchmarking datasets is called \"decontamination\".\n", "\n", "This Python code below is being used to quantify the contamination problem lying in the datasets, i.e., the proportion of documents in the test set that also appear in the training set using N-grams.\n", "\n", "The approach here is from GPT-3 paper. OpenAI defined a test document as contaminated if any N-gram overlap existed with any training document. (They used a range of N values between 8 and 13 depending on dataset.) When constructing the WebText dataset, OpenAI researchers decontaminated the data by eliminating all Wikipedia content from the training set. This was necessary as Wikipedia data was heavily used in their benchmark datasets." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.021108179419525065" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "contamination_ratio = contamination_eliminate(train_dataset, test_dataset)\n", "contamination_ratio" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Duplication\n", "When datasets are created by scraping raw text from the Internet, this will often result in the same sequences being repeated multiple times. This paper mentions a single 50 word sequence that is repeated in the C4 dataset 60,000 times.\n", "\n", "Deduplication helps prevent models from outputting verbatim training data when there are many duplicates, and makes models less vulnerable to privacy attacks. Deduplication can also improve model training efficiency and prevent benchmark contamination.\n", "\n", "Tools & Tutorials\n", "The GPT-3 paper mentions they fuzzily deduplicated documents within each dataset using Spark’s MinHashLSH implementation with 10 hashes.\n", "\n", "deduplicate-text-datasets is an ExactSubstr deduplication implementation (written in Rust) along with the scripts to perform ExactSubstr deduplication and inspect the results (written in Python).\n", "\n", "datasketch gives you probabilistic data structures that can process and search very large amount of data super fast, with little loss of accuracy.\n", "\n", "This article provides a MinHash walkthrough to demonstrate how to implement a parallelel deduplication.\n", "\n", "The following code uses the datasketch library and LSH (Locality Sensitive Hashing) to deduplicate the dataset. For each text in the DataFrame, it creates a query MinHash object and performs a query on the LSH index to find similar documents.\n", "\n", "It worths to mention that the de-duplication process usually requires a lot of computational resources (CPU and RAM) due to the size of web crawl datasets and it's therefore recommended to run such computations in distributed settings." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unique_documents, duplication_ratio = duplication_eliminate(train_dataset.sample(frac=sample_ratio))\n", "duplication_ratio" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thanks to [HuggingFace-Datasets-Text-Quality-Analysis](https://huggingface.co/spaces/Dreamsome/HuggingFace-Datasets-Text-Quality-Analysis)" ] } ], "metadata": { "kernelspec": { "display_name": "langchain", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }