Spaces:
Runtime error
Runtime error
<!--- | |
Copyright 2020 The HuggingFace Team. All rights reserved. | |
Licensed under the Apache License, Version 2.0 (the "License"); | |
you may not use this file except in compliance with the License. | |
You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, software | |
distributed under the License is distributed on an "AS IS" BASIS, | |
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
See the License for the specific language governing permissions and | |
limitations under the License. | |
--> | |
## Language model training | |
Fine-tuning (or training from scratch) the library models for language modeling on a text dataset for GPT, GPT-2, | |
ALBERT, BERT, DistilBERT, RoBERTa, XLNet... GPT and GPT-2 are trained or fine-tuned using a causal language modeling | |
(CLM) loss while ALBERT, BERT, DistilBERT and RoBERTa are trained or fine-tuned using a masked language modeling (MLM) | |
loss. XLNet uses permutation language modeling (PLM), you can find more information about the differences between those | |
objectives in our [model summary](https://huggingface.co/transformers/model_summary.html). | |
There are two sets of scripts provided. The first set leverages the Trainer API. The second set with `no_trainer` in the suffix uses a custom training loop and leverages the 🤗 Accelerate library . Both sets use the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets. | |
**Note:** The old script `run_language_modeling.py` is still available [here](https://github.com/huggingface/transformers/blob/main/examples/legacy/run_language_modeling.py). | |
The following examples, will run on datasets hosted on our [hub](https://huggingface.co/datasets) or with your own | |
text files for training and validation. We give examples of both below. | |
### GPT-2/GPT and causal language modeling | |
The following example fine-tunes GPT-2 on WikiText-2. We're using the raw WikiText-2 (no tokens were replaced before | |
the tokenization). The loss here is that of causal language modeling. | |
```bash | |
python run_clm.py \ | |
--model_name_or_path gpt2 \ | |
--dataset_name wikitext \ | |
--dataset_config_name wikitext-2-raw-v1 \ | |
--per_device_train_batch_size 8 \ | |
--per_device_eval_batch_size 8 \ | |
--do_train \ | |
--do_eval \ | |
--output_dir /tmp/test-clm | |
``` | |
This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It reaches | |
a score of ~20 perplexity once fine-tuned on the dataset. | |
To run on your own training and validation files, use the following command: | |
```bash | |
python run_clm.py \ | |
--model_name_or_path gpt2 \ | |
--train_file path_to_train_file \ | |
--validation_file path_to_validation_file \ | |
--per_device_train_batch_size 8 \ | |
--per_device_eval_batch_size 8 \ | |
--do_train \ | |
--do_eval \ | |
--output_dir /tmp/test-clm | |
``` | |
This uses the built in HuggingFace `Trainer` for training. If you want to use a custom training loop, you can utilize or adapt the `run_clm_no_trainer.py` script. Take a look at the script for a list of supported arguments. An example is shown below: | |
```bash | |
python run_clm_no_trainer.py \ | |
--dataset_name wikitext \ | |
--dataset_config_name wikitext-2-raw-v1 \ | |
--model_name_or_path gpt2 \ | |
--output_dir /tmp/test-clm | |
``` | |
### RoBERTa/BERT/DistilBERT and masked language modeling | |
The following example fine-tunes RoBERTa on WikiText-2. Here too, we're using the raw WikiText-2. The loss is different | |
as BERT/RoBERTa have a bidirectional mechanism; we're therefore using the same loss that was used during their | |
pre-training: masked language modeling. | |
In accordance to the RoBERTa paper, we use dynamic masking rather than static masking. The model may, therefore, | |
converge slightly slower (over-fitting takes more epochs). | |
```bash | |
python run_mlm.py \ | |
--model_name_or_path roberta-base \ | |
--dataset_name wikitext \ | |
--dataset_config_name wikitext-2-raw-v1 \ | |
--per_device_train_batch_size 8 \ | |
--per_device_eval_batch_size 8 \ | |
--do_train \ | |
--do_eval \ | |
--output_dir /tmp/test-mlm | |
``` | |
To run on your own training and validation files, use the following command: | |
```bash | |
python run_mlm.py \ | |
--model_name_or_path roberta-base \ | |
--train_file path_to_train_file \ | |
--validation_file path_to_validation_file \ | |
--per_device_train_batch_size 8 \ | |
--per_device_eval_batch_size 8 \ | |
--do_train \ | |
--do_eval \ | |
--output_dir /tmp/test-mlm | |
``` | |
If your dataset is organized with one sample per line, you can use the `--line_by_line` flag (otherwise the script | |
concatenates all texts and then splits them in blocks of the same length). | |
This uses the built in HuggingFace `Trainer` for training. If you want to use a custom training loop, you can utilize or adapt the `run_mlm_no_trainer.py` script. Take a look at the script for a list of supported arguments. An example is shown below: | |
```bash | |
python run_mlm_no_trainer.py \ | |
--dataset_name wikitext \ | |
--dataset_config_name wikitext-2-raw-v1 \ | |
--model_name_or_path roberta-base \ | |
--output_dir /tmp/test-mlm | |
``` | |
**Note:** On TPU, you should use the flag `--pad_to_max_length` in conjunction with the `--line_by_line` flag to make | |
sure all your batches have the same length. | |
### Whole word masking | |
This part was moved to `examples/research_projects/mlm_wwm`. | |
### XLNet and permutation language modeling | |
XLNet uses a different training objective, which is permutation language modeling. It is an autoregressive method | |
to learn bidirectional contexts by maximizing the expected likelihood over all permutations of the input | |
sequence factorization order. | |
We use the `--plm_probability` flag to define the ratio of length of a span of masked tokens to surrounding | |
context length for permutation language modeling. | |
The `--max_span_length` flag may also be used to limit the length of a span of masked tokens used | |
for permutation language modeling. | |
Here is how to fine-tune XLNet on wikitext-2: | |
```bash | |
python run_plm.py \ | |
--model_name_or_path=xlnet-base-cased \ | |
--dataset_name wikitext \ | |
--dataset_config_name wikitext-2-raw-v1 \ | |
--per_device_train_batch_size 8 \ | |
--per_device_eval_batch_size 8 \ | |
--do_train \ | |
--do_eval \ | |
--output_dir /tmp/test-plm | |
``` | |
To fine-tune it on your own training and validation file, run: | |
```bash | |
python run_plm.py \ | |
--model_name_or_path=xlnet-base-cased \ | |
--train_file path_to_train_file \ | |
--validation_file path_to_validation_file \ | |
--per_device_train_batch_size 8 \ | |
--per_device_eval_batch_size 8 \ | |
--do_train \ | |
--do_eval \ | |
--output_dir /tmp/test-plm | |
``` | |
If your dataset is organized with one sample per line, you can use the `--line_by_line` flag (otherwise the script | |
concatenates all texts and then splits them in blocks of the same length). | |
**Note:** On TPU, you should use the flag `--pad_to_max_length` in conjunction with the `--line_by_line` flag to make | |
sure all your batches have the same length. | |
## Streaming | |
To use the streaming dataset mode which can be very useful for large datasets, add `--streaming` to the command line. This is currently supported by `run_mlm.py` and `run_clm.py`. | |
## Low Cpu Memory Usage | |
To use low cpu memory mode which can be very useful for LLM, add `--low_cpu_mem_usage` to the command line. This is currently supported by `run_clm.py`,`run_mlm.py`, `run_plm.py`,`run_mlm_no_trainer.py` and `run_clm_no_trainer.py`. | |
## Creating a model on the fly | |
When training a model from scratch, configuration values may be overridden with the help of `--config_overrides`: | |
```bash | |
python run_clm.py --model_type gpt2 --tokenizer_name gpt2 \ --config_overrides="n_embd=1024,n_head=16,n_layer=48,n_positions=102" \ | |
[...] | |
``` | |
This feature is only available in `run_clm.py`, `run_plm.py` and `run_mlm.py`. | |