--- tags: - summarization - summary - booksum - long-document - long-form license: - apache-2.0 - bsd-3-clause datasets: - kmfoda/booksum metrics: - rouge inference: False --- # long-t5-tglobal-xl + BookSum - summarize long text and get a SparkNotes-esque summary of arbitrary topics! - generalizes reasonably well to academic & narrative text. This is the XL checkpoint, which **from a human-evaluation perspective, produces even better summaries**. - A simple example/use case with the `base` model on ASR is [here](https://longt5-booksum-example.netlify.app/). ## Model description A fine-tuned version of [google/long-t5-tglobal-xl](https://huggingface.co/google/long-t5-tglobal-xl) on the `kmfoda/booksum` dataset. Read the paper by Guo et al. here: [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/pdf/2112.07916.pdf) ## How-To in Python > `LLM.int8()` appears to be compatible with summarization and does not degrade the quality of the outputs; this is a crucial enabler for using this model on standard GPUs. A PR for this is in-progress [here](https://github.com/huggingface/transformers/pull/20341), and this model card will be updated with instructions once done :) Install/update transformers `pip install -U transformers` Summarize text with pipeline: ```python import torch from transformers import pipeline summarizer = pipeline( "summarization", "pszemraj/long-t5-tglobal-xl-16384-book-summary", device=0 if torch.cuda.is_available() else -1, ) long_text = "Here is a lot of text I don't want to read. Replace me" result = summarizer(long_text) print(result[0]["summary_text"]) ``` Pass [other parameters related to beam search textgen](https://huggingface.co/blog/how-to-generate) when calling `summarizer` to get even higher quality results. ## Intended uses & limitations - while this model seems to improve upon factual consistency, **do not take summaries to be foolproof and check things that seem odd**. - specifically: negation statements (i.e. model says: _This thing does not have _ where instead it should have said _This thing has a lot of _). - I'm sure someone will write a paper on this eventually (if there isn't one already), but you can usually fact-check this by paying attention to the surrounding sentences of a claim by the model. ## Training and evaluation data - `kmfoda/booksum` dataset on HuggingFace - read [the original paper here](https://arxiv.org/abs/2105.08209). - **Initial fine-tuning** only used input text with 12288 tokens input or less and 1024 tokens output or less for memory reasons. Per brief analysis, summaries in the 12288-16384 range in this dataset are in the **small** minority - In addition, this initial training combined the training and validation sets and trained on these in aggregate to increase the functional dataset size. **Therefore, take the validation set results with a grain of salt; primary metrics should be (always) the test set.** - **final phases of fine-tuning** used the standard conventions of 16384 input/1024 output keeping everything (truncating longer sequences). This did not appear to change the loss/performance much. ## Eval Results Official results with the [model evaluator](https://huggingface.co/spaces/autoevaluate/model-evaluator) will be computed and posted here. **Please read the note above as due to training methods it looks better than the test set results will be**. The model achieves the following results on the evaluation set: - eval_loss: 1.2756 - eval_rouge1: 41.8013 - eval_rouge2: 12.0895 - eval_rougeL: 21.6007 - eval_rougeLsum: 39.5382 - eval_gen_len: 387.2945 - eval_runtime: 13908.4995 - eval_samples_per_second: 0.107 - eval_steps_per_second: 0.027 --- ## FAQ ### How can I run inference with this on CPU? lol --- ## Training procedure ### Updates Updates to this model/model card will be posted here as relevant. The model seems fairly converged, but if updates/improvements can be made using `kmfoda/booksum`, this repo will be updated. ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 0.0006 - train_batch_size: 1 - eval_batch_size: 1 - seed: 10350 - distributed_type: multi-GPU - num_devices: 4 - gradient_accumulation_steps: 32 - total_train_batch_size: 128 - total_eval_batch_size: 4 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: constant - num_epochs: 1.0 \*_Prior training sessions used roughly similar parameters (learning rates were higher); multiple sessions were required as this takes eons to train ### Framework versions - Transformers 4.25.0.dev0 - Pytorch 1.13.0+cu117 - Datasets 2.6.1 - Tokenizers 0.13.1