|
--- |
|
library_name: transformers |
|
license: mit |
|
language: |
|
- en |
|
metrics: |
|
- rouge |
|
pipeline_tag: summarization |
|
base_model: |
|
- google/pegasus-large |
|
--- |
|
|
|
# Pegasus Large Privacy Policy Summarization V2 |
|
|
|
Google Pegasus Large model fine-tuned on privacy policy documents and their corresponding summaries. |
|
|
|
## Model Details |
|
|
|
- **Model Type**: Transformer-based abstractive summarization model |
|
- **Architecture**: Google PEGASUS Large |
|
- **Fine-tuning Dataset**: A curated dataset of privacy policy documents and their corresponding summaries. |
|
- **Intended Use**: Summarizing long and complex privacy policies into concise and readable summaries. |
|
- **Limitations**: May miss critical nuances, legal jargon, or context-dependent details in privacy policies. |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
This model can be used for summarizing lengthy privacy policy documents into concise summaries. |
|
It is designed for applications that require automated document summarization, such as compliance analysis and legal document processing. |
|
|
|
### Downstream Use |
|
|
|
This model can be fine-tuned further for domain-specific summarization tasks related to legal, business, or government policy documents. |
|
|
|
### Out-of-Scope Use |
|
|
|
- **Legal Advice**: The model is not a replacement for professional legal consultation. |
|
- **Summarization of Non-Privacy-Related Texts**: Performance may degrade on general texts outside privacy policies. |
|
- **High-Stakes Decision-Making**: Should not be used in critical legal or compliance decisions without human oversight. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
### Risks |
|
|
|
- **Summarization Bias**: The model may overemphasize certain parts of privacy policies while omitting crucial information. |
|
- **Misinterpretation**: Legal terms might not be accurately represented in layman's summaries. |
|
- **Data Sensitivity**: Summarization results could be misleading if applied to incomplete or biased datasets. |
|
|
|
### Recommendations |
|
|
|
- Human verification of summaries is advised, especially for legal and compliance use cases. |
|
- Users should be aware of the potential biases in the training data. |
|
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
``` |
|
import torch |
|
from transformers import PegasusTokenizer, PegasusForConditionalGeneration |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model_checkpoint = "AryehRotberg/Pegasus-Large-Privacy-Policy-Summarization-V2" |
|
model = PegasusForConditionalGeneration.from_pretrained(model_checkpoint).to(device) |
|
tokenizer = PegasusTokenizer.from_pretrained(model_checkpoint) |
|
|
|
def summarize(text): |
|
inputs = tokenizer( |
|
f"Summarize the following document: {text}\nSummary: ", |
|
padding="max_length", |
|
truncation=True, |
|
max_length=1024, |
|
return_tensors="pt", |
|
).to(device) |
|
|
|
outputs = model.generate(**inputs) |
|
|
|
return tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
``` |
|
## Training Details |
|
|
|
### Training and Evaluation Data |
|
|
|
The documents and summaries were extracted from the ToS;DR website's API. Only comprehensively reviewed website documents with a rating were used. |
|
|
|
### Training Procedure |
|
|
|
#### Preprocessing |
|
|
|
TextRank algorithm was used to extract the top n sentences from both the documents and summaries, with a maximum of 30 sentences for documents and 20 for summaries. |
|
BeautifulSoup library was used to parse HTML text, and regular expressions were applied to remove excessive spaces. |
|
The dataset was then split into training and validation sets, with a test size of 0.2 and a random seed of 42. |
|
|
|
#### Training Hyperparameters |
|
|
|
- Epochs: 10 |
|
- Weight decay: 0.01 |
|
- Batch size: 2 (train & eval) |
|
- Logging steps: 10 |
|
- Warmup steps: 500 |
|
- Evaluation strategy: epoch |
|
- Save strategy: epoch |
|
- Metric for best model: ROUGE-1 |
|
- Load best model at end: True |
|
- Prediction mode: predict_with_generate=True |
|
- Optimizer: Adam with learning rate 0.001 |
|
- Scheduler: Linear scheduler with warmup: num_warmup_steps=500, num_training_steps=1500 |
|
- Reporting: MLflow |
|
|
|
## Evaluation |
|
|
|
#### Metrics |
|
|
|
- ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L) were used to measure summarization quality. |
|
|
|
### Results |
|
|
|
- **rouge1**: 0.5141839409652631 |
|
- **rouge2**: 0.2895850459169673 |
|
- **rougeL**: 0.27764589200709305 |
|
- **rougeLsum**: 0.2776501244969102 |