Update README.md

3151462 verified 6 months ago

4.42 kB

	---
	library_name: transformers
	license: mit
	language:
	- en
	metrics:
	- rouge
	pipeline_tag: summarization
	base_model:
	- google/pegasus-large
	---

	# Pegasus Large Privacy Policy Summarization V2

	Google Pegasus Large model fine-tuned on privacy policy documents and their corresponding summaries.

	## Model Details

	- Model Type: Transformer-based abstractive summarization model
	- Architecture: Google PEGASUS Large
	- Fine-tuning Dataset: A curated dataset of privacy policy documents and their corresponding summaries.
	- Intended Use: Summarizing long and complex privacy policies into concise and readable summaries.
	- Limitations: May miss critical nuances, legal jargon, or context-dependent details in privacy policies.

	## Uses

	### Direct Use

	This model can be used for summarizing lengthy privacy policy documents into concise summaries.
	It is designed for applications that require automated document summarization, such as compliance analysis and legal document processing.

	### Downstream Use

	This model can be fine-tuned further for domain-specific summarization tasks related to legal, business, or government policy documents.

	### Out-of-Scope Use

	- Legal Advice: The model is not a replacement for professional legal consultation.
	- Summarization of Non-Privacy-Related Texts: Performance may degrade on general texts outside privacy policies.
	- High-Stakes Decision-Making: Should not be used in critical legal or compliance decisions without human oversight.

	## Bias, Risks, and Limitations

	### Risks

	- Summarization Bias: The model may overemphasize certain parts of privacy policies while omitting crucial information.
	- Misinterpretation: Legal terms might not be accurately represented in layman's summaries.
	- Data Sensitivity: Summarization results could be misleading if applied to incomplete or biased datasets.

	### Recommendations

	- Human verification of summaries is advised, especially for legal and compliance use cases.
	- Users should be aware of the potential biases in the training data.
	- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

	## How to Get Started with the Model

	Use the code below to get started with the model.

	```
	import torch
	from transformers import PegasusTokenizer, PegasusForConditionalGeneration

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model_checkpoint = "AryehRotberg/Pegasus-Large-Privacy-Policy-Summarization-V2"
	model = PegasusForConditionalGeneration.from_pretrained(model_checkpoint).to(device)
	tokenizer = PegasusTokenizer.from_pretrained(model_checkpoint)

	def summarize(text):
	inputs = tokenizer(
	f"Summarize the following document: {text}\nSummary: ",
	padding="max_length",
	truncation=True,
	max_length=1024,
	return_tensors="pt",
	).to(device)

	outputs = model.generate(**inputs)

	return tokenizer.decode(outputs[0], skip_special_tokens=True)
	```
	## Training Details

	### Training and Evaluation Data

	The documents and summaries were extracted from the ToS;DR website's API. Only comprehensively reviewed website documents with a rating were used.

	### Training Procedure

	#### Preprocessing

	TextRank algorithm was used to extract the top n sentences from both the documents and summaries, with a maximum of 30 sentences for documents and 20 for summaries.
	BeautifulSoup library was used to parse HTML text, and regular expressions were applied to remove excessive spaces.
	The dataset was then split into training and validation sets, with a test size of 0.2 and a random seed of 42.

	#### Training Hyperparameters

	- Epochs: 10
	- Weight decay: 0.01
	- Batch size: 2 (train & eval)
	- Logging steps: 10
	- Warmup steps: 500
	- Evaluation strategy: epoch
	- Save strategy: epoch
	- Metric for best model: ROUGE-1
	- Load best model at end: True
	- Prediction mode: predict_with_generate=True
	- Optimizer: Adam with learning rate 0.001
	- Scheduler: Linear scheduler with warmup: num_warmup_steps=500, num_training_steps=1500
	- Reporting: MLflow

	## Evaluation

	#### Metrics

	- ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L) were used to measure summarization quality.

	### Results

	- rouge1: 0.5141839409652631
	- rouge2: 0.2895850459169673
	- rougeL: 0.27764589200709305
	- rougeLsum: 0.2776501244969102