Spaces:
Runtime error
Runtime error
File size: 15,103 Bytes
68e5edd 6124176 68e5edd 6124176 68e5edd 6124176 68e5edd 6124176 68e5edd 6124176 68e5edd 6124176 68e5edd 6124176 68e5edd 6124176 68e5edd 6124176 68e5edd 6124176 68e5edd 6124176 68e5edd 6124176 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 |
---
title: Summvis
emoji: 📚
colorFrom: yellow
colorTo: green
sdk: streamlit
app_file: app.py
pinned: false
---
# SummVis
SummVis is an open-source visualization tool that supports fine-grained analysis of summarization models, data, and evaluation
metrics. Through its lexical and semantic visualizations, SummVis enables in-depth exploration across important dimensions such as factual consistency and abstractiveness.
Authors: [Jesse Vig](https://twitter.com/jesse_vig)<sup>1</sup>,
[Wojciech Kryściński](https://twitter.com/iam_wkr)<sup>1</sup>,
[Karan Goel](https://twitter.com/krandiash)<sup>2</sup>,
[Nazneen Fatema Rajani](https://twitter.com/nazneenrajani)<sup>1</sup><br/>
<sup>1</sup>[Salesforce Research](https://einstein.ai/) <sup>2</sup>[Stanford Hazy Research](https://hazyresearch.stanford.edu/)
📖 [Paper](https://arxiv.org/abs/2104.07605)
🎥 [Demo](https://vimeo.com/540429745)
<p>
<img src="website/demo.gif" alt="Demo gif"/>
</p>
_Note: SummVis is under active development, so expect continued updates in the coming weeks and months.
Feel free to raise issues for questions, suggestions, requests or bug reports._
## Table of Contents
- [User guide](#user-guide)
- [Installation](#installation)
- [Quickstart](#quickstart)
- [Running with pre-loaded datasets](#running-with-pre-loaded-datasets)
- [Get your data into SummVis](#get-your-data-into-summvis)
- [Citation](#citation)
- [Acknowledgements](#acknowledgements)
## User guide
### Overview
SummVis is a tool for analyzing abstractive summarization systems. It provides fine-grained insights on summarization
models, data, and evaluation metrics by visualizing the relationships between source documents, reference summaries,
and generated summaries, as illustrated in the figure below.<br/>
![Relations between source, reference, and generated summaries](website/triangle.png)
### Interface
The SummVis interface is shown below. The example displayed is the first record from the
[CNN / Daily Mail](https://huggingface.co/datasets/cnn_dailymail) validation set.
![Main interface](website/main-vis.jpg)
#### Components
**(a)** Configuration panel<br/>
**(b)** Source document (or reference summary, depending on configuration)<br/>
**(c)** Generated summaries (and/or reference summary, depending on configuration)<br/>
**(d)** Scroll bar with global view of annotations<br/>
#### Annotations
<img src="website/annotations.png" width="548" height="39" alt="Annotations"/>
**N-gram overlap:** Word sequences that overlap between the document on the left and
the selected summary on the right. Underlines are color-coded by index of summary sentence. <br/>
**Semantic overlap**: Words in the summary that are semantically close to one or more words in document on the left.<br/>
**Novel words**: Words in the summary that do not appear in the document on the left.<br/>
**Novel entities**: Entity words in the summary that do not appear in the document on the left.<br/>
### Limitations
Currently only English text is supported.
## Installation
**IMPORTANT**: Please use `python>=3.8` since some dependencies require that for installation.
```shell
# Requires python>=3.8
git clone https://github.com/robustness-gym/summvis.git
cd summvis
pip install -r requirements.txt
python -m spacy download en_core_web_sm
```
Installation takes around 2 minutes on a Macbook Pro.
## Quickstart
Follow the steps below to start using SummVis immediately.
### 1. Download and extract data
Download our pre-cached dataset that contains predictions for state-of-the-art models such as PEGASUS and BART on
1000 examples taken from the CNN / Daily Mail validation set.
```shell
mkdir data
mkdir preprocessing
curl https://storage.googleapis.com/sfr-summvis-data-research/cnn_dailymail_1000.validation.anonymized.zip --output preprocessing/cnn_dailymail_1000.validation.anonymized.zip
unzip preprocessing/cnn_dailymail_1000.validation.anonymized.zip -d preprocessing/
```
### 2. Deanonymize data
Next, we'll need to add the original examples from the CNN / Daily Mail dataset to deanonymize the data (this information
is omitted for copyright reasons). The `preprocessing.py` script can be used for this with the `--deanonymize` flag.
#### Deanonymize 10 examples:
```shell
python preprocessing.py \\n--deanonymize \\n--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \\n--dataset cnn_dailymail \\n--version 3.0.0 \\n--split validation \\n--processed_dataset_path data/10:cnn_dailymail_1000.validation \\n--n_samples 10
```
This will take either a few seconds or a few minutes depending on whether you've previously loaded CNN/DailyMail from
the Datasets library.
### 3. Run SummVis
Finally, we're ready to run the Streamlit app. Once the app loads, make sure it's pointing to the right `File` at the top
of the interface.
```shell
streamlit run summvis.py
```
## Running with pre-loaded datasets
In this section we extend the approach described in [Quickstart](#quickstart) to other pre-loaded datasets.
### 1. Download one of the pre-loaded datasets:
##### CNN / Daily Mail (1000 examples from validation set): https://storage.googleapis.com/sfr-summvis-data-research/cnn_dailymail_1000.validation.anonymized.zip
##### CNN / Daily Mail (full validation set): https://storage.googleapis.com/sfr-summvis-data-research/cnn_dailymail.validation.anonymized.zip
##### XSum (1000 examples from validation set): https://storage.googleapis.com/sfr-summvis-data-research/xsum_1000.validation.anonymized.zip
##### XSum (full validation set): https://storage.googleapis.com/sfr-summvis-data-research/xsum.validation.anonymized.zip
We recommend that you choose the smallest dataset that fits your need in order to minimize download / preprocessing time.
#### Example: Download and unzip CNN / Daily Mail
```shell
mkdir data
mkdir preprocessing
curl https://storage.googleapis.com/sfr-summvis-data-research/cnn_dailymail_1000.validation.anonymized.zip --output preprocessing/cnn_dailymail_1000.validation.anonymized.zip
unzip preprocessing/cnn_dailymail_1000.validation.anonymized.zip -d preprocessing/
```
#### Example: Download and unzip XSum
```shell
mkdir data
mkdir preprocessing
curl https://storage.googleapis.com/sfr-summvis-data-research/xsum_1000.validation.anonymized.zip --output preprocessing/xsum_1000.validation.anonymized.zip
unzip preprocessing/xsum_1000.validation.anonymized.zip -d preprocessing/
```
### 2. Deanonymize *n* examples:
Set the `--n_samples` argument and name the `--processed_dataset_path` output file accordingly.
#### Example: Deanonymize 100 examples from CNN / Daily Mail:
```shell
python preprocessing.py \\n--deanonymize \\n--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \\n--dataset cnn_dailymail \\n--version 3.0.0 \\n--split validation \\n--processed_dataset_path data/100:cnn_dailymail_1000.validation \\n--n_samples 100
```
#### Example: Deanonymize all pre-loaded examples from CNN / Daily Mail (1000 examples dataset):
```shell
python preprocessing.py \\n--deanonymize \\n--dataset_rg preprocessing/cnn_dailymail_1000.validation.anonymized \\n--dataset cnn_dailymail \\n--version 3.0.0 \\n--split validation \\n--processed_dataset_path data/full:cnn_dailymail_1000.validation \\n--n_samples 1000
```
#### Example: Deanonymize all pre-loaded examples from CNN / Daily Mail (full dataset):
```shell
python preprocessing.py \\n--deanonymize \\n--dataset_rg preprocessing/cnn_dailymail.validation.anonymized \\n--dataset cnn_dailymail \\n--version 3.0.0 \\n--split validation \\n--processed_dataset_path data/full:cnn_dailymail.validation
```
#### Example: Deanonymize all pre-loaded examples from XSum (1000 examples dataset):
```shell
python preprocessing.py \\n--deanonymize \\n--dataset_rg preprocessing/xsum_1000.validation.anonymized \\n--dataset xsum \\n--split validation \\n--processed_dataset_path data/full:xsum_1000.validation \\n--n_samples 1000
```
### 3. Run SummVis
Once the app loads, make sure it's pointing to the right `File` at the top
of the interface.
```shell
streamlit run summvis.py
```
Alternately, if you need to point SummVis to a folder where your data is stored.
```shell
streamlit run summvis.py -- --path your/path/to/data
```
Note that the additional `--` is not a mistake, and is required to pass command-line arguments in streamlit.
## Get your data into SummVis
The simplest way to use SummVis with your own data is to create a jsonl file of the following format:
```
{"document": "This is the first source document", "summary:reference": "This is the reference summary", "summary:testmodel1": "This is the summary for testmodel1", "summary:testmodel2": "This is the summary for testmodel2"}
{"document": "This is the second source document", "summary:reference": "This is the reference summary", "summary:testmodel1": "This is the summary for testmodel1", "summary:testmodel2": "This is the summary for testmodel2"}
```
The key for the reference summary must equal `summary:reference` and the key for any other summary must be of the form
`summary:<summary_name>`, e.g. `summary:BART`. The document and at least one summary (reference, other, or both) are required.
The following additional install step is required.:
```
python -m spacy download en_core_web_lg
```
You have two options to load this jsonl file into the tool:
#### Option 1: Load the jsonl file directly
The disadvantage of this approach is that all computations are performed in realtime. This is particularly expensive for
semantic similarity, which uses a Transformer model. At a result, each example will be slow to load (~5-15 seconds on a Macbook Pro).
1. Place the jsonl file in the `data` directory. Note that the file must be named with a `.jsonl` extension.
2. Start SummVis: `streamlit run summvis.py`
3. Select your jsonl file from the `File` dropdown at the top of the interface.
#### Option 2: Preprocess jsonl file (recommended)
You may run `preprocessing.py` to precompute all data required in the interface (running `spaCy`, lexical and semantic
aligners) and save a cache file, which can be read directly into the tool. Note that this script may run for a while
(~5-15 seconds per example on a MacBook Pro for
documents of typical length found in CNN/DailyMail or XSum), and will be greatly expedited by running on a GPU.
1. Run preprocessing script to generate cache file
```shell
python preprocessing.py \\n --workflow \\n --dataset_jsonl path/to/my_dataset.jsonl \\n --processed_dataset_path path/to/my_cache_file
```
You may wish to first try it with a subset of your data by adding the following argument: `--n_samples <number_of_samples>`.
2. Copy output cache file to the `data` directory
3. Start SummVis: `streamlit run summvis.py`
4. Select your file from the `File` dropdown at the top of the interface.
As an alternative to steps 2-3, you may point SummVis to a folder in which the cache file is stored:
```shell
streamlit run summvis.py -- --path <parent_directory_of_cache_file>
```
### Generating predictions
The instructions in the previous section assume access to model predictions. We also provide tools to load predictions,
either by downloading datasets with precomputed predictions or running
a script to generate predictions for HuggingFace-compatible models. In this section we describe an end-to-end pipeline
for using these tools.
Prior to running the following, an additional install step is required:
```
python -m spacy download en_core_web_lg
```
#### 1. Standardize and save dataset to disk.
Loads in a dataset from HF, or any dataset that you have and stores it in a
standardized format with columns for `document` and `summary:reference`.
##### Example: Save CNN / Daily Mail validation split to disk as a jsonl file.
```shell
python preprocessing.py \\n--standardize \\n--dataset cnn_dailymail \\n--version 3.0.0 \\n--split validation \\n--save_jsonl_path preprocessing/cnn_dailymail.validation.jsonl
```
##### Example: Load custom `my_dataset.jsonl`, standardize, and save.
```shell
python preprocessing.py \\n--standardize \\n--dataset_jsonl path/to/my_dataset.jsonl \\n--save_jsonl_path preprocessing/my_dataset.jsonl
```
Expected format of `my_dataset.jsonl`:
```
{"document": "This is the first source document", "summary:reference": "This is the reference summary"}
{"document": "This is the second source document", "summary:reference": "This is the reference summary"}
```
If you wish to use column names other than `document` and `summary:reference`, you may specify custom column names
using the `doc_column` and `reference_column` command-line arguments.
#### 2. Add predictions to the saved dataset.
Takes a saved dataset that has already been standardized and adds predictions to it
from prediction jsonl files. Cached predictions for several models available here:
https://storage.googleapis.com/sfr-summvis-data-research/predictions.zip
You may also generate your own predictions using this [this script](generation.py).
##### Example: Add 6 prediction files for PEGASUS and BART to the dataset.
```shell
python preprocessing.py \\n--join_predictions \\n--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \\n--prediction_jsonls \\npredictions/bart-cnndm.cnndm.validation.results.anonymized \\npredictions/bart-xsum.cnndm.validation.results.anonymized \\npredictions/pegasus-cnndm.cnndm.validation.results.anonymized \\npredictions/pegasus-multinews.cnndm.validation.results.anonymized \\npredictions/pegasus-newsroom.cnndm.validation.results.anonymized \\npredictions/pegasus-xsum.cnndm.validation.results.anonymized \\n--save_jsonl_path preprocessing/cnn_dailymail.validation.jsonl
```
#### 3. Run the preprocessing workflow and save the dataset.
Takes a saved dataset that has been standardized, and predictions already added.
Applies all the preprocessing steps to it (running `spaCy`, lexical and semantic aligners),
and stores the processed dataset back to disk.
##### Example: Autorun with default settings on a few examples to try it.
```shell
python preprocessing.py \\n--workflow \\n--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \\n--processed_dataset_path data/cnn_dailymail.validation \\n--try_it
```
##### Example: Autorun with default settings on all examples.
```shell
python preprocessing.py \\n--workflow \\n--dataset_jsonl preprocessing/cnn_dailymail.validation.jsonl \\n--processed_dataset_path data/cnn_dailymail
```
## Citation
When referencing this repository, please cite [this paper](https://arxiv.org/abs/2104.07605):
```
@misc{vig2021summvis,
title={SummVis: Interactive Visual Analysis of Models, Data, and Evaluation for Text Summarization},
author={Jesse Vig and Wojciech Kryscinski and Karan Goel and Nazneen Fatema Rajani},
year={2021},
eprint={2104.07605},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2104.07605}
}
```
## Acknowledgements
We thank [Michael Correll](http://correll.io) for his valuable feedback.
|