metadata
tags:
- summarization
- mT5
datasets:
- csebuetnlp/xlsum
language:
- am
- ar
- az
- bn
- my
- zh
- en
- fr
- gu
- ha
- hi
- ig
- id
- ja
- rn
- ko
- ky
- mr
- ne
- om
- ps
- fa
- pcm
- pt
- pa
- ru
- gd
- sr
- si
- so
- es
- sw
- ta
- te
- th
- ti
- tr
- uk
- ur
- uz
- vi
- cy
- yo
licenses:
- cc-by-nc-sa-4.0
widget:
- text: >-
Yahoo's patents suggest users could weigh the type of ads against the
sizes of discount before purchase. It says in two US patent applications
that ads for digital book readers have been "less than optimal" to date.
The filings suggest that users could be offered titles at a variety of
prices depending on the ads' prominence They add that the products shown
could be determined by the type of book being read, or even the contents
of a specific chapter, phrase or word. The paperwork was published by the
US Patent and Trademark Office late last week and relates to work carried
out at the firm's headquarters in Sunnyvale, California. "Greater levels
of advertising, which may be more valuable to an advertiser and
potentially more distracting to an e-book reader, may warrant higher
discounts," it states. Free books It suggests users could be offered ads
as hyperlinks based within the book's text, in-laid text or even "dynamic
content" such as video. Another idea suggests boxes at the bottom of a
page could trail later chapters or quotes saying "brought to you by
Company A". It adds that the more willing the customer is to see the ads,
the greater the potential discount. "Higher frequencies... may even be
great enough to allow the e-book to be obtained for free," it states. The
authors write that the type of ad could influence the value of the
discount, with "lower class advertising... such as teeth whitener
advertisements" offering a cheaper price than "high" or "middle class"
adverts, for things like pizza. The inventors also suggest that ads could
be linked to the mood or emotional state the reader is in as a they
progress through a title. For example, they say if characters fall in love
or show affection during a chapter, then ads for flowers or entertainment
could be triggered. The patents also suggest this could applied to
children's books - giving the Tom Hanks animated film Polar Express as an
example. It says a scene showing a waiter giving the protagonists hot
drinks "may be an excellent opportunity to show an advertisement for hot
cocoa, or a branded chocolate bar". Another example states: "If the
setting includes young characters, a Coke advertisement could be provided,
inviting the reader to enjoy a glass of Coke with his book, and providing
a graphic of a cool glass." It adds that such targeting could be further
enhanced by taking account of previous titles the owner has bought.
'Advertising-free zone' At present, several Amazon and Kobo e-book readers
offer full-screen adverts when the device is switched off and show smaller
ads on their menu screens, but the main text of the titles remains free of
marketing. Yahoo does not currently provide ads to these devices, and a
move into the area could boost its shrinking revenues. However, Philip
Jones, deputy editor of the Bookseller magazine, said that the internet
firm might struggle to get some of its ideas adopted. "This has been
mooted before and was fairly well decried," he said. "Perhaps in a limited
context it could work if the merchandise was strongly related to the title
and was kept away from the text. "But readers - particularly parents -
like the fact that reading is an advertising-free zone. Authors would also
want something to say about ads interrupting their narrative flow."
- max_length: 84
- num_beams: 4
mT5-multilingual-XLSum
This repository contains the mT5 checkpoint finetuned on the 45 languages of XL-Sum dataset. For finetuning details and scripts, see the paper and the official repository.
Using this model in transformers
(tested on 4.11.0.dev0)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
article_text = """Input article text"""
model_name = "csebuetnlp/mT5_multilingual_XLSum"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
input_ids = tokenizer.prepare_seq2seq_batch(
[article_text.replace("\n", " ").strip()],
return_tensors="pt",
padding="max_length",
truncation=True,
max_length=512
)["input_ids"]
output_ids = model.generate(
input_ids=input_ids,
max_length=84,
no_repeat_ngram_size=2,
num_beams=4
)[0]
summary = tokenizer.decode(
output_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
print(summary)
Benchmarks
Scores on test sets are given below.
Language | ROUGE-1 / ROUGE-2 / ROUGE-L |
---|---|
Amharic | 20.0485 / 7.4111 / 18.0753 |
Arabic | 34.9107 / 14.7937 / 29.1623 |
Azerbaijani | 21.4227 / 9.5214 / 19.3331 |
Bengali | 29.5653 / 12.1095 / 25.1315 |
Burmese | 15.9626 / 5.1477 / 14.1819 |
Chinese (Simplified) | 39.4071 / 17.7913 / 33.406 |
Chinese (Traditional) | 37.1866 / 17.1432 / 31.6184 |
English | 37.601 / 15.1536 / 29.8817 |
French | 35.3398 / 16.1739 / 28.2041 |
Gujarati | 21.9619 / 7.7417 / 19.86 |
Hausa | 39.4375 / 17.6786 / 31.6667 |
Hindi | 38.5882 / 16.8802 / 32.0132 |
Igbo | 31.6148 / 10.1605 / 24.5309 |
Indonesian | 37.0049 / 17.0181 / 30.7561 |
Japanese | 48.1544 / 23.8482 / 37.3636 |
Kirundi | 31.9907 / 14.3685 / 25.8305 |
Korean | 23.6745 / 11.4478 / 22.3619 |
Kyrgyz | 18.3751 / 7.9608 / 16.5033 |
Marathi | 22.0141 / 9.5439 / 19.9208 |
Nepali | 26.6547 / 10.2479 / 24.2847 |
Oromo | 18.7025 / 6.1694 / 16.1862 |
Pashto | 38.4743 / 15.5475 / 31.9065 |
Persian | 36.9425 / 16.1934 / 30.0701 |
Pidgin | 37.9574 / 15.1234 / 29.872 |
Portuguese | 37.1676 / 15.9022 / 28.5586 |
Punjabi | 30.6973 / 12.2058 / 25.515 |
Russian | 32.2164 / 13.6386 / 26.1689 |
Scottish Gaelic | 29.0231 / 10.9893 / 22.8814 |
Serbian (Cyrillic) | 23.7841 / 7.9816 / 20.1379 |
Serbian (Latin) | 21.6443 / 6.6573 / 18.2336 |
Sinhala | 27.2901 / 13.3815 / 23.4699 |
Somali | 31.5563 / 11.5818 / 24.2232 |
Spanish | 31.5071 / 11.8767 / 24.0746 |
Swahili | 37.6673 / 17.8534 / 30.9146 |
Tamil | 24.3326 / 11.0553 / 22.0741 |
Telugu | 19.8571 / 7.0337 / 17.6101 |
Thai | 37.3951 / 17.275 / 28.8796 |
Tigrinya | 25.321 / 8.0157 / 21.1729 |
Turkish | 32.9304 / 15.5709 / 29.2622 |
Ukrainian | 23.9908 / 10.1431 / 20.9199 |
Urdu | 39.5579 / 18.3733 / 32.8442 |
Uzbek | 16.8281 / 6.3406 / 15.4055 |
Vietnamese | 32.8826 / 16.2247 / 26.0844 |
Welsh | 32.6599 / 11.596 / 26.1164 |
Yoruba | 31.6595 / 11.6599 / 25.0898 |
Citation
If you use this model, please cite the following paper:
@inproceedings{hasan-etal-2021-xl,
title = "{XL}-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages",
author = "Hasan, Tahmid and
Bhattacharjee, Abhik and
Islam, Md. Saiful and
Mubasshir, Kazi and
Li, Yuan-Fang and
Kang, Yong-Bin and
Rahman, M. Sohel and
Shahriyar, Rifat",
booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-acl.413",
pages = "4693--4703",
}