metadata

tags:
  - summarization
  - mT5
datasets:
  - csebuetnlp/xlsum
language:
  - am
  - ar
  - az
  - bn
  - my
  - zh
  - en
  - fr
  - gu
  - ha
  - hi
  - ig
  - id
  - ja
  - rn
  - ko
  - ky
  - mr
  - ne
  - om
  - ps
  - fa
  - pcm
  - pt
  - pa
  - ru
  - gd
  - sr
  - si
  - so
  - es
  - sw
  - ta
  - te
  - th
  - ti
  - tr
  - uk
  - ur
  - uz
  - vi
  - cy
  - yo
licenses:
  - cc-by-nc-sa-4.0
widget:
  - text: >-
      Yahoo's patents suggest users could weigh the type of ads against the
      sizes of discount before purchase. It says in two US patent applications
      that ads for digital book readers have been "less than optimal" to date.
      The filings suggest that users could be offered titles at a variety of
      prices depending on the ads' prominence They add that the products shown
      could be determined by the type of book being read, or even the contents
      of a specific chapter, phrase or word. The paperwork was published by the
      US Patent and Trademark Office late last week and relates to work carried
      out at the firm's headquarters in Sunnyvale, California. "Greater levels
      of advertising, which may be more valuable to an advertiser and
      potentially more distracting to an e-book reader, may warrant higher
      discounts," it states. Free books It suggests users could be offered ads
      as hyperlinks based within the book's text, in-laid text or even "dynamic
      content" such as video. Another idea suggests boxes at the bottom of a
      page could trail later chapters or quotes saying "brought to you by
      Company A". It adds that the more willing the customer is to see the ads,
      the greater the potential discount. "Higher frequencies... may even be
      great enough to allow the e-book to be obtained for free," it states. The
      authors write that the type of ad could influence the value of the
      discount, with "lower class advertising... such as teeth whitener
      advertisements" offering a cheaper price than "high" or "middle class"
      adverts, for things like pizza. The inventors also suggest that ads could
      be linked to the mood or emotional state the reader is in as a they
      progress through a title. For example, they say if characters fall in love
      or show affection during a chapter, then ads for flowers or entertainment
      could be triggered. The patents also suggest this could applied to
      children's books - giving the Tom Hanks animated film Polar Express as an
      example. It says a scene showing a waiter giving the protagonists hot
      drinks "may be an excellent opportunity to show an advertisement for hot
      cocoa, or a branded chocolate bar". Another example states: "If the
      setting includes young characters, a Coke advertisement could be provided,
      inviting the reader to enjoy a glass of Coke with his book, and providing
      a graphic of a cool glass." It adds that such targeting could be further
      enhanced by taking account of previous titles the owner has bought.
      'Advertising-free zone' At present, several Amazon and Kobo e-book readers
      offer full-screen adverts when the device is switched off and show smaller
      ads on their menu screens, but the main text of the titles remains free of
      marketing. Yahoo does not currently provide ads to these devices, and a
      move into the area could boost its shrinking revenues. However, Philip
      Jones, deputy editor of the Bookseller magazine, said that the internet
      firm might struggle to get some of its ideas adopted. "This has been
      mooted before and was fairly well decried," he said. "Perhaps in a limited
      context it could work if the merchandise was strongly related to the title
      and was kept away from the text. "But readers - particularly parents -
      like the fact that reading is an advertising-free zone. Authors would also
      want something to say about ads interrupting their narrative flow."
  - max_length: 84
  - num_beams: 4

mT5-multilingual-XLSum

This repository contains the mT5 checkpoint finetuned on the 45 languages of XL-Sum dataset. For finetuning details and scripts, see the paper and the official repository.

Using this model in `transformers` (tested on 4.11.0.dev0)

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

article_text = """Input article text"""

model_name = "csebuetnlp/mT5_multilingual_XLSum"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

input_ids = tokenizer.prepare_seq2seq_batch(
    [article_text.replace("\n", " ").strip()],
    return_tensors="pt",
    padding="max_length",
    truncation=True,
    max_length=512
)["input_ids"]

output_ids = model.generate(
    input_ids=input_ids,
    max_length=84,
    no_repeat_ngram_size=2,
    num_beams=4
)[0]

summary = tokenizer.decode(
    output_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)
print(summary)

Benchmarks

Scores on test sets are given below.

Language	ROUGE-1 / ROUGE-2 / ROUGE-L
Amharic	20.0485 / 7.4111 / 18.0753
Arabic	34.9107 / 14.7937 / 29.1623
Azerbaijani	21.4227 / 9.5214 / 19.3331
Bengali	29.5653 / 12.1095 / 25.1315
Burmese	15.9626 / 5.1477 / 14.1819
Chinese (Simplified)	39.4071 / 17.7913 / 33.406
Chinese (Traditional)	37.1866 / 17.1432 / 31.6184
English	37.601 / 15.1536 / 29.8817
French	35.3398 / 16.1739 / 28.2041
Gujarati	21.9619 / 7.7417 / 19.86
Hausa	39.4375 / 17.6786 / 31.6667
Hindi	38.5882 / 16.8802 / 32.0132
Igbo	31.6148 / 10.1605 / 24.5309
Indonesian	37.0049 / 17.0181 / 30.7561
Japanese	48.1544 / 23.8482 / 37.3636
Kirundi	31.9907 / 14.3685 / 25.8305
Korean	23.6745 / 11.4478 / 22.3619
Kyrgyz	18.3751 / 7.9608 / 16.5033
Marathi	22.0141 / 9.5439 / 19.9208
Nepali	26.6547 / 10.2479 / 24.2847
Oromo	18.7025 / 6.1694 / 16.1862
Pashto	38.4743 / 15.5475 / 31.9065
Persian	36.9425 / 16.1934 / 30.0701
Pidgin	37.9574 / 15.1234 / 29.872
Portuguese	37.1676 / 15.9022 / 28.5586
Punjabi	30.6973 / 12.2058 / 25.515
Russian	32.2164 / 13.6386 / 26.1689
Scottish Gaelic	29.0231 / 10.9893 / 22.8814
Serbian (Cyrillic)	23.7841 / 7.9816 / 20.1379
Serbian (Latin)	21.6443 / 6.6573 / 18.2336
Sinhala	27.2901 / 13.3815 / 23.4699
Somali	31.5563 / 11.5818 / 24.2232
Spanish	31.5071 / 11.8767 / 24.0746
Swahili	37.6673 / 17.8534 / 30.9146
Tamil	24.3326 / 11.0553 / 22.0741
Telugu	19.8571 / 7.0337 / 17.6101
Thai	37.3951 / 17.275 / 28.8796
Tigrinya	25.321 / 8.0157 / 21.1729
Turkish	32.9304 / 15.5709 / 29.2622
Ukrainian	23.9908 / 10.1431 / 20.9199
Urdu	39.5579 / 18.3733 / 32.8442
Uzbek	16.8281 / 6.3406 / 15.4055
Vietnamese	32.8826 / 16.2247 / 26.0844
Welsh	32.6599 / 11.596 / 26.1164
Yoruba	31.6595 / 11.6599 / 25.0898

Citation

If you use this model, please cite the following paper:

@inproceedings{hasan-etal-2021-xl,
    title = "{XL}-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages",
    author = "Hasan, Tahmid  and
      Bhattacharjee, Abhik  and
      Islam, Md. Saiful  and
      Mubasshir, Kazi  and
      Li, Yuan-Fang  and
      Kang, Yong-Bin  and
      Rahman, M. Sohel  and
      Shahriyar, Rifat",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.413",
    pages = "4693--4703",
}

mT5-multilingual-XLSum

Using this model in transformers (tested on 4.11.0.dev0)

Benchmarks

Citation

Using this model in `transformers` (tested on 4.11.0.dev0)