File size: 4,183 Bytes
b8a1c54
 
 
 
7feb3da
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b8a1c54
 
 
 
 
 
c7615d8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7feb3da
 
 
 
 
 
52c9c23
 
 
da747dc
52c9c23
 
 
 
 
 
 
 
 
 
7747f3d
52c9c23
da747dc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
datasets:
- csebuetnlp/xlsum
language:
  - am
  - ar
  - az
  - bn
  - my
  - zh
  - en
  - fr
  - gu
  - ha
  - hi
  - ig
  - id
  - ja
  - rn
  - ko
  - ky
  - mr
  - ne
  - om
  - ps
  - fa
  - pcm
  - pt
  - pa
  - ru
  - gd
  - sr
  - si
  - so
  - es
  - sw
  - ta
  - te
  - th
  - ti
  - tr
  - uk
  - ur
  - uz
  - vi
  - cy
  - yo
multilinguality:
  - multilingual
pipeline_tag: summarization
---
# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->

This model is fine-tuned version of [DeltaLM-base](https://huggingface.co/nguyenvulebinh/deltalm-base) on the [XLSum dataset](https://huggingface.co/datasets/csebuetnlp/xlsum)
, aiming for abstractive multilingual summarization.

It achieves the following results on the evaluation set:
- rouge-1: 18.2
- rouge-2: 7.6
- rouge-l: 14.9
- rouge-lsum: 14.7

## Dataset desctiption
[XLSum dataset](https://huggingface.co/datasets/csebuetnlp/xlsum) is a comprehensive and diverse dataset comprising 1.35 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 45 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation.

## Languages
- amharic
- arabic
- azerbaijani
- bengali
- burmese
- chinese_simplified
- chinese_traditional
- english
- french
- gujarati
- hausa
- hindi
- igbo
- indonesian
- japanese
- kirundi
- korean
- kyrgyz
- marathi
- nepali
- oromo
- pashto
- persian
- pidgin
- portuguese
- punjabi
- russian
- scottish_gaelic
- serbian_cyrillic
- serbian_latin
- sinhala
- somali
- spanish
- swahili
- tamil
- telugu
- thai
- tigrinya
- turkish
- ukrainian
- urdu
- uzbek
- vietnamese
- welsh
- yoruba

## Training hyperparameters

The model trained with a p4d.24xlarge instance on aws sagemaker, with the following config:
- model: deltalm base
- batch size: 8
- learning rate: 1e-5
- number of epochs: 3
- warmup steps: 500
- weight decay: 0.01

## Inference example
```
from modeling_deltalm import DeltalmForConditionalGeneration  # download from https://huggingface.co/hhhhzy/deltalm-base-xlsum/blob/main/modeling_deltalm.py
from configuration_deltalm import DeltalmConfig      # download from https://huggingface.co/hhhhzy/deltalm-base-xlsum/blob/main/configuration_deltalm.py
from transformers import AutoTokenizer                        

model = DeltalmForConditionalGeneration.from_pretrained("hhhhzy/deltalm-base-xlsum")
tokenizer = AutoTokenizer.from_pretrained("hhhhzy/deltalm-base-xlsum")

text = "The USA’s biggest sports league, the NFL, has extended its partnership with Amazon Prime, granting the streaming platform an additional live game on ‘black Friday’, the day after Thanksgiving. The additional game, added from 2023, builds on Amazon Prime’s package of ‘Thursday night football’ live rights (secured in an 11-year deal).\\nOn the surface, the deal makes sense because it gives Amazon Prime additional game time during the holiday season. But there is a deeper motivation at play. Black Friday is also regarded as the starting point of the pre-Christmas shopping season. Amazon has worked hard to leverage its sports rights in a way that benefits its ecommerce platform, so the addition of this fixture will boost that strategic goal.\\nIt’s unusual for sports rights holders to utilise their inventory in such a granular way – but it does suggest a shift towards a more data-driven approach to negotiations. For NFL, the deal means it now has partnerships with NBC, CBS, Fox and Amazon across the Thanksgiving period. Amazon Prime is currently in the NFL’s good books, helping revitalise the Thursday night slot through its marketing support and onscreen investment. Around 10 million people in the US are watching live fixtures each week."
inputs = tokenizer(text, max_length=512, return_tensors="pt")

generate_ids = model.generate(inputs["input_ids"], min_length=32, max_length=128)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
```