Ihor commited on
Commit
da85c44
·
verified ·
1 Parent(s): 9a402b3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +384 -3
README.md CHANGED
@@ -1,3 +1,384 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ metrics:
6
+ - f1
7
+ - precision
8
+ - recall
9
+ tags:
10
+ - NER
11
+ - information extraction
12
+ - relation extraction
13
+ - summarization
14
+ - sentiment extraction
15
+ - question-answering
16
+ pipeline_tag: token-classification
17
+ library_name: gliner
18
+ datasets:
19
+ - knowledgator/GLINER-multi-task-synthetic-data
20
+ ---
21
+ 🚀 Meet the first multi-task prompt-tunable GLiNER model 🚀
22
+
23
+ **GLiNER-Multitask** is a model designed to extract various pieces of information from plain text based on a user-provided custom prompt. This versatile model leverages a bidirectional transformer encoder, similar to BERT, which ensures both high generalization and compute efficiency despite its compact size.
24
+
25
+ The `gliner-multitask-large` variant achieves state-of-the-art performance on NER zero-shot benchmarks, demonstrating its robustness and flexibility. It excels not only in named entity recognition but also in handling various other information extraction tasks, making it a powerful tool for diverse natural language processing applications.
26
+
27
+ ### Supported tasks:
28
+ * **Named Entity Recognition (NER)**: Identifies and categorizes entities such as names, organizations, dates, and other specific items in the text.
29
+ * **Relation Extraction**: Detects and classifies relationships between entities within the text.
30
+ * **Summarization**: Extract the most important sentences that summarize the input text, capturing the essential information.
31
+ * **Sentiment Extraction**: Identify parts of the text that signalize a positive, negative, or neutral sentiment;
32
+ * **Key-Phrase Extraction**: Identifies and extracts important phrases and keywords from the text.
33
+ * **Question-answering**: Finding an answer in the text given a question;
34
+ * **Open Information Extraction**: Extracts pieces of text given an open prompt from a user, for example, product description extraction;
35
+ * **Text classification**: Classifying text by matching labels specified in the prompt;
36
+
37
+
38
+ ### Installation
39
+ To use this model, you must install the [GLiNER Python library](https://github.com/urchade/GLiNER):
40
+
41
+ ```bash
42
+ pip install gliner
43
+ ```
44
+ And install LLM2Vec package:
45
+ ```bash
46
+ pip install llm2vec
47
+ ```
48
+
49
+ Once you've downloaded the GLiNER library, you can import the GLiNER class. You can then load this model using GLiNER.from_pretrained.
50
+
51
+ **How to use for NER:**
52
+
53
+ ```python
54
+ from gliner import GLiNER
55
+
56
+ model = GLiNER.from_pretrained("knowledgator/gliner-llama-multitask-1B-v1.0")
57
+
58
+ text = """
59
+ Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975 to develop and sell BASIC interpreters for the Altair 8800. During his career at Microsoft, Gates held the positions of chairman, chief executive officer, president and chief software architect, while also being the largest individual shareholder until May 2014.
60
+ """
61
+
62
+ labels = ["founder", "computer", "software", "position", "date"]
63
+
64
+ entities = model.predict_entities(text, labels)
65
+
66
+ for entity in entities:
67
+ print(entity["text"], "=>", entity["label"])
68
+ ```
69
+
70
+ If you want to use flash attention or increase sequence length, please, check the following code:
71
+ ```python
72
+ from gliner import GLiNER
73
+ import torch
74
+
75
+ model = GLiNER.from_pretrained("knowledgator/gliner-llama-1B-v1.0",
76
+ _attn_implementation = 'flash_attention_2',
77
+ max_length = 2048).to('cuda:0', dtype=torch.float16)
78
+ ```
79
+
80
+
81
+
82
+ ### Performance:
83
+
84
+ | Model | Dataset | Precision | Recall | F1 Score | F1 Score (Decimal) |
85
+ |------------------------------------|--------------------|-----------|--------|----------|--------------------|
86
+ | knowledgator/gliner-multitask-v0.5 | CrossNER_AI | 51.00% | 51.11% | 51.05% | 0.5105 |
87
+ | | CrossNER_literature | 72.65% | 65.62% | 68.96% | 0.6896 |
88
+ | | CrossNER_music | 74.91% | 73.70% | 74.30% | 0.7430 |
89
+ | | CrossNER_politics | 78.84% | 77.71% | 78.27% | 0.7827 |
90
+ | | CrossNER_science | 69.20% | 65.48% | 67.29% | 0.6729 |
91
+ | | mit-movie | 61.29% | 52.59% | 56.60% | 0.5660 |
92
+ | | mit-restaurant | 50.65% | 38.13% | 43.51% | 0.4351 |
93
+ | | **Average** | | | | **0.6276** |
94
+ | knowledgator/gliner-multitask-v1.0 | CrossNER_AI | 51.00% | 51.11% | 51.05% | 0.5105 |
95
+ | | CrossNER_literature | 72.65% | 65.62% | 68.96% | 0.6896 |
96
+ | | CrossNER_music | 74.91% | 73.70% | 74.30% | 0.7430 |
97
+ | | CrossNER_politics | 78.84% | 77.71% | 78.27% | 0.7827 |
98
+ | | CrossNER_science | 69.20% | 65.48% | 67.29% | 0.6729 |
99
+ | | mit-movie | 61.29% | 52.59% | 56.60% | 0.5660 |
100
+ | | mit-restaurant | 50.65% | 38.13% | 43.51% | 0.4351 |
101
+ | | **Average** | | | | **0.6276** |
102
+
103
+ ---
104
+ **How to use for relation extraction:**
105
+
106
+ ```python
107
+ text = """
108
+ Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975 to develop and sell BASIC interpreters for the Altair 8800. During his career at Microsoft, Gates held the positions of chairman, chief executive officer, president and chief software architect, while also being the largest individual shareholder until May 2014.
109
+ """
110
+
111
+ labels = ["Microsoft <> founder", "Microsoft <> inception date", "Bill Gates <> held position"]
112
+
113
+ entities = model.predict_entities(text, labels)
114
+
115
+ for entity in entities:
116
+ print(entity["label"], "=>", entity["text"])
117
+ ```
118
+ ### Construct relations extraction pipeline with [utca](https://github.com/Knowledgator/utca)
119
+ First of all, we need import neccessary components of the library and initalize predictor - GLiNER model and construct pipeline that combines NER and realtions extraction:
120
+ ```python
121
+ from utca.core import RenameAttribute
122
+ from utca.implementation.predictors import (
123
+ GLiNERPredictor,
124
+ GLiNERPredictorConfig
125
+ )
126
+ from utca.implementation.tasks import (
127
+ GLiNER,
128
+ GLiNERPreprocessor,
129
+ GLiNERRelationExtraction,
130
+ GLiNERRelationExtractionPreprocessor,
131
+ )
132
+
133
+ predictor = GLiNERPredictor( # Predictor manages the model that will be used by tasks
134
+ GLiNERPredictorConfig(
135
+ model_name = "knowledgator/gliner-llama-multitask-1B-v1.0", # Model to use
136
+ device = "cuda:0", # Device to use
137
+ )
138
+ )
139
+
140
+ pipe = (
141
+ GLiNER( # GLiNER task produces classified entities that will be at the "output" key.
142
+ predictor=predictor,
143
+ preprocess=GLiNERPreprocessor(threshold=0.7) # Entities threshold
144
+ )
145
+ | RenameAttribute("output", "entities") # Rename output entities from GLiNER task to use them as inputs in GLiNERRelationExtraction
146
+ | GLiNERRelationExtraction( # GLiNERRelationExtraction is used for relation extraction.
147
+ predictor=predictor,
148
+ preprocess=(
149
+ GLiNERPreprocessor(threshold=0.5) # Relations threshold
150
+ | GLiNERRelationExtractionPreprocessor()
151
+ )
152
+ )
153
+ )
154
+ ```
155
+
156
+ To run pipeline we need to specify entity types and relations with their parameters:
157
+
158
+ ```python
159
+ r = pipe.run({
160
+ "text": text, # Text to process
161
+ "labels": ["organisation", "founder", "position", "date"],
162
+ "relations": [{ # Relation parameters
163
+ "relation": "founder", # Relation label. Required parameter.
164
+ "pairs_filter": [("organisation", "founder")], # Optional parameter. It specifies possible members of relations by their entity labels.
165
+ "distance_threshold": 100, # Optional parameter. It specifies the max distance between spans in the text (i.e., the end of the span that is closer to the start of the text and the start of the next one).
166
+ }, {
167
+ "relation": "inception date",
168
+ "pairs_filter": [("organisation", "date")],
169
+ }, {
170
+ "relation": "held position",
171
+ "pairs_filter": [("founder", "position")],
172
+ }]
173
+ })
174
+
175
+ print(r["output"])
176
+ ```
177
+
178
+ ### Performance:
179
+ | Model | Dataset | Precision | Recall | F1 Score |
180
+ |:-----------------------|------------:|---------:|-----------:|-----------:|
181
+ | knowledgator/gliner-llama-multitask-1B-v1.0 | CrossRe | 0.606472 | 0.511444 | 0.554919 |
182
+ | | DocRed | 0.707483 | 0.589355 | 0.643039 |
183
+ | knowledgator/gliner-multitask-v0.5 | CrossRe | 0.585319 | 0.800176 | 0.676088 |
184
+ | | DocRed | 0.713392 | 0.772826 | 0.74192 |
185
+ |knowledgator/gliner-multitask-v1.0 | CrossRe | 0.760653 | 0.738556 | 0.749442 |
186
+ | | DocRed | 0.770644 | 0.761373 | 0.76598 |
187
+
188
+ ---
189
+
190
+ **How to use for open information extraction:**
191
+
192
+ ```python
193
+ prompt = """Find all positive aspects about the product:\n"""
194
+ text = """
195
+ I recently purchased the Sony WH-1000XM4 Wireless Noise-Canceling Headphones from Amazon and I must say, I'm thoroughly impressed. The package arrived in New York within 2 days, thanks to Amazon Prime's expedited shipping.
196
+
197
+ The headphones themselves are remarkable. The noise-canceling feature works like a charm in the bustling city environment, and the 30-hour battery life means I don't have to charge them every day. Connecting them to my Samsung Galaxy S21 was a breeze, and the sound quality is second to none.
198
+
199
+ I also appreciated the customer service from Amazon when I had a question about the warranty. They responded within an hour and provided all the information I needed.
200
+
201
+ However, the headphones did not come with a hard case, which was listed in the product description. I contacted Amazon, and they offered a 10% discount on my next purchase as an apology.
202
+
203
+ Overall, I'd give these headphones a 4.5/5 rating and highly recommend them to anyone looking for top-notch quality in both product and service.
204
+ """
205
+
206
+ input_ = prompt+text
207
+
208
+ labels = ["match"]
209
+
210
+ matches = model.predict_entities(input_, labels)
211
+
212
+ for match in matches:
213
+ print(match["text"], "=>", match["score"])
214
+ ```
215
+
216
+ ### Performance:
217
+
218
+ *Dataset: WiRe57_343-manual-oie*
219
+ | Model | Precision | Recall | F1 Score |
220
+ |:-----------------------|------------:|---------:|-----------:|
221
+ | knowledgator/gliner-llama-multitask-1B-v1.0 | 0.914894 | 0.200466 | 0.328872 |
222
+ | knowledgator/gliner-multitask-v0.5 | 0.848485 | 0.140351 | 0.24086 |
223
+ | knowledgator/gliner-multitask-v1.0 | 0.9 | 0.155172 | 0.264706 |
224
+
225
+ ---
226
+
227
+ **How to use for question-answering:**
228
+
229
+ ```python
230
+ question = "Who was the CEO of Microsoft?"
231
+ text = """
232
+ Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975, to develop and sell BASIC interpreters for the Altair 8800. During his career at Microsoft, Gates held the positions of chairman, chief executive officer, president and chief software architect, while also being the largest individual shareholder until May 2014.
233
+ """
234
+
235
+ labels = ["answer"]
236
+
237
+ input_ = question+text
238
+ answers = model.predict_entities(input_, labels)
239
+
240
+ for answer in answers:
241
+ print(answer["text"], "=>", answer["score"])
242
+ ```
243
+
244
+ ### Performance:
245
+ *Dataset: SQuAD 2.0*
246
+
247
+ | Model | Precision | Recall | F1 Score |
248
+ |:-----------------------|------------:|---------:|-----------:|
249
+ | knowledgator/gliner-llama-multitask-1B-v1.0 | 0.578296 | 0.795821 | 0.669841 |
250
+ | knowledgator/gliner-multitask-v0.5 | 0.429213 | 0.94378 | 0.590072 |
251
+ | knowledgator/gliner-multitask-v1.0 | 0.601354 | 0.874784 | 0.712745 |
252
+
253
+ ---
254
+
255
+ **How to use for summarization:**
256
+
257
+ With threshold parameters, you can control how much information you want to extract.
258
+
259
+ ```python
260
+ prompt = "Summarize the given text, highlighting the most important information:\n"
261
+
262
+ text = """
263
+ Several studies have reported its pharmacological activities, including anti-inflammatory, antimicrobial, and antitumoral effects.
264
+ The effect of E-anethole was studied in the osteosarcoma MG-63 cell line, and the antiproliferative activity was evaluated by an MTT assay.
265
+ It showed a GI50 value of 60.25 μM with apoptosis induction through the mitochondrial-mediated pathway. Additionally, it induced cell cycle arrest at the G0/G1 phase, up-regulated the expression of p53, caspase-3, and caspase-9, and down-regulated Bcl-xL expression.
266
+ Moreover, the antitumoral activity of anethole was assessed against oral tumor Ca9-22 cells, and the cytotoxic effects were evaluated by MTT and LDH assays.
267
+ It demonstrated a LD50 value of 8 μM, and cellular proliferation was 42.7% and 5.2% at anethole concentrations of 3 μM and 30 μM, respectively.
268
+ It was reported that it could selectively and in a dose-dependent manner decrease cell proliferation and induce apoptosis, as well as induce autophagy, decrease ROS production, and increase glutathione activity. The cytotoxic effect was mediated through NF-kB, MAP kinases, Wnt, caspase-3 and -9, and PARP1 pathways. Additionally, treatment with anethole inhibited cyclin D1 oncogene expression, increased cyclin-dependent kinase inhibitor p21WAF1, up-regulated p53 expression, and inhibited the EMT markers.
269
+ """
270
+
271
+ labels = ["summary"]
272
+
273
+ input_ = prompt+text
274
+
275
+ threshold = 0.5
276
+ summaries = model.predict_entities(input_, labels, threshold=threshold)
277
+
278
+ for summary in summaries:
279
+ print(summary["text"], "=>", summary["score"])
280
+ ```
281
+
282
+ ### Performance:
283
+ *Dataset: SQuAD 2.0*
284
+
285
+ | Model | BLEU | ROUGE1 | ROUGE2 | ROUGEL | Cosine Similarity |
286
+ |:-----------------------|------------:|----------:|-----------:|----------:|--------------------:|
287
+ | knowledgator/gliner-llama-multitask-1B-v1.0 | 7.9728e-157 | 0.0955005 | 0.00236265 | 0.0738533 | 0.0515591 |
288
+ | knowledgator/gliner-multitask-v0.5 | 1.70326e-06 | 0.0627964 | 0.00203505 | 0.0482932 | 0.0532316 |
289
+ | knowledgator/gliner-multitask-v1.0 | 5.78799e-06 | 0.0878883 | 0.0030312 | 0.0657152 | 0.060342 |
290
+
291
+ ---
292
+
293
+ **How to use for text classification:**
294
+
295
+ With threshold parameters, you can control recall and precision of text classification.
296
+
297
+ ```python
298
+ prompt = "Classify text into the following classes: positive review, negative review"
299
+
300
+ text = """
301
+ "I recently purchased the Sony WH-1000XM4 Wireless Noise-Canceling Headphones from Amazon and I must say, I'm thoroughly impressed. The package arrived in New York within 2 days, thanks to Amazon Prime's expedited shipping.
302
+ """
303
+
304
+ labels = ["match"]
305
+
306
+ input_ = prompt+text
307
+
308
+ threshold = 0.5
309
+ classes = model.predict_entities(input_, labels, threshold=threshold)
310
+
311
+ for label in classes:
312
+ print(label["text"], "=>", label["score"])
313
+ ```
314
+
315
+ ### Performance:
316
+
317
+ | Model Name | Dataset | Micro F1 Score |
318
+ |-----------------------|-----------|----------------|
319
+ | knowledgator/gliner-multitask-v1.0 | Emotion | 0.322 |
320
+ | | AG News | 0.7436 |
321
+ | | IMDb | 0.7907 |
322
+ | knowledgator/gliner-llama-multitask-1B-v1.0 | Emotion | 0.3475 |
323
+ | | AG News | 0.7436 |
324
+ | | IMDb | 0.7907 |
325
+
326
+ ---
327
+
328
+ ### Extensive NER Benchmarks:
329
+
330
+ ![Model Performance](gliner_multitask_performance.png)
331
+
332
+ Our multitask model demonstrates comparable performance on different zero-shot benchmarks to dedicated models to NER task (all labels were lowecased in this testing):
333
+
334
+ | Model | Dataset | Precision | Recall | F1 Score | F1 Score (Decimal) |
335
+ |------------------------------------|--------------------|-----------|--------|----------|--------------------|
336
+ | numind/NuNER_Zero-span | CrossNER_AI | 63.82% | 56.82% | 60.12% | 0.6012 |
337
+ | | CrossNER_literature| 73.53% | 58.06% | 64.89% | 0.6489 |
338
+ | | CrossNER_music | 72.69% | 67.40% | 69.95% | 0.6995 |
339
+ | | CrossNER_politics | 77.28% | 68.69% | 72.73% | 0.7273 |
340
+ | | CrossNER_science | 70.08% | 63.12% | 66.42% | 0.6642 |
341
+ | | mit-movie | 63.00% | 48.88% | 55.05% | 0.5505 |
342
+ | | mit-restaurant | 54.81% | 37.62% | 44.62% | 0.4462 |
343
+ | | **Average** | | | | **0.6196** |
344
+ | knowledgator/gliner-multitask-v0.5 | CrossNER_AI | 51.00% | 51.11% | 51.05% | 0.5105 |
345
+ | | CrossNER_literature | 72.65% | 65.62% | 68.96% | 0.6896 |
346
+ | | CrossNER_music | 74.91% | 73.70% | 74.30% | 0.7430 |
347
+ | | CrossNER_politics | 78.84% | 77.71% | 78.27% | 0.7827 |
348
+ | | CrossNER_science | 69.20% | 65.48% | 67.29% | 0.6729 |
349
+ | | mit-movie | 61.29% | 52.59% | 56.60% | 0.5660 |
350
+ | | mit-restaurant | 50.65% | 38.13% | 43.51% | 0.4351 |
351
+ | | **Average** | | | | **0.6276** |
352
+ | urchade/gliner_large-v2.1 | CrossNER_AI | 54.98% | 52.00% | 53.45% | 0.5345 |
353
+ | | CrossNER_literature| 59.33% | 56.47% | 57.87% | 0.5787 |
354
+ | | CrossNER_music | 67.39% | 66.77% | 67.08% | 0.6708 |
355
+ | | CrossNER_politics | 66.07% | 63.76% | 64.90% | 0.6490 |
356
+ | | CrossNER_science | 61.45% | 62.56% | 62.00% | 0.6200 |
357
+ | | mit-movie | 55.94% | 47.36% | 51.29% | 0.5129 |
358
+ | | mit-restaurant | 53.34% | 40.83% | 46.25% | 0.4625 |
359
+ | | **Average** | | | | **0.5754** |
360
+ | EmergentMethods/gliner_large_news-v2.1| CrossNER_AI | 59.60% | 54.55% | 56.96% | 0.5696 |
361
+ | | CrossNER_literature| 65.41% | 56.16% | 60.44% | 0.6044 |
362
+ | | CrossNER_music | 67.47% | 63.08% | 65.20% | 0.6520 |
363
+ | | CrossNER_politics | 66.05% | 60.07% | 62.92% | 0.6292 |
364
+ | | CrossNER_science | 68.44% | 63.57% | 65.92% | 0.6592 |
365
+ | | mit-movie | 65.85% | 49.59% | 56.57% | 0.5657 |
366
+ | | mit-restaurant | 54.71% | 35.94% | 43.38% | 0.4338 |
367
+ | | **Average** | | | | **0.5876** |
368
+
369
+
370
+ ### Join Our Discord
371
+
372
+ Connect with our community on Discord for news, support, and discussion about our models. Join [Discord](https://discord.gg/dkyeAgs9DG).
373
+
374
+ ### Citation:
375
+ ```
376
+ @misc{stepanov2024gliner,
377
+ title={GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks},
378
+ author={Ihor Stepanov and Mykhailo Shtopko},
379
+ year={2024},
380
+ eprint={2406.12925},
381
+ archivePrefix={arXiv},
382
+ primaryClass={id='cs.LG' full_name='Machine Learning' is_active=True alt_name=None in_archive='cs' is_general=False description='Papers on all aspects of machine learning research (supervised, unsupervised, reinforcement learning, bandit problems, and so on) including also robustness, explanation, fairness, and methodology. cs.LG is also an appropriate primary category for applications of machine learning methods.'}
383
+ }
384
+ ```