Commit
·
b8b62e2
1
Parent(s):
265aaaf
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,78 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<span align="center">
|
2 |
+
<a href="https://huggingface.co/SajjadAyoubi/"><img src="https://img.shields.io/static/v1?label=%F0%9F%A4%97%20Hugging%20Face&message=SajjadAyoubi&color=yellow"></a>
|
3 |
+
<a href="https://colab.research.google.com/github/sajjjadayobi/PersianQA/blob/main/notebooks/Demo.ipynb"><img src="https://img.shields.io/static/v1?label=Colab&message=Fine-tuning Example&logo=Google%20Colab&color=f9ab00"></a>
|
4 |
+
</span>
|
5 |
+
|
6 |
+
# ParsBigBird: Persian Bert For **Long-Range** Sequences
|
7 |
+
The [Bert](https://arxiv.org/abs/1810.04805) and [ParsBert](https://arxiv.org/abs/2005.12515) algorithms can handle texts with token lengths of up to 512, however, many tasks such as summarizing and answering questions require longer texts. In our work, we have trained the [BigBird](https://arxiv.org/abs/2007.14062) model for the Persian language to process texts up to 4096 in the Farsi (Persian) language using sparse attention.
|
8 |
+
|
9 |
+
## Evaluation: 🌡️
|
10 |
+
We have evaluated the model on three tasks with different sequence lengths
|
11 |
+
|
12 |
+
| Name | Params | SnappFood (F1) | Digikala Magazine | PersianQA (F1) |
|
13 |
+
| :--------------------------------------------------------------: | :----: | :-----------------: | :---------------: | :--------------: |
|
14 |
+
| [distil-bigbird-fa-zwnj](https://github.com/sajjjadayobi/ParsBigBird) | 78M | 85.43% | **94.05%** | **73.34%** |
|
15 |
+
| [bert-base-fa](https://github.com/hooshvare/parsbert) | 118M | **87.98%** | 93.65% | 70.06% |
|
16 |
+
|
17 |
+
- Despite being as big as distill-bert, the model performs equally well as ParsBert and is much better on PersianQA which requires much more context
|
18 |
+
- This evaluation was based on `max_lentgh=2048` (It can be changed up to 4096)
|
19 |
+
|
20 |
+
|
21 |
+
## How to use❓
|
22 |
+
|
23 |
+
### As Contextualized Word Embedding
|
24 |
+
```python
|
25 |
+
from transformers import BigBirdModel, AutoTokenizer
|
26 |
+
|
27 |
+
MODEL_NAME = "SajjadAyoubi/distil-bigbird-fa-zwnj"
|
28 |
+
# by default its in `block_sparse` block_size=32
|
29 |
+
model = BigBirdModel.from_pretrained(MODEL_NAME, block_size=32)
|
30 |
+
# you can use full attention like the following: use this when input isn't longer than 512
|
31 |
+
model = BigBirdModel.from_pretrained(MODEL_NAME, attention_type="original_full")
|
32 |
+
|
33 |
+
text = "😃 امیدوارم مدل بدردبخوری باشه چون خیلی طول کشید تا ترین بشه"
|
34 |
+
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
|
35 |
+
tokens = tokenizer(text, return_tensors='pt')
|
36 |
+
output = model(**tokens) # contextualized embedding
|
37 |
+
```
|
38 |
+
|
39 |
+
### As Fill Blank
|
40 |
+
```python
|
41 |
+
from transformers import pipeline
|
42 |
+
|
43 |
+
MODEL_NAME = 'SajjadAyoubi/distil-bigbird-fa-zwnj'
|
44 |
+
fill = pipeline('fill-mask', model=MODEL_NAME, tokenizer=MODEL_NAME)
|
45 |
+
results = fill('تهران پایتخت [MASK] است.')
|
46 |
+
print(results[0]['token_str'])
|
47 |
+
>>> 'ایران'
|
48 |
+
```
|
49 |
+
|
50 |
+
## Pretraining details: 🔭
|
51 |
+
This model was pretrained using a masked language model (MLM) objective on the Persian section of the Oscar dataset. Following the original BERT training, 15% of tokens were masked. This was first described in this [paper](https://arxiv.org/abs/2007.14062) and released in this [repository](https://github.com/google-research/bigbird). Documents longer than 4096 were split into multiple documents, while documents much smaller than 4096 were merged using the [SEP] token. Model is warm started from `distilbert-fa`’s [checkpoint](https://huggingface.co/HooshvareLab/distilbert-fa-zwnj-base).
|
52 |
+
- For more details, you can take a look at config.json at the model card in 🤗 Model Hub
|
53 |
+
|
54 |
+
## Fine Tuning Recommendations: 🐤
|
55 |
+
Due to the model's memory requirements, `gradient_checkpointing` and `gradient_accumulation` should be used to maintain a reasonable batch size. Considering this model isn't really big, it's a good idea to first fine-tune it on your dataset using Masked LM objective (also called intermediate fine-tuning) before implementing the main task. In block_sparse mode, it doesn't matter how many tokens are input. It just attends to 256 tokens. Furthermore, original_full should be used up to 512 sequence lengths (instead of block sparse).
|
56 |
+
|
57 |
+
### Fine Tuning Examples 👷♂️👷♀️
|
58 |
+
|
59 |
+
| Dataset | Fine Tuning Example |
|
60 |
+
| ------------------------------------- | ------------------------------------------------------------ |
|
61 |
+
| Digikala Magazine Text Classification | <a href="https://colab.research.google.com/github/sajjjadayobi/PersianQA/blob/main/notebooks/Demo.ipynb"><img src="https://img.shields.io/static/v1?label=Colab&message=Fine-tuning Example&logo=Google%20Colab&color=f9ab00"></a> |
|
62 |
+
|
63 |
+
|
64 |
+
## Contact us: 🤝
|
65 |
+
If you have a technical question regarding the model, pretraining, code or publication, please create an issue in the repository. This is the fastest way to reach us.
|
66 |
+
|
67 |
+
## Citation: ↩️
|
68 |
+
we didn't publish any papers on the work. However, if you did, please cite us properly with an entry like one below.
|
69 |
+
```bibtex
|
70 |
+
@misc{ParsBigBird,
|
71 |
+
author = {Ayoubi, Sajjad},
|
72 |
+
title = {ParsBigBird: Persian Bert For Long-Range Sequences},
|
73 |
+
year = 2021,
|
74 |
+
publisher = {GitHub},
|
75 |
+
journal = {GitHub repository},
|
76 |
+
howpublished = {\url{https://github.com/SajjjadAyobi/ParsBigBird}},
|
77 |
+
}
|
78 |
+
```
|