File size: 2,207 Bytes
41ef42d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
301ca29
41ef42d
 
 
 
 
 
 
 
c4bd8e3
41ef42d
 
 
 
 
 
 
8798313
41ef42d
 
 
 
 
 
 
 
 
 
 
 
 
 
5f29167
 
 
e7bfb3c
5f29167
41ef42d
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
datasets:
- HaifaCLGroup/KnessetCorpus
language:
- he
base_model:
- intfloat/multilingual-e5-large
---

# Knesset-multi-e5-large

This is a [sentence-transformers](https://www.sbert.net) model. It maps sentences and paragraphs to a 1024-dimensional dense vector space and can be used for tasks like clustering or semantic search.

**Knesset-multi-e5-large** is based on the [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) model. 
The transformer encoder has been fine-tuned on [Knesset data](https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus) to better capture legislative and parliamentary language. 

## Usage (Sentence-Transformers)

Using this model is straightforward if you have [sentence-transformers](https://www.sbert.net) installed:

```bash
pip install -U sentence-transformers
```

Then you can use the model like this:

```python
from sentence_transformers import SentenceTransformer
sentences = ["ื–ื” ืžืฉืคื˜ ืจืืฉื•ืŸ ืœื“ื•ื’ืžื”", "ื–ื” ื”ืžืฉืคื˜ ื”ืฉื ื™"]

model = SentenceTransformer('GiliGold/Knesset-multi-e5-large')
embeddings = model.encode(sentences)
print(embeddings)
```


## Full Model Architecture
```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)
```
## Additional Details
- Base Model: intfloat/multilingual-e5-large
- Fine-Tuning Data: Knesset data
- Key Modifications:
The encoder part has been fine-tuned on [Knesset data](https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus) to enhance performance for tasks involving legislative and parliamentary content.
The original pooling and normalization layers have been retained to ensure that the model's embeddings remain consistent with the architecture of the base model.
## Citing & Authors
<!--- Describe where people can find more information -->
TBD