Update README.md
Browse files
README.md
CHANGED
@@ -9,4 +9,73 @@ library_name: transformers
|
|
9 |
tags:
|
10 |
- modernbert
|
11 |
- m-bert
|
12 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
tags:
|
10 |
- modernbert
|
11 |
- m-bert
|
12 |
+
---
|
13 |
+
# **MBERT Context Specifier**
|
14 |
+
|
15 |
+
*MBERT Context Specifier* with 150M parameters is a text-based context labeler or classifier trained using the modernized bidirectional encoder-only Transformer model (BERT-style). This model is pre-trained on 2 trillion tokens of English and code data, with a native context length of up to 8,192 tokens. It incorporates the following features:
|
16 |
+
|
17 |
+
1. **Rotary Positional Embeddings (RoPE):** Enables long-context support.
|
18 |
+
2. **Local-Global Alternating Attention:** Enhances efficiency when processing long inputs.
|
19 |
+
3. **Unpadding and Flash Attention:** Optimizes efficient inference.
|
20 |
+
|
21 |
+
ModernBERT’s native long-context length makes it ideal for tasks requiring the processing of lengthy documents, such as retrieval, classification, and semantic search within large corpora. The model was trained on a vast dataset of text and code, making it suitable for a wide range of downstream tasks, including code retrieval and hybrid (text + code) semantic search.
|
22 |
+
|
23 |
+
# **Run inference**
|
24 |
+
|
25 |
+
```python
|
26 |
+
from transformers import pipeline
|
27 |
+
|
28 |
+
# load model from huggingface.co/models using our repository id
|
29 |
+
classifier = pipeline(
|
30 |
+
task="text-classification",
|
31 |
+
model="prithivMLmods/MBERT-Context-Specifier",
|
32 |
+
device=0,
|
33 |
+
)
|
34 |
+
|
35 |
+
sample = "The global market for sustainable technologies has seen rapid growth over the past decade as businesses increasingly prioritize environmental sustainability."
|
36 |
+
|
37 |
+
classifier(sample)
|
38 |
+
```
|
39 |
+
# **Intended Use**
|
40 |
+
|
41 |
+
The MBERT Context Specifier is designed for the following purposes:
|
42 |
+
|
43 |
+
1. **Text and Code Classification:**
|
44 |
+
- Assigning contextual labels to large text or code inputs.
|
45 |
+
- Suitable for tasks requiring semantic understanding of both text and code.
|
46 |
+
|
47 |
+
2. **Long-Document Processing:**
|
48 |
+
- Ideal for tasks like document retrieval, summarization, and classification within lengthy documents (up to 8,192 tokens).
|
49 |
+
|
50 |
+
3. **Semantic Search:**
|
51 |
+
- Enables semantic understanding and hybrid (text + code) searches across large corpora.
|
52 |
+
- Applicable in industries requiring domain-specific retrieval tasks (e.g., legal, healthcare, and finance).
|
53 |
+
|
54 |
+
4. **Code Retrieval and Documentation:**
|
55 |
+
- Retrieving relevant code snippets or understanding context in large repositories of codebases and technical documentation.
|
56 |
+
|
57 |
+
5. **Language Understanding and Analysis:**
|
58 |
+
- General-purpose tasks like question answering, summarization, and sentiment analysis over large text inputs.
|
59 |
+
|
60 |
+
6. **Efficient Inference with Long Contexts:**
|
61 |
+
- Optimized for scenarios requiring efficient processing of large inputs with minimal computational overhead, thanks to Flash Attention and RoPE.
|
62 |
+
|
63 |
+
# **Limitations**
|
64 |
+
|
65 |
+
1. **Domain-Specific Performance:**
|
66 |
+
- While pre-trained on a large corpus of text and code, MBERT may require fine-tuning for niche or highly specialized domains to achieve optimal performance.
|
67 |
+
|
68 |
+
2. **Tokenization Constraints:**
|
69 |
+
- Inputs exceeding the 8,192-token limit will need truncation or intelligent preprocessing to avoid losing critical information.
|
70 |
+
|
71 |
+
3. **Bias in Training Data:**
|
72 |
+
- The pre-training data (text + code) may include biases from the source corpora, leading to biased classifications or retrievals in certain contexts.
|
73 |
+
|
74 |
+
4. **Code-Specific Challenges:**
|
75 |
+
- While MBERT supports code understanding, it may struggle with niche programming languages or highly domain-specific coding standards without fine-tuning.
|
76 |
+
|
77 |
+
5. **Inference Costs on Resource-Constrained Devices:**
|
78 |
+
- Processing long-context inputs can be computationally expensive, making MBERT less suitable for edge devices or environments with limited computational resources.
|
79 |
+
|
80 |
+
6. **Multilingual Support:**
|
81 |
+
- While optimized for English and code, MBERT may perform sub-optimally for other languages unless explicitly fine-tuned on multilingual datasets.
|