prithivMLmods commited on
Commit
1467da5
·
verified ·
1 Parent(s): 99a74e2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -1
README.md CHANGED
@@ -9,4 +9,73 @@ library_name: transformers
9
  tags:
10
  - modernbert
11
  - m-bert
12
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  tags:
10
  - modernbert
11
  - m-bert
12
+ ---
13
+ # **MBERT Context Specifier**
14
+
15
+ *MBERT Context Specifier* with 150M parameters is a text-based context labeler or classifier trained using the modernized bidirectional encoder-only Transformer model (BERT-style). This model is pre-trained on 2 trillion tokens of English and code data, with a native context length of up to 8,192 tokens. It incorporates the following features:
16
+
17
+ 1. **Rotary Positional Embeddings (RoPE):** Enables long-context support.
18
+ 2. **Local-Global Alternating Attention:** Enhances efficiency when processing long inputs.
19
+ 3. **Unpadding and Flash Attention:** Optimizes efficient inference.
20
+
21
+ ModernBERT’s native long-context length makes it ideal for tasks requiring the processing of lengthy documents, such as retrieval, classification, and semantic search within large corpora. The model was trained on a vast dataset of text and code, making it suitable for a wide range of downstream tasks, including code retrieval and hybrid (text + code) semantic search.
22
+
23
+ # **Run inference**
24
+
25
+ ```python
26
+ from transformers import pipeline
27
+
28
+ # load model from huggingface.co/models using our repository id
29
+ classifier = pipeline(
30
+ task="text-classification",
31
+ model="prithivMLmods/MBERT-Context-Specifier",
32
+ device=0,
33
+ )
34
+
35
+ sample = "The global market for sustainable technologies has seen rapid growth over the past decade as businesses increasingly prioritize environmental sustainability."
36
+
37
+ classifier(sample)
38
+ ```
39
+ # **Intended Use**
40
+
41
+ The MBERT Context Specifier is designed for the following purposes:
42
+
43
+ 1. **Text and Code Classification:**
44
+ - Assigning contextual labels to large text or code inputs.
45
+ - Suitable for tasks requiring semantic understanding of both text and code.
46
+
47
+ 2. **Long-Document Processing:**
48
+ - Ideal for tasks like document retrieval, summarization, and classification within lengthy documents (up to 8,192 tokens).
49
+
50
+ 3. **Semantic Search:**
51
+ - Enables semantic understanding and hybrid (text + code) searches across large corpora.
52
+ - Applicable in industries requiring domain-specific retrieval tasks (e.g., legal, healthcare, and finance).
53
+
54
+ 4. **Code Retrieval and Documentation:**
55
+ - Retrieving relevant code snippets or understanding context in large repositories of codebases and technical documentation.
56
+
57
+ 5. **Language Understanding and Analysis:**
58
+ - General-purpose tasks like question answering, summarization, and sentiment analysis over large text inputs.
59
+
60
+ 6. **Efficient Inference with Long Contexts:**
61
+ - Optimized for scenarios requiring efficient processing of large inputs with minimal computational overhead, thanks to Flash Attention and RoPE.
62
+
63
+ # **Limitations**
64
+
65
+ 1. **Domain-Specific Performance:**
66
+ - While pre-trained on a large corpus of text and code, MBERT may require fine-tuning for niche or highly specialized domains to achieve optimal performance.
67
+
68
+ 2. **Tokenization Constraints:**
69
+ - Inputs exceeding the 8,192-token limit will need truncation or intelligent preprocessing to avoid losing critical information.
70
+
71
+ 3. **Bias in Training Data:**
72
+ - The pre-training data (text + code) may include biases from the source corpora, leading to biased classifications or retrievals in certain contexts.
73
+
74
+ 4. **Code-Specific Challenges:**
75
+ - While MBERT supports code understanding, it may struggle with niche programming languages or highly domain-specific coding standards without fine-tuning.
76
+
77
+ 5. **Inference Costs on Resource-Constrained Devices:**
78
+ - Processing long-context inputs can be computationally expensive, making MBERT less suitable for edge devices or environments with limited computational resources.
79
+
80
+ 6. **Multilingual Support:**
81
+ - While optimized for English and code, MBERT may perform sub-optimally for other languages unless explicitly fine-tuned on multilingual datasets.