Ihor commited on
Commit
56e3372
verified
1 Parent(s): b05e92e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +126 -50
README.md CHANGED
@@ -19,6 +19,17 @@ tags:
19
  - **Repository:** https://github.com/McGill-NLP/llm2vec
20
  - **Paper:** https://arxiv.org/abs/2404.05961
21
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
  ## Installation
24
  ```bash
@@ -27,65 +38,130 @@ pip install llm2vec
27
 
28
  ## Usage
29
  ```python
30
- from llm2vec import LLM2Vec
31
 
32
  import torch
33
- from transformers import AutoTokenizer, AutoModel, AutoConfig
34
- from peft import PeftModel
35
 
36
  # Loading base Mistral model, along with custom code that enables bidirectional connections in decoder-only LLMs. MNTP LoRA weights are merged into the base model.
37
  tokenizer = AutoTokenizer.from_pretrained(
38
- "McGill-NLP/LLM2Vec-Sheared-LLaMA-mntp"
39
  )
40
- config = AutoConfig.from_pretrained(
41
- "McGill-NLP/LLM2Vec-Sheared-LLaMA-mntp", trust_remote_code=True
42
- )
43
- model = AutoModel.from_pretrained(
44
- "McGill-NLP/LLM2Vec-Sheared-LLaMA-mntp",
45
- trust_remote_code=True,
46
- config=config,
47
- torch_dtype=torch.bfloat16,
48
- device_map="cuda" if torch.cuda.is_available() else "cpu",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  )
50
- model = PeftModel.from_pretrained(
51
- model,
52
- "McGill-NLP/LLM2Vec-Sheared-LLaMA-mntp",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  )
54
- model = model.merge_and_unload() # This can take several minutes on cpu
55
 
56
- # Loading supervised model. This loads the trained LoRA weights on top of MNTP model. Hence the final weights are -- Base model + MNTP (LoRA) + supervised (LoRA).
57
- model = PeftModel.from_pretrained(
58
- model, "McGill-NLP/LLM2Vec-Sheared-LLaMA-mntp-supervised"
 
 
 
59
  )
60
 
61
- # Wrapper for encoding and pooling operations
62
- l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)
 
 
 
 
 
63
 
64
- # Encoding queries using instructions
65
- instruction = (
66
- "Given a web search query, retrieve relevant passages that answer the query:"
67
- )
68
- queries = [
69
- [instruction, "how much protein should a female eat"],
70
- [instruction, "summit define"],
71
- ]
72
- q_reps = l2v.encode(queries)
73
-
74
- # Encoding documents. Instruction are not required for documents
75
- documents = [
76
- "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
77
- "Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments.",
78
- ]
79
- d_reps = l2v.encode(documents)
80
-
81
- # Compute cosine similarity
82
- q_reps_norm = torch.nn.functional.normalize(q_reps, p=2, dim=1)
83
- d_reps_norm = torch.nn.functional.normalize(d_reps, p=2, dim=1)
84
- cos_sim = torch.mm(q_reps_norm, d_reps_norm.transpose(0, 1))
85
-
86
- print(cos_sim)
87
- """
88
- tensor([[0.6500, 0.1291],
89
- [0.0916, 0.4733]])
90
- """
91
- ```
 
19
  - **Repository:** https://github.com/McGill-NLP/llm2vec
20
  - **Paper:** https://arxiv.org/abs/2404.05961
21
 
22
+ ## Overview:
23
+ This is a bi-directional version of Sheared-LLaMA-1.3B trained with masked token prediction on the Wikipedia dataset. Modern decoder models offer several advantages over classical encoders like BERT:
24
+
25
+ They are pre-trained on more recent textual corpora
26
+ They are trained on larger and more diverse datasets
27
+ Modern decoders have better support for long-context windows
28
+ Flash-attention support is available for these models
29
+
30
+ Considering these benefits, we are excited to release a series of decoder models tuned to work in a bi-directional setting. This approach combines the strengths of modern decoder architectures with the versatility of bi-directional context understanding, potentially opening up new possibilities for various natural language processing tasks, such as NER.
31
+
32
+ In comparison to original LLM2Vec we trained all weights of LLama model, it potentially improve bi-directional abilities of the model.
33
 
34
  ## Installation
35
  ```bash
 
38
 
39
  ## Usage
40
  ```python
41
+ from llm2vec.models import LlamaBiModel
42
 
43
  import torch
44
+ from transformers import AutoTokenizer
 
45
 
46
  # Loading base Mistral model, along with custom code that enables bidirectional connections in decoder-only LLMs. MNTP LoRA weights are merged into the base model.
47
  tokenizer = AutoTokenizer.from_pretrained(
48
+ "knowledgator/Sheared-LLaMA-encoder-1.3B"
49
  )
50
+
51
+ model = LLamaBiModel.from_pretrained("knowledgator/Sheared-LLaMA-encoder-1.3B")
52
+
53
+ text = "The quick brown fox jumps over the lazy dog."
54
+
55
+ inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
56
+
57
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
58
+ model = model.to(device)
59
+ inputs = {k: v.to(device) for k, v in inputs.items()}
60
+
61
+ with torch.no_grad():
62
+ outputs = model(**inputs)
63
+
64
+ last_hidden_states = outputs.last_hidden_state
65
+ ```
66
+
67
+ Here's an improved and expanded version of the README snippet:
68
+
69
+ ## Adapting for Different Discriminative Tasks
70
+
71
+ Our bi-directional LLaMA model can be easily adapted for various discriminative tasks such as text classification, question answering, and token classification.
72
+ To use these specialized versions, we provide a [fork of LLM2Vec](https://github.com/Knowledgator/llm2vec) with additional functionality.
73
+
74
+ ### Installation
75
+
76
+ To get started, clone our fork of LLM2Vec and install it:
77
+
78
+ ```bash
79
+ git clone https://github.com/Knowledgator/llm2vec.git
80
+ cd llm2vec
81
+ pip install -e .
82
+ ```
83
+
84
+ Using `-e` flag installs the package in editable mode, which is useful for development.
85
+
86
+ ### Usage
87
+
88
+ Here's how to import and use the models for different tasks:
89
+
90
+ ```python
91
+ from llm2vec import (
92
+ AutoLLMEncoderForSequenceClassification,
93
+ AutoLLMEncoderForQuestionAnswering,
94
+ AutoLLMEncoderForTokenClassification
95
  )
96
+
97
+ # Load models for different tasks
98
+ classification_model = AutoLLMEncoderForSequenceClassification.from_pretrained('knowledgator/Sheared-LLaMA-encoder-1.3B')
99
+ question_answering_model = AutoLLMEncoderForQuestionAnswering.from_pretrained('knowledgator/Sheared-LLaMA-encoder-1.3B')
100
+ token_classification_model = AutoLLMEncoderForTokenClassification.from_pretrained('knowledgator/Sheared-LLaMA-encoder-1.3B')
101
+ ```
102
+
103
+ ### Example: Text Classification
104
+
105
+ Here's a basic example of how to use the model for text classification:
106
+
107
+ ```python
108
+ from transformers import AutoTokenizer
109
+
110
+ # Load tokenizer
111
+ tokenizer = AutoTokenizer.from_pretrained('knowledgator/Sheared-LLaMA-encoder-1.3B')
112
+
113
+ # Prepare input
114
+ text = "This movie is great!"
115
+ inputs = tokenizer(text, return_tensors="pt")
116
+
117
+ # Get classification logits
118
+ outputs = classification_model(**inputs)
119
+ logits = outputs.logits
120
+
121
+ # The logits can be used with a softmax function to get probabilities
122
+ # or you can use torch.argmax(logits, dim=1) to get the predicted class
123
+ ```
124
+
125
+ ### Fine-tuning
126
+
127
+ To fine-tune these models on your specific task:
128
+
129
+ 1. Prepare your dataset in a format compatible with HuggingFace's `datasets` library.
130
+ 2. Use the `Trainer` class from HuggingFace's `transformers` library to fine-tune the model.
131
+
132
+ Here's a basic example:
133
+
134
+ ```python
135
+ from transformers import Trainer, TrainingArguments
136
+ from datasets import load_dataset
137
+
138
+ # Load your dataset
139
+ dataset = load_dataset("your_dataset")
140
+
141
+ # Define training arguments
142
+ training_args = TrainingArguments(
143
+ output_dir="./results",
144
+ num_train_epochs=3,
145
+ per_device_train_batch_size=8,
146
+ per_device_eval_batch_size=8,
147
+ warmup_steps=500,
148
+ weight_decay=0.01,
149
+ logging_dir="./logs",
150
  )
 
151
 
152
+ # Initialize Trainer
153
+ trainer = Trainer(
154
+ model=classification_model,
155
+ args=training_args,
156
+ train_dataset=dataset["train"],
157
+ eval_dataset=dataset["test"],
158
  )
159
 
160
+ # Fine-tune the model
161
+ trainer.train()
162
+ ```
163
+
164
+ ### Contributing
165
+
166
+ We welcome contributions! If you have suggestions for improvements or encounter any issues, please open an issue or submit a pull request on our [GitHub repository](https://github.com/Knowledgator/llm2vec).
167