Text Generation
PEFT
Safetensors
mistral
conversational
Eval Results
dfurman commited on
Commit
b80a593
·
1 Parent(s): 77624a7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -86
README.md CHANGED
@@ -14,7 +14,7 @@ base_model: meta-llama/Llama-2-13b-hf
14
 
15
  This instruction model was built via parameter-efficient QLoRA finetuning of [llama-2-13b](https://huggingface.co/meta-llama/Llama-2-13b-hf) on the first 100k rows of [ehartford/dolphin](https://huggingface.co/datasets/ehartford/dolphin) (an open-source implementation of [Microsoft's Orca](https://www.microsoft.com/en-us/research/publication/orca-progressive-learning-from-complex-explanation-traces-of-gpt-4/)). Finetuning was executed on a single A6000 (48 GB) for roughly 18 hours on the [Lambda Labs](https://cloud.lambdalabs.com/instances) platform.
16
 
17
- ### Benchmark metrics
18
 
19
  | Metric | Value |
20
  |-----------------------|-------|
@@ -26,7 +26,7 @@ This instruction model was built via parameter-efficient QLoRA finetuning of [ll
26
 
27
  We use state-of-the-art [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) to run the benchmark tests above, using the same version as Hugging Face's [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
28
 
29
- ### Helpful Links
30
 
31
  * Model license: Llama 2 Community License Agreement
32
  * Basic usage: [notebook](assets/basic_inference_llama_2_13b_dolphin.ipynb)
@@ -34,7 +34,7 @@ We use state-of-the-art [Language Model Evaluation Harness](https://github.com/E
34
  * Loss curves: [plot](https://huggingface.co/dfurman/llama-2-13b-dolphin-peft#finetuning-description)
35
  * Runtime stats: [table](https://huggingface.co/dfurman/llama-2-13b-dolphin-peft#runtime-tests)
36
 
37
- ### Example prompts and responses
38
 
39
  Example 1:
40
 
@@ -117,117 +117,76 @@ While great efforts have been taken to clean the pretraining data, it is possibl
117
 
118
  ## How to Use
119
 
120
- Basic usage: [notebook](assets/basic_inference_llama_2_13b_dolphin.ipynb)
121
-
122
- Install and import the package dependencies:
123
 
124
  ```python
125
  !pip install -q -U huggingface_hub peft transformers torch accelerate
126
  ```
127
 
128
  ```python
 
129
  import torch
130
  from peft import PeftModel, PeftConfig
131
- from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
132
- ```
133
-
134
- Sign into a HF account with access to Llama-2:
 
 
135
 
136
- ```python
137
- from huggingface_hub import notebook_login
138
  notebook_login()
139
  ```
140
 
141
- Basic model loading:
142
-
143
  ```python
144
  peft_model_id = "dfurman/llama-2-13b-dolphin-peft"
145
  config = PeftConfig.from_pretrained(peft_model_id)
146
 
147
- tokenizer = AutoTokenizer.from_pretrained(
148
- config.base_model_name_or_path,
149
- use_auth_token=True
 
150
  )
151
- tokenizer.pad_token = tokenizer.eos_token
152
  model = AutoModelForCausalLM.from_pretrained(
153
  config.base_model_name_or_path,
154
- torch_dtype=torch.bfloat16,
155
- device_map="auto",
156
  use_auth_token=True,
 
157
  )
158
 
159
- # Load the Lora model
160
- model = PeftModel.from_pretrained(model, peft_model_id)
161
- ```
162
 
163
- Once loaded, the model and tokenizer can be used with the following code:
164
 
165
- ```python
166
- def llama_generate(
167
- model: AutoModelForCausalLM,
168
- tokenizer: AutoTokenizer,
169
- prompt: str,
170
- max_new_tokens: int = 128,
171
- temperature: float = 0.92,
172
- ) -> str:
173
- """
174
- Initialize the pipeline
175
- Uses Hugging Face GenerationConfig defaults
176
- https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationConfig
177
- Args:
178
- model (transformers.AutoModelForCausalLM): Falcon model for text generation
179
- tokenizer (transformers.AutoTokenizer): Tokenizer for model
180
- prompt (str): Prompt for text generation
181
- max_new_tokens (int, optional): Max new tokens after the prompt to generate. Defaults to 128.
182
- temperature (float, optional): The value used to modulate the next token probabilities.
183
- Defaults to 1.0
184
- """
185
- device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
186
-
187
- inputs = tokenizer(
188
- [prompt],
189
- return_tensors="pt",
190
- return_token_type_ids=False,
191
- ).to(
192
- device
193
- ) # tokenize inputs, load on device
194
-
195
- # when running Torch modules in lower precision, it is best practice to use the torch.autocast context manager.
196
- with torch.autocast("cuda", dtype=torch.bfloat16):
197
- response = model.generate(
198
- **inputs,
199
- max_new_tokens=max_new_tokens,
200
- temperature=temperature,
201
- return_dict_in_generate=True,
202
- eos_token_id=tokenizer.eos_token_id,
203
- pad_token_id=tokenizer.pad_token_id,
204
- )
205
-
206
- decoded_output = tokenizer.decode(
207
- response["sequences"][0],
208
- skip_special_tokens=True,
209
- ) # grab output in natural language
210
-
211
- return decoded_output[len(prompt) :] # remove prompt from output
212
  ```
213
 
214
- We can now generate text! For example:
215
-
216
  ```python
217
- prompt = "### Human: Write me a numbered list of things to do in New York City.### Assistant: "
218
-
219
- response = llama_generate(
220
- model,
221
- tokenizer,
222
- prompt,
223
- max_new_tokens=250,
224
- temperature=0.92,
225
- )
226
-
227
- print(response)
 
 
 
 
 
 
 
 
 
 
228
  ```
229
 
230
- ### Runtime tests
231
 
232
 
233
  | runtime / 50 tokens (sec) | GPU | attn | torch dtype | VRAM (GB) |
@@ -260,7 +219,7 @@ The license on this model does not constitute legal advice. We are not responsib
260
 
261
  ---
262
 
263
- ### Framework versions
264
 
265
 
266
  - PEFT 0.5.0.dev0
 
14
 
15
  This instruction model was built via parameter-efficient QLoRA finetuning of [llama-2-13b](https://huggingface.co/meta-llama/Llama-2-13b-hf) on the first 100k rows of [ehartford/dolphin](https://huggingface.co/datasets/ehartford/dolphin) (an open-source implementation of [Microsoft's Orca](https://www.microsoft.com/en-us/research/publication/orca-progressive-learning-from-complex-explanation-traces-of-gpt-4/)). Finetuning was executed on a single A6000 (48 GB) for roughly 18 hours on the [Lambda Labs](https://cloud.lambdalabs.com/instances) platform.
16
 
17
+ ## Benchmark metrics
18
 
19
  | Metric | Value |
20
  |-----------------------|-------|
 
26
 
27
  We use state-of-the-art [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) to run the benchmark tests above, using the same version as Hugging Face's [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
28
 
29
+ ## Helpful Links
30
 
31
  * Model license: Llama 2 Community License Agreement
32
  * Basic usage: [notebook](assets/basic_inference_llama_2_13b_dolphin.ipynb)
 
34
  * Loss curves: [plot](https://huggingface.co/dfurman/llama-2-13b-dolphin-peft#finetuning-description)
35
  * Runtime stats: [table](https://huggingface.co/dfurman/llama-2-13b-dolphin-peft#runtime-tests)
36
 
37
+ ## Example prompts and responses
38
 
39
  Example 1:
40
 
 
117
 
118
  ## How to Use
119
 
120
+ * [notebook](assets/basic_inference_llama_2_dolphin.ipynb)
 
 
121
 
122
  ```python
123
  !pip install -q -U huggingface_hub peft transformers torch accelerate
124
  ```
125
 
126
  ```python
127
+ from huggingface_hub import notebook_login
128
  import torch
129
  from peft import PeftModel, PeftConfig
130
+ from transformers import (
131
+ AutoModelForCausalLM,
132
+ AutoTokenizer,
133
+ BitsAndBytesConfig,
134
+ pipeline,
135
+ )
136
 
 
 
137
  notebook_login()
138
  ```
139
 
 
 
140
  ```python
141
  peft_model_id = "dfurman/llama-2-13b-dolphin-peft"
142
  config = PeftConfig.from_pretrained(peft_model_id)
143
 
144
+ bnb_config = BitsAndBytesConfig(
145
+ load_in_4bit=True,
146
+ bnb_4bit_quant_type="nf4",
147
+ bnb_4bit_compute_dtype=torch.bfloat16,
148
  )
149
+
150
  model = AutoModelForCausalLM.from_pretrained(
151
  config.base_model_name_or_path,
152
+ quantization_config=bnb_config,
 
153
  use_auth_token=True,
154
+ device_map="auto",
155
  )
156
 
157
+ tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path, use_fast=True)
158
+ tokenizer.pad_token = tokenizer.eos_token
 
159
 
160
+ model = PeftModel.from_pretrained(model, peft_model_id)
161
 
162
+ format_template = "You are a helpful assistant. {query}\n"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
163
  ```
164
 
 
 
165
  ```python
166
+ # First, format the prompt
167
+ query = "Tell me a recipe for vegan banana bread."
168
+ prompt = format_template.format(query=query)
169
+
170
+ # Inference can be done using model.generate
171
+ print("\n\n*** Generate:")
172
+
173
+ input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
174
+ with torch.autocast("cuda", dtype=torch.bfloat16):
175
+ output = model.generate(
176
+ input_ids=input_ids,
177
+ max_new_tokens=512,
178
+ do_sample=True,
179
+ temperature=0.7,
180
+ return_dict_in_generate=True,
181
+ eos_token_id=tokenizer.eos_token_id,
182
+ pad_token_id=tokenizer.pad_token_id,
183
+ repetition_penalty=1.2,
184
+ )
185
+
186
+ print(tokenizer.decode(output["sequences"][0], skip_special_tokens=True))
187
  ```
188
 
189
+ ## Runtime tests
190
 
191
 
192
  | runtime / 50 tokens (sec) | GPU | attn | torch dtype | VRAM (GB) |
 
219
 
220
  ---
221
 
222
+ ## Framework versions
223
 
224
 
225
  - PEFT 0.5.0.dev0