Update README.md
Browse files
README.md
CHANGED
@@ -21,6 +21,7 @@ Otherwise, I emulated the training process as closely as possible (rank 64 QLoRA
|
|
21 |
## How to Use
|
22 |
The easiest way is to use the GPTQ weights (linked above) with [oobabooga text-generation-webui](https://github.com/oobabooga/text-generation-webui) and ExLlama. You'll need to set max_seq_len to 8192 and compress_pos_emb to 4.
|
23 |
|
|
|
24 |
|
25 |
## Motivation
|
26 |
Recent advancements in extending context by RoPE scaling ([kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) demonstrate the ability to extend the context window without (total) retraining. Finetuning has shown to be necessary to properly leverage the longer context. The superHOT LoRA is an adapter that has been finetuned on longer context (8192 tokens); even when applied to models trained on dissimilar datasets, it successfully extends the context window to which the model can attend. While it's impressive this adapter is so flexible, how much does performance suffer relative to a model that has been finetuned with the scaled embeddings from the start? This is an experiment to explore this.
|
|
|
21 |
## How to Use
|
22 |
The easiest way is to use the GPTQ weights (linked above) with [oobabooga text-generation-webui](https://github.com/oobabooga/text-generation-webui) and ExLlama. You'll need to set max_seq_len to 8192 and compress_pos_emb to 4.
|
23 |
|
24 |
+
**IMPORTANT: To use these weights you'll need to patch in the appropriate RoPE scaling module. see: [replace_llama_rope_with_scaled_rope](https://github.com/bhenrym14/qlora-airoboros-longcontext/blob/main/scaledllama/llama_rope_scaled_monkey_patch.py)**
|
25 |
|
26 |
## Motivation
|
27 |
Recent advancements in extending context by RoPE scaling ([kaiokendev](https://kaiokendev.github.io/til#extending-context-to-8k) and [meta AI)](https://arxiv.org/abs/2306.15595)) demonstrate the ability to extend the context window without (total) retraining. Finetuning has shown to be necessary to properly leverage the longer context. The superHOT LoRA is an adapter that has been finetuned on longer context (8192 tokens); even when applied to models trained on dissimilar datasets, it successfully extends the context window to which the model can attend. While it's impressive this adapter is so flexible, how much does performance suffer relative to a model that has been finetuned with the scaled embeddings from the start? This is an experiment to explore this.
|