bhenrym14
/

airoboros-l2-13b-2.1-YaRN-64k

+---
+datasets:
+- jondurbin/airoboros-2.1
+- kmfoda/booksum
+---
+# Extended Context (via YaRN) Llama-2-13b with airoboros-2.1 (fp16)
+## Overview
+This is a finetune of [NousResearch/Yarn-Llama-2-13b-64k](https://huggingface.co/NousResearch/Yarn-Llama-2-13b-64k). This starting point is Llama-2-13b with additional pretraining done with YaRN scaling applied to RoPE to extend the useful context length to 64k tokens. Starting with this model, I performed instruction tuning with  [Jon Durbin's Airoboros 2.1 dataset](https://huggingface.co/datasets/jondurbin/airoboros-2.1), with same scaling approach applied.
+**This is a (merged) QLoRA fine-tune (rank 64)**.
+The finetune was performed with 1x RTX 6000 Ada.
+## How to Use
+YaRN is not implemented natively in `Transformers`. The YaRN pretrained model [NousResearch/Yarn-Llama-2-13b-64k](https://huggingface.co/NousResearch/Yarn-Llama-2-13b-64k) contains a drop-in llama architecture replacement that interfaces with the included configuration file. **To maximize compatibility, I have included the version that omits flash attention.** To run using `Transformers`, you will therefore need to pass `trust_remote_code=True`.
+The PNTK method employed in my other model [bhenrym14/airophin-13b-pntk-16k-fp16](https://huggingface.co/bhenrym14/airophin-13b-pntk-16k-fp16), is very similar to YaRN. For GPTQ, I have an exllama patch that I may adapt for YaRN, but the community appears motivated to rapidly implement YaRN natively, so I may not bother.
+Please comment with any questions and feedback on how this model performs, especially at long context lengths!
+Ooba use: Be sure to increase the `Truncate the prompt up to this length` parameter to 16384 to utilize the full context capabilities. Again `trust_remote_code=True` is imperative
+## Motivation
+Y
+## Relative Performance (wikitext perplexity)
+| Context (tokens)  | **bhenrym14/airoboros-l2-13b-PI-16k-fp16** | bhenrym14/airophin-v2-13b-PI-8k-fp16 | bhenrym14/airophin-13b-pntk-16k-fp16| bhenrym14/airoboros-13b-gpt4-1.4.1-PI-8192-fp16 |bhenrym14/airoboros-33b-gpt4-1.4.1-lxctx-PI-16384-fp16  | jondurbin/airoboros-l2-13b-gpt4-1.4.1 |
+| --- | --- | ---| ----- | -----| ------| --- |
+| 512 | 7.67 | 7.38 | 7.62 | 8.24 | 7.90 | **7.23** |
+| 1024 | 6.15 | 5.99 | 6.20 | 6.71 | 6.17 | **5.85**  |
+| 2048 | 5.29 | 5.22 | 5.38 | 5.87 | 5.23 | **5.07** |
+| 4096 | 4.94 | 4.90 | 5.08 | 5.50 | 4.91 | **4.77** |
+| 8192 | **4.71** | **4.71** | 4.90 | 5.32 | Not Tested | 57.1 |
+| 12000 | **4.54** | 55 | 4.82 | 56.1 | Not Tested | Not Tested |
+- Larger PI scaling factors increase short context performance degradation. If you don't require 16k context, you're better off using a model with a different context extension method, or a smaller (or no) PI scaling factor. Given this, don't expect anything special from this model on the HF leaderboard. Whether or not this is relevant to you will depend on your intended use case.
+- Beyond 8k, this model has lower perplexity than all other models tested here.
+- I'm actively exploring/implementing other context extension methods that may ameliorate the tendency of PI methods to impair the ability of the model to attend to the context space equally.
+## Prompting:
+Prompting differs with the airoboros 2.1 models. See [jondurbin/airoboros-l2-13b-2.1](https://huggingface.co/jondurbin/airoboros-l2-13b-2.1)