arxiv:2311.00684

Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation

Published on Nov 1, 2023

Authors:

Abstract

An ideal length-extrapolatable Transformer language model can handle sequences longer than the training length without any fine-tuning. Such long-context utilization capability relies heavily on a flexible positional embedding design. Upon investigating the flexibility of existing large pre-trained Transformer language models, we find that the T5 family deserves a closer look, as its positional embeddings capture rich and flexible attention patterns. However, T5 suffers from the dispersed attention issue: the longer the input sequence, the flatter the attention distribution. To alleviate the issue, we propose two attention alignment strategies via temperature scaling. Our findings show improvement on the long-context utilization capability of T5 on language modeling, retrieval, multi-document question answering, and code completion tasks without any fine-tuning. This suggests that a flexible positional embedding design and attention alignment can go a long way toward Transformer length extrapolation.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2311.00684 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2311.00684 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2311.00684 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.