Thinking Outside the Attention Box: Introducing Gated Associative Memory (GAM)

Community Article Published September 3, 2025

image/png

Today, I’m excited to share that our paper, "Gated Associative Memory (GAM): A Parallel O(N) Architecture for Efficient Sequence Modeling," is now available on arXiv!

Paper: https://arxiv.org/abs/2509.00605 Code: https://github.com/rishiraj/gam

This blog post is for you if you're curious about the core ideas. We're going to skip the dense tables and charts (they're all in the paper!) and instead, take a little journey to understand the intuition behind GAM.

Let's start with the king of the hill: the Transformer.

The Transformer architecture, and specifically its self-attention mechanism, is incredible. It works by letting every token in a sequence look at every other token. This all-to-all comparison is what gives it such a rich understanding of context, but it's also its Achilles' heel. This process takes $O(N^2)$ time, where N is the length of your sequence. Double the sequence length, and you quadruple the computation and memory. This is the bottleneck that makes processing long documents, high-res images, or long audio clips so painfully expensive.

A lot of great research has tried to fix this. Some methods approximate the attention matrix to make it faster (like Linformer or Performers), while others have brought back a form of recurrence in a very modern, parallel way (like Mamba).

With GAM, we asked a slightly different question: What if we didn't try to fix attention, but instead, replaced it with something built from the ground up to be parallel and linear-time?

Let’s Brainstorm: How Do We Understand a Sentence?

Think about how you read this sentence. Your brain is doing (at least) two things at once:

  1. Looking at nearby words: To understand the word "words" in this sentence, you’re looking at "nearby" right before it. This is the local context. It’s all about grammar, syntax, and word order. It’s a very small, local window of information.
  2. Remembering the big picture: You also remember that this blog post is about a new model called "GAM" and the "Transformer." This is the global context. It’s not about word order, but about the core concepts and themes floating around in the document.

Self-attention tries to do both of these jobs with one hammer: a massive pairwise comparison. GAM’s core idea is to give each job to a specialized tool.

The GAM Block: A Team of Two Specialists

Instead of a self-attention layer, GAM uses a block with two parallel pathways that process the input simultaneously.

1. The Local Expert: A Causal Convolution

To handle the local context, we use a simple 1D causal convolution. You can think of it as a super-efficient sliding window that looks at the last few tokens to figure out what’s going on right here.

  • Its Job: Capture word order, n-grams, and local syntax.
  • Why it’s great: Convolutions are massively parallelizable on GPUs and have a complexity of $O(N)$. They are incredibly fast and efficient at their one job.

2. The Global Librarian: An Associative Memory

This is where things get interesting. To handle global context, we give the model a learnable Memory Bank. This is just a matrix of vectors, where each vector (or "memory slot") learns to represent a useful, high-level concept or pattern from the data.

The process is simple and, importantly, fully parallel for all tokens:

  • For every token in your input sequence, it asks the memory bank: "Which of you memory slots are most relevant to me?"
  • It calculates a similarity score between the token and all memory slots.
  • It uses these scores to retrieve a weighted mix of the most relevant "memories."

This gives each token a summary of the global concepts it relates to. Because every token does this lookup independently, the whole operation is just a couple of efficient matrix multiplications. Its complexity is $O(N)$, not $O(N^2)$, because we're not comparing tokens to each other, but rather each token to a fixed-size memory bank.

The Manager: A Gate to Fuse Them Together

So now we have two streams of information: the local context from the convolution and the global context from the memory bank. How do we combine them?

We don't just add them up. We use a gating mechanism.

This is a tiny neural network that, for each token, produces a "knob" that decides how much of the local context and how much of the global context to let through.

For a function word like "the," the gate might learn to crank up the volume on the local convolution. For a content word like "Transformer," it might decide the global memory is more important. This dynamic fusion lets the model intelligently combine the best of both worlds for every single token.

The Takeaway?

By breaking down context modeling into these two specialized, parallel pathways, GAM achieves linear-time complexity without any recurrence.

So, does it work?

Our experiments on WikiText-2 and TinyStories show that GAM is not only consistently faster to train than a standard Transformer (and even Mamba in our WikiText-2 benchmark), but it also achieves better or competitive final perplexity.

This suggests that the quadratic scaling of self-attention might not be a necessary evil for strong performance. There are other, more efficient ways to build powerful sequence models.

We're just getting started, and we're excited to see how GAM performs on much longer sequences and larger datasets.

If this sounds interesting to you, I’d love for you to check out the paper for the full details, including the scaling benchmarks, ablation studies, and training specifics. And if you want to play with the code yourself, it's all on GitHub.

Thanks for reading, and I look forward to hearing what you think.

Community

Sign up or log in to comment