An Evolved Universal Transformer Memory

I stumbled on HN over this very interesting article about a new kind of context memory system that, is able to remove information that is “unhelpful or redundant details”.

Thinking further, i think this would be super helpful for semantic search, that is currently not very performant due to the missing filters that extract importance. I have tried to counter this problem until now via summarization through small LLMs, but as one might guess turns out as not very precise and super expensive. There are other ideas one could post process text with LLMs but they are not very efficient either.

Paper https://arxiv.org/abs/2410.13166

TLDR by GPT-4o

(the article is very good, you might to prefer to read it over this TLDR, but here is it anyway)

Sakana AI introduces Neural Attention Memory Models (NAMMs), a novel memory system for transformers inspired by human selective memory. NAMMs optimize how transformers store and retrieve information, enabling them to “remember” important tokens and “forget” redundant ones, significantly improving efficiency and performance, particularly for long-context tasks.

Key Technical Highlights:

Evolutionary Optimization: NAMMs use evolutionary algorithms to train a neural classifier that decides which tokens to keep or discard, bypassing non-differentiable challenges.
Execution Steps:
- Convert attention sequences into spectrograms.
- Compress data using an exponential moving average (EMA).
- Use a classifier to score and selectively prune tokens.
Generalization:
- NAMMs can zero-shot transfer to other transformers (e.g., vision, reinforcement learning) without retraining.
- They adapt to tasks differently—retaining global information in early layers and focusing on local details in later ones.
Performance: Tested on LongBench, InfiniteBench, and their Japanese benchmark ChouBun, NAMMs reduce memory usage and outperform prior hand-designed memory strategies (H₂O, L₂).

NAMMs demonstrate cross-domain mastery and efficiency gains across diverse tasks and input modalities, paving the way for future research into learning transformers directly atop evolved memory systems.