Written by Ethan Smith

“Every concept, every notion, has a counterpart. Yet, every idea’s opposite is just the absence of the idea itself. Absence is the absence of presence. Presence is the absence of absence. Any perceived difference is, in fact, also a deep connection. In this way, opposing forces are unified, and the world becomes a cooperative web of interdependent pieces” - lyrics to Restraint by Haywyre

Table of Contents

I’ve put up the code I used for my experiments here. https://github.com/ethansmith2000/DipoleAttention Please feel free to submit a PR if you conduct your own experiments.

Intro

This exploration started originally as an attempt to find a linear attention method, which did not go so well… but some other things caught my eye.

For years now, the ML community (including myself!) has been searching for more efficient ways to implement attention, trying to circumvent the curse of $O(n^2)$ complexity.

Non-global attention methods at this point super powerful and criminally underrated (kernels like NATTEN have now made feasible implementations, and Hourglass diffusion as well as mistral are great PoC’s) but the quest to find a method to linearize attention and avoid having to every construct the QK matrix goes on.

There’s several methods that are interesting but generally not competitive except maybe on a few benchmarks, that only really cover a part of the big picture.

If you’re only interested in the Dipole attention method, feel free to collapse and skip this part.

Primer on Linearizing Attention

Readings on Linear attention that provoked some other ideas

The Philosophy Behind Dipole Attention