Written by Ethan Smith

“Every concept, every notion, has a counterpart. Yet, every idea’s opposite is just the absence of the idea itself. Absence is the absence of presence. Presence is the absence of absence. Any perceived difference is, in fact, also a deep connection. In this way, opposing forces are unified, and the world becomes a cooperative web of interdependent pieces” - lyrics to Restraint by Haywyre

Update 02/12/25: While my methods ultimately didn’t perform so well, a performant method of implementing the motivations presented here occurred in this very interesting paper

More Expressive Attention with Negative Weights

Table of Contents

Intro

This exploration started originally as an attempt to find a linear attention method, which did not go so well… but some other things caught my eye.

For years now, the ML community (including myself!) has been searching for more efficient ways to implement attention, trying to circumvent the curse of $O(n^2)$ complexity.

Non-global attention methods at this point super powerful and criminally underrated (kernels like NATTEN have now made feasible implementations, and Hourglass diffusion as well as mistral are great PoC’s) but the quest to find a method to linearize attention and avoid having to every construct the QK matrix goes on.

There’s several methods that are interesting but generally not competitive except maybe on a few benchmarks, that only really cover a part of the big picture.

If you’re only interested in the Dipole attention method, feel free to collapse and skip this part.

Primer on Linearizing Attention

Readings on Linear attention that provoked some other ideas

The Philosophy Behind Dipole Attention

As we know from the curse of dimensionality, the orthogonal space grows rapidly as we increase dimensions

Here in this 3D visualization, with the vector pointing in the positive y direction as our reference We have one vector that achieves perfect 1.0 cosine similarity, and one vector that achieves perfect -1.0 cosine similarity. But then we have a whole flat disk of possible vectors with 0.0 similarity.

created with https://www.math3d.org/

created with https://www.math3d.org/

In 4D, the 2D disk becomes a 3D ball. The size of this subspace trails closely behind the size of the full space.

So given how spacious something like 64D space is, actually relatively low for many neural network operations, we can realize how rare it is to see a well-size similarity value, positive or negative!

It’s one of those things that is so rare, that I like to think of it in the lens of “It’s not a coincidence”, its a signal.

For a more intuitive analogy

If someone reads off the following words to you:

“Man”, “Table”, “Rooster”

There’s really not too much to think about here

Now if you heard the words

“Man”, “Woman”

There’s something of a pattern here.

Granted, the words mentioned aren’t necessarily latent opposites, and you’re not necessarily guaranteed nicely interpretable opposite vectors. Nonetheless it feels worth trying.

Screenshot 2024-03-05 at 1.06.31 AM.png

Looking at our activation function for attention again $exp(x)$ we have the following properties for attention scores prior to normalization