Written by Ethan Smith
“Every concept, every notion, has a counterpart. Yet, every idea’s opposite is just the absence of the idea itself. Absence is the absence of presence. Presence is the absence of absence. Any perceived difference is, in fact, also a deep connection. In this way, opposing forces are unified, and the world becomes a cooperative web of interdependent pieces” - lyrics to Restraint by Haywyre
Update 02/12/25: While my methods ultimately didn’t perform so well, a performant method of implementing the motivations presented here occurred in this very interesting paper
More Expressive Attention with Negative Weights
This exploration started originally as an attempt to find a linear attention method, which did not go so well… but some other things caught my eye.
For years now, the ML community (including myself!) has been searching for more efficient ways to implement attention, trying to circumvent the curse of $O(n^2)$ complexity.
Non-global attention methods at this point super powerful and criminally underrated (kernels like NATTEN have now made feasible implementations, and Hourglass diffusion as well as mistral are great PoC’s) but the quest to find a method to linearize attention and avoid having to every construct the QK matrix goes on.
There’s several methods that are interesting but generally not competitive except maybe on a few benchmarks, that only really cover a part of the big picture.
If you’re only interested in the Dipole attention method, feel free to collapse and skip this part.
As we know from the curse of dimensionality, the orthogonal space grows rapidly as we increase dimensions
Here in this 3D visualization, with the vector pointing in the positive y direction as our reference We have one vector that achieves perfect 1.0 cosine similarity, and one vector that achieves perfect -1.0 cosine similarity. But then we have a whole flat disk of possible vectors with 0.0 similarity.
created with https://www.math3d.org/
In 4D, the 2D disk becomes a 3D ball. The size of this subspace trails closely behind the size of the full space.
So given how spacious something like 64D space is, actually relatively low for many neural network operations, we can realize how rare it is to see a well-size similarity value, positive or negative!
It’s one of those things that is so rare, that I like to think of it in the lens of “It’s not a coincidence”, its a signal.
For a more intuitive analogy
If someone reads off the following words to you:
“Man”, “Table”, “Rooster”
There’s really not too much to think about here
Now if you heard the words
“Man”, “Woman”
There’s something of a pattern here.
Granted, the words mentioned aren’t necessarily latent opposites, and you’re not necessarily guaranteed nicely interpretable opposite vectors. Nonetheless it feels worth trying.
Looking at our activation function for attention again $exp(x)$ we have the following properties for attention scores prior to normalization