Written by Ethan Smith
GitHub - ethansmith2000/clip-decomposition
https://github.com/ethansmith2000/clip-text-directions
I frequently geek out about the representational capacity of image embeddings, able to represent effectively our entire visual world in even <1000 dimensions, relatively small compared to many modern large neural networks, but still an unfathomably large space to the human mind.
With that in mind, how can we effectively explore this space and better understand it?
It’s common to use classification and clustering methods, but I’m interested in something that gives us a more intuitive feel for what this space offers us.
Borrowing from classic data science techniques, we can use Principal Component Analysis to explore the directions in which this data varies.
Following, we can then decode these embeddings with the UnCLIP models to visualize the concepts these directions cover.
Principal Component Analysis is a common data science technique to better understand the main “directions” of our data. This can allow us to discover interesting features of our data or allow for compression by isolating the most important features of our data while filtering out noise or lesser important features.
Similarly we can apply this technique to the representation spaces to learn more about them https://arxiv.org/abs/2208.06894 https://www.lesswrong.com/posts/mkbGjzxD8d8XqKHzA/the-singular-value-decompositions-of-transformer-weight The value of the interpretability these methods offer is a bit subjective but nonetheless its really cool to see.
The idea first came to me when the DALLE-2 paper showed you could do approximate decodings, even compressing down to only 20 of the first principal component vectors and still capture the general concept