Written by Ethan Smith
I started this exploration by comparing CLIP to T5 as text encoders on their own, prior to investigating how they affect diffusion models.
There were some qualities about CLIP that made it feel like not a great fit for use as conditionining, the two main ones being
A prompt like: “A bat flew through the air”
Because “bat” cannot access the text in front of it, the “bat” token remains a superposition between the animal and the sports stick.
I hypothesize here that results in the model becoming dependent on later tokens, especially the class/pooler token at the end.
This may also make things like fine-grained attributions like color and object relations complicated as we may rely on the spatial map from a single summary token than properly from each individual word.
However, we’ll find things get a lot weirder.
You may have seen some nice pictures like this from research papers,
https://prompt-to-prompt.github.io/
What’s happening here is we have