Written by Ethan Smith

Intro

I started this exploration by comparing CLIP to T5 as text encoders on their own, prior to investigating how they affect diffusion models.

There were some qualities about CLIP that made it feel like not a great fit for use as conditionining, the two main ones being

Output tokens are not well isolated from each other, instead we see kind of a smear of similarity. I hypothesize this is because the training objective of CLIP does not actually directly act on the full set of tokens themselves, just the pooled token
CLIP uses causal masking, meaning that the final output sequence tokens, each one is only influenced by the tokens preceding it.

A prompt like: “A bat flew through the air”

Because “bat” cannot access the text in front of it, the “bat” token remains a superposition between the animal and the sports stick.

I hypothesize here that results in the model becoming dependent on later tokens, especially the class/pooler token at the end.

This may also make things like fine-grained attributions like color and object relations complicated as we may rely on the spatial map from a single summary token than properly from each individual word.

However, we’ll find things get a lot weirder.

___

Cross Attention

You may have seen some nice pictures like this from research papers,

https://prompt-to-prompt.github.io/

What’s happening here is we have

a sequence of image tokens (the grid of latent pixels),
and a sequence of text tokens (pictured as the words below the heatmaps here)

Table of Contents

Intro

___

Cross Attention