Written by Ethan Smith

Table of Contents

https://github.com/ethansmith2000/SGDImagePrompt

https://github.com/ethansmith2000/QuickEmbedding

Intro


A long while back, I made a tweak to the method of textual inversion training that circumvents having to involve the diffusion model at all in the training. Instead we just leverage CLIP’s inherent ability to match text and images to produce textual embeddings that yield similar content to a set of reference images.

At this point, there are a number of methods that can allow for image prompting, so I decided to release it open source.

This is appealing for a couple of reasons

  1. Textual inversion is notorious for overfitting,
    1. typically the weight norm is unbounded beyond small weight decay parameter, thus in order to minimize reconstruction error, our trained embedding can just learn to have very large norm to “steal” all the attention from other text tokens and pretty much just encode the training images themselves into the embedding.
    2. because we never actually operate in pixel/image space and strictly in CLIP’s semantic space, I believe this acts as a means of regularization.
  2. It is exceptionally faster
    1. Having to utilize the diffusion model for the loss comes with several downsides, namely
      1. needing to backprop through the entire Unet and text encoder is costly and relatively slow
      2. the random sampling of noise + noise levels lends itself to noisier updates
    2. On an a100 we can get a pretty useful text embedding in as short as 15 seconds.
    3. Admittedly, the results are a bit less adherent than those of standard textual inversion, however, we can also use this method as a “warm-start” for textual inversion training to decrease the total time needed while retaining the performance.

From this original observation, I then set up another related method that I called SGDImagePrompt, which uses the same optimization method, but instead optimizes a given prompt to relate to an image.

QuickEmbedding Method


The method is extremely simple.

  1. Aquire small (~4+ images) dataset of captioned images you’d like to personalize