Written by Ethan Smith

Table of Contents

Intro

Given that DPO is all the rage for learning preference data right now. I’ve been thinking on some use cases for it aside from the vanilla implementation. Specifically:

<aside> 📌 Are there ways we can automate or assume preference data?

  1. Either through natural patterns in data or something like presenting the same example side-by-side, with the losing sample being perturbed in some way.
  2. Can we extend it to choose between multiple samples? Does it even make sense to do that? </aside>

Background on DPO

DPO (Direct Preference Optimization) is a neat method that’s ~kinda RL, kinda not~ and has been pretty superb at learning from preference datasets for LLMs and diffusion models among other things

It’s a pretty easy set up that doesn’t come with a lot of the pains and complexity of RL and it also generally converges more quickly than RL methods.

The reason why I say it’s ~kinda~ RL is because the model still aims to maximize a reward, however we do not actually every have a reward model in the typical sense of having a classifier that outputs a score or similar.

I’ll walk through the paper linked above on DPO for diffusion and explain how we can get away with that.

Screenshot 2024-03-01 at 12.51.41 AM.png

We start off with the Bradley-Terrey model which expresses the probability that X is preferred over Y can be modeled as the difference between their reward scores passed through the sigmoid function. For some intuition on this, let’s look at the sigmoid function.

Screenshot 2024-03-01 at 12.56.40 AM.png

To reiterate, we’re modeling the probability that item1 is preferred to item2.

When they have the same score, (difference = 0 on x axis) the probability is 0.5 which is as uncertain as we can get about whether one is better than the other, which makes sense!

As the score of item1 surpasses item2 (positive difference), the probability increases, we can be more certain that item1 may be preferred over item2