Written by Ethan Smith
Given that DPO is all the rage for learning preference data right now. I’ve been thinking on some use cases for it aside from the vanilla implementation. Specifically:
<aside> 📌 Are there ways we can automate or assume preference data?
DPO (Direct Preference Optimization) is a neat method that’s ~kinda RL, kinda not~ and has been pretty superb at learning from preference datasets for LLMs and diffusion models among other things
It’s a pretty easy set up that doesn’t come with a lot of the pains and complexity of RL and it also generally converges more quickly than RL methods.
The reason why I say it’s ~kinda~ RL is because the model still aims to maximize a reward, however we do not actually every have a reward model in the typical sense of having a classifier that outputs a score or similar.
I’ll walk through the paper linked above on DPO for diffusion and explain how we can get away with that.
We start off with the Bradley-Terrey model which expresses the probability that X is preferred over Y can be modeled as the difference between their reward scores passed through the sigmoid function. For some intuition on this, let’s look at the sigmoid function.
To reiterate, we’re modeling the probability that item1 is preferred to item2.
When they have the same score, (difference = 0 on x axis) the probability is 0.5 which is as uncertain as we can get about whether one is better than the other, which makes sense!
As the score of item1 surpasses item2 (positive difference), the probability increases, we can be more certain that item1 may be preferred over item2