Paper Summary: Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Last updated:Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

WHAT
An approach to align pre-trained LMs to human preferences without using Reinforcement Learning (RL).
WHY
Because RL-based instruction-tuning methods (such as RLHF) are costly and difficult to implement.
HOW
The authors figured out a way to represent the objective function from RLHF as a loss function that can be directly optimized using algorithms such as SGD.
The algorithm uses the same style of training data as what's used to train a reward model in RLHF (pairwise preference data). The objective function includes both types of pairs to calculate the loss.
CLAIMS/QUOTES
Better than PPO in an objective evaluation: better results than PPO (the RL algorithm used by RHLF) as measured by reward and KL-divergence from the original text distribution.
Better than PPO in a subjective evaluation: Also better results than RLHF-PPO but the comparison setup is very nontraditional and based upon proxies. Authors use GPT-4 to provide ground truth for experiments, sentiment classifiers to filter generated text wrt sentiment, etc.
More stable than PPO: Learning with DPO is more stable (smaller variance) than RLHF-PPO.
DPO converges quickly: "... DPO converges to its best performance relatively quickly."
Optimizing policies, not rewards: "...our key insight is to leverage an analytical mapping from reward functions to optimal policies, which enables us to transform a loss function over reward functions into a loss function over policies."
NOTES
LLM as a Judge: GPT-4 (zero-shot) was used to evaluate DPO against other types of fine-tuning. Crazy.
After SFT: DPO was applied on an LLM that had been previously fine-tuned with regular SFT.