Paper Summary: KTO: Model Alignment as Prospect Theoretic Optimization
Last updated:Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

WHAT
Authors develop a framework with HALOs (Human-Aware Losses), which add insights from Prospect Theory1 to losses used to instruction-tune LLMs in DPO and similar strategies.
They introduce one specific HALO called KTO (Kahneman-Tversky Optimization) and use it to fine-tune LLMs.
WHY
Because they found that DPO implicitly uses some kind of HALO, but better ones could be used using better knowledge from human decision-making under uncertainty (Prospect Theory).
HOW
They show that both DPO and PPO can be rewritten as HALOs.
They create KTO, a HALO based upon Prospect Theory, which allows one to adjust the relative weight of desirable vs undesirable model outputs, or how much we want to let the outputs stray from the original LLM distribution (like the KL-penalty in DPO and RLHF).
They instruction-tune pretrained LLMs on KTO and compare it with other methods.
CLAIMS/QUOTES
Can use pointwise preference data (positive or negative): "KTO only requires a binary signal of whether an output is desirable or undesirable for an input."
-
KTO beats DPO: "KTO matches or exceeds DPO performance at scales from 1B to 30B parameters"
- Here they decomposed the pairwise preferences from DPO so they could train KTO
KTO is more sample-efficient than DPO: "[KTO matches] DPO performance while using up to 90% fewer desirable examples."
KTO can skip SFT: "When the pretrained model is sufficiently good, one can skip supervised finetuning (SFT) and go straight to KTO without a loss in generation quality, whereas SFT is always needed for best results with DPO."
DPO and PPO-Clip are human-aware losses (HALOs): As stated.
Up to a scale of 7B parameters, alignment provides virtually no gains over SFT alone As stated.
-
Despite only using dummy +1/-1 rewards, our offline PPO variant performs as well as DPO
- This is interesting. All the signal from different levels of preference made no difference.
- To be clear, this means that each output only needs a +1 (positive) or a -1 (negative) pointwise label from a human.
How KTO regularizes: "Intuitively, KTO works as follows: if the model increases the reward of a desirable example in a blunt manner, then the KL penalty also rises and no progress is made. This forces the model to learn exactly what makes an output desirable, so that the reward can be increased while keeping the KL term flat (or even decreasing it)."
EXTENDS/USES
- DPO
- Prospect Theory
NOTES
-
On converting pairwise to pointwise preferences: The authors took pairwise datasets and assumed that the highest-ranking element of the pair had a positive rating (
) while the dispreffered element had a negative rating (
). This is obviously false (and even the authors acknowledged this)
MY 2¢
KTO looks good but I'm not sure the comparisons with methods that use pairwise preference data are fair (maybe to the detriment of KTO itself). It's not at all clear what the trade-offs are, wrt to choosing one vs the other.
Well written paper, with clear claims. Easy to read in spite of the dense mathematical notation.
REFERENCES
Arxiv: KTO: Model Alignment as Prospect Theoretic Optimization
-
HuggingFace: Comparing DPO, IPO and KTO on 7B models
- According to this benchmark, KTO consistently lags DPO and IPO in performance
- A lot of manual hyperparam tuning so it's hard to say if it's a fair comparison. Also the decision on how to convert pairwise to pointwise preferences is probably crucial to the performance of KTO.
1: Prospect Theory is the study of human biases and human decision-making under uncertainty.