Paper Summary: KTO: Model Alignment as Prospect Theoretic Optimization

Last updated: 22 Jul 2025

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

article-front-page-for-kto-optimization-algorithm

KTO: Model Alignment as Prospect Theoretic Optimization Source

WHAT

Authors develop a framework with HALOs (Human-Aware Losses), which add insights from Prospect Theory¹ to losses used to instruction-tune LLMs in DPO and similar strategies.

They introduce one specific HALO called KTO (Kahneman-Tversky Optimization) and use it to fine-tune LLMs.

WHY

Because they found that DPO implicitly uses some kind of HALO, but better ones could be used using better knowledge from human decision-making under uncertainty (Prospect Theory).

HOW

They show that both DPO and PPO can be rewritten as HALOs.
They create KTO, a HALO based upon Prospect Theory, which allows one to adjust the relative weight of desirable vs undesirable model outputs, or how much we want to let the outputs stray from the original LLM distribution (like the KL-penalty in DPO and RLHF).
They instruction-tune pretrained LLMs on KTO and compare it with other methods.

CLAIMS/QUOTES

Can use pointwise preference data (positive or negative): "KTO only requires a binary signal of whether an output is desirable or undesirable for an input."
KTO beats DPO: "KTO matches or exceeds DPO performance at scales from 1B to 30B parameters"
- Here they decomposed the pairwise preferences from DPO so they could train KTO
KTO is more sample-efficient than DPO: "[KTO matches] DPO performance while using up to 90% fewer desirable examples."
KTO can skip SFT: "When the pretrained model is sufficiently good, one can skip supervised finetuning (SFT) and go straight to KTO without a loss in generation quality, whereas SFT is always needed for best results with DPO."
DPO and PPO-Clip are human-aware losses (HALOs): As stated.
Up to a scale of 7B parameters, alignment provides virtually no gains over SFT alone As stated.
Despite only using dummy +1/-1 rewards, our offline PPO variant performs as well as DPO
- This is interesting. All the signal from different levels of preference made no difference.
- To be clear, this means that each output only needs a +1 (positive) or a -1 (negative) pointwise label from a human.
How KTO regularizes: "Intuitively, KTO works as follows: if the model increases the reward of a desirable example in a blunt manner, then the KL penalty also rises and no progress is made. This forces the model to learn exactly what makes an output desirable, so that the reward can be increased while keeping the KL term flat (or even decreasing it)."

EXTENDS/USES

DPO
Prospect Theory

NOTES

On converting pairwise to pointwise preferences: The authors took pairwise datasets and assumed that the highest-ranking element of the pair had a positive rating () while the dispreffered element had a negative rating (). This is obviously false (and even the authors acknowledged this)

MY 2¢

KTO looks good but I'm not sure the comparisons with methods that use pairwise preference data are fair (maybe to the detriment of KTO itself). It's not at all clear what the trade-offs are, wrt to choosing one vs the other.
Well written paper, with clear claims. Easy to read in spite of the dense mathematical notation.

REFERENCES

Arxiv: KTO: Model Alignment as Prospect Theoretic Optimization
Github: Repo for HALOs and KTO
HuggingFace: Comparing DPO, IPO and KTO on 7B models
- According to this benchmark, KTO consistently lags DPO and IPO in performance
- A lot of manual hyperparam tuning so it's hard to say if it's a fair comparison. Also the decision on how to convert pairwise to pointwise preferences is probably crucial to the performance of KTO.

1: Prospect Theory is the study of human biases and human decision-making under uncertainty.

Felipe 21 Jul 2025 22 Jul 2025 paper-summary instruction-tuning language-modeling