Paper Summary: Llama 2: Open Foundation and Fine-Tuned Chat Models

Paper Summary: Llama 2: Open Foundation and Fine-Tuned Chat Models

Last updated:

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

llama-2-article-cover-arxiv Llama 2: Open Foundation and Fine-Tuned Chat Models Source

WHAT

Updated version of LLaMA 1 (summary) with more data (still fully open), double the context size, and enhanced attention.

Two model variations are published: a vanilla LLM and an instruction-tuned version.

HOW

  • LLaMA-2: Similar to LLaMA-1, with 40% more data (only public data), better data cleaning and larger context. One epoch over the training data. Also, enhanced attention.

  • LLaMA-2-chat: SFT and RLHF instruction-tuning on top of LLaMA-2.

CLAIMS

  • Using a smaller but higher-quality preference dataset yields better results.

  • RLHF is responsible for most of the increase in instruction-following performance.

QUOTES

  • Small but high-quality instruction-following data for SFT: "We found that SFT annotations in the order of tens of thousands was (sic) enough to achieve a high-quality result. We stopped annotating SFT after collecting a total of 27,540 annotations"

  • Reward model initialization: "We initialize our reward models from pretrained chat model checkpoints, as it ensures that both models benefit from knowledge acquired in pretraining. In short, the reward model “knows” what the chat model knows."

EXTENDS/USES

  • Main architectural decisions from LLaMA-1 (Touvron et al., 2023).

  • Grouped-query Attention (GQA), from Ainslie et al., 2023.

  • RLHF loop from Instruct-GPT (Ouyang et al., 2022).

    • But they experiment with Rejection Sampling Fine-tuning instead of PPO.

NOTES

  • Just like the DPO paper (summary), the authors used GPT-4 to evaluate the models subjectively.

  • Authors tried to decrease hallucination by oversampling known trusted sources.

  • Two reward models were trained, one optimized only helpfulness and the other only optimized safety.

  • The reward model is also a transformer-based LM (but trained for regression instead of predicting the next token).

  • Authors introduce a variant of Attention during fine-tuning, called Ghost Attention. The objective is to help the optimizer learn from multi-turn messaging like a chat conversation.

  • Authors used red-team adversarial attacks on the model, to test its safety.

MY 2¢

  • PPL shows no sign of saturation as more tokens are used (Figure 5)

References