Paper Summary: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Last updated: 15 Jun 2025

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

Chain-of-Though Prompting Elicits Reasoning in Large Language Models. Source

WHAT

Chain-of-Thought (CoT) is a technique whereby the first part of the output is augmenting the input and more thoroughly explaining it. And only then providing the output.

Chain-of-Thought prompting example. Source

WHY

Because even large LLMs struggle with tasks that require multiple steps of reasoning and/or symbolic logic. And adding more parameters doesn't seem to help much.

HOW

Using the few-shot prompting technique, one adds examples of how to "augment" the input such that the task is more clearly explained and the model has more data to work with. Then, proceed as normal.

CLAIMS/QUOTES

Types of tasks amenable to CoT: "... chain-of-thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks"
Interpretability: CoT adds some level of interpretability to the way the LLM is thinking as it produces an output. This is interesting.
CoT can be added to any model without needing retraining or fine-tuning: You can add CoT capabilities to any pre-trained LLM by just adding CoT examples and then using few-shot prompting.
Emergent behavior: CoT only works in large models: "... chain-of-thought prompting does not positively impact performance for small models, and only yields performance gains when used with models of ∼100B parameters."
- In smaller models, the chains produced were "fluent but illogical", actually making results worse than with normal prompting.
Performance vs Special-purpose models: Vanilla LLMs with CoT in-context learning outperform larger LLMs that have been fine-tuned to excel in specific domains (e.g. GPT-3 fine-tuned on math).
CoT helps more, the more complex a task is. The performance gains of CoT is larger for tasks that require multi-step reasoning, such as logic and common-sense tasks.

NOTES

The CoT needs not be shown to a user using the model. It can just be hidden.
The experiments used just 8 examples of CoT in the context, for few-shot prompting.

MY 2¢

It's important to realize that CoT is an inference-type technique. It does not change the training-time setup of a model at all!
- "No language models were finetuned in the process of writing this paper."
This wasn't discussed in the paper, but there definitely are latency tradeoffs when adding CoT to a model.

REFERENCES

Arxiv: Wei et al., 2022: Chain-of-Though Prompting Elicits Reasoning in Large Language Models

Felipe 15 Jun 2025 15 Jun 2025 paper-summary language-modeling reasoning

WHAT

WHY

HOW

CLAIMS/QUOTES

NOTES

MY 2¢

REFERENCES

Dialogue & Discussion