If you only learn one advanced prompting technique, make it chain of thought. It's the single biggest improvement you can make to AI output quality on complex tasks, and it works across every major model.
This tutorial goes deeper than the usual "just add 'think step by step.'" You'll learn the different variants, when each one works best, and see real before-and-after examples that demonstrate exactly why this technique matters.
What Is Chain-of-Thought Prompting?
Chain-of-thought (CoT) prompting is a technique where you ask the model to reason through a problem step by step before giving its final answer. Instead of producing an answer directly, the model generates intermediate reasoning steps that lead to its conclusion.
Why does this matter? Language models generate text one token at a time. Each new token is influenced by all the tokens that came before it. When a model jumps straight to an answer, it's making that answer based only on the question. When it reasons step by step, each reasoning step becomes additional context that informs the next step and the final answer.
Think of it this way. If someone asks you "What is 347 times 28?" you don't just stare at the numbers and blurt out an answer. You break it down: 347 times 20 is 6,940. 347 times 8 is 2,776. Add them: 9,716. Each intermediate result makes the final answer more reliable. CoT prompting makes language models work the same way.
The Two Main Variants
Zero-Shot Chain of Thought
Zero-shot CoT is the simplest version. You just add a phrase like "Let's think step by step" or "Think through this carefully before answering" to your prompt. No examples needed.
Without CoT:
A store has 45 apples. They sell 12 in the morning and receive a shipment of 30 in the afternoon. Then they sell 18 more before closing. How many apples do they have at the end of the day?
Model response (no CoT): 55 apples. [Incorrect]
With zero-shot CoT:
A store has 45 apples. They sell 12 in the morning and receive a shipment of 30 in the afternoon. Then they sell 18 more before closing. How many apples do they have at the end of the day? Let's work through this step by step.
Model response (with CoT):
Step 1: Start with 45 apples.
Step 2: Sell 12 in the morning. 45 - 12 = 33 apples.
Step 3: Receive shipment of 30. 33 + 30 = 63 apples.
Step 4: Sell 18 before closing. 63 - 18 = 45 apples.
Answer: 45 apples. [Correct]
The phrase "Let's think step by step" was popularized by Kojima et al. in their 2022 paper "Large Language Models are Zero-Shot Reasoners." They showed this simple addition improved accuracy by 10 to 40 percentage points on reasoning benchmarks. It's one of the highest-impact, lowest-effort techniques in prompt engineering.
Other effective trigger phrases:
- "Think through this step by step before answering."
- "Break this problem down into parts."
- "Show your reasoning, then give your final answer."
- "Work through this carefully."
Few-Shot Chain of Thought
Few-shot CoT combines few-shot prompting with chain-of-thought reasoning. You provide examples that include the reasoning steps, not just the final answers. The model learns to replicate the reasoning pattern.
Few-shot CoT prompt:
Q: Sarah is older than Tom. Tom is older than Jake. Is Sarah older than Jake?
A: Let me work through this logically.
- Sarah is older than Tom. So Sarah > Tom in age.
- Tom is older than Jake. So Tom > Jake in age.
- Since Sarah > Tom and Tom > Jake, by transitivity, Sarah > Jake.
- Therefore, yes, Sarah is older than Jake.
Q: All doctors are professionals. Some professionals are wealthy. Are all doctors wealthy?
A: Let me work through this logically.
- All doctors are professionals. This means every doctor belongs to the set of professionals.
- Some professionals are wealthy. This means at least one professional is wealthy, but not necessarily all.
- The fact that some professionals are wealthy doesn't guarantee that the doctor subset falls within the wealthy subset.
- Therefore, no, we cannot conclude that all doctors are wealthy.
Q: If it rains, the ground gets wet. The ground is wet. Did it rain?
A:
The model sees the pattern: analyze each premise, connect them logically, then draw a conclusion. It will follow this pattern for the new question, producing: "Let me work through this logically. If it rains, the ground gets wet. This is a conditional: rain implies wet ground. The ground is wet. But wet ground doesn't necessarily imply rain. A sprinkler could have caused it. This is the logical fallacy called 'affirming the consequent.' Therefore, no, we cannot conclude it rained just because the ground is wet."
Few-shot CoT is more reliable than zero-shot CoT when you need the reasoning to follow a specific structure, when the task requires domain-specific logic, or when you want consistent output formatting.
When to Use Chain of Thought (and When Not To)
CoT Works Best For
- Math and arithmetic: Any task involving calculations, especially multi-step ones. CoT catches errors that occur when models try to do math "in their head."
- Logic and reasoning: Syllogisms, conditionals, transitive relationships. The step-by-step format prevents logical leaps.
- Multi-step analysis: Tasks where you need to consider multiple factors before reaching a conclusion. Diagnostic reasoning, root cause analysis, decision-making.
- Complex classification: When the classification depends on multiple criteria that interact. Sentiment analysis of nuanced text, compliance checking, medical coding.
- Word problems: Any task that requires extracting relevant information from natural language and applying it to reach an answer.
CoT Doesn't Help (and Can Hurt) For
- Simple factual questions: "What year was Python released?" doesn't benefit from step-by-step reasoning. The model either knows it or doesn't.
- Creative writing: Asking a model to "think step by step" before writing a poem usually produces worse results. Creativity benefits from fluidity, not structure.
- Simple classification: Binary sentiment analysis of clear-cut text ("I love this product!" or "Worst purchase ever") doesn't need reasoning steps. CoT adds tokens and latency without improving accuracy.
- Speed-critical applications: CoT generates more tokens, which means higher latency and cost. If you're classifying thousands of inputs per minute, the overhead matters.
Worked Examples
Let's look at real-world scenarios where CoT makes a measurable difference.
Example 3: Multi-Step Data Analysis
Prompt: A company had $2.4M revenue in Q1, $3.1M in Q2, $2.8M in Q3, and $3.7M in Q4. Their costs were 65% of revenue in Q1-Q3 and 70% in Q4. What was their total annual profit?
Response: $3.89M [Often incorrect, especially the Q4 calculation]
Prompt: A company had $2.4M revenue in Q1, $3.1M in Q2, $2.8M in Q3, and $3.7M in Q4. Their costs were 65% of revenue in Q1-Q3 and 70% in Q4. What was their total annual profit? Show your calculations for each quarter.
Response:
Q1: Revenue $2.4M, Costs $2.4M x 0.65 = $1.56M, Profit = $0.84M
Q2: Revenue $3.1M, Costs $3.1M x 0.65 = $2.015M, Profit = $1.085M
Q3: Revenue $2.8M, Costs $2.8M x 0.65 = $1.82M, Profit = $0.98M
Q4: Revenue $3.7M, Costs $3.7M x 0.70 = $2.59M, Profit = $1.11M
Total annual profit: $0.84M + $1.085M + $0.98M + $1.11M = $4.015M [Correct]
Example 4: Ethical Reasoning
Prompt: A self-driving car's AI must choose between swerving left (hitting one pedestrian) or swerving right (hitting a concrete barrier, injuring the passenger). Going straight would hit three pedestrians. Analyze the ethical considerations step by step.
Response with CoT:
Step 1: Identify the options and their consequences.
- Option A (swerve left): 1 pedestrian harmed
- Option B (swerve right): 1 passenger harmed
- Option C (straight): 3 pedestrians harmed
Step 2: Apply utilitarian analysis. Pure numbers favor Option A or B (1 person harmed vs. 3). Between A and B, the harm count is equal but the relationship differs.
Step 3: Consider the duty-of-care distinction. The passenger chose to ride in the vehicle and accepted some risk. Pedestrians did not consent to any risk from the vehicle. This creates an asymmetric duty of care.
Step 4: Consider the precedent. If self-driving cars prioritize passenger safety over pedestrian safety, public trust decreases. If they sacrifice passengers, nobody buys them.
Step 5: Note the framing limitations. Real-world scenarios rarely present clean choices. The car should brake, alert, and minimize total harm rather than choosing a "target."
Without CoT, the model would likely give a surface-level answer about the trolley problem. The step-by-step format forces it to consider multiple ethical frameworks and practical implications.
Example 5: Code Debugging
Prompt: This Python function should return the second largest number in a list, but it's buggy. Find and explain the bug step by step.
def second_largest(nums):
largest = nums[0]
second = nums[0]
for n in nums:
if n > largest:
second = largest
largest = n
return second
Response with CoT:
Let me trace through this with an example.
Input: [1, 5, 3, 9, 7]
- Start: largest = 1, second = 1
- n=1: 1 > 1? No. Skip.
- n=5: 5 > 1? Yes. second = 1, largest = 5
- n=3: 3 > 5? No. Skip.
- n=9: 9 > 5? Yes. second = 5, largest = 9
- n=7: 7 > 9? No. Skip.
- Return second = 5. Correct for this case.
But try: [5, 5, 5]
- Start: largest = 5, second = 5
- All elements equal to largest, nothing updates.
- Return 5. Is 5 the second largest of [5, 5, 5]? Ambiguous, but probably fine.
Now try: [9, 5, 3, 1]
- Start: largest = 9, second = 9
- n=9: 9 > 9? No. Skip.
- n=5: 5 > 9? No. Skip.
- n=3: 3 > 9? No. Skip.
- n=1: 1 > 9? No. Skip.
- Return second = 9. Bug! The second largest is 5, not 9.
The bug: when the list is already sorted in descending order (or the largest element is first), the function never updates `second` because no element is greater than the initial `largest`. The fix: initialize `second` to negative infinity, or add an `elif` to handle elements larger than `second` but not larger than `largest`.
Example 6: Business Decision Analysis
Prompt: We're choosing between two vendors for our AI infrastructure. Vendor A: $0.03/1K tokens, 99.9% uptime SLA, 200ms average latency, limited to GPT-4. Vendor B: $0.05/1K tokens, 99.5% uptime SLA, 150ms average latency, supports GPT-4, Claude, and Gemini. We process 50M tokens/month. Analyze this decision step by step.
Response with CoT:
Step 1: Cost comparison at our volume.
- Vendor A: 50M/1K x $0.03 = $1,500/month
- Vendor B: 50M/1K x $0.05 = $2,500/month
- Difference: $1,000/month, $12,000/year
Step 2: Uptime impact.
- 99.9% uptime = 43.8 minutes of downtime/month
- 99.5% uptime = 3.65 hours of downtime/month
- Vendor B has roughly 5x more expected downtime. Depending on our use case, this could be significant.
Step 3: Latency comparison.
- Vendor B is 50ms faster (150 vs 200ms). At 50M tokens, this matters for user-facing applications. For batch processing, it's negligible.
Step 4: Model flexibility.
- Vendor A locks us to GPT-4. If OpenAI has outages, rate limits, or price increases, we have no alternative.
- Vendor B supports three providers. This gives us failover capability and negotiating power.
Step 5: Risk assessment.
- Single-vendor dependency is the biggest risk factor. Model providers have had multi-hour outages. With Vendor A, an OpenAI outage means we're down. With Vendor B, we failover to Claude or Gemini.
Recommendation: Vendor B. The $12K/year premium buys model diversity and failover capability. The lower uptime SLA is concerning, but the multi-model support actually reduces real-world downtime risk more than the SLA difference increases it.
Advanced Techniques
Tree of Thought
Tree of thought (ToT) extends chain of thought by exploring multiple reasoning paths simultaneously. Instead of one linear chain, the model considers several approaches and evaluates which path is most promising.
Prompt: I need to reduce customer churn by 20% in 6 months. Explore three different strategic approaches, evaluate each, then recommend the strongest one.
Approach 1: Proactive engagement. Identify at-risk customers using usage patterns and reach out before they leave. Evaluate: How quickly can we build the prediction model? Do we have the usage data?
Approach 2: Pricing restructuring. Offer flexible pricing tiers that match different usage levels so customers feel they're getting fair value. Evaluate: What's the revenue impact? How do current customers react to plan changes?
Approach 3: Product improvement. Focus on the top 3 features customers request and ship them fast. Evaluate: Do we know what features matter most? Can engineering deliver in 6 months?
Compare the three approaches on: speed of impact, cost, risk, and likelihood of hitting the 20% target.
Tree of thought is most useful when there are multiple viable approaches and you need to compare them systematically. It prevents the model from fixating on the first solution it generates.
Self-Consistency
Self-consistency generates multiple chain-of-thought responses to the same prompt and takes the majority answer. It's essentially ensemble prompting.
Process:
1. Send the same prompt 5 times with temperature 0.7
2. Each response reasons through the problem step by step
3. Compare the final answers
4. Take the majority answer
If 4 out of 5 responses say "42", you have high confidence.
If responses split 2-2-1, the task might be ambiguous or the prompt needs refinement.
Self-consistency works best for tasks with definitive correct answers: math, classification, factual questions. It's less useful for creative or open-ended tasks where "correct" is subjective.
The tradeoff: self-consistency costs 5x more (5 API calls instead of 1). Use it selectively for high-stakes decisions where accuracy matters more than cost.
Practical Tips for Production CoT
Separating Reasoning from Output
In production, you often want the reasoning but don't want to show it to the end user. Structure your prompt to produce both, then extract only what you need.
Prompt pattern:
Analyze the following customer message and determine its category. First, think through your reasoning in a REASONING section. Then provide your final classification in an ANSWER section.
REASONING:
[Your step-by-step analysis here]
ANSWER:
[Single category label]
This gives you the reasoning for debugging and logging while keeping the user-facing output clean. Parse the ANSWER section for the downstream system.
Controlling Reasoning Length
Sometimes CoT produces excessively long reasoning. You can constrain it.
- "Think through this in 3 concise steps, then give your answer."
- "Briefly explain your reasoning (2-3 sentences), then provide the answer."
- "Identify the key factors (maximum 4) and explain how they lead to your conclusion."
The goal is enough reasoning to improve accuracy without generating thousands of unnecessary tokens.
CoT with Different Models
Different models respond to CoT differently. GPT-4 and Claude 3.5 Sonnet produce structured, methodical reasoning with minimal guidance. Smaller models sometimes need more explicit instruction about what "step by step" means. When using smaller models, provide few-shot CoT examples rather than relying on zero-shot.
Some newer models (like OpenAI's o1 series) have built-in chain-of-thought that runs internally. For these models, adding "think step by step" is redundant and can actually slow down responses without improving quality. Check the model's documentation to know whether explicit CoT is needed.
Combining CoT with Other Techniques
CoT + Role Prompting
Setting a specific expert role before requesting chain-of-thought reasoning produces more domain-appropriate reasoning steps.
Prompt: You are a senior financial analyst. A client asks whether they should refinance their mortgage. Current rate: 6.5%, remaining balance: $320,000, 22 years left. New rate offered: 5.1%, closing costs: $8,500, 30-year term. Analyze this step by step from a financial advisory perspective.
The role prompt ("senior financial analyst") ensures the reasoning steps include relevant financial concepts (break-even analysis, total interest comparison, opportunity cost) rather than generic math.
CoT + Output Formatting
You can combine chain-of-thought with strict output formatting by instructing the model to reason first, then format its final answer in a specific structure.
"Think through the classification step by step. After your reasoning, output a JSON object with fields: category (string), confidence (high/medium/low), and reasoning_summary (one sentence)."
This gives you the accuracy benefits of CoT with the parseable output format you need for downstream processing.
Measuring CoT Impact
Don't just assume CoT helps. Measure it.
Build an evaluation set of 50 to 100 test cases with known correct answers. Run them through your prompt with and without CoT. Compare accuracy, latency, and cost. Document the results.
In our community's experience, CoT typically improves accuracy by 15 to 40% on reasoning-heavy tasks, has minimal impact (under 5%) on simple tasks, adds 50 to 200% more tokens to the output, and increases latency by 30 to 100% depending on reasoning length.
The accuracy gain is almost always worth the cost increase for tasks where getting the right answer matters. For high-volume, simple tasks, skip CoT and save the tokens.
For more on building effective prompts, check our best practices guide. For career guidance on putting these skills to work, see our career roadmap and job board.
Frequently Asked Questions
Does chain-of-thought prompting work with all AI models?
CoT works with all major large language models (GPT-4, Claude, Gemini, Llama), but effectiveness varies with model size. Large models (70B+ parameters) show the biggest improvements. Smaller models sometimes produce reasoning steps that look right but contain errors. They mimic the format without actually reasoning more carefully. For smaller models, few-shot CoT with explicit examples tends to work better than zero-shot "think step by step."
How much extra does chain-of-thought cost in API calls?
CoT typically increases output tokens by 50 to 200%, which directly increases API costs by the same amount. A classification that normally uses 50 output tokens might use 150 with CoT. At GPT-4 pricing, that's the difference between fractions of a cent per call. For high-volume applications processing millions of requests, the cost adds up. The calculation is simple: multiply your current output token costs by 2 to 3x and decide if the accuracy improvement justifies it.
Can I use chain of thought for creative tasks?
Yes, but differently. For creative writing, asking the model to outline its approach before writing can improve structure and coherence. "First, plan the narrative arc, then write the story." This is CoT applied to planning rather than reasoning. Avoid asking for step-by-step analysis in the middle of creative output, as it breaks the flow. The planning-then-executing approach works well for essays, marketing copy, and structured creative work.
What is the difference between chain of thought and chain of thought with self-consistency?
Standard CoT generates one reasoning chain and one answer. Self-consistency generates multiple reasoning chains (typically 5 to 10) with higher temperature, then picks the most common final answer through majority voting. Self-consistency is more accurate but costs N times more, where N is the number of samples. Use standard CoT for most tasks. Use self-consistency when accuracy is critical and you can afford the extra API calls, such as medical coding, financial calculations, or legal analysis.
Should I use chain of thought in system prompts or user prompts?
Put the CoT instruction in the system prompt if you want the model to always reason step by step for every user message. Put it in the user prompt if you only need CoT for specific queries. For chatbots, system-level CoT makes every response longer and more expensive, even for simple greetings. A better approach: include CoT as a conditional in the system prompt. "For questions involving math, analysis, or multi-step reasoning, think through your answer step by step before responding. For simple factual questions, answer directly."