Girijesh Prasad

Posted on Feb 4

How Reasoning LLMs Actually Work (And Do They Really Reason?)

#ai #reasoning #llm #machinelearning

Understanding the breakthrough in AI that's crushing olympiad-level problems - and whether it's genuine reasoning or elaborate pattern matching

Imagine you're stuck on a complex maths problem at 2 AM. You open ChatGPT, paste the question, and... it spits out an answer instantly. Correct, but how did it get there? Now imagine asking the new OpenAI O1 model the same question. This time, it "thinks" for 30 seconds, showing you its step-by-step reasoning before arriving at the answer. The difference is quite striking.

We've entered the era of "reasoning LLMs" - models that don't just predict the next word, but supposedly think through problems like humans do. OpenAI's O1, DeepSeek-R1, and others are crushing benchmarks that stumped earlier models. They're solving olympiad-level maths, debugging complex code, and tackling scientific problems with remarkable accuracy.

But here's the thing - are they actually reasoning? Or are they just really, really good at pattern matching? This isn't just academic hairsplitting. Understanding what these models can (and can't) do is crucial for anyone building AI systems.

Let's dive into how reasoning LLMs work, how they're trained, and tackle the big philosophical question head-on.

What Are Reasoning LLMs, Anyway?

Here's the simplest way to think about it: traditional LLMs are like someone who's brilliant at finishing your sentences. Reasoning LLMs? They're more like someone who stops, thinks carefully, and works through a problem on paper before answering.

The key difference comes down to something psychologists call System 1 versus System 2 thinking. System 1 is fast and intuitive - like when you instantly recognise a friend's face or dodge an obstacle whilst walking. System 2 is slow and deliberate - like working through a complicated problem or planning a complex project.

Traditional LLMs excel at System 1. They process your input and rapidly predict the most probable next token (word or part of a word). It's fast, efficient, and works brilliantly for many tasks. But when you need multi-step reasoning? That's where they struggle.

Reasoning LLMs attempt to implement System 2 thinking. They take their time, break down problems into steps, verify their work, and even backtrack when they spot errors. The results speak for themselves.

Take OpenAI's O1 model, released in September 2024. On the American Invitational Mathematics Examination (AIME) - a test that's hard enough to identify the top 500 students in the United States - O1 scores 93%. For context, earlier models barely scraped past 40%. DeepSeek-R1, an open-source model released in January 2025, achieves similar performance whilst being transparent about its training methods.

The Breakthrough Moment

The foundation for all this was laid by a brilliant 2022 paper from Google Research. Jason Wei and his colleagues discovered something remarkable: if you simply prompt a large enough language model with "Let's think step by step," its reasoning abilities improve dramatically. They called this Chain-of-Thought (CoT) prompting.

The magic wasn't in teaching the model some new trick. The capability was already there, lurking in models with around 100 billion parameters or more. CoT prompting just brought it out. On the GSM8K benchmark of maths word problems, a 540-billion parameter model with CoT prompting achieved state-of-the-art accuracy, surpassing even specially fine-tuned models.

Here's what makes this fascinating: traditional LLMs know the answer immediately but can't easily show their work. It's like asking a chess grandmaster to explain every calculation they made in a split second. Reasoning LLMs, by generating intermediate steps, make their thought process transparent. And that transparency isn't just nice to have - it actually helps them arrive at better answers.

Think of it like cooking. A traditional LLM has memorised thousands of recipes and can instantly tell you what goes into a dish. A reasoning LLM reads the recipe, checks what ingredients it has, plans the steps, and adjusts on the fly if something's missing. Both might give you a decent meal, but you'd trust the second approach for a complicated French pastry.

How Reasoning LLMs Are Actually Trained

Now, let's get into the fascinating bit - how do you train a model to reason? The journey from "predict the next word" to "solve olympiad-level maths" is quite ingenious.

The Foundation: Learning to Show Your Work

It starts with Chain-of-Thought training. Instead of just training the model on question-answer pairs, you train it on examples that include the full reasoning process. If you're teaching it to solve "Sarah has 5 apples, gives away 2, how many left?", you don't just show it "3". You show it "Let's think step by step. Sarah started with 5 apples. She gave away 2. So 5 - 2 = 3 apples remaining."

Do this enough times with enough examples, and the model learns to generate these intermediate steps naturally. It's not just mimicking the format - larger models genuinely develop better reasoning capabilities through this process.

But here's where it gets clever. Xuezhi Wang and colleagues (also from Google Research) discovered you could make this even more robust through "self-consistency." Instead of generating one reasoning path, generate several. Then pick the answer that appears most often. It's like solving a puzzle multiple ways and being more confident if you get the same answer each time.

The Reinforcement Learning Revolution

Traditional supervised learning has a limitation: you need humans to write out all those reasoning steps. That's expensive, slow, and limited by human creativity. Enter reinforcement learning (RL).

DeepSeek's R1 model, released in January 2025, proved something remarkable: you can develop sophisticated reasoning through pure RL, without needing human-written reasoning examples. Let the model explore, reward it when it gets things right, and it develops its own reasoning strategies.

But not all rewards are created equal. This is where Process Reward Models (PRMs) versus Outcome Reward Models (ORMs) become crucial.

Outcome Reward Models are straightforward: did you get the right final answer? Yes? Here's your reward. No? No reward. It's simple but has a problem - if the model gets the wrong answer, you don't know where in its reasoning chain it went wrong.

Process Reward Models are more sophisticated. They reward (or penalise) each step of the reasoning process. If the model correctly identifies the problem in step 1, reward. Correctly breaks it down in step 2, reward. Makes an error in step 3, penalise. This granular feedback helps the model learn what good reasoning actually looks like.

Research shows PRMs significantly outperform ORMs for mathematical reasoning. It makes sense, really - it's the difference between a teacher marking just your final exam score versus providing feedback on every question.

DeepSeek's Four-Phase Training Pipeline

DeepSeek's R1 model reveals the modern approach to training reasoning LLMs. It's a four-phase process:

Phase 1 - Cold Start: Begin with supervised fine-tuning on a small dataset of high-quality, readable examples. This gives the model a foundation to build on.

Phase 2 - Reasoning-Oriented RL: This is where the magic happens. Large-scale reinforcement learning on maths, coding, and logical reasoning tasks. They use an algorithm called Group Relative Policy Optimization (GRPO), which is 4.5 times faster than previous approaches. The rewards are rule-based: accuracy rewards for getting things right, plus format rewards to ensure the model's outputs are well-structured.

Phase 3 - Rejection Sampling + SFT: Generate numerous outputs, use another model to grade them, keep only the correct and readable ones, then fine-tune on this filtered data combined with other domain knowledge.

Phase 4 - Diverse RL: Continue reinforcement learning across an even broader range of scenarios.

The fascinating bit? The model develops capabilities nobody explicitly taught it. Self-reflection: "Wait, that doesn't look right..." Self-correction: going back to re-evaluate flawed steps. Researchers observed "aha moments" during training where the model suddenly figured out how to catch its own errors.

Inside the Architecture: What's Actually Happening?

Let's peek under the hood. How does a reasoning LLM actually work when you give it a problem?

OpenAI hasn't fully disclosed O1's internals, but researchers have reverse-engineered its behaviour into a six-step process:

1. Problem Analysis: The model rephrases the problem and identifies key constraints. It's not just reading your question - it's making sure it understands what you're really asking.

2. Task Decomposition: Complex problems get broken into smaller, manageable sub-problems. This is crucial. Humans do this naturally; teaching AI to do it is a big deal.

3. Systematic Execution: Build the solution step-by-step. Each step builds on the previous one, with explicit connections between them.

4. Alternative Solutions: Here's where it gets interesting - the model explores multiple approaches rather than committing to the first one that comes to mind. This is genuine exploratory thinking.

5. Self-Evaluation: Regular checkpoints to verify progress. "Does this step make sense given what came before? Am I still on track?"

6. Self-Correction: If errors are detected during self-evaluation, fix them immediately rather than ploughing ahead.

Let's say you ask it to solve a complex algebra problem. It might first rephrase it in simpler terms (step 1), break it into solving for x, then y, then combining them (step 2), work through each part systematically (step 3), try both substitution and elimination methods (step 4), check if intermediate results make sense (step 5), and backtrack if something doesn't add up (step 6).

The Hidden Cost: Reasoning Tokens

Here's something most users don't realise: all that thinking has a cost. OpenAI's O1 uses something called "reasoning tokens" - essentially, internal tokens for its thinking process. You don't see these tokens in the output, but they consume context window space and you're billed for them as output tokens.

This is why O1 is slower and more expensive than GPT-4. When it's thinking for 30 seconds before answering, it's actually generating thousands of hidden reasoning tokens. The model adjusts this reasoning time based on problem complexity - simple questions get quick answers, hard problems get deep thought.

It's a tradeoff: better answers versus higher computational cost and longer wait times. For simple queries, you probably don't need it. For debugging a tricky piece of code or working through a complex mathematical proof? The extra cost is often worth it.

The Big Debate: Are They Actually Reasoning?

Right, let's tackle the elephant in the room. We've talked about what reasoning LLMs do, but are they genuinely reasoning, or just very sophisticated pattern matchers? The AI research community is quite divided on this.

The Case FOR Reasoning

If you look at what these models can do, it's tempting to call it reasoning. Here's the evidence:

Emergent abilities at scale: Reasoning capabilities appear naturally in large enough models. Nobody explicitly programmed in the ability to solve olympiad maths - it emerged from training. That's remarkable.

Novel problem-solving: These models handle tasks that aren't in their training data. Recent research on coding tasks showed reasoning models maintaining consistent performance on out-of-distribution problems. If they were just matching patterns from training, they'd fail on genuinely novel tasks.

Structured internal strategies: A January 2026 paper on propositional logical reasoning found evidence of "structured, interpretable strategies" in how LLMs process logic - not just opaque pattern matching.

Self-verification and correction: They catch their own errors and re-evaluate. That's not something simple pattern matching would do naturally.

If something solves problems systematically, adjusts its strategy based on intermediate results, explores alternatives, and self-corrects... isn't that reasoning? At least functionally?

The Case AGAINST: It's Pattern Matching All the Way Down

But here's the other side, and it's argued quite forcefully by people like Yann LeCun (Meta's Chief AI Scientist and a Turing Award winner).

Statistical foundation: Ultimately, these models are predicting the most probable ne xt token based on statistical patterns in their training data. That's the fundamental mechanism, however sophisticated.

Training data dependency: Chain-of-Thought works brilliantly... because the training data contains massive amounts of human-written reasoning examples. The model learns to replicate the form of reasoning without necessarily understanding the content. It's excellent pattern completion.

Prompt sensitivity: Change the wording of a problem slightly, and performance can drop sharply. True reasoning should be robust to superficial changes in presentation.

Hallucinations in reasoning: LLMs generate plausible-sounding but completely wrong reasoning steps. They can construct elaborate, logical-looking arguments that lead to nonsense. That's concerning.

No world model: As LeCun emphasises, these models lack understanding of causality, physics, and common sense. They don't build internal models of how the world works - they just predict text. A four-year-old child has processed vastly more sensory data and built richer world models than the largest LLM.

Solving unsolvable problems: Give an LLM a paradox or a question with no answer, and instead of recognising the impossibility, it'll try to provide a solution based on learned patterns. True reasoning would identify when a problem is malformed.

LeCun's critique is sharp: LLMs are "elaborate mimicry, not intelligence." He argues that scaling up language models is a "dead end" for achieving general intelligence, and that we need fundamentally different architectures (like his proposed "world models") to get there.

The Nuanced Truth

So who's right? Well, it depends on how you define "reasoning."

If reasoning means: systematic, logical thought leading to accurate conclusions
Yes, reasoning LLMs qualify. They demonstrably perform systematic analysis and reach sound conclusions on complex problems.

If reasoning means: genuine understanding, consciousness, causal comprehension independent of statistical correlation
No, they're sophisticated pattern matchers. They don't "understand" in any human sense.

Here's the practical reality for those of us building AI systems: these models exhibit behaviours consistent with reasoning whilst using pattern recognition as their mechanism. They're reasoning-capable, not truly reasoning. And that distinction matters.

Why it matters:

Know when to trust them: Verifiable domains like maths and code? Excellent. Common-sense reasoning about novel physical situations? Not so much.
Know their blindspots: They struggle with tasks requiring genuine world knowledge or causal understanding.
Use verification: For critical applications, always verify outputs with external tools or human review.

I think the most useful frame is: they're powerful tools that can augment human reasoning, not replace it. Use them where they excel, be cautious where they struggle, and always maintain oversight.

Performance and Benchmarks: How Good Are They Really?

Let's talk numbers. How do reasoning LLMs actually perform?

The Benchmark Saturation Era

By 2024, we hit an interesting milestone: the traditional benchmarks were too easy. Claude 3.5 Sonnet scores 96.4% on GSM8K (grade school maths word problems). Kimi K2 hits 95%. At this point, the benchmark isn't differentiating between top models anymore - they've all basically maxed out.

GSM8K was brilliant for measuring improvement from GPT-2 to GPT-4. But when everyone's scoring above 95%, you need harder tests.

The New Frontier: AIME and Expert-Level Benchmarks

Enter the American Invitational Mathematics Examination (AIME). This is serious stuff - problems that identify the top 500 mathematics students in the United States. It's not just applying formulas; it requires genuine problem-solving creativity.

Here's where it gets exciting:

OpenAI O1: 93% on AIME 2024 (placing it among top 500 students nationally)
Grok 3 beta: 93.3% on AIME 2025, 95.8% on AIME 2024
DeepSeek-R1: 86.7% on AIME 2024 with majority voting
Gemini 3 Pro: Reportedly 95%

Some sources claim GPT-5.2 hit a perfect 100% on AIME 2025, though this remains to be independently verified.

The trajectory is remarkable. Just two years ago, these problems stumped the best models. Now they're achieving gold-medal performance in mathematics competitions.

Beyond AIME, new benchmarks are emerging:

GPQA: Graduate-level questions in chemistry, physics, and biology
Humanity's Last Exam (HLE): Designed to be at the frontier of what's currently possible

The Performance Trajectory

Here's a striking statistic: the ability of state-of-the-art models to complete complex tasks is doubling approximately every seven months. If this trend continues (and that's a big if), we could see autonomous AI agents handling week-long tasks within the next few years.

2025 is being called "the year of reasoning" in AI circles. The focus has shifted from simply making models larger to making them think more effectively. Techniques like Reinforcement Learning from Verifiable Rewards (RLVR) - training models specifically to optimise for provably correct outputs - are becoming standard practice.

Real-World Applications and Critical Limitations

Let's get practical. Where should you actually use reasoning LLMs, and where should you be cautious?

Where Reasoning LLMs Excel

Mathematical problem-solving: This is the sweet spot. The model shows its work, you can verify each step, and it catches its own computational errors. Perfect for educational tools, automated grading, or helping students understand problem-solving approaches.

Code generation and debugging: Reasoning through code logic step-by-step produces better results than instant code completion. The model can explain why it chose a particular approach, identify edge cases, and debug issues systematically. I've seen it catch subtle concurrency bugs that took humans hours to spot.

Scientific analysis: Multi-step hypothesis testing, experimental design, and data interpretation all benefit from systematic reasoning. Researchers are using these models to help analyse complex datasets and propose experimental approaches.

Complex planning: Breaking down large tasks into subtasks, identifying dependencies, and creating execution strategies. This is useful for project planning, system design, and strategic decision-support.

Why they work well in these domains:

Verifiable - you can check if the answer is right
Logical structure - problems have clear reasoning paths
Step decomposition helps - breaking things down actually improves performance

Critical Limitations You Need to Know

But - and this is important - reasoning LLMs have significant limitations:

1. Hallucination in reasoning steps: They can generate plausible, logical-sounding arguments that are completely wrong. The reasoning looks good, the steps seem to follow, but the underlying logic is flawed. This is dangerous because it's harder to spot than a simple factual error.

2. Computational cost: O1 is roughly 5-10x slower and more expensive than GPT-4. For many use cases, that cost isn't justified. You wouldn't use it to summarise a document or answer simple questions.

3. Prompt brittleness: Slight changes in how you phrase a question can lead to significant performance differences. This makes them less robust than you'd want for production systems.

4. No true common sense: Ask it to reason about everyday physical situations or social dynamics, and the cracks show. It hasn't built the rich world models humans develop through lived experience.

5. Relational reasoning gaps: Complex hierarchies, long-term causal chains, and nuanced relationships remain challenging. Human-level reasoning in these areas is still far off.

6. Ethical inconsistency: Unlike humans who (generally) apply consistent moral frameworks, LLMs produce unreliable ethical reasoning, contradicting themselves across similar scenarios.

Mitigation Strategies

So how do you work with these limitations?

Chain-of-Thought prompting: Explicitly ask for step-by-step reasoning. This doesn't eliminate errors but makes them easier to spot.

Self-consistency: Generate multiple reasoning paths and check if they agree. If five different approaches give you the same answer, you can be more confident.

External verification: Use specialised tools to verify outputs. For code, run it through compilers and tests. For maths, check calculations with symbolic math libraries. Don't trust the LLM alone.

Retrieval-Augmented Generation (RAG): Ground responses in factual, verified data rather than relying solely on the model's parametric knowledge.

Human-in-the-loop: For high-stakes decisions, always have human review. The LLM can draft, analyse, and suggest, but humans should approve.

Think of reasoning LLMs as brilliant but unreliable interns. They can do impressive work, but you'd never let them make critical decisions without oversight.

The Road Ahead: What's Next for Reasoning AI?

We're at an inflection point. Here's what's coming and what to watch for.

2025 Trends

Reinforcement Learning from Verifiable Rewards (RLVR) is becoming the dominant training paradigm. Instead of just learning from human feedback, models are trained to optimise for provably correct outputs. This works brilliantly for maths and code where correctness is verifiable. The challenge now is extending it beyond STEM - can you use RLVR for legal reasoning? Philosophy? Creative problem-solving?

Distillation techniques are improving rapidly. Researchers are finding ways to transfer reasoning capabilities from massive models like O1 and DeepSeek-R1 into smaller, faster, cheaper models. This could make reasoning capabilities accessible for edge deployment and cost-sensitive applications.

Domain-specific reasoning models: Instead of one giant model that reasons about everything, expect to see specialised models optimised for specific domains - medical diagnosis, financial analysis, legal research. These can be smaller, faster, and more accurate within their domain.

Near-Term Expectations (6-12 months)

More open-source reasoning models: DeepSeek-R1's release has opened the floodgates. Expect more open-source alternatives matching proprietary performance.
Cheaper reasoning: Competition and optimisation will drive costs down. What costs ₹5 per query now might cost ₹0.50 in a year.
Better transparency: Current reasoning processes are partially hidden. Expect better tools to visualise and understand how models arrive at conclusions.
Hybrid approaches: Combining reasoning LLMs with traditional algorithms, knowledge graphs, and specialised solvers for more robust systems.

Key Questions to Watch

Can reasoning transfer to truly novel domains? Current success is mostly in domains with clear right/wrong answers. What about creative reasoning, ethical deliberation, or strategic planning where there's no single correct answer?

Will costs come down enough for widespread deployment? Reasoning capabilities are impressive but expensive. Broader adoption needs lower costs.

Can we solve the hallucination problem? Until we can reliably prevent hallucinations in reasoning steps, human oversight remains essential. This is the key unsolved challenge.

What's the next benchmark frontier? AIME will eventually saturate like GSM8K did. What comes next? Perhaps research-level problems or long-horizon tasks requiring days of reasoning?

For Practitioners: What You Should Do Now

Experiment now while the field is young. Understanding how to prompt, verify, and integrate reasoning capabilities gives you a competitive edge. The techniques you develop now will compound as models improve.

Build with verification in mind. Don't architect systems that blindly trust LLM outputs. Design for verification, validation, and human oversight from day one.

Watch the open-source space. DeepSeek-R1 proved open-source can match proprietary quality. You might not need to depend on expensive API calls forever.

Think hybrid. The best systems combine LLM reasoning with traditional tools. Use LLMs for what they're good at (ideation, decomposition, exploration) and other tools for what they excel at (exact calculation, database queries, rendering).

Conclusion: Reasoning-Capable, Not Truly Reasoning

Let's bring this all together.

Reasoning LLMs represent a genuine leap forward in AI capabilities. Whether they "truly" reason in some philosophical sense matters less than understanding what they can practically achieve - and they can achieve quite a lot.

The bottom line for AI engineers and data scientists:

1. Use them for verifiable domains. Maths, code, and formal logic where you can check answers? Excellent. Vague, subjective, or common-sense reasoning? Be cautious.

2. Always verify. Don't trust reasoning blindly, especially in critical applications. Build verification into your workflow.

3. Understand the tradeoff. Better quality comes with higher cost and latency. Not every problem needs reasoning capabilities - choose appropriately.

4. Watch the space rapidly evolve. With performance doubling every seven months and open-source alternatives emerging, what's expensive and proprietary today might be cheap and accessible tomorrow.

5. Think hybrid architectures. Combine reasoning LLMs with traditional tools, domain knowledge, and human expertise. The best systems leverage multiple complementary approaches.

The real question isn't "are they reasoning?" It's "when should I use reasoning capabilities?" The answer: when the problem is complex, systematically decomposable, verifiable, and the cost is justified by the value.

We're in early days. These models will get better, cheaper, and more reliable. The models we're discussing today will look primitive in two years. But the fundamental principles - understanding their capabilities, limitations, and appropriate use cases - will remain relevant.

Now, let's see what you build with them.

References

Wei, J., et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." arXiv:2201.11903. Link
DeepSeek-AI (2025). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948. Link
OpenAI (2024). "Learning to Reason with LLMs" OpenAI Blog
Wang, X., et al. (2022). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." arXiv:2203.11171. Link
Lightman, H., et al. (2023). "Let's Verify Step by Step." arXiv:2305.20050 Link
de Gregorio, I. (2024). "Do Large Language Models Actually Reason or Pattern Match?" Medium. Link
Various Authors (2024). "Dual Process Theory and Large Language Models." arXiv:2407.14054. Link
Various (2024). "Yann LeCun on LLM Limitations and World Models." Industry publications and interviews.
OpenAI, Various (2024-2025). "Reasoning LLM Benchmark Results: MATH, GSM8K, AIME." Multiple sources including official announcements.
Wikipedia Contributors (2024). "Hallucination (Artificial Intelligence)." Link
Willison, S. (2025). "2025: The Year of Reasoning Models." Simon Willison's Blog. Link
Towards Data Science (2025). "DeepSeek-R1: Training Reasoning Models Cost-Effectively." Link
GitHub Community. "Chain-of-Thought Implementation Examples." Link
OpenAI (2024). "OpenAI Reasoning Models Documentation." Link
DeepSeek-AI (2025). "DeepSeek-R1 GitHub Repository." Link