Have o1 Models Cracked Human Reasoning?
Discover how o1 models work in a speculative exploration, and discover whether LLMs have cracked human reasoning.
OpenAI set the AI world abuzz with the release of their o1 models. As the dust settles on this news, I can't help but feel this is the perfect moment to share my thoughts on LLMs reasoning as someone who's spent a good chunk of my research on understanding the capabilities of LLMs on compositional reasoning tasks. It's also an opportunity to open up publicly about the many “Faith and Fate” questions/concerns I have been receiving over the past year: e.g., do LLMs truly reason? have we achieved AGI? can they truly not solve simple arithmetic questions, etc.
The buzz around o1 models, code-named “strawberry” has been building since August fueled by rumors and media speculation. Last Thursday, Twitter exploded with OpenAI employees celebrating o1's performance boost on several reasoning tasks. The media amplified the excitement with headlines claiming that “human-like reasoning” is practically a solved problem in LLMs.
Without a doubt, o1 is extremely powerful and different from any other models, it's an incredible effort from OpenAI to release these models, and it's mind-blowing to see the significant jump in Elo scores on ChatBotArena comparing to the incremental improvements from other large players. ChatBotArena remains the top platform for assessing models in live, real-world tasks where companies cannot easily game the system.
To those outside the research community, it might seem as if a revolutionary new paradigm has suddenly emerged from OpenAI's labs. However, it’s necessary to provide context around the “o1 models” and reasoning in general, and to explain how foundational work by numerous researchers (including OpenAI employees) has led to such development.
My goal in this blog is to walk you beneath the surface of o1 and uncover its magic. I will explain that while we're making significant strides in LLMs reasoning, we're not quite there yet. The problem is far from solved.
[Disclaimer: I don’t know in reality how o1 model works, this is just my speculation]
Is o1 a magical model: Deciphering o1 training
We still don’t know the exact details of the inner workings of o1 due to the high level of secrecy, but as a researcher who has been working on LLMs and reasoning, I can make an educated guess about how it might function. It’s not clear whether o1 is a single model or a set of models, there are different views and speculation about this. But, I suspect the core concept revolves around a unified system. o1’s lead has confirmed that it's one model, though the specifics of its architecture remain undisclosed.
OpenAI hasn’t yet released the “o1” model, it only released the “o1-preview” model probably for infrastructure reasons related to serving it to millions of users, or perhaps because it's not as safe as they want it to be. However, we can see the gap between “GPT4o” and “o1” models. There's even a gap between humans and these models on PhD-level questions.
According to OpenAI’s blog post, these models are trained using reinforcement learning (RL) with chain of thoughts (CoT) to reason through problems step by step before generating the final answer. Making RL work successfully at such a large scale is undoubtedly a significant achievement and an impressive feat of engineering.
Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.
Speculated Training Process
How was o1 trained? It's uncertain whether it involves a sophisticated, newly invented RL algorithm that explicitly trained the model to backtrace if it made a mistake, or if it follows more common, standard techniques. What seems evident to me is that o1 likely follows the successful alignment pipeline that has become standard in the field. This process typically begins with Supervised Fine-Tuning (SFT) on CoT data, followed by scoring the generated reasoning chains using a process reward model. The resulting score is then used to optimize the model, improving its ability to think through solutions. The optimization likely uses policy gradient algorithms like Proximal Policy Optimization (PPO). PPO involves generating online completions. To produce these completions, OpenAI must have explored different reasoning paths searching for optimal trajectories. I will discuss below various techniques that may have been used.
The key difference between o1's training and that of previous models (GPT4, ChatGPT, GPT4-o) is predominantly the nature of the training data and rewarding the intermediate steps leading to the solution. In previous iterations, models were primarily trained on (prompt, solution) pairs: given a problem as input, the model was expected to produce the final answer as output. In contrast, o1 is trained on input problems paired with detailed, step-by-step explanations of how to reach the solution and each step of the solution is then scored.
Data collection: OpenAI likely paid high-skill annotators to create complex reasoning paths, probably including multiple paths for single problems. They likely also collected incorrect chain of thought paths that are difficult to identify as wrong, creating contrastive sets for stronger signal learning. Besides human data, they have likely also collected large-scale synthetic data seeded by human data. Automating this process is challenging, as a single mistake in a step can lead to exponential error propagation, and so the synthetic data must have been audited and filtered to ensure quality. Getting this high-quality data is the crucial part to building o1. OpenAI is clearly aware of the sensitivity of this information, and thus their decision not to display the reasoning trace to users and their efforts to prevent attempts to extract this data from the model through jailbreaking.
Inference-time generation
The real innovation in these models seems to be in the inference stage, which requires significant computational resources and engineering. o1 models seem to be the first product to enable large-scale text search in real-time, marking a significant breakthrough that is set to revolutionize deployment frameworks and raise expectations for AI products. The improvements we're seeing confirm the existence of inference-time scaling laws. This means that the number of problems solved by any attempt scales with the number of samples generated from the model at inference time. Consider this: the raw GPT-4 might generate only 1 correct response out of 1000 samples, a success rate of just 0.1%. By using a reward model to return the highest-scoring answer to the user, the success rate skyrockets to 100%. But is this real improvement or just clever filtering?
Google released a paper on inference-time scaling law in late July. Their experiments reveal a nice relationship between finding the correct answer and the number of generated answers from LLMs. This relationship is often log-linear and can be modeled with an exponentiated power law.
How does o1 work at inference time? During inference, o1 generates multiple chains of thought (what OpenAI refers to as the model “thinking”). It then searches for the best answer using a scoring function to evaluate these different reasoning trajectories. This process likely involves generating thousands, if not hundreds of thousands, of potential reasoning paths, distributing this generation across many GPUs and scoring and selecting the most promising solutions. The scale of this inference-time computation is unprecedented, potentially involving hundreds or thousands of GPUs working in parallel to explore the vast space of possible reasoning paths. The number of chain-of-thought steps generated at inference time is currently set by OpenAI, and users cannot control it. This has significant implications for the response time – essentially, how long one must wait for an answer, or “how much the model is thinking.” OpenAI indicated that in the future, they plan to give users more control over this waiting time.
Jim Fan summarized nicely in a Twitter post the shifting compute expenditures for next-generation AI systems.
So what kind of inference algorithms OpenAI may have used?
Inference Time Decoding Algorithm
The most straightforward way to generate text at inference time is to use greedy sampling or nucleus sampling combined with top-k sampling, without imposing any constraints on the generation. However, these techniques are suboptimal and don't necessarily lead to the best outcome. They can be inefficient, as there's no point in generating a path if it's doomed to be poor from the very first incorrect steps. This is where decoding time algorithms come into play. Controlled decoding has seen significant success in the NLP community as it allows for improving model performance without going through intensive training. Examples include: Neurologic decoding (Lu et al., 2020), and GBS (Hokamp and Liu, 2017) generalize beam search for lexically constrained decoding, DExperts (Liu et al., 2021b) and more. OpenAI has likely used a reinforcement decoding algorithm similar to Tree of Thoughts which I will explain below.
Tree of Thoughts:
(Yao et al, 2023) introduced this approach last year. It uses LLM self-feedback: sampling one thought at a time leading to the solution, then self-evaluating by asking the model whether the generated step is likely correct or not. If deemed incorrect, the model won't follow that path further. By rating multiple future reasoning steps before proceeding, the model has opportunities to identify and correct errors. This aligns with many key successes in deep reinforcement learning, such as AlphaGo. The process involves complex decoding algorithms that balance exploration of different thought paths with exploitation of promising directions. OpenAI, at least during training, has a mechanism to adjust the amount of compute used through search depth, width, or heuristic calculation (scoring the nodes of search).
Inference Time Scoring functions
Scoring in LLMs via Consensus
Consensus is the process of generating multiple candidates and selecting the most common one using regex exact matching. A straightforward way to pick the best response out of many candidates is to use majority voting, meaning you choose from paths that lead to the same final solution. Consensus is easily implemented when you need to return a single number. However, it's challenging to achieve consensus when writing a proof, as it's unlikely that the model will generate the same proof multiple times. By applying this consensus method over 1000 samples, Minerva from Google Research improved its performance on the MATH dataset from 33.6% to 50.3%.
Scoring in LLMs using Best-of-N:
Best-of-N is a method where multiple solutions are sampled and then scored using a reward model. The solution with the highest score is returned. With a sufficiently accurate reward model, Best-of-N can outperform consensus methods. However, this approach is ultimately limited by the quality of the reward model and risks overfitting to errors. In this context, the reward model is often referred to as the “outcome” model.
Scoring using Process Reward Models:
Scoring using Process Reward Models (PRMs) involves verifying every step individually rather than the entire sample. This method is more effective than both outcome reward models (which score only the final solution and not step-by-step) and majority voting when searching over a large number of model-generated solutions. PRMs provide a richer signal, because it specifies both how many of the first steps were correct and the precise location of any incorrect steps. However, there is no simple way to automate process supervision; it relies on human data labelers to provide process supervision by labeling the correctness of each step in model-generated solutions. To reduce dependence on costly human feedback, large-scale models are used to supervise small-scale model training. o1 likely used PRMs to score the generated chain of thoughts both during training and at inference time. The image below from @prnvrdy provides a nice illustration.
Have LLMs cracked reasoning?
Ok, having so far discussed how o1 might roughly be working, let's reflect on whether we have truly cracked reasoning and whether CoT and inference-time sampling mean reasoning is solved. Discussions around reasoning have been quite heated since the emergence of LLMs, and this isn't just since the appearance of ChatGPT. Even back with the release of BERT and GPT-3, the research community was already debating whether these LLMs can truly reason. One famous debate was between Geoffrey Hinton, Yoshua Benjio and Yann LeCun, who disagreed on whether LLMs can understand what they say. Geoffrey Hinton and Yoshua Bengio warn about existential risks, while Yann LeCun claims that current LLMs haven't yet reached dog-level intelligence.
But this hype got intensified even further with the release of ChatGPT, GPT-4, Gemini, Claude, LLama3, etc. The marketing of those products, coupled with their convincing-sounding responses (even when wrong), has painted an overly rosy picture to the point that many people believe these models can handle any task effortlessly.
This pushed me to question their true capabilities: have these models cracked reasoning for real? Can they handle any compositional task? Can they learn implicit problem-solving rules via chain-of-though training? And under what conditions do they succeed, fail, and why?
I still don't have answers for all of these questions, and it's difficult to give a final verdict on their reasoning skills. The fact that they already manage to answer very complex queries is mind-boggling in itself. But the mystery is how they do it: whether they've learned to reason like humans, or if their reasoning is a mixture of pattern recognition coupled with some form of reasoning.
My investigations led to “Faith and Fate” paper where we studied three compositional reasoning tasks: multiplication, dynamic programming and 1 NLP task called the Einstein puzzle. The beauty of these tasks lies in their compositional nature, meaning we can easily increase or decrease their complexity and observe how this affects LLMs behaviors (e.g. 2 x 2 multiplication vs 10x10 multiplication).
Chain-of-Thought Training in LLMs: Performance Boost and Limitations
TDLR: Chain of thought training leads to a significant boost in performance BUT performance decreases to 0 as the complexity of the task increases and OOD cases are encountered.
We were particularly curious about whether training these LLMs with a massive amount of (prompt, chain-of-thought, solution) in an SFT style would enable them to learn the underlying algorithm and achieve 100% accuracy, regardless of the task's complexity. Could they for example generalize from 5x5 multiplication training to solve 6x6 multiplication problems at inference time?
Across all tasks, we observed a boost in performance compared to just
(prompt, solution) training. The main problem was the OOD performance, the performance quickly drops to 0% on say 5-digit x 5-digit multiplication if the model was trained up to 4x4. “o1-preview” model fail with long digit multiplication. “o1” models answer wrongly a 8x8 digit question:
Prompting
> “Can you tell me how much 96745982 times 93678239 ?”
> The correct answer is “9,062,993,224,085,698”
Yuntian Deng tested “o1-mini” on up to 20x20 multiplication—it solves up to 9x9 multiplication with decent accuracy, while gpt-4o struggles beyond 4x4. GPT-4o was probably was not trained with the same amount of CoT trajectories and thus the lower performance.
Let’s go back to the question: “Can you tell me how much 96745982 times 93678239 ?”
> The correct answer is “9,062,993,224,085,698”
> o1’s answer is “9,066,591,470,043,979,798”
If you look closely at “o1” answer, you can notice that the model partially predicted the response correctly (the first and last 2 digits), even if the overall answer is incorrect. The same phenomena we observe in “Faith and Fate” and we explain it through relative information gain. What’s happening exactly during the model training?
The next-token prediction leads the model to learning superficial spurious patterns. This means that if an output element heavily relies on a single or a small set of input features, LLMs are likely to recognize such correlation during training and make partial guesses during testing without executing the full multi-step reasoning required by the task. So this is as if the model learns shortcuts from the CoT training. There is nothing wrong with learning shortcuts, as we humans commonly use them for swiftly delivering answers. However, the key difference lies is our ability to discern when and how to use shortcuts — a skill LLMs seem to lack. Teaching LLMs with CoT and RL search does not necessarily solve this issue especially when the complexity of the task is super high.
LLMs Reduce Multi-Step Compositional Reasoning into Subgraph Matching
To understand more why GPT4, ChatGPT cannot solve complex reasoning task with 100% accuracy, we tried to understand the success cases and whether they were due to genuine reasoning or simply because the models had been exposed to similar training examples during their training.
How do we do this? We formulate these tasks as computation graphs which break down problem-solving into smaller, functional steps.
For each graph, we computed the average frequency of partial computations that appear in the training data needed to solve a task, for both correctly and wrongly predicted example. We found that Transformers’ successes are heavily linked to having seen significant portions of the required computation graph during training — suggesting that compositional behaviors may truly be pattern matching.
Although the final prediction may be highly compositional, the true compositionality shown is much less impressive because the solutions could be easily extracted from input-output sequences present in the training data. This type of learning could be very effective when the compositional complexity of tasks is low but it becomes less efficient when tasks are increasingly complex. Shortcut learning via pattern-matching may yield fast correct answers when similar compositional patterns are available during training but does not allow for robust generalization to uncommon or complex examples.
Despite RL-training and search, it’s very likely that o1 has also the same behavior as it produces similar patterns.
As research progresses, we may need to redefine our concepts of “understanding” and “reasoning.” Perhaps these models are developing forms of cognition that are fundamentally different from human reasoning.
The Inference-Time Algorithm: A Step Forward or a Brute Force Detour?
While inference-time algorithms undeniably enhance LLM performance, we must question whether this approach truly advances our quest for artificial reasoning. Are we maybe overlooking more clever human-like solutions in our rush to improve benchmark scores?
OpenAI is claiming that models like o1 now “think” like humans. But let’s be clear: generating thousands, hundreds of thousands, or even millions of trajectories and then scoring them using a scoring function is far from human cognition. It's a brute-force approach that, while undeniably effective, is fundamentally different from human reasoning.
CoT combined with RL search and reward modeling scoring is super powerful, but do we really need to search over thousands or millions of chain of thoughts to answer math questions that humans can solve with a simple algorithm or or basic tools like a calculator? Maybe once compute becomes cheaper, it wouldn’t be a big deal especially since OpenAI showed that they found a way to scale it up, and scale seems to be the most effective method, outperforming compute-poor models by a large margin. This is reminiscent of the bitter lesson from Rich Sutton.
The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation.
But one thing, open-ended tasks (e.g., writing, medicine) where solutions are rarely binary and often highly contextual, these brute-force won’t necessarily work. Search and RL can work great in domains with clear success metrics (e.g., coding or mathematics) but when the “correctness” of an answer becomes ambiguous or subjective, they become an inefficient use of time and resources.
Why Does the Debate Over Whether LLMs Show Signs of Reasoning or Not Matter?
It matters at least for me as a researcher. But, …
From a functional standpoint, it doesn't really matter. The broad set of users care mostly about whether LLMs effectively satisfy their requests, so reasoning or not, or the underlying process of how it arrived at a solution, may be less relevant. The focus shifts to the end result rather than the specific cognitive steps taken.
However, from a deeper understanding and interpretability standpoint, we need to understand the full picture of their functioning. This is needed to help us make smart choices when it comes to their architecture and training strategies to advance their capabilities.
There is a big problem in doing this: how tricky their evaluation is. We're currently testing these models according to human-like mental abilities, designing benchmarks and testing whether they can follow the same algorithms humans use. But what if they have a completely different computational graph or fundamentally different mechanisms from humans? Are our evaluations fair or complete?
The divergence in thinking processes becomes more significant here. If an LLM uses fundamentally different mechanisms or lacks a clear explanatory framework, understanding how it arrived at the output becomes super challenging. Without a clear insight into their reasoning process, we're left questioning whether our current evaluation methods are truly adequate or if they're potentially missing crucial aspects of AI capabilities and limitations.
Cognition Limit: Is There Some Upper Limit On How Much Cognition LLMs Can Obtain From Language Alone?
This is the most asked question: I think, to achieve a deeper comprehension of reality, we need systems that learn from sensory inputs like video, that allow them to grasp how the world truly functions. Current LLMs are purely trained on massive amounts of text. But I think most human knowledge has nothing to do with language, so that part of the human experience is not captured by AI.
It's ironic that LLMs could now pass the Bar in the U.S., an examination required to become an attorney. However, they can't accurately do simple multiplication or addition. Models that incorporate vision-based learning would provide a more profound understanding. There are lots of efforts to build multimodel systems that process text, audio, videos at the same time but are we there yet?
I think without learning world models from sensory inputs and incorporating architectures capable of reasoning and planning (rather than just auto-regression), it maybe be hard to achieving AI at the level of humans.
Embracing Agent LLMs as the Next Frontier
LLMs are undeniably powerful, but by design, they are limited to producing natural language, which does not allow them to interact with the real world or use external tools as humans do. For example, we don't need a gigantic LLM trained on millions of arithmetic problems to perform simple multiplication tasks. The model could instead directly call an external calculator.
Tool use has been getting increasing popularity in the AI community recently. Sam Altman said last week that OpenAI is actively working on this direction. The next step is to augment these models with planning modules that will allow them to make more strategic decisions about when to rely on their parametric knowledge and when to leverage external tools. These composite systems will allow to improve processing of long contexts for orchestrating multiple tools and will thus improve multi-turn conversational abilities to effectively use expert human feedback.
I will be writing more about LLMs and agents in the near future … Stay tuned!
The End.
Feel free to reach out if I missed something. Also, please leave comments or send me DMs if you have any questions!










