DeepSeek R1: Innovative Research and Engineering Can Rival Brute-Force Scaling
Nice display of engineering and research
DeepSeek-R1 dropped like a bombshell just as researchers (myself included) were trying to reverse-engineer OpenAI's o1 model. It exposed how o1 works and shattered the illusion that secret groundbreaking algorithms were developed behind closed doors. DeepSeek didn't just release a model—it published a paper packed with algorithms, model details, and training methods. It made the models open-source, free to use, though the data remains private. At a time when top AI labs are increasingly gatekeeping research due to rising competition, DeepSeek chose openness over secrecy.
What’s even more astonishing is how DeepSeek-R1 took the world by storm. Many called it a Sputnik moment. At first, I thought it was just a viral moment within research circles and academia. I was wrong. It shook the entire U.S. economy. It wiped off $1 trillion off U.S. stocks and triggered the biggest Nvidia’s fall in the US stock market history: a staggering $600 billion in market value. It didn’t stop there. DeepSeek R1 became the most-downloaded free app on the App Store, even surpassing ChatGPT. Friends and family started reaching out, asking what was happening. The impact was bigger than anyone had expected.
What shocked us as researchers wasn’t just the results—but the cost of achieving them. DeepSeek was built on a fraction of the budget used by Meta, OpenAI, and other AI giants.
DeepSeek-R1 (MoE with 600B parameters) was trained on 2.8M GPU-hours using 2048 GPUs for 2 months, costing around $6M. On the other hand, Meta’s Llama 3.1 (405B) was trained on 30.8M GPU-hours, costing an estimated $720M. That means DeepSeek used ~11x less compute and ~120x less money while delivering far better performance. Also, DeepSeek was trained on H800 GPUs, which are a tad slower than Meta’s H100s.
This is an incredible display of research and engineering. The U.S. has restricted advanced chip exports to China and China innovated. With a small team of top researchers and PhD students from leading Chinese universities and a budget of a few millions, DeepSeek proved that scaling compute alone isn’t the key to reaching a breakthrough. It pushed the boundaries of innovation and showed the world (and researchers in particular) that innovative research and engineering can rival brute-force scaling.
The research community is celebrating DeepSeek-R1 as a direct blow to the secrecy of closed labs, which have kept scientific progress locked behind doors. The most important thing is that it will help researchers which research dead-ends to avoid and which research directions to pursue without wasting their time and resources.
But while everyone is cheering, not everything was revealed. The paper lays out the training stages, but crucial details are missing—no dataset disclosures, no explanation for why certain methods work, no deep dive into what’s driving this level of performance. Training data is a crucial element for the success of these models. Without it, fully replicating DeepSeek- R1 is challenging. But, the research community is actively working on this: Huggingface is leading the push.
So what are DeepSeek R1 models? Why are they so powerful?
Since o1’s release in September, there’s been a lot of speculation about how it works, with OpenAI giving only vague hints. I wrote a blog post back then, guessing how o1 was built. I assume some of my assumptions were mostly right (assuming o1 is similar to DeepSeek), except for one detail—the model was trained with RL from scratch, without fine-tuning first, and it worked incredibly well. I’ll dive deeper into this below.
There are 3 types of models: DeepSeek-R1-Zero, DeepSeek- R1 and a series of distilled models.
Let’s start by the first model:
DeepSeek-R1-Zero: RL with no cold-start SFT and the insanity it brings with it 🤯
This model was built by applying RL training on top of their V3 base model (released December 2024). The model was trained without any supervised fine-tuning data (SFT). Typically, SFT provides models with crucial priors about response structure, reasoning patterns, and communication style.
What’s amazing is that DeepSeek-R1-Zero developed astonishing math/code capabilities through pure discovery and reinforcement: the model generates responses, receives reward signals about whether it succeeded or failed, and refines its parameters accordingly. This is different from SFT where models are essentially given the "correct answers" upfront. This is huge! Think of how humans learn. Imagine a child solving a math problem. If their parents just hand them the correct answer, they never get the chance to struggle, make mistakes, get feedback, and improve. That’s essentially what supervised fine-tuning does—it forces the model to learn the "right" response without letting it truly discover the reasoning process on its own. This is reminiscent of AlphaGo-Zero that beats humans by learning how to play from scratch using only self-play.
DeepSeek-R1-Zero proves that self-discovery and reinforcement learning can drive breakthroughs—and this changes everything.
The results are strong. Through exploration and self-discovery, the model’s average response length grew from 2,000 to 10,000 tokens during self-learning. And this wasn’t on just any data—it was tested on extremely challenging problems: like AIME (the American Invitational Mathematics Examination Dataset)
DeepSeek-R1-Zero: The AHA moment
One fascinating thing was the emergence of AHA moments during training. For each reasoning question, the model generates chain-of-thought reasoning to work through the solution. But, midway through training the AHA moment occurred.
At a certain point, the model realized on its own that its initial chain of thought was flawed. Instead of blindly following it, it backtracked and corrected itself, a behavior that was never explicitly programmed. This wasn’t just reinforcement learning optimizing output, it was the model rethinking its own reasoning process.
This blew my mind. I’m still trying to wrap my head around it. How did these abilities emerge with zero fine-tuning data? How did the model figure it out on its own from just a base model?
It turns out that a very strong base model really matters for unlocking this.
How does DeepSeek-R1-Zero work?
DeepSeek-R1-Zero is given a templated prompt that guides its response. The model is instructed to first spell out its reasoning process and add it within <think>...</think> tags. Then, it outputs its final answer within <answer>...</answer> tags.
What’s behind the RL training, GRPO?
GRPO is a modified version of PPO (Proximal Policy Optimization) . Before diving into how GRPO works, let’s briefly go over how PPO works. I won’t go into deep technical details here, I will keep it for another post.
The core goal of PPO is to maximize reward while ensuring the policy (the LLM’s behavior) doesn’t change too drastically in a single update. This is done by optimizing the following objective function:
Using PPO for large-scale LLM training is computationally expensive as it needs to fit 4 big models into the GPUs memory: Base model, Policy model, Reward Model and Value Model. To address these challenges, various implementations of PPO focus on optimizing GPU utilization, improving efficiency, and enabling scalability.
Policy ratio:
measures how much the new policy changes compared to the old policy for a given action
\( r_t(\theta) = \frac{\pi_{\theta}(a_t \mid s_t)}{\pi_{\text{old}}(a_t \mid s_t)} \)
The ratio of the probabilities under the new policy (πθ) and the old policy (πold):
Advantage (A^t):
The advantage function tells the model how much better (or worse) an action a_t (generated response) was compared to the average expectation for that state s_t
How it’s computed:
\( A_t = R_t - V(s_t) \)Where:
R_t: The return (reward) obtained from the generated response (a_t). The reward is usually obtained by training a reward model or by using a deterministic reward (e.g., a python interpreter for the generated code to assess correctness or ground truth answer for math questions). The KL terms ensures the policy doesn’t learn to produce outputs that are too different from those that the reward model has seen during training.
V(s_t): The value function estimates the expected return starting from state st. It needs to be trained alongside the policy model. During training, it is optimized to minimize the error between the predicted value and the actual return. The value loss is typically a mean-squared error:
Clipping:
The clipping term ensures that the policy ratio rt(θ) doesn’t deviate too far from 1. This prevents the model from making overly large updates which ensures stable training.
Minimization:
The minimum operator ensures the loss doesn’t increase too much when the policy changes excessively. This stabilizes learning.
Now how did GRPO go about resource constraints?
To reduce the computational burden of running four models simultaneously, GRPO drops the Value Model. But without it, how does it compute the advantage A_t? Instead of relying on a value model, GRPO estimates the advantage using the average reward of multiple sampled outputs. For each question q, it samples a group of outputs {o1,o2,…,oG} from the old policy πθold and then optimizes the policy model. Another key difference is in how KL divergence is handled. Instead of penalizing KL divergence in the reward, GRPO directly adds KL divergence between the trained policy and the reference policy to the loss function—modifying the original PPO formulation.
I’ll dive deeper into the specifics of this approach in a future blog post.
Issues with DeepSeek-R1-Zero:
Despite the impressive performance, the model mixed languages and produced responses in an inconsistent and unreadable format. So, the model needed further polishing. This was addressed by fine-tuning with high-quality SFT (Supervised Fine-Tuning) CoT data, leading to DeepSeek-R1.
DeepSeek-R1 Training Recipe
The recipe is simple and elegant, it is alternating between SFT and RL.
Step 1: Supervised Fine-Tuning (SFT)
Collect thousands of high-quality CoT (Chain-of-Thought) SFT examples.
Fine-tune DeepSeek-V3-Base on this dataset.
Step 2: Reinforcement Learning (GRPO)
Apply GRPO to the SFT model, following the same self-learning recipe as DeepSeek-R1 Zero.
Use two reward models (RMs):
Accuracy Rewards – Evaluates whether the response is correct.
Format Rewards – Ensures the model structures its reasoning within <think>...</think> tags.
Step 3: More Fine-Tuning on Expanded CoT Data
Fine-tune the RL-trained model on 800K CoT examples, including both reasoning and non-reasoning data.
This dataset was collected from the RL model itself in the previous step.
Main Takeaways
Standing on the shoulders of a strong base model can achieve amazing outcomes: a strong base model provides a robust prior that helps narrow down the search space and allows the model to reach solutions faster though not always accurately enough.
SFT on reasoning data matures the model, boosts its confidence in finding the response. It enhances readability and improves the model’s understanding of instructions which leads to responses that are "close enough to correct" but still not perfectly accurate.
RL addresses SFT limitations and shifts the objective from predicting the most likely next token to maximizing the outcome reward. It enables exploration, allows the model to discover the solution on its own, self-correct, and explore options rather than just being spoon-fed the answers as in SFT.
RL from the pretrained base model only works when your base model is already reasonably good at a task. If not, it’s more efficient to go back to pretraining, rather than wasting more compute on RL. That's why RL-tuning a small not competent model is WORSE than distilling from a larger RL tuned model! why? It’s simply more efficient to train on samples from an "approximately optimal" generative distribution (strong RL-tuned model) rather than using rejection sampling from a less accurate, weaker model.
Inference-time compute can help improve the model's chances of finding solutions, but it's not foolproof for solving complex reasoning tasks. It's more akin to brute force. If the model starts off poorly at approximating responses and its solution space is too far off, no amount of inference-time computation will miraculously find the right solution.
You can simulate reasoning by linearizing the thought process and training with autoregression. Complex tree search methods like MCTS might not be necessary for training data, though I’m reserved about this. It still needs more investigation.
For memory efficiency, you can drop the value model and do GRPO without loss in performance. So, no need for value functions that require another expensive copy of the model.
For reasoning tasks like math and code, there is no need for dense complex process reward models. Rely as much as possible on verifiable groundtruth rewards.










Great post! A lot of useful insights! I liked the user-friendly language for explaining technical concepts and math.