Regarding DeepSeek

January 30, 2025 | Sagar Varma

In light of the DeepSeek API being down, I thought it would be a good idea to dive into how DeepSeek works and why it is causing such an uproar in the AI community.

DeepSeek is an impressive model that stands out for its exceptional affordability. While OpenAI's GPT-4o (O1) costs approximately $60 per million output tokens, DeepSeek is priced at just $2.20 per million output tokens—a staggering 30x reduction in cost. Though DeepSeek may not outperform O1 in every aspect, it delivers comparable performance across most tasks, making it an incredibly cost-effective alternative.

Even more remarkable is the claimed training cost. DeepSeek's developers state that they trained the model for just $6 million ^[1], a fraction of what industry giants typically spend. For context, Claude 3.5 Sonnet required tens of millions of dollars to train, and GPT-4o likely exceeded that figure.

Additionally, DeepSeek offers effectively unlimited queries through its chat interface for free—a stark contrast to OpenAI's O1, which requires at least $20 per month for just 200 queries.

Given its drastic cost savings, strong performance, and generous free-tier access, DeepSeek is undeniably a game-changer in the AI space. While some of the excitement may be overhyped, much of it is well-deserved. The release of DeepSeek also shattered the more capex = better ai model narrative. Though this narrative was being questioned before ^[2], DeepSeek completely blew it out of the water, and with it the stock prices of the companies that bet big on it (though the stocks have mostly rebounded from their Monday lows). Everyone knows now that scaling is not the only path to AGI. All if this combined with the fact that this came out of China, who are under heavy export controls, has lead to a perfect storm. While some of the hype is undeserved, most if it is justifiable - leading to many across the industry to call this the sputnik moment for AI.

Before diving into how the DeepSeek models work, it is important to understand reinforcement learning. It is important to understand some terms before we continue: Agent: The entity that learns some task by taking the actions in the environment. Policy: A function that maps the current state of the agent to the next action it should take. It is the primary component being optimized in RL. Environment: The world in which the agent operates. It determines how actions control the state and dictated rewards through a reward function. Reward function: The function that evaluates the agents actions by assigning rewards. It acts as a proxy for the task we want to get good at, so optimizing the agent for the reward function ends up leading to great performance on the actual task.

In AlphaGo, the agent was the model learning to play Go, while the policy, represented by the model's weights, processed the current game state and determined the best move to make next. The environment consisted of the Go board and the game rules, which dictated the legal moves, and the reward function was based on whether the model won or lost the game. Through millions of iterations, the model continuously refined its policy to maximize its expected reward, ultimately achieving world-class performance.

Humans also engage in reinforcement learning every day. The saying "practice makes perfect" reflects the same underlying principle—learning through repeated trial and feedback. Consider a student striving to land a prestigious internship. In this scenario, the student acts as the agent, with their mind functioning as the policy, guiding decisions on how to approach the application process. The environment is the job procurement process itself, including applications, interviews, and competition, while the reward function is determined by whether or not the student secures an offer. Over time, through repeated attempts and failures, the student refines their approach, learning from past experiences and improving their chances of success ^[3].

In the machine learning context, there are a few flavors of reinforcement learning. Classic RL follows the exact process described above and was used to train systems like AlphaGo. Then there is RLHF, which stands for Reinforcement Learning from Human Feedback—the method currently used to align language models for human use ^[4]. Another approach is Trust Region Policy Optimization (TRPO), which led to the development of Proximal Policy Optimization (PPO), and ultimately Group Relative Policy Optimization (GRPO), the core of DeepSeek R1.

Before diving into TRPO, it's important to understand how reinforcement learning (RL) is traditionally applied in language modeling. Unlike in structured environments like Go or chess, where the reward function is well-defined, language models lack a clear reward signal. The reward can come from human feedback (as in RLHF) or methods like rejection sampling, but because these rewards are non-differentiable, we cannot use traditional backpropagation to update the model's weights. This limitation renders standard optimization techniques ineffective.

So, how do we update the weights? In the most basic form of RL, we use a version of Monte Carlo gradient optimization, or policy optimization, known as REINFORCE ^[5]. In this method, we assume that whatever prediction the policy produces is the correct one, regardless of its actual quality. This simplifies the loss function to:

log(π_θ) × reward

where π_θ represents the policy. The key effect of this formulation is that actions leading to higher rewards have a greater influence on updating the policy parameters than those leading to lower rewards. Over time, if all goes well, this should result in a reasonably effective policy.

However, REINFORCE suffers from several major issues. First, even if an action is neither particularly good nor bad, the parameters still get updated, leading to unnecessary noise in training. Second, the magnitude of updates is uncontrolled—they can be too large or too small. If updates are too large, the model may move away from a useful region, forgetting what it has learned. If they are too small, learning becomes painfully slow. Lastly, REINFORCE introduces high variance, making updates noisy and unstable, which is especially problematic for large-scale models with billions of parameters. These issues compound, leading to significant inefficiencies and instability during training.

Traditional policy optimization, while simple, is not particularly effective. To address its shortcomings, Trust Region Policy Optimization (TRPO) was introduced ^[6]. TRPO incorporates several key improvements to stabilize learning and improve efficiency.

One of the major additions is the advantage function, A, which refines the way updates are made. Instead of using log(π_θ) × reward as in REINFORCE, TRPO modifies the update rule to log(π_θ) × A. The advantage function is computed as A = reward - baseline, where the baseline is a value predicted by a separate model, known as the value network or critic. This baseline represents the expected or neutral reward, meaning that only actions leading to better-than-expected outcomes receive positive updates, while worse-than-expected actions receive negative updates. By incorporating this adjustment, TRPO eliminates the inefficiency of updating the policy for neutral actions, which was a major drawback in REINFORCE.

Enter PPO, or proximal policy optimization ^[7]. The core idea behind TRPO and PPO is the same but PPO is far lighter to implement mathematically. Whereas TRPO used second order differential equations, which are a nightmare to calculate, and used line search, PPO simply uses first order differential equations and a simple gradient descent method, as is used in vanilla policy optimization. It still keeps the same idea of clipping the gradient if it gets too high and keeping the advantage. The problem with PPO is that though it is lighter in terms of compute, it is still not enough. To get the advantage, you need a critic model to predict what the baseline will be. That means it needs to take in the same input the language model does and predict some number rather than the next word. This is very computationally heavy and is not feasible when we already have billions, soon to be trillions, of parameters.

Enter GRPO ^[8]. GRPO does everything PPO does but instead of using a dedicated value model, it just samples from the policy multiple times. It then averages the reward that it got from those samples, the group relative part, and uses that as a baseline for the advantage. This has the benefit of cutting compute required by almost half and makes it far more efficient and viable to train deep neural nets this way.

With that out of the way, let's dive into how DeepSeek works. There are two models to consider: DeepSeek R1-Zero and DeepSeek R1. While R1 is the widely used model—capable of following instructions and engaging in natural, human-like conversation—R1-Zero is arguably the more impressive technical achievement.

Both models start with the same pre-trained model - DeepSeek V3 ^[9]. This model has around 650 billion parameters, on par with something like GPT 4o and uses the same training pipeline most LLM's today use: you get a lot of data, on the order of trillions of tokens ^[10], then train a transformer based autoregressive model to predict the next word.

Once the pre-training phase was over, simple reinforcement learning was applied. They used GRPO to update the policy weights and for the reward, two kinds were provided: accuracy rewards that reward getting a question right and format rewards, which rewards having a chain of thought and reasoning things out. Over lots and lots of steps, it was seen that the model develops complex reasoning behaviors on its own, naturally.

Now there were some problems with this model, namely that it the text it produced was not human readable or really understandable. The reasoning was there but it was not a very good consumer facing product.

Enter DeepSeek R1. They essentially added some SFT to the mix with R1 and ended up with a model that is much more usable, but with the same reasoning capabilities as the model that was just trained using reinforcement learning.

The big learning with DeepSeek R1 is that you can just get rid of the entire post training pipeline and create a really good model by just using reinforcement leaning and nothing else. It goes to show how there are still so many algorithmic improvements left in AI and how this is just the beginning of what will be a complete rewriting of what it means to be a human. What a time to be alive!

Dario Amodei points this figure out in his recent essay: https://darioamodei.com/on-deepseek-and-export-controls
An interesting analysis of the more capex narrative from sequoia: https://www.sequoiacap.com/article/ais-600b-question/
Interesting perspective on that rat race: https://space.ong.ac/escaping-flatland#user-content-fnref-1
Interesting X post from Karpathy detailing the issues with RLHF and why it is hardly a form of reinforcement learning https://x.com/karpathy/status/1821277264996352246?lang=en
A great deep dive into how policy optimization works, and how various algorithms, including REINFORCE, use it: https://lilianweng.github.io/posts/2018-04-08-policy-gradient/
The trust region policy optimization paper: https://arxiv.org/abs/1502.05477
The proximal policy optimization paper: https://arxiv.org/abs/1707.06347
Interestingly GRPO was invented by DeepSeek a few months ago. PPO has been around since 2018 and was pioneered by OAI. The elimination of the critic model probably led to some decrease in cost to train, since you are eliminating effectively billions of parameters in favor of a simple mean. Paper link: https://arxiv.org/abs/1707.06347
DeepSeek V3 paper: https://arxiv.org/abs/2412.19437
An interesting look at how dataset collection works and how they actually end up with trillions of tokens: https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1