DeepSeek's reasoning AI shows power of small models, efficiently trained

27 January 2025

Author

Aili McConnon

Tech Reporter, IBM

DeepSeek-R1, the AI model from Chinese startup DeepSeek, soared to the top of the charts of the most downloaded and active models on the AI open-source platform Hugging Face hours after its launch last week. It also sent shockwaves through the financial markets as it prompted investors to reconsider the valuations of chipmakers like NVIDIA and the colossal investments that American AI giants are making to scale their AI businesses.

Why all the buzz? A so-called "reasoning model," DeepSeek-R1 is a digital assistant that performs as well as OpenAI’s o1 on certain AI benchmarks for math and coding tasks, was trained with far fewer chips and is approximately 96% cheaper to use, according to the company.

“DeepSeek is definitely reshaping the AI landscape, challenging giants with open-source ambition and state-of-the-art innovations,” says Kaoutar El Maghraoui, a Principal Research Scientist and Manager at IBM AI Hardware.

Meanwhile, ByteDance, the Chinese tech giant that owns TikTok, recently announced its own reasoning agent, UI-TARS, which it claims outperforms OpenAI’s GPT-4o, Anthropic’s Claude and Google’s Gemini on certain benchmarks. ByteDance’s agent can read graphical interfaces, reason and take autonomous, step-by-step action.

From startups to established giants, Chinese AI companies appear to be closing the gap with their American rivals, in large part thanks to their willingness to open source or share the underlying software code with other businesses and software developers. “DeepSeek has been able to proliferate some pretty powerful models across the community,” says Abraham Daniels, a Senior Technical Product Manager for IBM’s Granite model. DeepSeek-R1 is offered on Hugging Face under an MIT license that permits unrestricted commercial use. “DeepSeek could really accelerate AI democratization,” he says.

Last summer, Chinese company Kuaishou unveiled a video-generating tool that was like OpenAI’s Sora but available to the public out of the gates. Sora was unveiled last February but was only fully released in December and even then only those with a ChatGPT Pro subscription could access all of its features. Developers on Hugging Face have also snapped up new open-source models from the Chinese tech giants Tencent and Alibaba. While Meta has open-sourced its Llama models, both OpenAI and Google have pursued a predominantly closed-source approach to their model development.

Besides the boon of open source, DeepSeek engineers also used only a fraction of the highly specialized NVIDIA chips used by that of their American competitors to train their systems. DeepSeek engineers, for example, said they needed only 2,000 GPUs (graphic processing units), or chips, to train their DeepSeek-V3 model, according to a research paper they published with the model’s release.

Reasoning models

“What’s really impressive is DeepSeek models’ ability to reason,” says Kush Varshney, an IBM Fellow. Reasoning models essentially verify or check themselves, representing a type of “meta cognition,” or “thinking about thinking,” Varshney says. "We are now starting to put wisdom into these models, and that’s a huge step."

Reasoning models became the hot topic of conversation last September when OpenAI previewed its o1 reasoning model. Unlike previous AI models, which produced an answer without explaining the reasoning, it solves complex problems by breaking them into steps. Reasoning models can take a few more seconds or minutes to answer because they reflect on their analysis step-by-step, or in a “chain of thought” manner.

Reinforcement learning

DeepSeek-R1 combines chain-of-thought reasoning with reinforcement learning in which an autonomous agent learns to perform a task through trial and error and without any instructions from a human user. Reinforcement learning differs from more commonly used forms of learning, such as supervised learning, which uses manually labeled data to produce predictions or classifications, and unsupervised learning, which aims to uncover and learn hidden patterns from unlabeled data.

DeepSeek-R1 questions the assumption that models will improve their ability to reason by being trained on labeled examples of correct or incorrect behavior, or by extracting information from hidden patterns, says Yihua Zhang, a PhD student at Michigan State University who has authored dozens of papers on machine learning. “The key hypothesis is simple yet bold,” says Zhang. “Can we just reward the model for correctness and let it discover the best way to think on its own?”

Zhang says that what he and others have found particularly remarkable about the large-scale training of large language models such as DeepSeek’s is that “the model starts to show an actual ‘aha’ moment, where it steps back, spots mistakes and corrects itself.”

3D design of balls rolling on a track

The latest AI News + Insights 


Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter. 

The cost calculus

Part of the hoopla around DeepSeek stems from its low price point. DeepSeek-V3, released on Christmas day, cost USD 5.5 million to train and is much cheaper for developers looking to try it out, according to the technical report the company released.  “It’s really impressive what they did for the cost of the model, and how long they took to train it,” says Chris Hay, Distinguished Engineer at IBM.

This low price tag, however, may not be the whole story, says Kate Soule, Director of Technical Product Management for Granite at IBM Research. The USD 5.5 million cost “represents only a fraction of the compute needed,” she says. It does not include cost details that companies keep proprietary even with open-source models, such as “the compute costs for reinforcement learning, data ablations and hyperparameter searches,” says Soule.

What’s not up for question, however, is that DeepSeek achieved greater cost efficiency by using a mixture of experts (MoE) architecture, which is dramatically less resource-intensive to train. The MoE architecture divides an AI model into separate sub-networks (or “experts”), each specializing in a subset of the input data. The model only activates the specific experts needed for a given task rather than activating the entire neural network. Consequently, the MoE architecture greatly reduces computation costs during pre-training and achieves faster performance during inference time. Several companies around the globe, including pioneering French AI company Mistral and IBM have popularized the MoE architecture over the past year and achieved greater efficiency by combining MoE and open source.

In the case of IBM’s series of open-sourced Granite models (developed with a MoE architecture), enterprises are able to achieve frontier model performance at a fraction of the cost because they can adapt a large pre-trained model for their specific applications or use cases, effectively creating smaller, fit-for-purpose models. Packing tremendous capability into smaller models means these models can be used on smartphones and other mobile devices that operate at the edge, like car computers or smart sensors on a factory floor.

This process of taking a larger model and distilling it into less resource-intensive smaller models also contributed to DeepSeek’s success. Alongside its banner R1 model release, the Chinese startup also released a series of smaller, fit-for-purpose models. Interestingly, they showed that larger models distilled into smaller models perform better in reasoning compared to using reinforcement learning of small models from the get-go.

A global AI reckoning?

As they rival or creep past their older competitors on certain benchmarks, how will these new Chinese models impact the global AI landscape? “It's not just about the raw performance on benchmarks,” says El Maghraoui. “It’s about whether there is end-to-end integration of these models in a safe and ethical manner.” As a result, El Maghraoui says, it’s too early to tell whether DeepSeek-R1 and others will “transform human interactions, technology and enterprise applications.”

Ultimately, “the adoption rate by developers will determine the popularity of DeepSeek’s models,” says Daniels. In addition, it will be “very interesting to see the use cases they uncover for the models,” he says.  

And as IBM’s Varshney sees it, geopolitical differences may also matter less than one might assume in this global AI race. “Once a model is open-sourced, where it comes from stops mattering in many respects,” he says.

Get started with DeepSeek on IBM watsonx.ai today. Learn how to deploy distilled variants of DeepSeek-R1 to inference in a secure manner with watsonx.ai in this tutorial.

Mixture of Experts | 7 February, episode 41

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Related AI topics
Mixture of Experts podcast

Let’s bust some early myths about DeepSeek! In
Mixture of Experts' episode 40, the panel tackles DeepSeek R1 misconceptions, explains
model distillation, and dissects the open-source competition landscape.

DeepSeek facts vs hype, model distillation, and open source competition
Artificial intelligence solutions

Put AI to work in your business with IBM's industry-leading AI expertise and portfolio of solutions at your side.

Explore AI solutions
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Get one-stop access to capabilities that span the AI development lifecycle. Produce powerful AI solutions with user-friendly interfaces, workflows and access to industry-standard APIs and SDKs.

Explore watsonx.ai Book a live demo