🌐
Towards AI
pub.towardsai.net › deepseek-r1-model-architecture-853fefac7050
DeepSeek-R1: Model Architecture. This article provides an in-depth… | by Shakti Wadekar | Towards AI
March 13, 2025 - DeepSeek-R1 employs Multi-Head Latent Attention (MLA) layers instead of standard multi-head attention across all transformer layers. The first three transformer layers differ from the rest, using a standard Feed-Forward Network (FFN) layer. From layer 4 to 61, a Mixture-of-Experts (MoE) layer ...
🌐
HiddenLayer
hiddenlayer.com › home › research › innovation hub › analysing deepseek-r1’s architecture
Analysing DeepSeek-R1’s Architecture
March 25, 2025 - For the purposes of our analysis, our team converted the DeepSeek R1 model hosted on HuggingFace to the ONNX file format, enabling us to examine its computational graph. We used this, along with a review of associated technical papers and code, to identify shared characteristics and subgraphs observed within other models and piece together the defining features of its architecture.
🌐
Medium
medium.com › @namnguyenthe › deepseek-r1-architecture-and-training-explain-83319903a684
DeepSeek-R1: Architecture and training explain | by The Nam | Medium
January 25, 2025 - But does DeepSeek-R1 rely entirely on RL? The answer is both yes and no. The authors released two distinct models: DeepSeek-R1-Zero and DeepSeek-R1. The former only used RL in the post-training process. While it showed performance on par with GPT-o1 on certain reasoning benchmarks, it struggled with poor readability and occasional language mixing.
🌐
Hugging Face
huggingface.co › deepseek-ai › DeepSeek-R1
deepseek-ai/DeepSeek-R1 · Hugging Face
DeepSeek-R1-Zero & DeepSeek-R1 are trained based on DeepSeek-V3-Base. For more details regarding the model architecture, please refer to DeepSeek-V3 repository.
🌐
Fireworks AI
fireworks.ai › blog › deepseek-model-architecture
DeepSeek v3 and R1 Model Architecture: Why it's powerful and economical
DeepSeek v3 and R1 continue to use the traditional Transformer block, incorporating SwiGLU, RoPE, and RMSNorm. It also inherits Multi-head Latent Attention (MLA) and radical Mixture-of-Experts (MoE) introduced by DeepSeek v2.
🌐
GeeksforGeeks
geeksforgeeks.org › artificial intelligence › deepseek-r1-technical-overview-of-its-architecture-and-innovations
DeepSeek-R1: Technical Overview of its Architecture and Innovations - GeeksforGeeks
February 3, 2025 - At its core, DeepSeek-R1 distinguishes ... architecture is built on two foundational pillars: a cutting-edge Mixture of Experts (MoE) framework and an advanced transformer-based design....
🌐
GitHub
github.com › deepseek-ai › DeepSeek-R1
GitHub - deepseek-ai/DeepSeek-R1
DeepSeek-R1-Zero & DeepSeek-R1 are trained based on DeepSeek-V3-Base. For more details regarding the model architecture, please refer to DeepSeek-V3 repository.
Starred by 91.6K users
Forked by 11.8K users
🌐
NVIDIA
build.nvidia.com › deepseek-ai › deepseek-r1 › modelcard
deepseek-r1 Model by Deepseek-ai | NVIDIA NIM
Runtime Engine(s): vLLM and SGLang Supported Hardware Microarchitecture Compatibility: NVIDIA Ampere, NVIDIA Blackwell, NVIDIA Jetson, NVIDIA Hopper, NVIDIA Lovelace, NVIDIA Pascal, NVIDIA Turing, and NVIDIA Volta architectures [Preferred/Supported] Operating System(s): Linux
🌐
YouTube
youtube.com › watch
DeepSeek R1 Theory Tutorial – Architecture, GRPO, KL Divergence - YouTube
Learn about DeepSeek R1's innovative AI architecture from @deeplearningexplained. The course explores how R1 achieves exceptional reasoning through reinforc...
Published   March 11, 2025
Find elsewhere
🌐
Milvus
milvus.io › ai-quick-reference › what-is-the-architecture-of-deepseeks-r1-model
What is the architecture of DeepSeek's R1 model?
The base architecture likely includes features like pre-normalization (stabilizing training) and rotary positional embeddings (better handling of sequence length). The training framework emphasizes parallelism and optimization.
🌐
HiddenLayer
hiddenlayer.com › home › research › innovation hub › deepseek-r1 architecture
DeepSeek-R1 Architecture
March 25, 2025 - Initial analysis revealed that DeepSeek-R1 shares its architecture with DeepSeekV3, which supports the information provided in the model’s accompanying write-up. The primary difference is that R1 was fine-tuned using Reinforcement Learning to improve reasoning and Chain-of-Thought output.
🌐
Medium
medium.com › @isaakmwangi2018 › a-simple-guide-to-deepseek-r1-architecture-training-local-deployment-and-hardware-requirements-300c87991126
A Simple Guide to DeepSeek R1: Architecture, Training, Local Deployment, and Hardware Requirements | by Isaak Kamau | Medium
January 23, 2025 - It features 671 billion parameters, utilizing a mixture-of-experts (MoE) architecture where each token activates parameters equivalent to 37 billion. This model showcases emergent reasoning behaviors, such as self-verification, reflection, and ...
🌐
Languagemodels
newsletter.languagemodels.co › p › the-illustrated-deepseek-r1
The Illustrated DeepSeek-R1 - by Jay Alammar
January 27, 2025 - Just like previous models from the dawn of GPT2 and GPT 3, DeepSeek-R1 is a stack of Transformer decoder blocks. It’s made up 61 of them. The first three are dense, but the rest are mixture-of-experts layers (See my co-author Maarten’s ...
🌐
arXiv
arxiv.org › pdf › 2501.12948 pdf
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Table 1 | Template for DeepSeek-R1-Zero. prompt will be replaced with the specific reasoning ... The reward is the source of the training signal, which decides the optimization direction of RL. To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two
🌐
DeepWiki
deepwiki.com › deepseek-ai › DeepSeek-R1 › 2-model-architecture
Model Architecture | deepseek-ai/DeepSeek-R1 | DeepWiki
This architecture enables these models to have a massive parameter count while maintaining computational efficiency during inference. ... The core innovation in the DeepSeek-R1 models is the Mixture of Experts (MoE) architecture, which allows the model to have a large total parameter count ...
🌐
Fireworks AI
fireworks.ai › blog › deepseek-r1-deepdive
DeepSeek-R1 Overview: Features, Capabilities, Parameters
DeepSeek R1 excels at tasks demanding logical inference, chain-of-thought reasoning, and real-time decision-making. Whether it’s solving high-level mathematics, generating sophisticated code, or breaking down complex scientific questions, DeepSeek R1’s RL-based architecture allows it to self-discover and refine reasoning strategies over time.
🌐
ResearchGate
researchgate.net › publication › 388856323_Highlighting_DeepSeek-R1_Architecture_Features_and_Future_Implications
(PDF) Highlighting DeepSeek-R1: Architecture, Features and Future Implications
February 11, 2025 - DeepSeek-R1 emphasizes unique training processes devoid of supervised fine-tuning and utilizes rule-based reinforcement learning by means of group relative policy optimization. The paper will identify the major features of DeepSeek-R1 and their ...
🌐
Founderscreative
founderscreative.org › model-architecture-behind-deepseek-r1
Model Architecture Behind DeepSeek R1 – Founders Creative
This concludes this section. We explored three primary architectural patterns that the DeepSeek team adapted and enhanced to develop the DeepSeek-R1 model: DeepSeekMoE, Multi-Head Latent Attention, and Mult-Token Prediction. We also reviewed various improvements made to the training framework ...
🌐
BentoML
bentoml.com › blog › the-complete-guide-to-deepseek-models-from-v3-to-r1-and-beyond
The Complete Guide to DeepSeek Models: V3, R1, V3.1, V3.2 and Beyond
This builds on roughly $6 million spent to develop the underlying V3-Base model. R1 is also thought to be the first major LLM to undergo the peer-review process. This marks a rare moment of transparency in large-scale AI research. Image Source: DeepSeek-R1 Supplementary Information
🌐
Adyog
blog.adyog.com › home › how deepseek-r1 was built: architecture and training explained
How DeepSeek-R1 Was Built: Architecture and Training Explained
February 3, 2025 - DeepSeek-R1 is a text-generation AI model designed for complex reasoning and logical inference. It is based on a Mixture of Experts (MoE) architecture, which allows it to dynamically allocate computational resources to different specialized ...