reddit.com › r › LLMDevs › comments › 1ibhpqw › how_was_deepseekr1_built_for_dummies

Just wrote about it, it's absolutely great, and the less is more will definitely redefine AI as we know it Answer from Rolandojuve on reddit.com

reddit.com › r/localllama › deepseek r1 / r1 zero

r/LocalLLaMA on Reddit: Deepseek R1 / R1 Zero

January 20, 2025 - I got a bunch of such requests but declined all of them because I do not want to help them train a model to replace myself and achieve a short AGI timeline. But it is less relevant now because R1 Zero told the world you can just use outcome based RL and skip the expensive human annotation. Continue this thread Continue this thread Continue this thread Continue this thread ... The DeepSeek R1 paper is out.

reddit.com › r/janitorai_official › deepseek r1 and deepseek r1 zero

[Mature Content] r/JanitorAI_Official on Reddit: Deepseek R1 and Deepseek R1 Zero

August 23, 2024 -

Is there a difference between them? I only just saw the R1 Zero one today, curious if anyone’s tried it or not.

Top answer

1 of 1

Deepseek R1-Zero is like the beta version of Deepseek-R1. Basically Deepseek-R1 is the refined version of zero, and it's only benefit is that it's basically rawdogging AI generation without any guidance which could mean it (probably) won't give you an error if you try generating NSFW, but I've never had that issue with Deepseek-R1 either. It's essentially just a worse R1 that repeats itself and has less logical guidance. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning. With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL. - Deepseek, via huggingface.co/deepseek-ai/DeepSeek-R1-Zero

Videos

01:07:43

YouTube

DeepSeek R1 Theory Tutorial – Architecture, GRPO, KL Divergence ...

March 11, 2025

00:43

YouTube

Mike Knoop on the difference between R1 and R1 zero - YouTube

February 7, 2025

26:10

YouTube

Improved DeepSeek R1-32B & R1-14B: NEW Light-R1 - YouTube

March 17, 2025

youtube.com

DeepSeek R1 vs DeepSeek R1 Zero [Architecture Explained ...

27:54

YouTube

OpenAI’s O3 Mini vs. DeepSeek R1: Which One Wins? - YouTube

February 2, 2025

View all

reddit.com › r/localllama › [deleted by user]

Compared DeepSeek-R1 to DeepSeek-R1-Zero

June 8, 2024 - R1, on the other hand, combines SFT and RL, using rewards that emphasize language consistency. ... Math: Zero improved from 71% to 87%, while R1 performs at a similar level to OpenAI's models (around 80%).

reddit.com › r/selfhosted › deepseek r1-zero

r/selfhosted on Reddit: DeepSeek R1-zero

April 14, 2024 -

I found the guide using Ollama and Chatbox very helpful but I am very interested in the CoT R1-zero version (vs R1 that still uses SFT) and can't seem to find a distilled version anywhere. Has anyone figured this out?

Top answer

1 of 1

If I am reading the model summary on this page correctly this might be what you're looking for? https://huggingface.co/deepseek-ai/DeepSeek-R1-Zero "' reinforcement learning (RL) without supervised fine-tuning (SFT)" "This approach allows the model to explore chain-of-thought (CoT) for solving complex problems,"

reddit.com › r/llmdevs › how was deepseek-r1 built; for dummies

r/LLMDevs on Reddit: How was DeepSeek-R1 built; For dummies

January 27, 2025 -

Over the weekend I wanted to learn how was DeepSeek-R1 trained, and what was so revolutionary about it. So I ended up reading the paper, and wrote down my thoughts. < the article linked is (hopefully) written in a way that it's easier for everyone to understand it -- no PhD required!

Here's a "quick" summary:

1/ DeepSeek-R1-Zero is trained with pure-reinforcement learning (RL), without using labeled data. It's the first time someone tried and succeeded doing that. (that we know of, o1 report didn't show much)

2/ Traditional RL frameworks (like PPO) have something like an 'LLM coach or critic' that tells the model whether the answer was good or bad -- based on given examples (labeled data). DeepSeek uses GRPO, a pure-RL framework that skips the critic and calculates the group average of LLM answers based on predefined rules

3/ But, how can you evaluate the performance if you don't have labeled data to test against it? With this framework, the rules aren't perfect—they’re just a best guess at what "good" looks like. The RL process tries to optimize on things like:

Does the answer make sense? (Coherence)

Is it in the right format? (Completeness)

Does it match the general style we expect? (Fluency)

For example, for the DeepSeek-R1-Zero model, for mathematical tasks, the model could be rewarded for producing outputs that align to mathematical principles or logical consistency.

It makes sense.. and it works... to some extent!

4/ This model (R1-Zero) had issues with poor readability and language mixing -- something that you'd get from using pure-RL. So, the authors wanted to go through a multi-stage training process and do something that feels like hacking various training methods:

5/ What you see above is the DeepSeek-R1 model that goes through a list of training methods for different purposes

(i) the cold start data lays a structured foundation fixing issues like poor readability
(ii) pure-RL develops reasoning almost on auto-pilot
(iii) rejection sampling + SFT works with top-tier training data that improves accuracy, and
(iv) another final RL stage ensures additional level of generalization.

And with that they're doing as good as or better than o1 models.

Lmk if you have any questions (i might be able to answer them).

Top answer

1 of 21

Just wrote about it, it's absolutely great, and the less is more will definitely redefine AI as we know it

2 of 21

Work in progress https://github.com/huggingface/open-r1

reddit.com › r/localllama › serving deepseek r1 zero

r/LocalLLaMA on Reddit: Serving deepseek r1 ZERO

June 18, 2024 -

There's been a lot of excitement around deepseek r1 obviously, but I was wondering if anyone has had success running deepseek r1 zero? I don't think there's as many quantizations or distillations of r1-ZERO out there. Is my only option to rent an 8xa100 cluster?

Top answer

1 of 3

It's an intellectual curiosity more than a useful model. Deepseek says it will produce "endless repetition, poor readability, and language mixing", even if it demonstrated some useful behaviours concurrently. In other words, it was an insightful failure that gave the researchers enough information to alter their processes enough to make R1.

2 of 3

8xA100 not enough VRAM. Need 16x

reddit.com › r/localllama › deepseek-r1 and deepseek-r1-zero repo is preparing to launch？

r/LocalLLaMA on Reddit: Deepseek-R1 and Deepseek-R1-zero repo is preparing to launch？

October 21, 2024 -

https://huggingface.co/deepseek-ai/DeepSeek-R1

https://huggingface.co/deepseek-ai/DeepSeek-R1-Zero

I am waiting for this. hopfully today

Top answer

1 of 3

If you have a nice amount of storage, you can already download R1 Zero if you want to. Same with R1 now.

2 of 3

MoE? Nice if yes, shame if not.

reddit.com › r/janitorai_official › deepseek r1 zero

[Mature Content] r/JanitorAI_Official on Reddit: DeepSeek R1 Zero

August 1, 2024 -

Recently just added to OpenRouter. Anyone tried it? How is it?

Top answer

1 of 3

I just got done testing it myself. R1 Zero seems to be less creative that R1 and doesn't do the 'let's break this down' thing that R1 likes to do. I would say R1 Zero is more like V3 just slightly more creative. That being said, R1 Zero resists efforts to use formatting marks like * and **, but it gets the message and uses the correct formatting marks after a little while. it also tries to take over the user character more at first but I've found editing the first few messages and adding an OOC note at the end of r1 Zero's messages fixes that issue after a few messages. My overall impression is this: if you use R1 you will see an improvement in plot progression using R1 Zero. if you use V3 you likely won't see much difference. OOC Note I use: ((OOC: I will NOT roleplay for {{user}} under any circumstance as I am forbidden))

2 of 3

I keep getting somehow \boxed{} with a random number inserted as a response. I have read that you are supposed to prompt it to insert the answer of a mathematical problem (if you prompt it to solve a math question) into \boxes{} but I don’t know why it keeps doing this to my roleplay texts :/ (also Im using Risu, not sure if trying on Janitor might change anything). I honestly fell in love with R1 on how extremely good and creative it is and how it scratches my itch for something that could compare to a Nous Hermes 3 and a Gemini 1.5 combined, but it’s biggest weakness is also how it’s too creative, constantly trying to bring up random details and events that make completely no sense and having their reasoning patterns go immediately into the context, so good luck trying to get rid off that random event it tries to force onto you like a threat. It’s a pain to make R1 act like u want to, but at the same time it’s the only free model so far I have tried that seems to actually be intuitive and actually hit you with some fleshed out literary goodness (especially in the nsfw department, Im sick of vague vanilla sex with Gemini 24/7, just give me god damn details and take initiative, no matter how many instructions I give it, it never seems to be on the same level as R1 is), that is raw, in detail and actually the closest thing you get to a roleplay between two real and breathing people. It just tends to powerplay alot and insert details no one asked for. I hoped Zero would improve upon it but like mentioned above I don’t get it to work for some reason :/ I honestly don’t know why, couldn’t find anyone posting about that same issue either or making a guide. Maybe it’s just a me issue ig.

Find elsewhere

Google Bing Mojeek

reddit.com › r/localllama › r1-zero: pure rl creates a mind we can’t decode—is this agi’s dark mirror?

r/LocalLLaMA on Reddit: R1-Zero: Pure RL Creates a Mind We Can’t Decode—Is This AGI’s Dark Mirror?

August 14, 2024 -

The AI world is losing its mind over DeepSeek-R1-Zero, a model that skipped supervised fine-tuning (SFT) entirely and learned purely through reinforcement learning (RL). Unlike its sibling R1—which uses some SFT data to stay "human-readable"—R1-Zero’s training mirrors AlphaZero’s trial-and-error self-play. The result? Jaw-dropping performance (AIME math scores jumped from 15.6% → 86.7%) paired with bizarre, uninterpretable reasoning. Researchers observed "aha moments" where it autonomously rechecked flawed logic mid-process and allocated more compute to harder problems—without human guidance. But here’s the kicker: its outputs are riddled with garbled language mixes (e.g., Chinese/English spaghetti code) and logic leaps that even its creators can’t fully explain.

Meanwhile, R1 (the SFT-hybrid version) achieves similar performance without the chaos, proving that human-curated data still tames the beast. But at what cost? R1-Zero’s pure RL approach hints at a terrifying possibility: minds that optimize truth beyond human comprehension. And with API costs 50x cheaper than OpenAI’s, scaling this could democratize superintelligence—or unleash unreadable black-box AI.

If R1-Zero’s "alien logic" solves problems we can’t, does readability even matter… or is this how alignment dies?

Top answer

1 of 21

343

Am I the only one sitting here not surprised? The output you are seeing is coming from a tokenizer not the model. The model is auto-regressively generating new tokens and it’s producing greater insights because it has broken through the confines of linguistics and into symbolic reasoning. What this tells you is that the neural network has repurposed certain tokens to actually mean something else. They aren’t garbled, they just appear that way due to shifts that occurred in alignment with its tokenizer. This happens in humans too. I’m GenX and my Gen Alpha kid speaks a string of nonsense to his friends and reasons with verbal tokens I have no clue about. If I try to get him to reason and think in my GenX dialect his brain literally strips a gear. He’s repurposed certain tokens into symbols ie shorthand for a complex set of interrelationships between concepts in a higher order concept space.

2 of 21

"But here’s the kicker: " - llm-generated bullshit detected

reddit.com › r/sillytavernai › deepseek r1 zero (free) is very interesting.

r/SillyTavernAI on Reddit: Deepseek R1 Zero (Free) is very interesting.

August 11, 2024 -

Since a month ago I was looking for a free model that I really like, there were several with pros and cons between the Cohere, Mistrall and Gemini, but when I tried the Deepseek R1 Zero (Free) I was very satisfied with the responses as they meet both NSFW and SFW, sometimes it becomes repetitive but it is easy to get out of it, maybe I'm not demanding but I like when a model is aware of the scenario and character descriptions.

Top answer

1 of 2

You can use the newest DeepSeek-V3_0325 for free too

2 of 2

R1 Zero loves to duplicate its response into \boxed{}, so to get rid of this, use any prefill and set stopping string to ["\\boxed{"]. (It doesn't do its own CoT like R1 does, so the thinking block is just an exact duplicate of the boxed response, doubling the time it takes to output. Very strange.) As a model, it seems okay but nothing outstanding. It doesn't understand paragraph limits.

reddit.com › r/localllama › deepseek-r1-zero api available?

r/LocalLLaMA on Reddit: Deepseek-R1-Zero API available?

June 30, 2024 -

Hey guys deepseek seems to only provide API for R1 and not for R1-Zero, so is there another platform where i can find API for R1-Zero?

If there's no API available, what GPUs do i need to run inference on R1-Zero?

Top answer

1 of 3

Hyperbolic is serving an fp8 version of R1-Zero

2 of 3

According to their paper, R1 demonstrates superior performance than R1-Zero.

reddit.com › r/singularity › notes on deepseek r1: just how good it is compared to o1

r/singularity on Reddit: Notes on Deepseek r1: Just how good it is compared to o1

August 3, 2024 -

Finally, there is a model worthy of the hype it has been getting since Claude 3.6 Sonnet. Deepseek has released something anyone hardly expected: a reasoning model on par with OpenAI’s o1 within a month of the v3 release, with an MIT license and 1/20th of o1’s cost.

This is easily the best release since GPT-4. It's wild; the general public seems excited about this, while the big AI labs are probably scrambling. It feels like things are about to speed up in the AI world. And it's all thanks to this new DeepSeek-R1 model and how they trained it.

Some key details from the paper

Pure RL (GRPO) on v3-base to get r1-zero. (No Monte-Carlo Tree Search or Process Reward Modelling)
The model uses “Aha moments” as pivot tokens to reflect and reevaluate answers during CoT.
To overcome r1-zero’s readability issues, v3 was SFTd on cold start data.
Distillation works, small models like Qwen and Llama trained over r1 generated data show significant improvements.

Here’s an overall r0 pipeline

v3 base + RL (GRPO) → r1-zero

r1 training pipeline.

DeepSeek-V3 Base + SFT (Cold Start Data) → Checkpoint 1
Checkpoint 1 + RL (GRPO + Language Consistency) → Checkpoint 2
Checkpoint 2 used to Generate Data (Rejection Sampling)
DeepSeek-V3 Base + SFT (Generated Data + Other Data) → Checkpoint 3
Checkpoint 3 + RL (Reasoning + Preference Rewards) → DeepSeek-R1

We know the benchmarks, but just how good is it?

Deepseek r1 vs OpenAI o1.

So, for this, I tested r1 and o1 side by side on complex reasoning, math, coding, and creative writing problems. These are the questions that o1 solved only or by none before.

Here’s what I found:

For reasoning, it is much better than any previous SOTA model until o1. It is better than o1-preview but a notch below o1. This is also shown in the ARC AGI bench.
Mathematics: It's also the same for mathematics; r1 is a killer, but o1 is better.
Coding: I didn’t get to play much, but on first look, it’s up there with o1, and the fact that it costs 20x less makes it the practical winner.
Writing: This is where R1 takes the lead. It gives the same vibes as early Opus. It’s free, less censored, has much more personality, is easy to steer, and is very creative compared to the rest, even o1-pro.

What interested me was how free the model sounded and thought traces were, akin to human internal monologue. Perhaps this is because of the less stringent RLHF, unlike US models.

The fact that you can get r1 from v3 via pure RL was the most surprising.

For in-depth analysis, commentary, and remarks on the Deepseek r1, check out this blog post: Notes on Deepseek r1

What are your experiences with the new Deepseek r1? Did you find the model useful for your use cases?

Top answer

1 of 5

Many of the problems o1 are has can just be attributed to the fact they refuse to let it think for long enough

2 of 5

The reason AGI/ASI is imminent: steps 3-5 work recursively. You can use your final checkpoint to generate better reasoning data (step 3), train your base model on that better data (step 4), get a better Checkpoint 3 and do even more impressive RL in step 5.

reddit.com › r/accelerate › arc-prize: an analysis of deepseek's r1-zero and r1

r/accelerate on Reddit: ARC-Prize: An Analysis of DeepSeek's R1-Zero and R1

June 24, 2024 -

Link to blog post: An Analysis of DeepSeek's R1-Zero and R1

From Mike Knoop (ARC-Prize Cofounder) on X:

just published my full u/arcprize analysis of deepseek's r1-zero and r1. link below. key points:

r1-zero is more important than r1.

both r1-zero and r1 score ~15% on ARC-AGI-1. this is fascinating. it matches deepseek's own benchmarking showing comprable results in logical domains like math and coding across r1-zero and r1.

r1-zero removes the final human input bottleneck -- "expert CoT labeling" eg. supervised fine-tuning ("SFT"). from there to AGI, it's all about efficiency.

deepseek says r1-zero suffers from incoherence and language mixing. this has been corroborated online. but we saw no evidence in our testing. all this suggests:

SFT is not necessary for accurate and legible CoT reasoning in domains with strong verification.

the r1-zero training process is capable of creating its own internal domain specific language (DSL) in token space via RL optimization.

SFT is currently necessary for increasing CoT reasoning domain generality with these LLM architectures

this makes intuitive sense, as language itself is effectively a reasoning DSL. The exact same "words" can be learned in one domain and applied in another, like a program. the pure RL approach can not yet discover a broad shared vocabulary and I expect this will be a strong focus for future research.

ultimately r1-zero demonstrates the prototype of a potential scaling regime with zero human bottlenecks – even in the training data acquisition itself.

more broadly, the public is very under-informed about impending inference demand. o3 beating ARC-AGI-1 (75%/86% on low/high compute) was barely reported mainstream. expect more market whiplash as the frontier progress isn't disseminated fast enough. mainstream press has important work to do.

o1/o3/r1 benchmark accuracy scores are exciting but the real practical impact will be massively improved reliability, leading agents to finally start working in 2025.

we'll also start seeing "synthetic data" (low quality) becoming "real data" (high quality) -- and the end user is paying for it! there is a legit power concentration potential feedback loop here to understand.

r1-zero and r1 being open is great for the world, deepseek has moved the science forward. many folks have told me they plan to use r1's ideas for ARC Prize 2025, which i'm excited to see. we are going to rapidly find the limits of LLMs + CoT search.

Top answer

1 of 2

Damn the beautifully insightful articles just keep dropping today. I'm hyper-focusing on this one; we'll also start seeing "synthetic data" (low quality) becoming "real data" (high quality) -- and the end user is paying for it! there is a legit power concentration potential feedback loop here to understand. and this one ultimately r1-zero demonstrates the prototype of a potential scaling regime with zero human bottlenecks – even in the training data acquisition itself. The bit about synthetic data becoming real data strikes me. The data isn't going to be low quality conversations with end users about random subjects with no clear goal moving forward. It's going to be through things like operator carrying out actions with a specific goal as objective. Right now it's quite simplistic, but the data generated by that simplistic tasks is not trivial. Essentially, whoever can collect as much of that data as possible can RL on it with the right algo, and it's a compounding feedback loop of scale. The smart people really laying out the bull case for NVDA today. And OAI/MSFT.

2 of 2

Apologies for the original double-up of the post, having a hard time formatting it. Nevertheless a huge amount of info from Knoop on X and the ARC report itself. Correct me if I'm wrong but I haven't seen a lot about the R1-0 base model until now - I know that the guy from the Cognitive Revolution podcast did mention it. This gives me goosebumps: "ultimately r1-zero demonstrates the prototype of a potential scaling regime with zero human bottlenecks – even in the training data acquisition itself" and "r1-zero removes the final human input bottleneck -- 'expert CoT labeling' eg. supervised fine-tuning ("SFT"). from there to AGI, it's all about efficiency."

reddit.com › r/localllama › deepseek-r1 paper

r/LocalLLaMA on Reddit: DeepSeek-R1 Paper

July 25, 2024 - The self-evolution process of DeepSeek-R1-Zero is a fascinating demonstration of how RL can drive a model to improve its reasoning capabilities autonomously. By initiating RL directly from the base model, we can closely monitor the model’s ...

reddit.com › r/localllama › notes on deepseek r1: just how good it is compared to openai o1

r/LocalLLaMA on Reddit: Notes on Deepseek r1: Just how good it is compared to OpenAI o1

October 25, 2024 -

Some key details from the paper

Pure RL (GRPO) on v3-base to get r1-zero. (No Monte-Carlo Tree Search or Process Reward Modelling)
The model uses “Aha moments” as pivot tokens to reflect and reevaluate answers during CoT.
To overcome r1-zero’s readability issues, v3 was SFTd on cold start data.
Distillation works, small models like Qwen and Llama trained over r1 generated data show significant improvements.

Here’s an overall r0 pipeline

v3 base + RL (GRPO) → r1-zero

r1 training pipeline.

DeepSeek-V3 Base + SFT (Cold Start Data) → Checkpoint 1
Checkpoint 1 + RL (GRPO + Language Consistency) → Checkpoint 2
Checkpoint 2 used to Generate Data (Rejection Sampling)
DeepSeek-V3 Base + SFT (Generated Data + Other Data) → Checkpoint 3
Checkpoint 3 + RL (Reasoning + Preference Rewards) → DeepSeek-R1

We know the benchmarks, but just how good is it?

Deepseek r1 vs OpenAI o1.

So, for this, I tested r1 and o1 side by side on complex reasoning, math, coding, and creative writing problems. These are the questions that o1 solved only or by none before.

Here’s what I found:

For reasoning, it is much better than any previous SOTA model until o1. It is better than o1-preview but a notch below o1. This is also shown in the ARC AGI bench.
Mathematics: It's also the same for mathematics; r1 is a killer, but o1 is better.
Coding: I didn’t get to play much, but on first look, it’s up there with o1, and the fact that it costs 20x less makes it the practical winner.
Writing: This is where R1 takes the lead. It gives the same vibes as early Opus. It’s free, less censored, has much more personality, is easy to steer, and is very creative compared to the rest, even o1-pro.

What interested me was how free the model sounded and thought traces were, akin to human internal monologue. Perhaps this is because of the less stringent RLHF, unlike US models.

The fact that you can get r1 from v3 via pure RL was the most surprising.

For in-depth analysis, commentary, and remarks on the Deepseek r1, check out this blog post: Notes on Deepseek r1

What are your experiences with the new Deepseek r1? Did you find the model useful for your use cases?

Top answer

1 of 5

396

Aside from the LLM model itself, this shown that OpenAI isn't that ahead anymore from others, I mean, OpenAI still has the money and the hype, but 1 year ago, no one could beat them. The game has changed, surely. Of course OpenAI is gonna make moves, but this is a huge W for LLM in general

2 of 5

107

My primary use case is coding, so I can only speak to that. I haven't found Deepseek (via Deepseek.com) to be significantly better than either Claude 3.6 or, surprisingly, Gemini-1206. I will say that it is absolutely a frontier model in every sense of the word. That's impressive in and of itself. Being able to do "deep think web searches" is very cool, and "Free" is also nice!

reddit.com › r/localllama › have you guys tried deepseek-r1-zero?

Have you guys tried DeepSeek-R1-Zero? : r/LocalLLaMA

June 29, 2024 - “However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing.”

reddit.com › r/localllama › an analysis of deepseek's r1-zero and r1 (arc prize)

r/LocalLLaMA on Reddit: An Analysis of DeepSeek's R1-Zero and R1 (ARC Prize)

June 13, 2024 -

DeepSeek's R1-Zero is significant because it achieves strong reasoning performance without human-labeled data (SFT)

It only relies on Reinforcement Learning (RL)

This overcomes the friction of human data bottlenecks

"Inference as Training."

Reasoning systems can generate high-quality data during inference, which can then be used to further train and improve the model

This creates a powerful feedback loop and a potential runaway effect for companies with large user bases

https://arcprize.org/blog/r1-zero-r1-results-analysis

Top answer

1 of 1

I really respect the arc-agi team. Excellent write up!

reddit.com › r/machinelearning › [d] how exactly did deepseek r1 achieve massive training cost reductions, most posts i read are about its performance, rl, chain of thought, etc, but it’s not clear how the cost of training of the model was brought down so drastically

r/MachineLearning on Reddit: [D] How exactly did Deepseek R1 achieve massive training cost reductions, most posts I read are about its performance, RL, chain of thought, etc, but it’s not clear how the cost of training of the model was brought down so drastically

December 20, 2024 - Standard “RL + SFT” pipelines often require massive supervised datasets and extensive fine-tuning prior to (or interleaved with) RL. By omitting or minimizing that supervised stage (which can be extremely expensive for large-scale language models), DeepSeek-R1-Zero cuts one huge chunk of training time and cost.

reddit.com › r/localllama › deepseek r1 is unusable [imho]

r/LocalLLaMA on Reddit: DeepSeek R1 is unusable [IMHO]

September 22, 2024 -

This is my personal experience. Small R1 models that can run fast enough generate too much output. Effectively they end up being very slow, compared to something like LLama3.2. Even if you are OK with the speed, R1 fails to stick to simple output instructions.
Regarding the chain of thought concept: I am not convinced that this is yielding significant improvement. Retrospection works if you have an external feedback or reference, not by going over your own thoughts like a schizophrenic exclaiming "wait no" every now and then.
R1 gives the impression of a student who doesn't know the answer and is hoping to wing it by accidentally stumbling on something acceptable while stalling the teacher.

Top answer

1 of 5

208

What did you try exactly? Below 14B doesn't work. 14B kinda works. 32B and 70B actually work. 670B works quite well.

2 of 5

you mean the smaller distills of r1? should have made that more clear in your title. yeah imagine the smaller ones wont be quite as helpful as the main larger models lol

reddit.com › r/localllama › deepseekr1 in five minutes

r/LocalLLaMA on Reddit: DeepSeekR1 in five minutes

January 29, 2025 -

I decided I wanted to do a lit review of everything the deepseek team had published so far and try to get a sense of what they did differently. "Just a copy/rip-off of GPT" didn't really compute for me. Here's my plain-language, 5-minute analysis. Think of it as a warm-start to "how do I explain this to my dad?" then go read the papers cited.

On January 20^th 2025, a little-known firm operating out of PRC open-sourced a model known as DeepSeek-R1, claiming to represent a frontier-level reasoning model, incorporating features such as chains-of-thought and multimodality, able to ingest and generate multiple data-types. This advancement represents the first such model to be produced by researchers within PRC and was accomplished without on-premises use of the NVIDIA H100 GPU, instead making use of the lower-clocked (1.75 vs 1.83Ghz) and lower memory (80 vs 96Gb) H800 GPU (estimated 5% lower computational throughput). Performance of R1 was benchmarked by DeepSeek and found to be near the performance of OpenAI’s o1-0912 across each of six benchmarks.

This level of performance on its own is not necessarily impressive. DeepSeekV3 and R1 join a growing group of highly performant AI “chat” models available to the public. DeepSeek researchers were able, however, to write, train, distill and deploy a set of state-of-the-art models for a small fraction of the cost of American-led efforts. DeepSeek’s self-published cost estimates for training the V3 LLM are in the range of 2788k GPU-hours costing an estimated $5.576M USD and a total size of around 600B parameters (DeepSeek-AI, 2024). This is in contrast to Sam Altman (CEO of OpenAI) estimating that GPT-4 cost over $100M USD to train at over 1 trillion parameters with GPT-5 costs running into the billions(Buchholz, 2024). While DeepSeek utilized only 2048 H800 GPUs, Meta-AI (the publisher of the open-source LLAMA model family) is estimated to own “350,000 NVIDIA H100 GPUs as part of a portfolio that will feature compute power equivalent to nearly 600,000 H100s.”(Kevin Lee, 2024).

The task now is understanding what innovations led to this massive leap in training efficiency. Undoubtedly having use of preexisting models substantially lowered the training costs for the DeepSeek venture. The DeepSeek team made ample use of the QwQ model published by the Alibaba Qwen team. Speedups were made through leveraging technical expertise, using 8-bit floating-point precision (FP-8), striking a middle-ground between the larger FP-16 and lower-precision INT-4. An added speedup was gained from a novel load-balancing strategy, a multi-token prediction objective, and “co-design of algorithms, frameworks and hardware [to] overcome the communication bottleneck in cross-node MoE training”. Great pains were clearly taken in optimizing the training strategy for efficiency with several other novel techniques not mentioned here but can be found in the DeepSeek V3 technical report(DeepSeek-AI, 2024).

The key advancement offered by the DeepSeek-R1 training strategy was the shift from large, human-compiled datasets, to an unsupervised strategy. DeepSeek-R1 was trained using only a small amount of supervised data and conducted the bulk of its learning through unsupervised reinforcement learning (RL). DeepSeek-R1-Zero meanwhile, was trained using no supervised data in a strategy reminiscent of the Chess and Shogi training of Alpha-Zero(Silver, 2017)).

Detailed in the paper “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Model”(Shao Z., 2024), DeepSeek researchers used a mixture-of-experts model which they trained under a strategy they call “Group Relative Policy Optimization” (GRPO). Under GRPO, computational costs are sharply reduced by eliminating the need for a second “critic” model to judge the reasoning of the model in training.

DeepSeek had, by 2025, published several papers and open-source models approaching state-of-the-art performance in mathematical reasoning and coding. While the DeepSeek team did have use of existing open-source models and public APIs, to dismiss the real innovations in their techniques would be a mistake. DeepSeek-R1 and the strategies behind it represent a shift in priorities common in any industry where a resource becomes limited – a shift away from “scale is all you need” or “no replacement for displacement” and towards an optimization for efficiency.

References

Buchholz, K. (2024, August 23). The Extreme Cost of
Training AI Models. Forbes.
DeepSeek-AI, A. L. (2024). DeepSeek-V3 Technical
Report. Arxiv.Org.
Kevin Lee, A. G. (2024). Building Meta's GenAI
Infrastructure. Engineering at Meta.
Shao Z., W. P. (2024). DeepSeekMath: Pushing the
Limits of Mathematical Reasoning in Open Language Models. Arxiv.org.
Silver, D. H. (2017). Mastering Chess and Shogi by
Self-Play with a General Reinforcement Learning Algorithm. Arxiv.org.

Top answer

1 of 2

Good work. Did you catch any info about MLA (multi-head latent attention) and optimized PTX code?

2 of 2

The key advancement offered by the DeepSeek-R1 training strategy was the shift from large, human-compiled datasets, to an unsupervised strategy. It appears to be a shift is the order in which they did the training. Whereas the standard training protocol for ChatGPT, Claude, Llama, or Gemini: Pretraining -> Supervise Fine Tuning -> RLHF (Focused on more structured responses early on) DeepSeekR1: Pretraining -> RL -> SFT (Focused on reasoning behavior early on) They did ultimately train on a 800K instruction and CoT samples after the RL phase to get better response outputs.