Videos
-
Aha moments emerged naturally in RL: Self-correction behaviors like "Wait, let’s reevaluate..." arose without SFT.
-
Cold-start SFT fixed readability: ~1k structured examples resolved language mixing.
-
GRPO cut RL costs by 30%: Group-wise reward normalization outperformed PPO.
-
RL increased CoT length autonomously: Reasoning steps grew from 100→1k tokens without penalties.
-
Distillation beat direct RL in small models: SFT on R1 data outperformed RL-trained base models.
-
Process rewards failed; outcome rewards worked better: Rule-based final-answer checks stabilized training.
-
XML tags reduced hallucinations 15%: Structured <think>/<answer> improved reward clarity.
-
Language mixing fixed via consistency rewards: Penalized code-switching in multilingual outputs.
I find it funny that ive seen multiple AI youtubers explain papers and they just go to another AI to help them in the video but hey it does a good job
https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf