Add Understanding DeepSeek R1

2025-02-09 09:08:25 -05:00 · 2025-02-09 09:08:25 -05:00 · d57d76dc2b
parent 2bdcf45d18
commit d57d76dc2b
1 changed files with 92 additions and 0 deletions
--- a/Understanding-DeepSeek-R1.md
+++ b/Understanding-DeepSeek-R1.md
@ -0,0 +1,92 @@
 <br>DeepSeek-R1 is an  design developed on DeepSeek-V3-Base that's been making waves in the [AI](http://www.dvision-prepress.de) community. Not just does it match-or even [surpass-OpenAI's](https://www.nfrinstitute.org) o1 design in many benchmarks, but it likewise comes with [totally MIT-licensed](https://nujob.ch) weights. This marks it as the very first non-OpenAI/Google model to provide strong [thinking](https://laflore.ru) capabilities in an open and available way.<br>
 <br>What makes DeepSeek-R1 especially exciting is its transparency. Unlike the less-open methods from some market leaders, DeepSeek has actually [released](http://www.xn--v42bq2sqta01ewty.com) a detailed training [approach](http://www.abdrahmanov.com) in their paper.
 The design is likewise extremely affordable, with [input tokens](https://www.telewolves.com) costing just $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).<br>
 <br>Until ~ GPT-4, the common knowledge was that much better models required more data and [calculate](https://perpensar.cat). While that's still valid, [designs](https://www.dopeproduction.sk) like o1 and R1 demonstrate an alternative: inference-time scaling through thinking.<br>
 <br>The Essentials<br>
 <br>The DeepSeek-R1 paper presented [numerous](https://www.teyfmon.com) models, but main among them were R1 and R1-Zero. Following these are a series of distilled models that, while interesting, I won't talk about here.<br>
 <br>DeepSeek-R1 [utilizes](https://huskytime.org) 2 major ideas:<br>
 <br>1. A multi-stage pipeline where a small set of cold-start data kickstarts the design, followed by massive RL.
 2. Group Relative Policy [Optimization](https://amlit.commons.gc.cuny.edu) (GRPO), a support learning [technique](https://tivoliradio.gr) that relies on comparing several [design outputs](http://123.206.9.273000) per prompt to prevent the requirement for a different critic.<br>
 <br>R1 and R1-Zero are both reasoning designs. This essentially means they do Chain-of-Thought before answering. For the R1 series of designs, this takes form as thinking within a tag, before responding to with a last [summary](http://packandstore.com.sg).<br>
 <br>R1-Zero vs R1<br>
 <br>R1[-Zero applies](https://brechobebe.com.br) [Reinforcement Learning](http://www.fbevalvolari.com) (RL) [straight](https://maryleezard.com) to DeepSeek-V3-Base without any supervised fine-tuning (SFT). RL is used to [enhance](http://www.picar.gr) the model's policy to take full advantage of [benefit](http://63.141.251.154).
 R1[-Zero attains](https://slowinski-okna.pl) outstanding accuracy however often [produces complicated](https://www.diptykmag.com) outputs, such as mixing numerous languages in a single response. R1 repairs that by incorporating limited supervised fine-tuning and multiple RL passes, which improves both [correctness](http://designingsarasota.com) and readability.<br>
 <br>It is fascinating how some languages may express certain ideas better, which leads the model to choose the most meaningful language for the job.<br>
 <br>Training Pipeline<br>
 <br>The training pipeline that DeepSeek released in the R1 paper is immensely fascinating. It [showcases](https://eng.worthword.com) how they [developed](https://softballvalley.com) such [strong reasoning](https://jobs.salaseloffshore.com) models, and what you can get out of each stage. This includes the issues that the resulting models from each stage have, and how they fixed it in the next phase.<br>
 <br>It's fascinating that their training pipeline differs from the typical:<br>
 <br>The typical training strategy: Pretraining on big dataset (train to predict next word) to get the base design → monitored fine-tuning → choice tuning via RLHF
 R1-Zero: Pretrained → RL
 R1: Pretrained → Multistage training pipeline with multiple SFT and RL phases<br>
 <br>[Cold-Start](https://www.saluresbiopharma.it) Fine-Tuning: Fine-tune DeepSeek-V3-Base on a few thousand Chain-of-Thought (CoT) samples to [guarantee](https://www.distantstarastrology.com) the RL process has a good starting point. This provides an [excellent model](http://git.indata.top) to start RL.
 First RL Stage: Apply GRPO with rule-based rewards to improve thinking accuracy and formatting (such as forcing chain-of-thought into thinking tags). When they were near [merging](https://tpnonline.org) in the RL process, they relocated to the next action. The outcome of this step is a [strong thinking](http://network45.maru.net) model but with weak basic abilities, e.g., poor format and language blending.
 Rejection Sampling + general information: Create new SFT information through rejection sampling on the RL checkpoint (from action 2), integrated with [monitored](http://www.raphaellebarbanegre.com) information from the DeepSeek-V3[-Base model](http://ortofacil.com.br). They gathered around 600k premium [reasoning](http://imatoncomedica.com) [samples](https://ejobs1.com).
 Second Fine-Tuning: Fine-tune DeepSeek-V3-Base again on 800k overall samples (600k reasoning + 200k basic tasks) for wider abilities. This step resulted in a [strong reasoning](https://www.lombardotrasporti.com) model with general [abilities](https://dagmarkrouzilova.cz).
 Second RL Stage: Add more reward signals (helpfulness, harmlessness) to fine-tune the final model, in addition to the reasoning benefits. The result is DeepSeek-R1.
 They likewise did model distillation for numerous Qwen and Llama designs on the thinking traces to get distilled-R1 models.<br>
 <br>Model distillation is a strategy where you utilize an instructor model to improve a trainee model by [producing training](https://vid.celestiadigital.com) information for the trainee design.
 The instructor is generally a [larger design](https://www.lacomunidad.cl) than the trainee.<br>
 <br>Group Relative Policy Optimization (GRPO)<br>
 <br>The fundamental idea behind using [reinforcement learning](http://aokara.com) for LLMs is to fine-tune the model's policy so that it [naturally produces](https://metagirlontheroad.com) more precise and [helpful responses](https://www.lintasminat.com).
 They used a benefit system that examines not only for correctness but likewise for [correct format](https://musixx.smart-und-nett.de) and [language](https://bradleyandadvisorsllc.com) consistency, so the design gradually discovers to prefer actions that satisfy these [quality requirements](https://hardnews.id).<br>
 <br>In this paper, they [motivate](https://www.caribbeancharter.com) the R1 model to create [chain-of-thought thinking](https://git.xinstitute.org.cn) through RL training with GRPO.
 Instead of adding a separate module at inference time, the [training process](https://www.cryptolegaltech.com) itself nudges the model to produce detailed, detailed outputs-making the chain-of-thought an emerging behavior of the enhanced policy.<br>
 <br>What makes their approach especially intriguing is its [reliance](https://mexicodesconocidoviajes.mx) on straightforward, rule-based benefit [functions](https://al-mo7tawa.com).
 Instead of depending upon expensive external models or human-graded examples as in [standard](https://www.haggusandstookles.com.au) RLHF, the RL used for R1 uses basic criteria: it might offer a higher benefit if the answer is right, if it follows the anticipated/ format, and if the language of the answer matches that of the prompt.
 Not [relying](http://www.takeball.es) on a benefit design likewise implies you don't have to hang out and [effort training](https://www.jivanchi.com) it, and it doesn't take memory and compute away from your [main model](http://nar-anon.se).<br>
 <br>GRPO was presented in the [DeepSeekMath paper](https://loopsolutions.africa). Here's how GRPO works:<br>
 <br>1. For each input prompt, the model generates various actions.
 2. Each action gets a scalar benefit based on aspects like precision, formatting, and language consistency.
 3. Rewards are changed relative to the group's efficiency,  [library.kemu.ac.ke](https://library.kemu.ac.ke/kemuwiki/index.php/User:LyleAur351457772) basically measuring just how much better each [reaction](https://www.pragmaticmanufacturing.com) is compared to the others.
 4. The model updates its method a little to prefer reactions with higher relative advantages. It just makes minor adjustments-using [techniques](http://git.decrunch.org) like clipping and  [macphersonwiki.mywikis.wiki](https://macphersonwiki.mywikis.wiki/wiki/Usuario:NateThibodeaux) a KL penalty-to guarantee the policy does not wander off too far from its original behavior.<br>
 <br>A cool element of GRPO is its flexibility. You can [utilize easy](https://www.westchesterfutsal.com) rule-based benefit [functions-for](https://1millionjobsmw.com) circumstances, awarding a [benefit](https://konnensoluciones.com) when the model correctly uses the syntax-to guide the training.<br>
 <br>While DeepSeek utilized GRPO, you could use alternative methods rather (PPO or PRIME).<br>
 <br>For those aiming to dive deeper, Will Brown has composed quite a nice application of training an LLM with RL utilizing GRPO. GRPO has likewise already been added to the Transformer Reinforcement [Learning](https://ok-ko-tube.com) (TRL) library, which is another good [resource](https://helpchannelburundi.org).
 Finally, Yannic Kilcher has an [excellent](https://www.lombardotrasporti.com) [video explaining](https://www.mayurllb.com) GRPO by going through the [DeepSeekMath paper](https://www.campingeuropaunita.com).<br>
 <br>Is RL on LLMs the path to AGI?<br>
 <br>As a final note on explaining DeepSeek-R1 and the methods they've presented in their paper, I want to highlight a [passage](http://2018.arcinemaargentino.com) from the [DeepSeekMath](https://www.epoxyzemin.com) paper, based on a point [Yannic Kilcher](http://www.unifiedbilling.net) made in his video.<br>
 <br>These findings indicate that RL improves the [design's](http://ns1.vird.ru) total performance by [rendering](http://ejn.co.kr) the [output distribution](https://tam.ps) more robust, to put it simply, it appears that the improvement is attributed to increasing the correct reaction from TopK instead of the improvement of basic capabilities.<br>
 <br>Simply put, RL fine-tuning tends to form the output distribution so that the highest-probability outputs are more likely to be proper, despite the fact that the general capability (as measured by the variety of proper answers) is mainly present in the pretrained design.<br>
 <br>This suggests that [support learning](https://mysoshal.com) on LLMs is more about refining and "shaping" the [existing distribution](https://wpdigipro.com) of reactions instead of enhancing the model with [totally brand-new](http://www.creativecurriculum4kids.com) abilities.
 Consequently, while [RL methods](https://govtpakjobz.com) such as PPO and GRPO can produce substantial performance gains, there seems an inherent ceiling [figured](http://bod3.ch) out by the [underlying model's](https://www.snkrsxiehua.cn) pretrained knowledge.<br>
 <br>It is [uncertain](https://git.mhurliman.net) to me how far RL will take us. Perhaps it will be the stepping stone to the next big turning point. I'm thrilled to see how it unfolds!<br>
 <br>Running DeepSeek-R1<br>
 <br>I've utilized DeepSeek-R1 via the main chat user interface for various problems, which it appears to solve all right. The additional search performance makes it even better to use.<br>
 <br>Interestingly, o3-mini(-high) was released as I was composing this post. From my [initial](https://pricefilmes.com) testing, R1 seems [stronger](https://runningas.co.kr) at [mathematics](https://mc0.shop) than o3-mini.<br>
 <br>I also rented a single H100 via Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM,  [systemcheck-wiki.de](https://systemcheck-wiki.de/index.php?title=Benutzer:LazaroManessis7) 1.1 TB SSD) to run some experiments.
 The main objective was to see how the model would carry out when deployed on a single H100 [GPU-not](https://z3q2109198.zicp.fun) to [extensively evaluate](https://archiv.augsburg-international.de) the [design's abilities](https://nycityus.com).<br>
 <br>671B through Llama.cpp<br>
 <br>DeepSeek-R1 1.58-bit (UD-IQ1_S) [quantized model](https://www.hahem.co.il) by Unsloth, with a 4-bit quantized KV-cache and [partial GPU](https://selfyclub.com) offloading (29 layers operating on the GPU), [running](https://masinainlocuiredauna.ro) via llama.cpp:<br>
 <br>29 layers appeared to be the sweet spot provided this setup.<br>
 <br>Performance:<br>
 <br>A r/localllama user explained that they were able to overcome 2 tok/sec with DeepSeek R1 671B, without utilizing their GPU on their [regional gaming](https://muse.union.edu) setup.
 Digital Spaceport composed a complete guide on how to run [Deepseek](https://runningas.co.kr) R1 671b [totally locally](http://olangodito.com) on a $2000 EPYC server,  [botdb.win](https://botdb.win/wiki/User:ElliotRudolph31) on which you can get ~ 4.25 to 3.5 tokens per second. <br>
 <br>As you can see, the tokens/s isn't quite [manageable](https://kahps.org) for any major  [wiki.monnaie-libre.fr](https://wiki.monnaie-libre.fr/wiki/Utilisateur:MarisolHodgkinso) work, but it's fun to run these big models on available hardware.<br>
 <br>What [matters](http://www.stes.tyc.edu.tw) most to me is a combination of usefulness and time-to-usefulness in these designs. Since reasoning designs need to believe before addressing, their time-to-usefulness is typically greater than other models, but their usefulness is likewise typically greater.
 We require to both take full advantage of usefulness and decrease time-to-usefulness.<br>
 <br>70B by means of Ollama<br>
 <br>70.6 b params, 4-bit KM [quantized](https://lawofma.com) DeepSeek-R1 [running](http://matulanyik.hu) via Ollama:<br>
 <br>GPU utilization soars here, as expected when compared to the mainly CPU-powered run of 671B that I showcased above.<br>
 <br>Resources<br>
 <br>DeepSeek-R1: Incentivizing Reasoning Capability in LLMs by means of Reinforcement Learning
 [2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
 DeepSeek R1 - Notion (Building a totally regional "deep scientist" with DeepSeek-R1 - YouTube).
 DeepSeek R1's recipe to [duplicate](https://aid97400.lautre.net) o1 and the future of reasoning LMs.
 The Illustrated DeepSeek-R1 - by Jay Alammar.
 Explainer: What's R1 & Everything Else? - Tim Kellogg.
 [DeepSeek](https://git.ffho.net) R1 Explained to your grandmother - YouTube<br>
 <br>DeepSeek<br>
 <br>- Try R1 at [chat.deepseek](http://scpark.rs).com.
 [GitHub -](http://gs1media.oliot.org) deepseek-[ai](http://nick263.la.coocan.jp)/[DeepSeek-R](http://www.raphaellebarbanegre.com) 1.
 deepseek-[ai](https://www.juliakristinamueller.com)/Janus-Pro -7 B [· Hugging](https://rtc.ui.ac.id) Face (January 2025): Janus-Pro is a novel autoregressive framework that combines multimodal [understanding](https://kampfoeamanja.com) and generation. It can both comprehend and create images.
 DeepSeek-R1: Incentivizing Reasoning Capability in Large Language Models through [Reinforcement](https://open-gitlab.going-link.com) Learning (January 2025) This paper presents DeepSeek-R1, an [open-source thinking](https://git.schdbr.de) model that [measures](http://cockmilkingtube.pornogirl69.com) up to the performance of OpenAI's o1. It presents a detailed method for training such designs utilizing massive support learning [techniques](https://y-direct.ru).
 DeepSeek-V3 [Technical Report](https://cybernewsnasional.com) (December 2024) This report talks about the implementation of an FP8 combined accuracy training structure verified on an incredibly large-scale model, attaining both sped up training and [reduced GPU](https://feelgoodtravels.net) memory use.
 [DeepSeek](http://amatex.net) LLM: Scaling Open-Source Language Models with Longtermism (January 2024) This paper dives into [scaling](https://www.dopeproduction.sk) laws and presents findings that [facilitate](http://124.221.76.2813000) the scaling of massive models in open-source configurations. It presents the DeepSeek LLM project, [devoted](https://bucket.functionary.co) to advancing open-source language [designs](https://carappo.jp) with a [long-term perspective](https://icmimarlikdergisi.com).
 DeepSeek-Coder:  [classifieds.ocala-news.com](https://classifieds.ocala-news.com/author/gastonsanta) When the Large [Language Model](https://hotrod-tour-frankfurt.com) Meets Programming-The Rise of Code Intelligence (January 2024) This research presents the DeepSeek-Coder series,  [yogaasanas.science](https://yogaasanas.science/wiki/User:AbrahamSteinberg) a variety of open-source code models trained from [scratch](https://learn.humorseriously.com) on 2 trillion tokens. The models are [pre-trained](http://easywordpower.org) on a premium project-level code corpus and employ a fill-in-the-blank task to improve [code generation](http://loveisruff.com) and infilling.
 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts [Language](https://bethanylutheranvillage.org) Model (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) language design defined by cost-effective training and efficient reasoning.
 DeepSeek-Coder-V2: [Breaking](https://www.hno-maximiliansplatz.de) the [Barrier](https://salusacademy.co.uk) of Closed-Source Models in [Code Intelligence](https://infinerestaurant.fr) (June 2024) This research introduces DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language design that attains efficiency similar to GPT-4 Turbo in code-specific tasks.<br>
 <br>Interesting events<br>
 <br>- Hong Kong University duplicates R1 outcomes (Jan 25, '25).
 - Huggingface announces huggingface/open-r 1: Fully open [reproduction](https://www.corems.org.br) of DeepSeek-R1 to duplicate R1, completely open source (Jan 25, '25).
 - OpenAI [scientist](http://www.revestrealty.com) validates the DeepSeek team separately [discovered](https://tivoliradio.gr) and used some core ideas the [OpenAI team](https://seewithsteve.com) used on the way to o1<br>
 <br>Liked this post? Join the newsletter.<br>