Deepseek-v3 101

author: @himanshusblogs

Hi! I hope you’re doing well. It’s been a long time since I’ve posted, though we are here to discuss the basic architecture of one of the best open-source models, beating Llama 3.1 405b, Qwen, and Mistral.

Deepseek v3 is the base model behind Deepseek r1.

TL;DR

Deepseek v3 performs on par and better on many benchmarks than the big closed models by OpenAI and Anthropic.
They have incorporated Multi-Head Latent Attention, one of the crucial breakthroughs by a young undergrad in the Deepseek lab, DeepseekMoE architecture, implementing FP8 mixed precision training, and developing a custom HAI-LLM framework.
It adapts auxiliary-loss-free-strategy for load balancing

Introduction

The architecture of DeepSeek-v3 incorporates innovative techniques like the Mixture of Experts (671B and 37B activated per token), Multi-Head Latent Attention (MHLA), and a pretraining process using 14.8T tokens. The model undergoes Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to enhance performance. Key architectural improvements include the Load Balancing Strategy (auxiliary-loss-free) and the Multi-Token-Prediction Objective (MTP), which boosts both performance and inference speed. The model employs FP8 mixed precision during pretraining to address communication bottlenecks and integrates reasoning capabilities distilled from DeepSeek R1 series models into DeepSeek-v3.

Andrez Karpathy on Deepseek v3

Pre-training, context length extension, post-training-

Architecture: Load Balancing Strategy - auxiliary-loss-free-strategy, Multi-Token-Prediction Objective (MTP) - model perf, inference acceleration