DeepSeek V3 & R1: MLA Architecture, DeepSeekMoE, and the Reasoning Revolution
Self-Hosted LLM Series — Part 2/4: Inside the architecture that shocked the AI world

At a Glance
Deep-dive into DeepSeek V3's MLA and DeepSeekMoE architecture. Analysis of why R1 changed the reasoning game with pure RL training. Complete self-host guide: VRAM requirements, quantization table, Ollama/vLLM setup, and R1 prompt engineering patterns.
Series: The Complete Self-Hosted LLM Guide — 4 Parts
- Part 1: Overview & Comparison: DeepSeek, Llama, Qwen, Kimi
- Part 2 (you're here): DeepSeek V3 & R1 — Architecture, Reasoning, Self-Host
- Part 3: Meta Llama 3.3 70B — Enterprise Workhorse
- Part 4: Alibaba Qwen 2.5 — Coding Master & Multilingual Champion
In January 2025, DeepSeek R1 landed and shook the AI world: an open-source model from a Chinese startup had just beaten OpenAI's o1 on AIME 2024 (79.8% vs 79.2%) and MATH-500 (97.3% vs 96.4%). Total training cost for V3: ~$5.5M — roughly 1/50th of what frontier labs spend.
The immediate question: How?
The answer lies in two core architectural innovations — MLA and DeepSeekMoE — and an R1 training process that looks nothing like any reasoning model before it.
Part 1: Inside DeepSeek V3's Architecture
MLA — Multi-head Latent Attention: Killing the KV Cache Bottleneck
In standard transformers, Multi-head Attention (MHA) creates the biggest scaling bottleneck: every token requires storing K (Key) and V (Value) matrices for every attention head. As context grows, the KV cache balloons.
Real-world example: Llama 3.3 70B at 128K context needs ~35GB just for the KV cache.
DeepSeek solves this with MLA (Multi-head Latent Attention):
Standard MHA:
Token → Q, K, V (separate per head) → Attention → Output
KV cache = num_heads × head_dim × 2 (K+V) per token
DeepSeek MLA:
Token → Q, [K,V] joint low-rank compression → latent vector c_KV
→ decompress on demand
KV cache = latent_dim per token (~5.75× smaller)
Result: DeepSeek V3 achieves ~5.75× smaller KV cache than equivalent MHA, making 128K context genuinely practical — not just theoretically supported.
DeepSeekMoE: When Expertise Actually Specializes
MoE isn't new — but DeepSeek does it differently in two critical ways:
1. Fine-grained Expert Segmentation:
| Standard MoE | DeepSeekMoE | |
|---|---|---|
| Expert count | ~16 large experts | 256 small experts per layer |
| Routing | Top-2 of ~16 | Top-8 of 256 |
| Specialization depth | Low | Much higher |
By shrinking experts, each one learns an extremely narrow, deep domain. The router can combine 8 highly-specific micro-experts per token instead of 2 generalist large ones.
Related Resources
Comments (0)
Loading comments...
Stay Updated
Get weekly insights on AI, automation, and shipping fast. Join 500+ founders.
Related Articles

What Is an AI Agent? A Complete Guide for Business Leaders and Non-Technical People
AI Agents aren't just smarter chatbots — they're digital employees that can think, plan, and act to achieve goals. This guide explains everything you need to know, even if you've never written a line of code.

OpenClaw 2026: 190K GitHub Stars, Moltbook, and Enterprise Security Warnings
190K GitHub stars, 1.5M user-created agents, the world's first AI social network — and three security vulnerabilities enterprises need to patch now.

CLI Authentication: When the Command Line Becomes Your AI Power Key
CLI Auth isn't just convenient — it's the mindset shift from 'using AI like a web app' to 'working alongside AI in an integrated environment'. Hands-on guide: claude login, setup-token, handling 401 errors, and protecting tokens on your local machine.