DeepSeek V3 & R1: MLA Architecture, MoE, and Self-Host Guide

Series: The Complete Self-Hosted LLM Guide — 4 Parts

Part 1: Overview & Comparison: DeepSeek, Llama, Qwen, Kimi

Part 2 (you're here): DeepSeek V3 & R1 — Architecture, Reasoning, Self-Host

Part 3: Meta Llama 3.3 70B — Enterprise Workhorse

Part 4: Alibaba Qwen 2.5 — Coding Master & Multilingual Champion

In January 2025, DeepSeek R1 landed and shook the AI world: an open-source model from a Chinese startup had just beaten OpenAI's o1 on AIME 2024 (79.8% vs 79.2%) and MATH-500 (97.3% vs 96.4%). Total training cost for V3: ~$5.5M — roughly 1/50th of what frontier labs spend.

The immediate question: How?

The answer lies in two core architectural innovations — MLA and DeepSeekMoE — and an R1 training process that looks nothing like any reasoning model before it.

Part 1: Inside DeepSeek V3's Architecture

MLA — Multi-head Latent Attention: Killing the KV Cache Bottleneck

In standard transformers, Multi-head Attention (MHA) creates the biggest scaling bottleneck: every token requires storing K (Key) and V (Value) matrices for every attention head. As context grows, the KV cache balloons.

Real-world example: Llama 3.3 70B at 128K context needs ~35GB just for the KV cache.

DeepSeek solves this with MLA (Multi-head Latent Attention):

Standard MHA:
Token → Q, K, V (separate per head) → Attention → Output
KV cache = num_heads × head_dim × 2 (K+V) per token

DeepSeek MLA:
Token → Q, [K,V] joint low-rank compression → latent vector c_KV
→ decompress on demand
KV cache = latent_dim per token (~5.75× smaller)

Result: DeepSeek V3 achieves ~5.75× smaller KV cache than equivalent MHA, making 128K context genuinely practical — not just theoretically supported.

DeepSeekMoE: When Expertise Actually Specializes

MoE isn't new — but DeepSeek does it differently in two critical ways:

1. Fine-grained Expert Segmentation:

	Standard MoE	DeepSeekMoE
Expert count	~16 large experts	256 small experts per layer
Routing	Top-2 of ~16	Top-8 of 256
Specialization depth	Low	Much higher

By shrinking experts, each one learns an extremely narrow, deep domain. The router can combine 8 highly-specific micro-experts per token instead of 2 generalist large ones.

Series: The Complete Self-Hosted LLM Guide — 4 Parts

Part 1: Overview & Comparison: DeepSeek, Llama, Qwen, Kimi

Part 2 (you're here): DeepSeek V3 & R1 — Architecture, Reasoning, Self-Host

Part 3: Meta Llama 3.3 70B — Enterprise Workhorse

Part 4: Alibaba Qwen 2.5 — Coding Master & Multilingual Champion

The immediate question: How?

The answer lies in two core architectural innovations — MLA and DeepSeekMoE — and an R1 training process that looks nothing like any reasoning model before it.

Part 1: Inside DeepSeek V3's Architecture

MLA — Multi-head Latent Attention: Killing the KV Cache Bottleneck

Real-world example: Llama 3.3 70B at 128K context needs ~35GB just for the KV cache.

DeepSeek solves this with MLA (Multi-head Latent Attention):

Standard MHA:
Token → Q, K, V (separate per head) → Attention → Output
KV cache = num_heads × head_dim × 2 (K+V) per token

DeepSeek MLA:
Token → Q, [K,V] joint low-rank compression → latent vector c_KV
→ decompress on demand
KV cache = latent_dim per token (~5.75× smaller)

Result: DeepSeek V3 achieves ~5.75× smaller KV cache than equivalent MHA, making 128K context genuinely practical — not just theoretically supported.

DeepSeekMoE: When Expertise Actually Specializes

MoE isn't new — but DeepSeek does it differently in two critical ways:

1. Fine-grained Expert Segmentation:

	Standard MoE	DeepSeekMoE
Expert count	~16 large experts	256 small experts per layer
Routing	Top-2 of ~16	Top-8 of 256
Specialization depth	Low	Much higher

By shrinking experts, each one learns an extremely narrow, deep domain. The router can combine 8 highly-specific micro-experts per token instead of 2 generalist large ones.

Benchmark	DeepSeek R1	OpenAI o1
AIME 2024	79.8%	79.2%
MATH-500	97.3%	96.4%
Codeforces Rating	2029	1891
GPQA Diamond	71.5%	75.7%
SWE-bench Verified	49.2%	48.9%

Model	FP16 VRAM	Q8 VRAM	Q4_K_M VRAM	Minimum Hardware
R1-Distill-Qwen-7B	14GB	7GB	4GB	1× RTX 3080
R1-Distill-Qwen-14B	28GB	14GB	8GB	1× RTX 4090
R1-Distill-Qwen-32B	64GB	32GB	18GB	1× RTX 4090 (Q4)
R1-Distill-Llama-70B	140GB	70GB	40GB	2× A100 80GB
Full DeepSeek R1	~1.3TB	~650GB	~350GB	8× A100 (cluster)

Use Case	Use	Reason
Code generation	V3	Faster, less verbose output
Multi-step bug analysis	R1	Reasoning chain finds root cause
Math / formal proofs	R1	Significantly higher accuracy
General Q&A / chat	V3	Faster, no CoT overhead
Security code review	R1	Deep analysis of logic flaws
Text generation	V3	No reasoning overhead needed

DeepSeek V3 & R1: MLA Architecture, DeepSeekMoE, and the Reasoning Revolution

At a Glance

Part 1: Inside DeepSeek V3's Architecture

MLA — Multi-head Latent Attention: Killing the KV Cache Bottleneck

DeepSeekMoE: When Expertise Actually Specializes

Related Resources

Comments (0)

Stay Updated

Related Articles

What Is an AI Agent? A Complete Guide for Business Leaders and Non-Technical People

OpenClaw 2026: 190K GitHub Stars, Moltbook, and Enterprise Security Warnings

CLI Authentication: When the Command Line Becomes Your AI Power Key

DeepSeek V3 & R1: MLA Architecture, DeepSeekMoE, and the Reasoning Revolution

At a Glance

Part 1: Inside DeepSeek V3's Architecture

MLA — Multi-head Latent Attention: Killing the KV Cache Bottleneck

DeepSeekMoE: When Expertise Actually Specializes

Related Resources

Comments (0)

Stay Updated

Related Articles

What Is an AI Agent? A Complete Guide for Business Leaders and Non-Technical People

OpenClaw 2026: 190K GitHub Stars, Moltbook, and Enterprise Security Warnings

CLI Authentication: When the Command Line Becomes Your AI Power Key

FP8 Training — How $5.5M Became Possible

Part 2: DeepSeek R1 — The Reasoning Revolution

Why R1 Is Different

Emergent Reasoning Behaviors

R1 vs o1 — Real Benchmarks

Part 3: Self-Hosting R1 — The Practical Guide

Pick the Right Variant for Your Hardware

Deploy with Ollama (Development)

Deploy with vLLM (Production)

Part 4: Prompt Engineering for R1

Triggering Deep Reasoning

Parse Thinking vs Final Answer

When to Use R1 vs V3

Production Tips

Conclusion