DeepSeek V4 Technical Report: Architecture, Training, and Benchmarks

The DeepSeek V4 technical report describes a preview V4 family with two Mixture-of-Experts language models:

DeepSeek V4 Pro: 1.6T total parameters, 49B activated parameters, 1M context.
DeepSeek V4 Flash: 284B total parameters, 13B activated parameters, 1M context.

Primary sources:

What the technical report focuses on

The report frames DeepSeek V4 around efficient long-context intelligence. The headline product implication is simple: both V4 Pro and V4 Flash expose a 1M-token context window, but they target different cost and capability envelopes.

Pro is the higher-capacity model for hard reasoning, coding, and agentic workflows. Flash is the lower-cost model for high-volume chat, summarization, routing, and everyday product paths.

Architecture notes

The report highlights several architecture and optimization upgrades:

Hybrid attention for long-context efficiency.
Manifold-Constrained Hyper-Connections for stronger signal propagation.
Muon optimizer for training stability and convergence.
MoE scaling with separate Pro and Flash model sizes.

DeepSeek V4 report layers and evidence map

Use the architecture section to decide what to measure, not as a substitute for measuring your own prompts.

For builders, the practical question is not just which model has the larger parameter count. The question is where longer context, cache behavior, and reasoning effort change the cost-quality curve.

Training and post-training

DeepSeek says the V4 models are pre-trained on more than 32T tokens and then post-trained with a multi-stage process. The release materials describe domain-specific expert cultivation followed by model consolidation.

That matters for product evaluation because one benchmark score is not enough. You should test domain tasks directly: code repair, long document synthesis, tool-use workflows, structured extraction, and high-volume support chat.

Reasoning modes

The technical report and model card describe non-thinking, thinking, and max-thinking styles. In practice:

Use non-thinking mode for low-risk, fast, low-cost responses.
Use thinking mode for math, coding, planning, and multi-step reasoning.
Use max-style reasoning only when the added latency and cost are justified.

The current DeepSeek API pricing page lists deepseek-v4-flash and deepseek-v4-pro as the V4 model IDs.

Benchmark signals

The release materials include benchmark snapshots across knowledge, coding, long-context, and agentic tasks. The site tracks a few practical anchor scores:

Model	MMLU-Pro	LiveCodeBench	SWE Verified
DeepSeek V4 Flash Max	86.2	91.6	79.0
DeepSeek V4 Pro Max	87.5	93.5	80.6

Treat these as routing hints, not final product truth. If your application depends on code changes, retrieval quality, or tool calls, build an eval set from your own traffic and compare Flash against Pro with the same prompts.

Implementation checklist

Before adopting DeepSeek V4 in production, verify:

Which workflows need Pro instead of Flash.
Whether Thinking improves your specific task enough to justify the cost.
How much prompt caching reduces repeated-context cost.
Whether your longest real documents fit cleanly inside the 1M context window.
Whether tool-use and JSON outputs are stable enough for your product contracts.

The technical report explains the direction. Your own evals should decide routing, retry behavior, and credit pricing.

Table of Contents