
DeepSeek V4 Benchmark: Pro and Flash Scores
The DeepSeek V4 release materials include benchmark rows for DeepSeek V4 Flash and DeepSeek V4 Pro in Max mode.

Benchmarks are useful as a first routing signal, but production defaults should still be decided with prompts from your own workload.
Official snapshot
| Model | MMLU-Pro | LiveCodeBench | SWE Verified |
|---|---|---|---|
| DeepSeek V4 Flash | 86.2 | 91.6 | 79.0 |
| DeepSeek V4 Pro | 87.5 | 93.5 | 80.6 |
Sources: DeepSeek-V4-Pro model card and DeepSeek_V4.pdf.
What the numbers suggest
Pro leads the snapshot, especially where reasoning and coding ceilings matter. Flash is close enough that it can be the default for many high-volume workflows, especially when the task can tolerate a second pass or escalation.
How to evaluate in production
Do not ship on public benchmarks alone. Build a small internal eval set with your real prompts:
- 20 frequent user requests
- 20 difficult edge cases
- 20 code or reasoning tasks
- 10 long-context tasks
Run Flash first, Pro second, then compare correctness, latency, and cost. The best default is usually workload-specific.

