DeepSeek V4 Benchmark: Pro and Flash Scores

DeepSeek V4 Benchmark: Pro and Flash Scores

The DeepSeek V4 release materials include benchmark rows for DeepSeek V4 Flash and DeepSeek V4 Pro in Max mode.

DeepSeek V4 benchmark dashboard

Benchmarks are useful as a first routing signal, but production defaults should still be decided with prompts from your own workload.

Official snapshot

ModelMMLU-ProLiveCodeBenchSWE Verified
DeepSeek V4 Flash86.291.679.0
DeepSeek V4 Pro87.593.580.6

Sources: DeepSeek-V4-Pro model card and DeepSeek_V4.pdf.

What the numbers suggest

Pro leads the snapshot, especially where reasoning and coding ceilings matter. Flash is close enough that it can be the default for many high-volume workflows, especially when the task can tolerate a second pass or escalation.

How to evaluate in production

Do not ship on public benchmarks alone. Build a small internal eval set with your real prompts:

  • 20 frequent user requests
  • 20 difficult edge cases
  • 20 code or reasoning tasks
  • 10 long-context tasks

Run Flash first, Pro second, then compare correctness, latency, and cost. The best default is usually workload-specific.

D-Chat Team

D-Chat Team

DeepSeek V4 Benchmark: Pro and Flash Scores | DeepSeek V4 Blog