DeepSeek V4

5 CREATORS5 VIDEOS153 CLAIMS

DeepSeek V4 is the latest open-source language model from DeepSeek, offering two variants: Pro (1.66T total params, 49B active) and Flash (13B active). Released with an MIT license and a detailed technical paper, it boasts near-frontier benchmarks at a fraction of the cost. However, real-world testing reveals mixed results—especially in coding and UI generation—leading to debate among reviewers. This cross-analysis synthesizes views from 5 prominent creators, highlighting areas of agreement (long-context, pricing) and sharp disagreements (real-world quality, geopolitical implications).

SUMMARY

Enterprises should evaluate DeepSeek V4’s cost and long-context prowess but independently test real-world coding quality and consider open-source supply-chain risks before migrating from US models.

01

Model Release and Variants

Consensus
DeepSeek V4 was released as two models: Pro and Flash.
Matthew Berman, bycloud, WorldofAI, Bijan Bowen and 4 other creators agree.
Unique Insights
The V4 weights appeared on Hugging Face without warning, followed by an official announcement on X.
Highlights an unconventional release pattern that differs from typical US lab announcements.
02

Technical Specifications

Consensus
V4 Pro has approximately 1.66 trillion total parameters with 49 billion active parameters.
Matthew Berman, bycloud, WorldofAI, Bijan Bowen and 4 other creators agree.
V4 Flash uses 13 billion active parameters and both models natively support a 1 million token context window.
Matthew Berman, bycloud, WorldofAI, Bijan Bowen and 4 other creators agree.
The architecture is a Mixture of Experts (MoE) design.
Matthew Berman, bycloud, WorldofAI and 3 other creators agree.
Diverse Views
Total parameter count of V4 Flash: 284 billion vs 158 billion.
View A: Flash has 284 billion total parameters
Official technical report and model page list 284B total with 13B active.
View B: Flash has 158 billion total parameters
Bijan stated the Flash variant has 158B parameters, possibly based on early information or a different variant.
Editor's Note: Multiple authoritative sources align on 284B; the 158B figure may be a transcription error or refer to an earlier incomplete set of specifications.
Unique Insights
V4 Pro outputs can reach 384K tokens at most.
Adds a specific output-length limit not mentioned by other reviewers.
03

Efficiency and Compute Reduction

Consensus
V4 Pro drastically reduces FLOPs and KV cache relative to V3.2, using about 27% of the compute and only 10% of the KV cache memory.
AI Search, bycloud and 2 other creators agree.
Unique Insights
V4 Pro’s KV cache is reduced by 34‑49 times compared to a GQA baseline such as Llama 2/3.
Quantifies the improvement against a widely used attention baseline rather than just the previous DeepSeek version.
The paper reports exactly 3.7x lower FLOPs than the previous V3.2.
Provides a precise multiplier that grounds the efficiency claims in the technical paper.
04

API Pricing

Consensus
Flash pricing is extremely low, on the order of cents per million tokens, making it cost-competitive with any commercial offering.
Matthew Berman, WorldofAI, bycloud and 3 other creators agree.
Diverse Views
Actual Pro output token price: ~$0.87/M versus $348/M.
View A: Pro output costs roughly $0.87 per million tokens
bycloud listed $0.87 as the DeepSeek API output price, possibly referring to Flash or a discounted Pro tier.
View B: Pro output costs $348 per million tokens
Both authors explicitly state $348 per million output tokens for the Pro model, which aligns with DeepSeek’s official Pro pricing.
Editor's Note: The $0.87 figure likely reflects the Flash tier; always verify the model variant when comparing costs. Pro output pricing is genuinely high, making total cost sensitive to output token volume.
DeepSeek’s profit margins on these prices.
View A: Estimated 50‑70% margin
Extrapolated from earlier V3 margins and the company’s aggressive permanent discounts.
View B: Not explicitly estimated
No other reviewer quantified margins; WorldofAI argued cost-efficiency alone doesn’t guarantee quality.
Editor's Note: bycloud’s margin estimate is speculative but based on publicly stated efficiency numbers; treat as an expert guess rather than confirmed fact.
Unique Insights
DeepSeek made a temporary cash‑hit price discount permanent.
Signals a strategic commitment to undercut competitors permanently rather than running short-term promotions.
05

Performance Benchmarks

Consensus
On standard academic benchmarks, V4 Pro scores are close to frontier closed models like Opus 4.7, GPT‑5.5, and Gemini 3.1 Pro, often beating previous open-source records.
Matthew Berman, AI Search, bycloud, Bijan Bowen and 4 other creators agree.
Unique Insights
V4 Pro achieved a perfect 120/120 on the Putnam 2025 undergraduate mathematics benchmark.
A striking domain‑specific result not emphasized by other reviewers.
DeepSeek left some benchmark entries blank when comparing against Kimi K2.6 and GLM 5.1 because their APIs were too busy, signalling serving capacity issues.
Raises questions about the reproducibility of third‑party comparisons and hints at infrastructure strains.
06

Real-World Performance

Diverse Views
Practical coding, UI generation, and creative task quality of DeepSeek V4.
View A: Near state‑of‑the‑art, sufficient for most use cases, competitive with closed models.
Cite benchmark parity and cost advantages; argue that nearly‑frontier intelligence is good enough for enterprise adoption.
View B: Subpar, lazy, benchmark‑maxed; often lagging behind other Chinese models like Kimi K2.6, GLM 5.1, and Miniax.
Multiple in‑person tests (browser OS, SVG, 3D objects, UI clones) produced buggy or inferior results compared to competitors, suggesting optimisation for benchmarks rather than real tasks.
Editor's Note: Bijan Bowen’s extensive testing found mixed results: impressive webOS and drum kit generation but glitches in games and non‑functional app features; thinking mode markedly improved output quality. Your mileage may vary depending on task type and prompt engineering.
Unique Insights
Thinking mode (DeepSeek’s reasoning mode) dramatically improved 3D printer simulation accuracy, while non‑thinking mode produced basic pancake stacking.
Demonstrates that test‑time compute scaling can turn a mediocre output into a polished result, highlighting the importance of selecting the right reasoning setting.
In one test, a terminal generated by V4 Pro could move windows and change desktop backgrounds via commands.
A rare functional integration beyond static UI generation, showing potential for interactive agentic behaviour.
V4 Flash sometimes performed better than Pro in certain prompting scenarios.
Suggests that the larger Pro model does not always dominate and that Flash may be more robust for some practical tasks.
07

Attention Mechanisms

Consensus
DeepSeek V4 uses Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to efficiently handle 1M‑token contexts.
AI Search, bycloud and 2 other creators agree.
Unique Insights
A Lightning indexer rapidly selects only the most relevant compressed blocks, skipping the rest.
Explains how the attention mechanism avoids wasting compute on irrelevant tokens, a detail not mentioned by bycloud.
CSA and HCA are interleaved 1:1 and both branches keep a 128‑token sliding window for recent precise context.
Reveals the exact mixing ratio and the use of sliding windows, which helps practitioners understand the trade‑off between compression and local fidelity.
08

Training and Optimization

Consensus
Training employed the Muon optimizer and a curriculum strategy that gradually increased sequence length from 4K to 1M tokens.
AI Search, bycloud and 2 other creators agree.
Unique Insights
DeepSeek used manifold constrained hyperconnections (MHC) with a 20‑iteration Sinkhorn algorithm per layer to stabilise trillion‑parameter training.
Provides an extremely in‑depth look at a novel stability technique, including the low overhead of 6.7% via fused GPU kernels.
Post‑training used separate specialist models for math, coding, and agents, distilled into a unified model via on‑policy distillation.
Shows a cleaner alternative to direct RL‑HF on a single model, possibly explaining strong specialised benchmarks without degrading general intelligence.
V4 uses FP4 quantisation‑aware training for MoE expert weights, learning to survive extremely low precision inference.
A cutting‑edge quantisation strategy that directly improves serving efficiency and is rarely described in other open‑source reports.
09

Open Source and Transparency

Consensus
Model weights are freely available on Hugging Face under the MIT license, and DeepSeek published an extensive technical paper.
Matthew Berman, AI Search, bycloud, WorldofAI, Bijan Bowen and 5 other creators agree.
Unique Insights
The white paper is exceptionally honest about failures, more so than any closed‑source US lab.
Positions DeepSeek as a leader in research transparency, which could become a standard that Western labs are pressured to follow.
10

Geopolitical and Economic Implications

Unique Insights
DeepSeek V4’s low cost and open‑source nature threaten US economic dominance by making Chinese AI infrastructure a strategic dependency; could lead to cultural narrative control and economic collapse if US investments fail to produce returns.
The only reviewer to analyse far‑reaching political and economic consequences beyond technical benchmarks, including a call for the US to push open‑source or drastically cut costs.
US export controls are partially bypassed by China through algorithmic innovation and likely hardware smuggling.
Adds a concrete layer to the geopolitical narrative, quoting Jensen Huang’s argument that selling US chips is better for long‑term influence.
DeepSeek’s alleged distillation attack involved only 150K exchanges, far fewer than other Chinese labs, insufficient to explain model quality and could simply be benchmark comparisons.
Counters the narrative of industrial‑scale theft specifically against DeepSeek, while acknowledging broader security concerns.
11

Hardware and Infrastructure Constraints

Unique Insights
DeepSeek has significantly less compute, no top NVIDIA chips, and a team about 40 times smaller than OpenAI.
Quantifies the resource asymmetry, making the resulting model quality even more remarkable.
DeepSeek expects price reductions after deploying 950 super nodes in the second half of the year.
A forward‑looking infrastructure plan that directly addresses current capacity limitations noted in the white paper.
Inference stack is optimised for Huawei chips with day‑zero support, and the CSA indexer is pushed into lower precision.
Demonstrates a deliberate move away from NVIDIA dependency, aligned with China’s self‑sufficiency goals.
Source Videos

Frequently Asked Questions

Related Analyses