Home/bycloud/DeepSeek V4

DeepSeek V4

bycloud · 38 Claims

Model Release
Neutral
DeepSeek released two models: DeepC V4 Pro and Flash.
Factual statement from the transcript.
Source: How Did DeepSeek Make V4 So Cheap?
Technical Report
Neutral
DeepSeek published a 58-page technical report covering everything they did.
Factual statement from the transcript.
Source: How Did DeepSeek Make V4 So Cheap?
Benchmarks
Neutral
DeepSeek V4 is benchmarked as the third best open weights model, behind Kimi K 2.6 and Mimo V 2.5 Pro.
Referencing benchmark rankings.
Source: How Did DeepSeek Make V4 So Cheap?
Agree
Flash with 13 billion active parameters can reach reasoning performance comparable to GPT-5.2 and Gemini 3.0 Pro when given a larger thinking budget.
Author reports DeepSeek's claim about Flash performance.
Source: How Did DeepSeek Make V4 So Cheap?
Agree
Pro Max is claimed as the strongest open model, beating previous open-source models on reasoning, coding, long context, and agentic benchmarks; on Artificial Analysis it wins Terminal Bench Hard and ranks second or third as an open-weights model.
Performance claims from DeepSeek and third-party benchmarks.
Source: How Did DeepSeek Make V4 So Cheap?
Neutral
On reasoning, Pro Max falls slightly behind GPT 5.4 and Gemini 3.1 Pro, representing roughly a 3 to 6 month gap behind frontier closed models.
Honest self-assessment from the paper as described by author.
Source: How Did DeepSeek Make V4 So Cheap?
Neutral
For agent tasks, ProMax is on par with Kimi K2.6 and GLM5.1 but still slightly behind frontier closed models.
Agent task performance comparison.
Source: How Did DeepSeek Make V4 So Cheap?
Agree
ProMax surpasses Gemini 3.1 Pro on academic long context benchmarks.
Author highlights a direct win over Gemini on long context.
Source: How Did DeepSeek Make V4 So Cheap?
Neutral
DeepSeek left some benchmark entries blank when comparing against Kimi K2.6 and GLM 5.1 because their APIs were too busy to return responses, indicating serving capacity issues.
Anecdote from the paper about missing benchmark results.
Source: How Did DeepSeek Make V4 So Cheap?
Long Context & Pricing
Agree
DeepSeek V4 natively supports a 1 million context window with near state-of-the-art retrieval accuracy and costs 10 to 100 times cheaper than competitors.
Author highlights this as an achievement no other lab accomplished.
Source: How Did DeepSeek Make V4 So Cheap?
Pricing
Neutral
DeepSeek API pricing: $0.435 per million input tokens, $0.87 per million output tokens, with a cash hit discount bringing it to $0.3625 per million tokens.
Factual pricing data.
Source: How Did DeepSeek Make V4 So Cheap?
Neutral
GLM 5.1 costs $1.4 input and $4.4 output; Gemini 3.1 Pro costs $2 input and $12 output; Opus 4.6 costs $5 input and $25 output per million tokens.
Comparative pricing facts.
Source: How Did DeepSeek Make V4 So Cheap?
Agree
No other AI labs can match DeepSeek's pricing without operating at a loss.
Author asserts DeepSeek's cost leadership.
Source: How Did DeepSeek Make V4 So Cheap?
Agree
DeepSeek is likely making a 50 to 70 percent margin on their pricing.
Author's estimate, described as a guess but plausible based on V3 margins.
Source: How Did DeepSeek Make V4 So Cheap?
Neutral
The price discount was made permanent.
Factual update about pricing policy.
Source: How Did DeepSeek Make V4 So Cheap?
Attention Mechanisms
Neutral
DeepSeek introduced two new attention mechanisms: Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA).
Factual mention of new techniques.
Source: How Did DeepSeek Make V4 So Cheap?
Agree
The new attention setup makes DeepSeek V4 the best open-source 1 million context window model, rivaling closed-source models like Gemini 3.1 Pro and closing in on Opus 4.6 in retrieval accuracy.
Author claims superiority in long-context retrieval.
Source: How Did DeepSeek Make V4 So Cheap?
Neutral
CSA compresses every four tokens into one learned KV entry, not a simple average pooling but a learned compression using contribution weights.
Technical description of CSA.
Source: How Did DeepSeek Make V4 So Cheap?
Neutral
HCA compresses every 128 tokens into a single KV entry, making memory usage 32 times lighter than CSA and enabling cheap dense global attention.
Technical description of HCA.
Source: How Did DeepSeek Make V4 So Cheap?
Neutral
The model interleaves CSA and HCA at a 1:1 ratio, and both branches have an additional sliding window attention of 128 tokens to preserve recent context.
Architecture detail of hybrid attention.
Source: How Did DeepSeek Make V4 So Cheap?
Model Architecture
Neutral
DeepSeek V4 Pro has 1.6 trillion total parameters with 49 billion active; V4 Flash has 284 billion total parameters with 13 billion active; both support 1 million token context windows.
Model size specifications.
Source: How Did DeepSeek Make V4 So Cheap?
Neutral
DeepSeek V4 replaces the residual stream with manifold constrained hyperconnections (MHC) to increase representational capacity and help information survive across depth.
Architecture change description.
Source: How Did DeepSeek Make V4 So Cheap?
Neutral
The MoE block uses 256 fine-grained router experts, shared experts, router activation changed from sigmoid to square root soft plus, and a sequence-level balance loss to avoid extreme routing imbalance.
MoE design changes.
Source: How Did DeepSeek Make V4 So Cheap?
Neutral
Early layers use hash routing based on token ID instead of learned routing, giving each token a stable expert path without spending capacity on routing.
Novel routing approach for early layers.
Source: How Did DeepSeek Make V4 So Cheap?
Efficiency
Agree
At 1 million tokens, V4 Pro uses only 27 percent of the inference flops and 10 percent of the KV cache compared to V3.2.
Efficiency improvement data.
Source: How Did DeepSeek Make V4 So Cheap?
Agree
Against a GQA baseline (as used in Llama 2 and 3), V4 Pro reduces total KV cache by 34 times and V4 Flash reduces it by 49 times.
Comparison of memory reduction.
Source: How Did DeepSeek Make V4 So Cheap?
Neutral
DeepSeek V4 uses FP4 quantization-aware training for MoE expert weights, so the model learns to survive extremely low precision during inference.
Quantization technique for efficiency.
Source: How Did DeepSeek Make V4 So Cheap?
Training
Neutral
DeepSeek V4 Flash was pre-trained on 32 trillion tokens and V4 Pro on 33 trillion tokens, doubling previous pre-training runs; the only other known model with a similar amount is Kimi K2.5 at 30 trillion tokens.
Pre-training data scale facts.
Source: How Did DeepSeek Make V4 So Cheap?
Neutral
The training used the Muon optimizer (first developed by Keller Jordan in December 2024, verified at scale by Kimi in February 2025) for most parameters, keeping AdamW for embeddings, prediction heads, and some MHC parameters.
Optimizer usage details.
Source: How Did DeepSeek Make V4 So Cheap?
Agree
Muon provides faster convergence and better training stability according to DeepSeek.
Claimed benefits of the optimizer by the lab.
Source: How Did DeepSeek Make V4 So Cheap?
Neutral
Data processing includes DeepCV3-style preparation, 128K tokenizer, token splitting, fill-in-the-middle, a new DSML special token for XML-based tool invocation, and sample-level attention masking for packed documents.
Data recipe details.
Source: How Did DeepSeek Make V4 So Cheap?
Post-training
Agree
Post-training: instead of direct RL, multiple specialist models for math, coding, agents, and instruction following were trained separately, then distilled into a unified model via on-policy distillation.
Author finds this approach cleaner and more beautiful.
Source: How Did DeepSeek Make V4 So Cheap?
Reasoning Modes
Neutral
DeepSeek V4 has three reasoning modes: non-thinking, thinking high, and think max, where think max uses longer context, weaker length penalties, and different system prompts.
Reasoning mode description.
Source: How Did DeepSeek Make V4 So Cheap?
Model Variants
Neutral
The release provides six main variants: Pro, Pro Base, Pro Max, Flash, Flash Base, Flash Max.
Model variant listing.
Source: How Did DeepSeek Make V4 So Cheap?
Infrastructure
Neutral
The inference stack is optimized for Huawei chips with day-zero support, compression of attention and KV cache, and pushing the CSA indexer into lower precision.
Hardware and system engineering choices.
Source: How Did DeepSeek Make V4 So Cheap?
Overall Assessment
Agree
DeepSeek V4 is a cost-performance release focused on long context efficiency and test-time scaling rather than chasing the top of the benchmark leaderboard.
Author's interpretation of the release philosophy.
Source: How Did DeepSeek Make V4 So Cheap?
Serving
Agree
DeepSeek is trying to solve the underlying serving cost and reliability problem that many powerful models face in the real world.
Author's positive spin on the engineering focus.
Source: How Did DeepSeek Make V4 So Cheap?
Industry Demand
Neutral
The demand for AI and LMS is stronger than ever, and new datacenter buildouts are potentially necessary.
Observation about market demand derived from API overload issues.
Source: How Did DeepSeek Make V4 So Cheap?