Home/AI Search/DeepSeek V4

DeepSeek V4

AI Search · 37 Claims

Neutral

DeepSeek released a new model called DeepSeek V4.

The transcript states 'They just dropped a new model, Deepseek V4.'

Source: The insane engineering of Deepseek V4

Neutral

DeepSeek has significantly less compute than top closed AI labs.

The author says 'They don't have nearly as much compute' in contrast to top closed labs.

Source: The insane engineering of Deepseek V4

Neutral

DeepSeek does not have access to top NVIDIA chips.

The transcript explicitly states 'they don't even have the top NVIDIA chips.'

Source: The insane engineering of Deepseek V4

Neutral

DeepSeek's team is about 40 times smaller than OpenAI's.

The video says 'their team is like 40 times smaller than OpenAI.'

Source: The insane engineering of Deepseek V4

Agree

DeepSeek V4 is on par with the top closed models.

The author states 'they managed to build a model that's on par with the top closed models out there.'

Source: The insane engineering of Deepseek V4

Agree

DeepSeek V4 Pro is on par with top closed models Opus 4.6 Max and Gemini 3.1 Pro across knowledge, reasoning, and agentic benchmarks.

The author shows benchmark charts and states it's 'pretty much on par with some of the top closed models out there, including Opus 4.6 Max and Gemini 3.1 Pro.'

Source: The insane engineering of Deepseek V4

Agree

DeepSeek V4 has a higher win rate than Opus 4.6 Max on average.

The transcript states 'DeepS version 4 on average has a higher win rate than Opus 4.6.'

Source: The insane engineering of Deepseek V4

Neutral

DeepSeek V4 achieved a perfect score of 120/120 on the Putnam 2025 undergraduate mathematics competition benchmark.

The video says 'Deepseek V4 achieved a perfect score, 120 out of 120' on the Putnam 2025.

Source: The insane engineering of Deepseek V4

Agree

At the extreme 1-million-token context length, DeepSeek V4's retrieval accuracy surpasses Google Gemini 3.1 Pro.

The transcript claims 'its retrieval accuracy even beats Google's latest Gemini 3.1 Pro' when pushed to the 1M limit.

Source: The insane engineering of Deepseek V4

Neutral

On the Artificial Analysis leaderboard, DeepSeek V4 Pro is the second best open-source model, below Kimik 2.6, and close to top closed models.

The video references the independent leaderboard showing DeepSeek V4 Pro as second best open-source model and edging close to top closed models.

Source: The insane engineering of Deepseek V4

Neutral

DeepSeek V4 is open sourced.

The transcript mentions 'they even open sourced this.'

Source: The insane engineering of Deepseek V4

Neutral

DeepSeek V4 is available for free download on Hugging Face.

The author states 'they actually open sourced the model... it's out on Hugging Face, you can download this for free.'

Source: The insane engineering of Deepseek V4

Neutral

DeepSeek released a paper detailing how they built V4.

The video says 'they even released a paper on how they built it.'

Source: The insane engineering of Deepseek V4

Neutral

DeepSeek's paper reveals design, training, and infrastructure details that closed AI labs typically keep secret.

The transcript says they released a paper 'spilling all the info on how it was designed and how they trained it, including this infrastructure stuff, which is like top secret for the closed AI labs.'

Source: The insane engineering of Deepseek V4

Neutral

DeepSeek V4 Pro has 1.6 trillion parameters.

The author states 'this latest V4 Pro model is massive. It has 1.6 trillion parameters.'

Source: The insane engineering of Deepseek V4

Neutral

DeepSeek V4 has a context length of 1 million tokens.

The transcript says 'this new V4 also has a context length of 1 million tokens.'

Source: The insane engineering of Deepseek V4

Neutral

1 million tokens is approximately 750,000 words.

The video explains '1 million tokens is roughly 750,000 words.'

Source: The insane engineering of Deepseek V4

Neutral

Building a usable 1-million-token context window correctly is extremely difficult.

The author states 'a 1 million token context window is insanely hard to actually build correctly.'

Source: The insane engineering of Deepseek V4

Neutral

Processing 1 million tokens with full attention results in astronomical compute requirements that choke high-end hardware.

The transcript explains that at a million tokens 'the number of comparisons becomes astronomical' and 'high-end hardware starts to choke.'

Source: The insane engineering of Deepseek V4

Neutral

At 1 million tokens, the KV cache consumes gigabytes of GPU memory, creating a major memory bottleneck.

The video states the KV cache becomes 'absurd... gigabytes sitting in expensive GPU memory.'

Source: The insane engineering of Deepseek V4

Neutral

DeepSeek uses compressed sparse attention (CSA) that merges small chunks of tokens (e.g. 4) into one representation, reducing sequence length by 4x.

The author explains CSA takes chunks of tokens and merges them, reducing sequence length by a factor of four.

Source: The insane engineering of Deepseek V4

Neutral

DeepSeek's Lightning indexer rapidly selects only the most relevant compressed blocks for attention, ignoring the rest.

The transcript describes the Lightning indexer picking out 'only the most useful pieces' and everything else is skipped.

Source: The insane engineering of Deepseek V4

Neutral

DeepSeek uses heavily compressed attention (HCA) that compresses 128 tokens (a paragraph) into a single representation, allowing full attention over the entire history.

The video says HCA groups '128 tokens or like an entire paragraph' and the sequence becomes short enough to look at everything.

Source: The insane engineering of Deepseek V4

Neutral

DeepSeek uses sliding window attention to keep the most recent tokens (e.g. 128) uncompressed for immediate precise context.

The author introduces sliding window attention that 'continuously tracks the most recent tokens... with full exact fidelity.'

Source: The insane engineering of Deepseek V4

Neutral

DeepSeek V4 interleaves CSA, HCA, and sliding window attention through the neural network to balance efficiency and precision.

The transcript explains the three parallel views are interleaved 'layer by layer through the neural network' to achieve a 1M context window without devastating compute cost.

Source: The insane engineering of Deepseek V4

Neutral

DeepSeek V4 Pro requires 3.7x less FLOPs than the previous V3.2, using only 27% of the compute.

The paper benchmarks show V4 Pro requires 3.7 times lower flops and 'runs on roughly 27% of the compute that was required for the previous version.'

Source: The insane engineering of Deepseek V4

Neutral

DeepSeek V4 Pro reduces KV cache memory by about 90% compared to V3.2 (needing only 10% of the memory).

The video states the KV cache is 'almost 10 times smaller than the previous DeepSeek version' and 'only requires 10% of the KV cache memory.'

Source: The insane engineering of Deepseek V4

Neutral

Scaling neural networks to 1.6 trillion parameters risks signal explosions and training crashes without proper mitigation.

The transcript explains that at trillion-parameter scale 'signals flowing through the network start to amplify like crazy' causing feedback loops that crash training.

Source: The insane engineering of Deepseek V4

Neutral

DeepSeek introduced manifold constrained hyperconnections (MHC) that enforce doubly stochastic matrix constraints, preventing signal amplification.

The paper introduces MHC which constrains residuals to a 'manifold of doubly stochastic matrices' where every row and column sums to 1, forbidding blowup.

Source: The insane engineering of Deepseek V4

Neutral

DeepSeek uses a 20-iteration Sinkhorn algorithm per layer to enforce the doubly stochastic constraints.

The transcript mentions 'a rapid sequence of row and column normalizations around 20 iterations' to satisfy the constraints.

Source: The insane engineering of Deepseek V4

Neutral

The overhead of the Sinkhorn constraint enforcement is only 6.7% of total runtime due to aggressive low-level GPU kernel optimizations.

The video states they shrunk the overhead of the entire process to 'only 6.7% of runtime' using fused kernels and custom GPU tweaks.

Source: The insane engineering of Deepseek V4

Neutral

DeepSeek introduced anticipatory routing that uses historical parameter snapshots to ignore noise and prevent loss spikes during training.

The transcript states the system looks at slightly older versions of parameters to lock onto the underlying trend and stabilise when early signs of a loss spike are detected.

Source: The insane engineering of Deepseek V4

Neutral

DeepSeek developed a custom optimizer called Muon that uses a two-phase update (aggressive then precise) for faster and more stable learning.

The author explains DeepSeek replaced AdamW with Muon, which first makes big rough adjustments and then tiny precise tweaks, like tuning a guitar.

Source: The insane engineering of Deepseek V4

Neutral

DeepSeek overlaps computation and communication using a wave-based pipeline to eliminate network idle time during training across racks.

The transcript describes breaking data into smaller sequential waves so that computation on wave N overlaps with transfer of wave N+1, making latency disappear.

Source: The insane engineering of Deepseek V4

Neutral

DeepSeek wrote fused GPU kernels using the Tilang language and formally verified their correctness with a Z3 SMT solver.

The paper mentions using Tilang to develop fused kernels and a Z3 SMT solver to mathematically prove the kernel code was correct.

Source: The insane engineering of Deepseek V4

Neutral

DeepSeek V4 Pro was trained on 33 trillion tokens.

The transcript explicitly states 'Deepsek v4 probe was trained on 33 trillion tokens.'

Source: The insane engineering of Deepseek V4

Neutral

DeepSeek used a curriculum training approach, starting with 4,000-token sequences and gradually increasing to the full 1M token window.

The video explains they started with short 4K token sequences, then raised to 16K, 64K, up to 1M tokens.

Source: The insane engineering of Deepseek V4