DeepSeek V4 dropped today — $0.28/M output on 1M context, running on Huawei Ascend. Are you routing workloads to it?

DeepSeek just released V4 and the pricing is hard to ignore.

V4-Flash: $0.28/M output tokens. V4-Pro: $2.19/M. Both with 1M token context as default.

For reference: GPT-4 Turbo is $30/M output. Claude Opus 4.6 is $75/M. That’s not a marginal difference — it’s a structural one. I’ve been digging into the technical report and wanted to share what I found, because I think this release has implications beyond “another cheap Chinese model.”


The infrastructure story is the real headline

V4 is the first Tier-1 LLM to run on **Huawei Ascend chips at 85%+ utilization**. DeepSeek co-optimized inference kernels directly with Huawei’s teams for Ascend 910B/950. They report inference quality matching Nvidia A100 deployments at roughly 40% lower hardware cost.

This matters because the GPU export ban was supposed to slow Chinese AI development. DeepSeek V4 running on Huawei Ascend at 85% utilization while costing 100x less than Western alternatives is a pretty direct answer to how that played out.


Three architecture innovations that make the pricing possible

Engram Architecture — separates static knowledge (CPU RAM, hash-based lookup) from dynamic reasoning (GPU). CPU RAM is 10-20x cheaper per GB than GPU HBM. This is why 1M context doesn’t require proportional GPU memory growth, and why 1M context is the default even on the cheapest tier. The model offloads long-context storage to CPU memory rather than keeping everything in HBM.

mHC (Manifold-Constrained Hyper-Connections) — training stability mechanism for the 1.6T parameter MoE via bi-stochastic matrix projection (Sinkhorn-Knopp). Prevents gradient explosion, reduces failed training runs, lowers amortized training cost. This is part of why they can offer these prices — fewer wasted training runs means lower cost basis.

DSA (DeepSeek Sparse Attention) — token-dimension compression that takes attention from O(n²) to near-linear scaling, with 60-70% memory bandwidth reduction per attention layer. Combined with the MoE architecture (1.6T total parameters, ~37B active per forward pass), this is what makes the Flash tier viable at $0.28/M output.

You’re not getting a smaller model. You’re getting selective activation of a very large model with near-linear attention scaling and cheap long-context storage.


Pricing table

| Model | Input | Output | Context |

|—|—|—|—|

| V4-Pro | $0.55/M | $2.19/M | 1M tokens |

| V4-Flash | $0.014/M | $0.28/M | 1M tokens |

| GPT-4 Turbo | $10/M | $30/M | 128K tokens |

| Claude Opus 4.6 | $15/M | $75/M | 200K tokens |


What this looks like on real workloads

  • Production chatbot (1M queries/month): $25,000/month on GPT-4 Turbo → $154/month on V4-Flash*

  • Agent coding assistant (1.5M output tokens/month): $112.50 on Opus 4.6 → $3.29 on V4-Pro

  • Enterprise doc processing (200K input + 10K output per doc): $2.30 on GPT-4 Turbo → $0.13 on V4-Pro

Even if V4 is meaningfully worse on some tasks, the cost gap is large enough that you can run multiple passes, add verification steps, or accept some quality tradeoff and still come out ahead economically.


The two-stack future

One day before release, Reuters reported DeepSeek refused early API access to U.S. chip manufacturers including Nvidia — a deliberate mirror of the U.S. GPU export ban. The AI supply chain is splitting:

  • Western stack: Nvidia GPUs → CUDA → AWS/Azure/GCP → OpenAI/Anthropic/Google

  • Chinese stack: Huawei Ascend → CANN → Huawei Cloud/Alibaba Cloud → DeepSeek/Alibaba/Baidu

For developers and enterprises, this creates a strategic dimension that goes beyond benchmark comparisons. If you build on V4 and the geopolitical situation escalates, what’s your fallback? If you stay on Western APIs and the 100x cost gap persists, what’s the competitive pressure from teams that don’t?


API compatibility and quick start

The endpoint is OpenAI-compatible, so migration from existing OpenAI SDK integrations is minimal:


from openai import OpenAI

client = OpenAI(

    api_key="your-deepseek-api-key",

    base_url="https://api.deepseek.com/v1"

)

response = client.chat.completions.create(

    model="deepseek-chat",   # V4-Pro

    messages=\[{"role": "user", "content": "Your prompt here"}\],

    max_tokens=1024

)

print(response.choices\[0\].message.content)

For V4-Flash, use `model=“deepseek-chat-flash”`. Both are live now.


The post-scaling paradigm shift

What’s interesting about V4 architecturally is that it represents a different thesis than “train bigger on more compute.” The Engram + DSA + mHC combination is about extracting more capability per dollar of inference cost, not just per dollar of training compute. If this approach generalizes, it suggests the next few years of model competition will be as much about inference efficiency as raw benchmark scores.

The open-source weights are on Hugging Face. The API is live at api.deepseek.com. Both tiers available now.

DeepSeek V4


Questions

1. For teams currently spending significant budget on GPT-4 Turbo or Claude Opus for non-reasoning workloads — are you evaluating V4, or do compliance/data residency concerns make it a non-starter regardless of price? Curious what the actual blockers look like in practice.

2. Has anyone tested V4-Pro on agent/coding tasks specifically? The claim that it benchmarks near Opus 4.6 on non-reasoning tasks is interesting if it holds up in practice. Would love to hear real results rather than benchmark numbers.

3. For those thinking about the two-stack future — are you building with explicit fallback strategies in mind, or treating this as a “wait and see” situation? At what point does the cost gap become large enough that you’d accept the strategic dependency?

Also curious: for anyone self-hosting, what hardware are you running it on and what utilization are you seeing? And has anyone tested the full 1M context window in production — curious about latency at that scale.