AI Chips

AWS Bets on Cerebras to Split Inference in Two and Leave Nvidia Behind

AWS deploys Cerebras CS-3 chips in its data centers, pairing them with Trainium for split-hardware inference.

Oliver Senti
Oliver SentiSenior AI Editor
March 14, 20266 min read
Share:
Cerebras CS-3 wafer-scale system installed in a cloud data center alongside server racks

Amazon Web Services and Cerebras Systems announced on March 13, 2026 that AWS will deploy Cerebras CS-3 systems directly in its data centers, available through Amazon Bedrock. The service will run open-source LLMs and Amazon's own Nova models on Cerebras hardware later this year. AWS is the first major cloud provider to offer Cerebras's inference solution, and it's an exclusive arrangement.

But the real story isn't just another chip vendor getting shelf space in a hyperscaler. It's the joint architecture the two companies are building together.

The split

Every time you ask an LLM a question, two distinct computational phases happen. First, prefill: the model ingests your prompt, processes all the tokens in parallel, and builds up the internal state it needs. That part is compute-heavy but parallelizable. Then decode: the model generates its answer one token at a time, sequentially, which is light on math but punishing on memory bandwidth. Every single output token requires a full pass through the model's weights.

Today, the same chips handle both phases. That means your hardware is either overpowered for decode or underpowered for prefill, depending on what it was optimized for. The AWS-Cerebras approach splits these phases across purpose-built silicon. Trainium, Amazon's custom AI chip, handles prefill. The Cerebras WSE-3 handles decode. Amazon's Elastic Fabric Adapter network shuttles the KV cache between them.

The claimed result: 5x more high-speed token capacity from the same hardware footprint. David Brown, VP of Compute and ML Services at AWS, put it more aggressively, saying the architecture would deliver inference "an order of magnitude faster" than what's currently available. That's a bold claim, and the gap between 5x and 10x suggests some marketing flexibility in how they're measuring things.

Why Cerebras actually makes sense here

The WSE-3 is a strange piece of hardware. It's a single chip the size of a dinner plate, fabricated on an entire silicon wafer rather than being diced into individual dies. The thing has 900,000 cores, 4 trillion transistors, and 44GB of on-chip SRAM with 21 petabytes per second of memory bandwidth. For context, Cerebras claims that's roughly 7,000x the memory bandwidth of an Nvidia H100.

That on-chip SRAM is what makes the decode story work. In a GPU cluster, generating each token means fetching model weights from external HBM memory, which creates a bottleneck. The WSE-3 stores weights directly on the chip in SRAM, eliminating the memory wall that makes decode slow on conventional hardware. Cerebras says its systems can generate thousands of output tokens per second where GPUs manage hundreds.

I find the architectural argument compelling. Whether the numbers hold up at scale, with real customer workloads and production networking overhead, is a different question. Performance disclaimers at the bottom of the press release note that results "may vary depending on workload, configuration, date and models being tested." Standard stuff, but worth flagging.

What this means for AWS specifically

Amazon has been on a custom silicon tear. Trainium3, announced at re:Invent 2025, already has commitments from both Anthropic and OpenAI. OpenAI alone signed up for 2 gigawatts of Trainium capacity. Adding Cerebras to the mix gives AWS something neither Google Cloud nor Microsoft Azure can currently offer: a heterogeneous inference stack that pairs two specialized processors instead of throwing general-purpose GPUs at everything.

For Amazon's Nova models, this could be a real differentiator. Running your own models on your own custom inference architecture is a competitive moat that's hard to replicate quickly.

The companies say they'll support both the disaggregated setup and a traditional aggregated configuration where the CS-3 handles everything. Cerebras's blog post acknowledges that most customers run mixed workloads with different prefill-to-decode ratios, and the aggregated approach still makes sense for many of them. That's a refreshingly honest caveat from an announcement that otherwise leans hard into superlatives.

Cerebras's momentum problem (the good kind)

This AWS deal lands at an interesting moment for Cerebras. In January, the company signed a $10 billion deal with OpenAI for 750 megawatts of compute through 2028. In February, OpenAI launched GPT-5.3-Codex-Spark on Cerebras hardware, its first production model running on non-Nvidia chips. Oracle namedropped Cerebras alongside Nvidia and AMD on a recent earnings call. The company is reportedly preparing for a Q2 2026 IPO at a valuation around $23 billion.

That's a lot of validation in a short window. And the AWS partnership provides something the OpenAI deal didn't: access to the broadest cloud customer base in the world, through Bedrock, with no separate infrastructure to manage.

There's a counterpoint, though. Cerebras's revenue concentration has been a persistent concern. In the first half of 2024, UAE-based G42 accounted for 87% of revenue. The OpenAI and AWS deals help diversify that picture, but the company is still pre-IPO and burning cash. Whether inference speed advantages translate into sustainable market share against Nvidia's ecosystem is the $23 billion question.

The agentic angle

Cerebras's blog post makes an interesting observation about why fast decode matters right now. AI coding agents, the kind that write and execute code autonomously, generate roughly 15x more tokens per query than conversational chat. Reasoning models that "think" through problems before answering are similarly token-heavy on the decode side. As these workloads grow, the bottleneck shifts decisively toward output generation speed.

This is where disaggregated inference could find its killer app. If you're running an agentic coding workflow that generates tens of thousands of tokens per task, shaving the per-token decode latency is worth more than speeding up the one-time prefill of a 128k context window.

What's missing

No pricing. No benchmark results against current Nvidia Blackwell systems with actual model comparisons. No timeline more specific than "the next couple of months" for availability. And the 5x capacity claim comes with enough caveats that I'd want to see independent testing before treating it as settled.

The whole thing runs on the AWS Nitro System for security and isolation, which is reassuring but also table stakes for any serious cloud deployment. Financial terms weren't disclosed.

Nvidia's GTC conference is next week. If Cerebras and AWS just handed Jensen Huang a talking point about disaggregated inference being something Nvidia can do with its own hardware stack, this announcement might age differently. But for now, the architecture is genuinely novel, and AWS has the distribution to make it matter.

Tags:CerebrasAWSinferenceTrainiumwafer-scale engineAmazon BedrockAI chipsdisaggregated inferenceNvidia
Oliver Senti

Oliver Senti

Senior AI Editor

Former software engineer turned tech writer, Oliver has spent the last five years tracking the AI landscape. He brings a practitioner's eye to the hype cycles and genuine innovations defining the field, helping readers separate signal from noise.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

AWS Deploys Cerebras CS-3 for Disaggregated AI Inference | aiHola