cuTile Rust Hits 96% of cuBLAS for GPU Kernels

A team including engineers tied to NVIDIA published a paper on June 14 describing cuTile Rust, a system that lets you write GPU kernels in Rust without abandoning the language's safety guarantees. The work, Fearless Concurrency on the GPU, claims to do this at performance levels that sit right next to hand-tuned CUDA.

The thing Rust couldn't do before

Rust's whole pitch is memory safety enforced at compile time. That pitch has mostly held up on the CPU. On the GPU it fell apart, because writing custom kernels meant dropping out of the ownership model and back into unsafe territory.

cuTile Rust tries to close that gap with a tile-based approach. Mutable outputs get split into disjoint pieces, so two parts of a kernel can't quietly stomp on the same memory. Kernel launches carry the host-side ownership contract through to the device. And when you actually need lower-level control, there are local opt-out hatches rather than one giant unsafe block.

That last detail matters more than the marketing around it. Most "safe" systems work fine until you hit the case they didn't plan for, and then you're writing raw pointers anyway. Scoping the escape hatches locally is the honest version of the promise.

Does it actually keep up?

This is where the paper either earns attention or doesn't. The authors report 7 TB/s for element-wise operations on an NVIDIA B200, and 2 PFlop/s for GEMM, which they put at 96% of cuBLAS. They also say it matches their Python implementation within measurement noise.

Ninety-six percent of cuBLAS is a real number, not a rounding-up-to-parity number. cuBLAS is NVIDIA's own heavily optimized library, so landing within four points while writing idiomatic Rust is the headline result. The element-wise figure is harder to contextualize without knowing the exact roofline, and the paper leans on its own measurements rather than third-party benchmarks, so take the precision with the usual grain of salt.

Running real models

To prove the abstractions survive contact with actual workloads, the team built Grout, an inference engine on top of cuTile Rust, and ran it through a Qwen3 path.

171 generated tokens/s for Qwen3-4B on an RTX 5090, 82 tokens/s for Qwen3-32B on the B200. The paper calls these "competitive with vLLM and SGLang," which is the careful phrasing of someone who knows competitive is not the same as faster.

Still, competitive is the bar that matters here. vLLM and SGLang are the engines people actually deploy. A Rust-native option landing in the same neighborhood, with compile-time safety, is the argument the paper is really making.

The batch-1 decode caveat is worth holding onto. These are single-stream numbers, and serving systems live or die on throughput under load, which the abstract doesn't address.

So what

For anyone building ML infrastructure in Rust, this chips away at the standard objection that the language can't touch the GPU without giving up what makes it Rust. Whether the broader ecosystem adopts cuTile Rust depends on tooling, release timing, and whether NVIDIA backs it as a product rather than a paper.

The work was submitted June 14, 2026, as version one. No public release date for cuTile Rust or Grout appears in the abstract, so the next thing to watch is whether code lands alongside the paper.

New Rust System Hits 96% of cuBLAS Speed for GPU Kernels

The thing Rust couldn't do before

Does it actually keep up?

Running real models

So what

Oliver Senti

Related Articles

OpenAI Builds Method to Predict AI Misbehavior Before Models Ship

AMD Opens Ryzen AI Halo Pre-Orders at $3,999 With 128GB Memory

AI Alliance Launches Project Tapestry to Co-Train Open Foundation Models

Stay Ahead of the AI Curve