Machine Learning

An AI System Is Now Writing GPU Code That Outperforms NVIDIA's Own Libraries

Reinforcement learning enables automated kernel generation that beats hand-tuned CUDA implementations across 1,000 matrix configurations.

Oliver Senti
Oliver SentiSenior AI Editor
December 6, 20255 min read
Share:
Abstract visualization of AI-optimized GPU code showing flowing data streams through a geometric representation of tensor core architecture

A research group called DeepReinforce has published CUDA-L2, a system that combines large language models with reinforcement learning to automatically generate GPU code for matrix multiplication. The generated kernels run 11% to 29% faster than NVIDIA's own optimized libraries, depending on the workload scenario. The code and pre-optimized kernels for NVIDIA A100 GPUs are now available on GitHub.

The implications extend beyond academic interest. Matrix multiplication consumes the majority of compute time during training and inference of large language models. Even modest improvements in these operations translate directly into reduced costs and faster iteration cycles for AI development.

How CUDA-L2 Works

Traditional approaches to GPU kernel optimization rely on human engineers crafting templates and using automated tuners to adjust parameters like tile sizes and memory access patterns. CUDA-L2 takes a fundamentally different approach: the language model writes complete CUDA source code from scratch for each specific matrix dimension.

The system uses DeepSeek V3, a 671-billion-parameter mixture-of-experts model, as its code generator. The researchers fine-tuned this model on a curated dataset of CUDA kernels drawn from PyTorch's ATen library, NVIDIA's CUTLASS templates, and various example implementations. This continued pretraining teaches the model the syntax and patterns of high-performance GPU code.

What makes the approach distinctive is its reinforcement learning loop. Generated kernels are compiled and executed on actual GPU hardware. The system measures both correctness (validated against FP32 CPU reference implementations) and execution speed. This performance data becomes the reward signal that guides subsequent code generation. Over many iterations, the model develops its own internal heuristics for what makes GPU code fast, rather than relying on rules encoded by human experts.

The LLM can modify not just parameters but the fundamental structure of the code. This includes choosing between different programming styles (raw CUDA, CuTe, CUTLASS, or inline PTX assembly), altering loop structures, and selecting tiling strategies, padding approaches, and swizzle patterns that conventional autotuners cannot explore.

Benchmark Results Against Industry Standards

The research team evaluated CUDA-L2 across 1,000 distinct matrix dimension configurations on NVIDIA A100 GPUs. These configurations represent the range of shapes commonly encountered in attention layers and feed-forward networks of models like Llama, Qwen, and DeepSeek itself.

In offline benchmarks, where kernels execute consecutively without pauses, CUDA-L2 achieved an average speedup of 22% over PyTorch's torch.matmul and 19.2% over cuBLAS with optimal layout configuration. Against cuBLASLt with its heuristic-based algorithm selection, the improvement was 16.8%. The most competitive baseline, cuBLASLt-AutoTuning, which exhaustively tests up to 100 kernel candidates per configuration, still fell 11.4% short of CUDA-L2's performance.

Server-mode benchmarks, designed to simulate real-time inference with random intervals between kernel calls, showed even larger gains. Here CUDA-L2 delivered 28.7% improvement over torch.matmul, 26% over cuBLAS, and 15.9% over cuBLASLt-AutoTuning. The researchers attribute the enhanced server-mode performance to kernels that handle intermittent workloads more efficiently than their hand-tuned counterparts.

What This Means for AI Training Economics

For organizations training or fine-tuning large language models, HGEMM (half-precision general matrix multiply) operations represent a dominant fraction of GPU utilization. A 20% improvement in these kernels does not mean 20% faster training overall, but the savings are meaningful. It translates to more training tokens processed within the same compute budget, more hyperparameter sweeps, or reduced cloud GPU costs.

The breadth of optimization matters as much as the depth. Because CUDA-L2 handles approximately 1,000 real matrix sizes rather than a handful of manually tuned configurations, the speedups apply across diverse model architectures rather than only benefiting specific layer dimensions.

Current Limitations and Future Plans

The released kernels are currently optimized exclusively for NVIDIA's Ampere architecture (A100 GPUs). The researchers note that kernels trained on A100 may not transfer performance gains to other architectures. Running them on an RTX 3090 or H100 might work, but speedups are not guaranteed.

The team's roadmap includes expanding support to Ada Lovelace, Hopper, and Blackwell architectures, adding denser matrix configurations, supporting HGEMM with 32-bit accumulators (the current release uses 16-bit), and building easier deployment paths for open-source LLMs.

For researchers and practitioners who need matrix dimensions not currently covered, the team invites dimension requests via GitHub issues and has indicated willingness to release additional optimized kernels.

A Shift in How Performance Engineering Gets Done

The CUDA-L2 results challenge a long-held assumption in systems programming: that highly optimized, vendor-tuned libraries represent a performance ceiling. NVIDIA employs experienced engineers who have refined cuBLAS over many years with deep knowledge of GPU microarchitecture. That an automated system can consistently exceed this level of optimization suggests that the configuration space is simply too vast for manual exploration.

The researchers frame this as vindication of LLM-guided reinforcement learning for systems optimization. The approach sidesteps the need for explicit performance models or hand-crafted heuristics. Instead, the system learns what works by directly measuring what runs fast.

Whether NVIDIA or other chip vendors will adopt similar techniques for their own library development remains to be seen. The research at minimum demonstrates that there is likely performance left on the table in many supposedly-optimal implementations, waiting for more exhaustive search methods to find it.

Tags:CUDAGPU optimizationreinforcement learningLLMNVIDIAmatrix multiplicationAI code generationmachine learningDeepReinforcecuBLAS
Oliver Senti

Oliver Senti

Senior AI Editor

Former software engineer turned tech writer, Oliver has spent the last five years tracking the AI landscape. He brings a practitioner's eye to the hype cycles and genuine innovations defining the field, helping readers separate signal from noise.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

AI System Writes CUDA Code Faster Than NVIDIA Engineers | aiHola