NVIDIA released CUDA 13.1 this week, introducing CUDA Tile, a programming model that abstracts tensor core operations into higher-level data blocks. The update targets Blackwell architecture GPUs exclusively at launch, with C++ support and broader hardware compatibility planned for future releases.
What CUDA Tile Actually Does
Traditional CUDA programming requires developers to specify execution paths for individual threads using the SIMT (Single Instruction, Multiple Threads) model. CUDA Tile moves the abstraction up one level: developers define chunks of data called tiles and the mathematical operations on them, while the compiler and runtime determine thread scheduling.
The practical effect is that code targeting tensor cores no longer requires manual optimization for specific hardware generations. NVIDIA claims tile-based code will be forward-compatible with future GPU architectures without deep refactoring.
CUDA 13.1 ships two tile programming components. CUDA Tile IR provides a new virtual instruction set architecture analogous to PTX but designed specifically for matrix operations. cuTile Python offers a domain-specific language for authoring tile-based kernels in Python. A C++ implementation is expected in a subsequent release.
The current version focuses on AI algorithms and runs only on Blackwell products (compute capability 10.x and 12.x). NVIDIA's benchmarks show B200 and GB200 achieving roughly 2x speedups over H200 for BF16, FP8, and block-scaled operations, while B300 and GB300 reach between 2x and 6x depending on the workload and data type.
Green Contexts Move to Runtime API
Green contexts, introduced in CUDA 12.4's driver API, are now accessible through the runtime API. These lightweight alternatives to traditional CUDA contexts allow developers to partition GPU resources at the streaming multiprocessor level, dedicating specific SM sets to particular workloads.
The use case is latency-sensitive code that needs guaranteed compute resources. By allocating a subset of SMs to a green context for priority work, developers prevent contention from other GPU operations. CUDA 13.1 adds a more flexible split() API that reduces the number of calls needed to build SM partitions and allows configuration of work queues to minimize false dependencies between contexts.
Multi-Process Service Gets Static Partitioning
CUDA 13.1 introduces static SM partitioning for the Multi-Process Service as an alternative to dynamic resource provisioning. The feature targets Ampere and newer architectures (compute capability 8.0+).
Developers enable static partitioning by launching the MPS control daemon with the -S flag. The fundamental partitioning unit varies by architecture: Hopper discrete GPUs use 8-SM chunks. The goal is deterministic resource allocation and stronger isolation between MPS clients.
A new Memory Locality Optimization Partition feature on select Blackwell GPUs (compute capability 10.0 and 10.3) creates specialized CUDA devices optimized for memory locality. Each partition appears as a distinct device with separate compute and memory resources. B200 and B300 products each have two partitions; support for GB200 and GB300 is scheduled for a later CUDA release.
Developer Tools Updates
Nsight Compute 2025.4 adds Tile kernel profiling with a new column distinguishing Tile from SIMT kernels, a Tile Statistics section showing tile dimensions and pipeline utilization, and source page mapping to cuTile kernel source.
Compute Sanitizer 2025.4 introduces compile-time patching through the NVCC flag -fdevice-sanitize=memcheck. The instrumentation integrates memory error detection directly into compilation, catching issues like illegal accesses between adjacent allocations through base-and-bounds analysis. Only memcheck is currently supported.
Nsight Systems 2025.6.1, released alongside CUDA 13.1, adds system-wide CUDA tracing across process trees, CUDA host function tracing for Graph host function nodes, and hardware-based tracing as the default mode.
Math Library Performance
cuBLAS gains experimental Grouped GEMM APIs supporting FP8 and BF16/FP16 for Blackwell GPUs. The grouped operations provide a host-synchronization-free implementation with device-side shapes for CUDA Graph support. NVIDIA claims up to 4x speedup over multi-stream GEMM in mixture-of-experts workloads.
cuSOLVER's batched SYEV routine for symmetric/Hermitian eigenvalue computation shows approximately 2x speedups on RTX PRO 6000 Blackwell Server Edition compared to L40S, tested on batches of 5,000 matrices with 24-256 rows.
CCCL 3.1, bundled with CUDA 13.1, adds two new determinism options for floating-point reductions in CUB. The default two-pass algorithm guarantees bitwise-identical results run-to-run on the same GPU. A new "GPU-to-GPU" mode based on Kate Clark's reproducible reduction technique from GTC 2024 guarantees identical results across different GPUs at some performance cost. A "not-guaranteed" single-pass mode using atomics offers the fastest execution when determinism is expendable.
Current Limitations
CUDA Tile's Blackwell-only restriction limits immediate adoption for developers targeting mixed hardware environments. The absence of C++ support at launch means Python is the only high-level entry point; teams with existing C++ CUDA codebases will wait for broader language support before evaluating migration.
NVIDIA describes the current release as focused on AI algorithms, with additional features and performance optimizations planned for subsequent versions. The tile model addresses a real pain point in GPU programming: the expertise required to extract peak performance from tensor cores across hardware generations. Whether the abstraction penalty proves acceptable compared to hand-tuned implementations remains an open question until independent benchmarks emerge.
CUDA 13.1 is available for download from NVIDIA's developer portal now.




