
CUDA Tile shifts programming to a tile-centric abstraction. The developer describes computations in terms of operations on tiles — structured blocks of data such as submatrices — without specifying threads, warps, or execution order. Then the compiler and runtime automatically map those tile operations onto threads, tensor cores, tensor memory accelerators (TMA), and the GPU memory hierarchy. This means the programmer focuses on what computation should happen to the data, while CUDA determines how it runs efficiently on the hardware, which ensures performance scalability across GPU generations, starting with Blackwell and extending to future architectures.
But why introduce such significant changes at the CUDA level? There are several motives behind the move: drastic architectural changes in GPUs, and the way modern GPU workloads operate. Firstly, AI, simulation, and technical computing no longer revolve around scalar operations: they rely on dense tensor math. Secondly, Nvidia's recent hardware has also followed the same trajectory, integrating tensor cores and TMAs as core architectural enhancements. Thirdly, both tensor cores and TMAs differ significantly between architectures.
From Turing (the first GPU architecture to incorporate tensor units as assisting units) to Blackwell (where tensors became the primary compute engines), Nvidia has repeatedly reworked how tensor engines are scheduled, how data is staged and moved, and how much of the execution pipeline is managed by warps and threads versus dedicated hardware. With Turing, tensors were used to execute warp-issued matrix instructions, but with Blackwell, things shifted to tile-native execution pipelines with autonomous memory engines, fundamentally reducing the role of traditional SIMT controls.
As a result, as tensor hardware has been scaling aggressively, the lack of uniformity across generations has made low-level tuning on warp and thread levels impractical, so Nvidia had to elevate CUDA toward higher-level abstractions that describe intent at the tile level, rather than at the thread level, leaving all the optimizations to compilers and runtimes. One bonus to this approach is that it can extract performance gains across virtually all workloads throughout the active life cycle of its GPU architectures.
Note that it does not abandon SIMT paths with NVVM/LLVM and PTX altogether; when developers need them, they can write appropriate kernels. However, when they need to use tensor cores, they must write tile kernels.
At the center of this new CUDA Tile stack sits CUDA Tile IR , a virtual instruction set that plays the same role for tile workloads that parallel thread execution (PTX) plays for SIMT kernels. In the traditional CUDA stack, PTX serves as a portable abstraction for thread-oriented programs that ensures that SIMT kernels persist across GPU generations. CUDA Tile IR is designed to provide that same long-term stability for tile-based computations: it defines tile blocks, their relationships, operations that transform them, but hides execution details that can change from one GPU family to another.
This virtual ISA also becomes the target for compilers, frameworks, and domain-specific languages that want to exploit tile-level semantics. Tool builders who previously generated PTX for SIMT can now create parallel backends that emit Tile IR for tensor-oriented workloads. The runtime takes Tile IR as input and assigns work to hardware pipelines, tensor engines, and memory systems in a way that maximizes performance without exposing device-level variability to the programmer.
Nvidia’s TiDAR experiment could speed up AI token generation using hybrid diffusion decoder
Nvidia reveals Vera Rubin Superchip for the first time
Key considerations
- Investor positioning can change fast
- Volatility remains possible near catalysts
- Macro rates and liquidity can dominate flows
Reference reading
- https://www.tomshardware.com/pc-components/gpus/SPONSORED_LINK_URL
- https://www.tomshardware.com/pc-components/gpus/nvidias-cuda-tile-examined-ai-giant-releases-programming-style-for-rubin-feynman-and-beyond-tensor-native-execution-model-lays-the-foundation-for-blackwell-and-beyond#main
- https://www.tomshardware.com
- Musk to expand xAI's training capacity to a monstrous 2 gigawatts with third building at Memphis site — announcement comes days after Musk vows to have 'more AI
- Game the Halls: GeForce NOW Brings Holiday Cheer With 30 New Games in the Cloud
- SK hynix to build first U.S. packaging plant for HBM — plugs critical hole in U.S. supply chain, $3.9B investment challenges TSMC and reshapes AI supply chains
- Opt-In NVIDIA Software Enables Data Center Fleet Management
- Razer's Hanbo AIO coolers up to 70 percent off, drop as low as $29 in clearance deal — Chroma RGB CPU coolers fall to extreme budget territory, with 360mm model
Informational only. No financial advice. Do your own research.