Nvidia’s TiDAR experiment could speed up AI token generation using hybrid diffusion decoder — new research boasts big throughput gains, but limitations remain

In direct comparisons with Dream, Block Diffusion, Llada, and speculative decoding based on EAGLE-3-style draft verification, TiDAR provides the best balance between speed and benchmark accuracy within the paper’s test suite.

Those results make sense given the mechanism. TiDAR performs multiple prediction tasks while the model’s weights and cached keys and values are already resident in memory, so more tokens are generated without additional memory movement. At the small scales tested, the GPU remains memory-bound rather than compute-bound across several positions, allowing the multi-token expansion to run efficiently.

Ultimately, model size appears to be a limiting factor. Although the paper shows TiDAR using Qwen3-32B profiling, the method is not demonstrated with more than 8 billion parameters. The behavior of “free token slots” depends on the balance between compute density and memory bandwidth. A large model running in tensor parallel mode may saturate compute earlier in the token dimension, reducing the range in which multi-token expansions are cheap. The authors acknowledge this and mark long-context and large-scale experiments as future work.

Finally, the authors run all inference using standard PyTorch with FlexAttention on a single H100, without any custom fused kernels or low-level optimizations. This creates a fair comparison between acceleration techniques, but makes the absolute throughput figures incomplete. Systems like Medusa, EAGLE-3, and optimized speculative decoders show materially higher speeds when tuned at the kernel level. TiDAR may benefit from similar engineering, but that work remains ahead.

TiDAR represents an attempt to merge the two dominant families of multi-token decoding techniques. Instead of relying on a separate draft model, as speculative decoding does, or giving up chain factorization, as diffusion-only approaches do, the authors propose a single backbone that learns both behaviours. The benefit is simplicity at inference time and a reduction in model footprint. The trade-offs appear manageable at a small scale, and the method gives a tangible demonstration of how much unused parallelism lives inside a modern GPU during next-token generation.

The “could” depends entirely on whether TiDAR scales. If its training recipe can be applied to large models without destabilizing optimization or exhausting memory budgets, it could offer a path to higher per-GPU throughput in cloud settings and lower latency for local inference on consumer GPUs. If, on the other hand, the “free slot” region shrinks once parameter counts and context windows expand, TiDAR may settle as a useful piece of research rather than a practical replacement for speculative decoding or multi-head approaches.

What the paper succeeds in showing is that autoregressive and diffusion predictors do not need to exist in separate networks. A single transformer can learn both, and it can do so without discarding the KV cache structures that make next-token generation viable at scale.

That is a meaningful contribution to inference acceleration, and the real test will come when the architecture is pushed into the size range where commercial models operate and where memory bandwidth no longer hides the cost of expanding the token dimension.

Luke James Social Links Navigation Contributor Luke James is a freelance writer and journalist. Although his background is in legal, he has a personal interest in all things tech, especially hardware and microelectronics, and anything regulatory.

Nvidia’s TiDAR experiment could speed up AI token generation using hybrid diffusion decoder — new research boasts big throughput gains, but limitations remain

Key considerations

Reference reading

More on this site

Leave a Comment Cancel reply

Key considerations

Reference reading

More on this site

Related posts:

Leave a Comment Cancel reply