
Today, Google DeepMind released DiffusionGemma — an experimental open model built for exceptionally fast text generation. NVIDIA has optimized DiffusionGemma to run even faster across NVIDIA GeForce RTX GPUs, the NVIDIA RTX PRO platform and NVIDIA DGX Spark systems, from local PCs to the cloud.
Rather than generating text one word at a time, DiffusionGemma generates multiple words in parallel to output whole blocks of text, opening a new, low-latency frontier for the kind of single-user workloads that developers, researchers and AI enthusiasts run every day.
Parallel generation: DiffusionGemma denoises up to 256 tokens per step instead of predicting one at a time.
Built on Gemma 4: DiffusionGemma is built on Gemma 4, a 26-billion-parameter mixture-of-experts model that activates just 3.8 billion parameters per step, pairing a diffusion head with Google’s Gemma 4 architecture.
Up to 4x faster performance: The boost means fast text generation, where single-user generation usually stalls — on local hardware.
Open and local: DiffusionGemma is open weights under a permissive Apache 2.0 license and runs entirely on RTX and DGX Spark — no cloud, no per-token cost — with day-zero support in Hugging Face Transformers , vLLM and Unsloth.
Almost every large language model (LLM) in wide use today is autoregressive — meaning it generates text one token at a time, with each new word depending on the one before it. That sequential process is what makes interactive AI feel like it’s typing.
DiffusionGemma takes a different path. Built on the Gemma 4 26B mixture-of-experts architecture, it generates text the way diffusion models generate images: by starting from noise and refining a whole block of text at once. Each step denoises up to 256 tokens in parallel rather than emitting a single token and waiting to compute the next.
The result is a model that thinks in blocks instead of sequentially. For latency-sensitive, single-user work — such as interactive chat, agentic loops or on-device assistants that plan and act — that parallelism translates into responses fast enough to keep pace with how developers think and iterate.
Generating one token at a time is fundamentally a memory-bound problem — a traditional LLM spends most of its time waiting on memory bandwidth, not doing math , which leaves a lot of compute on the table.
Diffusion flips the equation. Pulling a full 256-token block through the transformer in parallel is a compute-bound workload — exactly what NVIDIA GPUs are built for. NVIDIA Tensor Cores accelerate the dense parallel math, and the CUDA software stack lets the model run efficiently from day one without bespoke tuning. In short, the model’s design plays directly to the GPU ’ ‘ s strengths.
That shows up in the numbers. DiffusionGemma delivers 1,000 tokens/sec on a single NVIDIA H100 Tensor Core GPU , 15 0 tokens/sec on NVIDIA DGX Spark and fa s test local inference on NVIDIA DGX Stat ion — roug hly 4 x fast er than an equivalent autoregressive model running in the same single-user regime.
That advantage holds across NVIDIA’s full lineup, running :
Key considerations
- Investor positioning can change fast
- Volatility remains possible near catalysts
- Macro rates and liquidity can dominate flows
Reference reading
- https://blogs.nvidia.com/blog/rtx-ai-garage-local-gemma-diffusion/#primary
- https://blogs.nvidia.com/blog/author/mfukuyama/
- https://blogs.nvidia.com/blog/rtx-ai-garage-local-gemma-diffusion/#disqus_thread
- How the UK Is Turning Sovereign AI Ambition Into Action With NVIDIA Technologies
- Nvidia and SK hynix ink multi-year memory co-development and supply agreement — seeks to address extended development cycles
- Linux developers are using AI vibe coding to keep vintage AMD GPUs alive — R600 driver cleaned up with GitHub Copilot gives HD 2000 to HD 6000 series a new leas
- NVIDIA Factory Operations Blueprint Gives Factories a New AI Brain
- Elegoo Jupiter 2 Resin 3D Printer review: The giant returns for round two
Informational only. No financial advice. Do your own research.