Benchmarking Gemma 4 on DGX Spark
/ 8 min read
Table of Contents
The GB10 Grace Blackwell Superchip in NVIDIA’s DGX Spark has one property that matters for LLM inference at this model size: 128GB of unified LPDDR5X memory shared between CPU and GPU, with the GPU accessing the same physical pool as the CPU. When Gemma 4 shipped last week, I ran it through every available inference stack to find out what that hardware advantage actually translates to in practice.
The main finding: model architecture accounts for roughly 8x of the throughput difference. Stack choice is secondary.
The Hardware
The GB10 integrates a 72-core Arm CPU and a Blackwell GPU on a single die, sharing 128GB of unified LPDDR5X memory at ~273 GB/s bandwidth. CPU and GPU access the same physical memory pool, which eliminates the PCIe transfer overhead present in discrete GPU setups.
For LLM inference, the relevant number is memory capacity. The 30B+ parameter range sits in an awkward spot: too large for consumer GPU VRAM, too small to justify full datacenter hardware. At 128GB unified, running these models at full precision is straightforwardly feasible in principle. Gemma 4 31B at BF16 requires ~62GB. The 26B MoE model at FP16 sits around 55GB. Both fit with 66-73GB remaining for KV cache and system overhead.
What I Tested
- Models: Gemma 4 31B (dense) and Gemma 4 26B (Mixture of Experts, ~4B active parameters per token)
- Stacks: llama.cpp, Ollama, vLLM
- Quantizations: FP16 / BF16, Q8_0, Q4_K_M, NVFP4
- Tool: genai-perf 0.0.16, 512 input tokens, 512 output tokens, 5 requests per run
- Concurrency: 1 (single user) and 4 (light multi-user load)
Metrics that matter: output tok/s (generation speed), TTFT p50 (time to first token), ITL p50 (inter-token latency, the “streaming feel”).
For reference: 10-20 tok/s is usable but noticeably slow. 20-40 tok/s feels like comfortable typing speed. 40-80 tok/s is smooth enough that you stop noticing the generation. Above 80 tok/s feels instantaneous.
The Obvious Setup, and Why It Let Me Down
I started with vLLM + Gemma 4 31B at BF16. Makes sense on paper, right? vLLM is the production-standard inference stack. BF16 keeps full model quality. The hardware can hold it. What could possibly go wrong?
3.51 tok/s at c1.
That’s near reading speed. Technically functional; completely impractical for interactive use.
Next up: NVFP4 quantization using NVIDIA’s own nvidia/Gemma-4-31B-IT-NVFP4 model, purpose-built for Blackwell tensor cores. This should be the throughput ceiling.
6.34 tok/s. Roughly on par with a mid-range consumer GPU running a 7B model.
Then Ollama on the same dense 31B model: 9.4 tok/s. Marginally usable for single-user chat. The 106ms inter-token latency makes streaming noticeably choppy.
Something was clearly broken with the NVFP4 result specifically. Worth digging into before I wrote off the whole path.
The NVFP4 Dead End
The most plausible explanation I’ve found involves two overlapping software gaps. I haven’t confirmed these by building custom containers from source, so treat this as the best available theory consistent with the error output and container inspection.
First: PyTorch in every available container wasn’t compiled with sm_121-specific kernels. The GB10 GPU is sm_121 (CUDA capability 12.1, the Blackwell microarchitecture). Available containers max out at sm_120 or earlier. So the GPU falls back to Hopper or older codepaths for matrix multiply operations instead of running native Blackwell FP4 tensor core kernels.
Second: Torch.compile’s Triton JIT autotuner (which picks optimized GPU kernels at runtime) fails to find suitable sm_121 kernels and just tells you about it directly:
Not enough SMs to use max_autotune_gemm modeI tested every available image:
| Container | ARM64 | sm_121 PyTorch | Gemma 4 transformers | Verdict |
|---|---|---|---|---|
vllm/vllm-openai:v0.8.5 | No | No | Yes | x86 only, unusable on GB10 |
gemma4-cu130 | Partial (x86 pull) | No | Yes | Falls back to compat path |
gemma4-0409-arm64-cu130 | Yes | No | Yes | Best available, still falls back |
cu130-nightly-aarch64 | Yes | No | No | sm_121 PyTorch but no Gemma 4 support |
No available image satisfies all three requirements at once. The NVFP4 weights load fine: 4-bit precision, roughly 20GB versus the 62GB BF16 model, and that memory reduction is real. But compute runs on a non-native fallback path, so the throughput benefit of Blackwell’s FP4 tensor cores never shows up. What you get looks like BF16-level throughput with a smaller working set. Not the substantial uplift NVFP4 on native Blackwell hardware should deliver.
When NVIDIA eventually ships a container combining ARM64 + sm_121-compiled PyTorch + Gemma 4 transformers support, the NVFP4 number should improve substantially.
The projected bar represents a rough estimate above the best measured dense 31B result (~17 tok/s); the actual native NVFP4 ceiling is unknown pending sm_121 container support.
The Architecture Unlock
After exhausting the obvious options, I ran the same benchmarks against the other Gemma 4 configuration.
Gemma 4 ships in two architectures.
Gemma 4 31B is a standard dense transformer. Every token activates all 31 billion parameters. The full weight matrix participates in every single forward pass, no exceptions.
Gemma 4 26B is a Mixture of Experts model. Despite 26 billion total parameters, only approximately 4 billion are active per token. The inactive experts sit in memory (they do participate in prefill to some degree) but for decode, the dominant cost is that active expert set: roughly 4B parameters per step.
That distinction is the entire story.
I ran the 26B MoE through llama.cpp at FP16. No quantization, full precision, same stack I’d been using on the dense model. Results:
22.49 tok/s. TTFT 561ms. ITL 43ms.
Compare that to llama.cpp FP16 on the 31B dense model: 2.86 tok/s. TTFT 1,468ms. ITL 283ms.
Same stack. Same precision. Same hardware. ~7.9x faster throughput and ~6.6x better inter-token latency.
The explanation is pretty direct. LLM decoding is memory-bandwidth-bound: the bottleneck is how fast you can read model weights from memory to compute each token. At the GB10’s 273 GB/s unified memory bandwidth, reading 4B FP16 parameters (~8GB) takes about 29ms. Reading 31B FP16 parameters (~62GB) takes about 227ms. That 7.8x bandwidth ratio predicts the 7.9x throughput ratio almost exactly. The memory subsystem is the ceiling; the architecture determines how close you get to it.
So the smaller model produces better practical results with no quantization needed. One honest caveat: I didn’t run formal quality benchmarks comparing the two architectures. The 26B MoE and 31B dense are distinct models, not size variants of the same thing. Their output on complex reasoning tasks or long-context work may differ.
Full Results
| Model | Stack | Quant | c | Output tok/s | TTFT p50 (ms) | ITL p50 (ms) |
|---|---|---|---|---|---|---|
| Gemma 4 31B | llama.cpp | FP16 | 1 | 2.86 | 1,468 | 283 |
| Gemma 4 31B | llama.cpp | Q8_0 | 1 | 5.07 | 1,668 | 166 |
| Gemma 4 31B | llama.cpp | Q8_0 | 4 | 13.77 | 2,934 | 185 |
| Gemma 4 31B | llama.cpp | Q4_K_M | 1 | 7.77 | 1,271 | 104 |
| Gemma 4 31B | llama.cpp | Q4_K_M | 4 | 17.44 | 2,502 | 120 |
| Gemma 4 31B | Ollama | Q4_K_M | 1 | 9.4 | 106 | 106 |
| Gemma 4 31B | vLLM | BF16 | 1 | 3.51 | 596 | 284 |
| Gemma 4 31B | vLLM | BF16 | 4 | 8.66 | 428 | 294 |
| Gemma 4 31B | vLLM | NVFP4 | 1 | 6.34 | 359 | 157 |
| Gemma 4 31B | vLLM | NVFP4 | 4 | 15.84 | 357 | 159 |
| Gemma 4 26B MoE | llama.cpp | FP16 | 1 | 22.49 | 561 | 43 |
| Gemma 4 26B MoE | llama.cpp | FP16 | 4 | 41.01 | 978 | 73 |
| Gemma 4 26B MoE | llama.cpp | Q8_0 | 1 | 38.62 | 387 | 25 |
| Gemma 4 26B MoE | llama.cpp | Q8_0 | 4 | 63.33 | 930 | 48 |
| Gemma 4 26B MoE | llama.cpp | Q4_K_M | 1 | 57.62 | 332 | 17 |
| Gemma 4 26B MoE | llama.cpp | Q4_K_M | 4 | 91.80 | 914 | 33 |
| Gemma 4 26B MoE | Ollama | Q4_K_M | 1 | 59.4 | 20 * | 17 |
| Gemma 4 26B MoE | vLLM | BF16 | 1 | 21.53 | 259 | 45 |
| Gemma 4 26B MoE | vLLM | BF16 | 4 | 62.81 | 281 | 63 |
* Ollama results via native /api/chat, not genai-perf. TTFT excludes cold-start first request (model pre-loaded with keep_alive=-1). The 20ms TTFT under Ollama vs 332ms under llama.cpp at the same Q4_K_M quantization is a 16x gap I have not fully explained. Ollama’s model-keep-alive behavior likely eliminates loading overhead that llama.cpp may include in its TTFT measurement, but I have not isolated this variable. Treat the Ollama TTFT as a best-case, warm-model figure.
What to Actually Run
A few things worth knowing about the stack tradeoffs:
llama.cpp vs vLLM: vLLM consistently wins on TTFT: 2 to 2.5x faster prefill, thanks to fused attention and PagedAttention. At higher concurrency, vLLM’s continuous batching starts pulling ahead too; the MoE BF16 c4 result (62 tok/s) actually beats llama.cpp FP16 c4 (41 tok/s). But for single-user raw decode throughput, llama.cpp Q4_K_M wins by a pretty wide margin. The practical split, in my view: vLLM for multi-user APIs or anything latency-sensitive; llama.cpp for single-user throughput with simpler setup.
Ollama: Wraps llama.cpp internally, and the throughput numbers track closely (59.4 vs 57.62 tok/s for MoE Q4_K_M at c1). Easiest stack to get running, solid performance for personal use. The TTFT anomaly noted above is worth keeping in mind if you’re comparing raw latency numbers across stacks. It’s probably not showing you what you think it is.
Quantization on the MoE model: Going from FP16 to Q4_K_M on the 26B MoE nets roughly 2.5x throughput (22 to 57 tok/s). At this model size, the practical quality difference between FP16 and Q4_K_M is small. I haven’t formally measured it, but anecdotally it’s not alarming. FP16 at 22 tok/s is genuinely comfortable for single-user chat. Q4_K_M at 57-91 tok/s gives you real headroom for multi-user or agentic workloads.
What Happens When NVIDIA Closes the Software Gap
NVFP4 at 6.34 tok/s isn’t the final answer. When NVIDIA ships a container combining ARM64 + sm_121-compiled PyTorch + Gemma 4 transformers support, that number should improve substantially. The compute path currently being bypassed is exactly what Blackwell’s FP4 tensor cores are designed for. That changes the stack conversation: NVIDIA’s official quantized model running on native hardware, potentially competitive with current llama.cpp numbers through a supported path.
The MoE architecture advantage is independent of that. A model activating 4B parameters per token decodes faster than one activating 31B at the same memory bandwidth, regardless of what the dense NVFP4 path eventually delivers. The bandwidth math doesn’t change.
For now: grab the 26B MoE GGUF, serve it through llama.cpp at FP16 or Q4_K_M, and revisit vLLM + NVFP4 when NVIDIA ships that container update.