Prefill vs Decode: Where the Time Goes, and What Batching Buys
Phase asymmetry and batch-scaling measurements for pfnet/plamo-2-1b
(Mamba-hybrid) vs pfnet/plamo-3-nict-2b (attention-only) — llama.cpp b9596,
Apple M4 Pro 48 GB (Metal), Q8 quants, llama-bench / llama-batched-bench
(npp=512, ntg=128).
1. The two phases are different problems
LLM inference has two phases with opposite resource profiles. Prefill processes the whole prompt in one parallel pass — it is compute-bound. Decode produces one token at a time, and each token must re-read every weight in the model — it is memory-bandwidth-bound. On this laptop the asymmetry is a factor of ~19×:
Decode at ~127 t/s × 1.34 GB of weights ≈ 170 GB/s of effective bandwidth on a ~273 GB/s part — the textbook signature of bandwidth-bound decoding. Prefill speed is flat from 512 to 2,048-token prompts: the GPU is already saturated with useful work.
2. What improved: batching amortizes the weight reads
If one decode step reads all weights to produce one token, the obvious fix is to make the same read produce N tokens — one per concurrent sequence. That is continuous batching, and it is nearly free throughput… for the attention model:
| Parallel sequences | PLaMo 2 (hybrid) | scaling | PLaMo 3 (attention) | scaling |
|---|---|---|---|---|
| 1 | 128 t/s | 1.0× | 79 t/s | 1.0× |
| 2 | 223 t/s | 1.74× | 148 t/s | 1.88× |
| 4 | 263 t/s | 2.05× | 178 t/s | 2.27× |
| 8 | crash | — | 182 t/s | 2.31× |
| 16 | crash | — | 391 t/s | 4.97× |
Three observations:
(a) The attention model gets ~5× aggregate decode throughput at B=16, with prefill flat at ~1,415 t/s throughout — it was already compute-bound, so batching adds nothing there. (The superlinear jump from B=8→16 likely reflects a Metal kernel-path switch at that batch width; unverified.)
(b) The hybrid bends earlier. Per-sequence Mamba state updates are compute that does not amortize across the batch the way shared weight-reads do — at B=4 the hybrid extracts 2.05× where the attention model gets 2.27×.
(c) The hybrid then crashes. At ≥6 parallel sequences, llama.cpp aborts with
GGML_ASSERT(… data_size + view_offs <= ggml_nbytes(view_src)) inside
build_plamo2_mamba_layer — an out-of-bounds view into the recurrent-state buffer. Deterministic
boundary: works at -npl 5, crashes at -npl 6; PLaMo 3 runs B=16 cleanly on the same
build. On this stack, multi-user serving of the hybrid simply stops at 5 concurrent sequences. (Upstream
report in preparation.)
llama-batched-bench -m plamo-2-1b-Q8.gguf -c 16384 -ngl 99 -npp 512 -ntg 128 -npl 6 # reproduces the crash
3. Why "more replicas" is the wrong lever on one box
Running N server processes on the same machine duplicates the weights N times in memory and makes N
processes compete for the same bandwidth — the exact resource decode is starved of. One server with
--parallel N shares a single weight copy across all sequences; that sharing is the 5×
above. Replication helps when you add machines (more aggregate bandwidth); on shared hardware,
batching wins by construction.
4. Future improvements in this space
Ordered roughly by how directly they attack the measured bottleneck (bandwidth-bound decode):
| Lever | Idea | Status for PLaMo |
|---|---|---|
| Continuous batching | Amortize weight reads across users | Works for PLaMo 3 (~5×); hybrid capped at 5 seqs by the crash above — fixing it is the cheapest win |
| Lower-bit quantization | Fewer bytes per weight read = proportionally faster decode | Q4 ≈ 2× over Q8 in principle; for PLaMo 2 requires the ssm_out exemption we documented (study 1; llama.cpp#24501) |
| Speculative decoding | Verify k drafted tokens in one weight-read | Measured in study 1 (on llama.cpp): lossless but unprofitable for the hybrid (SSM checkpoint overhead). The same rollback now exists in llama.cpp, vLLM and SGLang (conditionally) — but the open win is self-drafting on attention-only PLaMo 3, which avoids the rollback tax entirely |
| MTP / EAGLE heads | Self-drafting without a separate model | No public head exists for any PLaMo; PLaMo 3's vocabulary suggests PFN has internal scaffolding (study 2) |
| Chunked prefill | Interleave prefill chunks with decode steps to protect latency under load | Supported by llama.cpp/vLLM; PFN reported ~3–10% gains in their vLLM meetup talk |
| P/D disaggregation | Separate prefill and decode onto different hardware pools | Server-scale only — and notably unavailable for SSM models in vLLM (state can't be transferred like KV cache): a third place the hybrid architecture taxes deployment |
The through-line of all three studies: Samba-style hybrids buy training-time and
long-context efficiency, but on today's inference stacks they pay three measured deployment taxes —
speculative-decoding rollback overhead (study 1), quantization fragility of ssm_out (study 1),
and a batching ceiling (this study). PLaMo 3's return to full attention avoids all three. Quantifying that
trade-off is exactly the work of inference optimization.