Prefill vs Decode: Where the Time Goes, and What Batching Buys

Phase asymmetry and batch-scaling measurements for pfnet/plamo-2-1b (Mamba-hybrid) vs pfnet/plamo-3-nict-2b (attention-only) — llama.cpp b9596, Apple M4 Pro 48 GB (Metal), Q8 quants, llama-bench / llama-batched-bench (npp=512, ntg=128).

Gurunath · June 2026 · Study 3 of the PLaMo series · working notes

1. The two phases are different problems

LLM inference has two phases with opposite resource profiles. Prefill processes the whole prompt in one parallel pass — it is compute-bound. Decode produces one token at a time, and each token must re-read every weight in the model — it is memory-bandwidth-bound. On this laptop the asymmetry is a factor of ~19×:

Decode at ~127 t/s × 1.34 GB of weights ≈ 170 GB/s of effective bandwidth on a ~273 GB/s part — the textbook signature of bandwidth-bound decoding. Prefill speed is flat from 512 to 2,048-token prompts: the GPU is already saturated with useful work.

2. What improved: batching amortizes the weight reads

If one decode step reads all weights to produce one token, the obvious fix is to make the same read produce N tokens — one per concurrent sequence. That is continuous batching, and it is nearly free throughput… for the attention model:

Parallel sequencesPLaMo 2 (hybrid)scalingPLaMo 3 (attention)scaling
1128 t/s1.0×79 t/s1.0×
2223 t/s1.74×148 t/s1.88×
4263 t/s2.05×178 t/s2.27×
8crash182 t/s2.31×
16crash391 t/s4.97×

Three observations:

(a) The attention model gets ~5× aggregate decode throughput at B=16, with prefill flat at ~1,415 t/s throughout — it was already compute-bound, so batching adds nothing there. (The superlinear jump from B=8→16 likely reflects a Metal kernel-path switch at that batch width; unverified.)

(b) The hybrid bends earlier. Per-sequence Mamba state updates are compute that does not amortize across the batch the way shared weight-reads do — at B=4 the hybrid extracts 2.05× where the attention model gets 2.27×.

(c) The hybrid then crashes. At ≥6 parallel sequences, llama.cpp aborts with GGML_ASSERT(… data_size + view_offs <= ggml_nbytes(view_src)) inside build_plamo2_mamba_layer — an out-of-bounds view into the recurrent-state buffer. Deterministic boundary: works at -npl 5, crashes at -npl 6; PLaMo 3 runs B=16 cleanly on the same build. On this stack, multi-user serving of the hybrid simply stops at 5 concurrent sequences. (Upstream report in preparation.)

llama-batched-bench -m plamo-2-1b-Q8.gguf -c 16384 -ngl 99 -npp 512 -ntg 128 -npl 6   # reproduces the crash

3. Why "more replicas" is the wrong lever on one box

Running N server processes on the same machine duplicates the weights N times in memory and makes N processes compete for the same bandwidth — the exact resource decode is starved of. One server with --parallel N shares a single weight copy across all sequences; that sharing is the 5× above. Replication helps when you add machines (more aggregate bandwidth); on shared hardware, batching wins by construction.

4. Future improvements in this space

Ordered roughly by how directly they attack the measured bottleneck (bandwidth-bound decode):

LeverIdeaStatus for PLaMo
Continuous batchingAmortize weight reads across users Works for PLaMo 3 (~5×); hybrid capped at 5 seqs by the crash above — fixing it is the cheapest win
Lower-bit quantizationFewer bytes per weight read = proportionally faster decode Q4 ≈ 2× over Q8 in principle; for PLaMo 2 requires the ssm_out exemption we documented (study 1; llama.cpp#24501)
Speculative decodingVerify k drafted tokens in one weight-read Measured in study 1 (on llama.cpp): lossless but unprofitable for the hybrid (SSM checkpoint overhead). The same rollback now exists in llama.cpp, vLLM and SGLang (conditionally) — but the open win is self-drafting on attention-only PLaMo 3, which avoids the rollback tax entirely
MTP / EAGLE headsSelf-drafting without a separate model No public head exists for any PLaMo; PLaMo 3's vocabulary suggests PFN has internal scaffolding (study 2)
Chunked prefillInterleave prefill chunks with decode steps to protect latency under load Supported by llama.cpp/vLLM; PFN reported ~3–10% gains in their vLLM meetup talk
P/D disaggregationSeparate prefill and decode onto different hardware pools Server-scale only — and notably unavailable for SSM models in vLLM (state can't be transferred like KV cache): a third place the hybrid architecture taxes deployment

The through-line of all three studies: Samba-style hybrids buy training-time and long-context efficiency, but on today's inference stacks they pay three measured deployment taxes — speculative-decoding rollback overhead (study 1), quantization fragility of ssm_out (study 1), and a batching ceiling (this study). PLaMo 3's return to full attention avoids all three. Quantifying that trade-off is exactly the work of inference optimization.