PLaMo Inference & Fine-Tuning Studies
Three hands-on studies of Preferred Networks' PLaMo model family — speculative decoding, LoRA fine-tuning for function calling, and prefill/decode batch-scaling — all measured on a single laptop.
Study 1 · Inference
Speculative Decoding for PLaMo 2 (Mamba-Hybrid) in llama.cpp
A measured negative result: draft-model speculation with pfnet/plamo-2-1b drafting for
pfnet/plamo-2-8b, and n-gram speculation, benchmarked on Apple Silicon with verified-lossless
acceptance. Speculation does not pay off for this Mamba-hybrid architecture under the tested conditions
— and the investigation surfaced a novel quantization bug in the conversion path along the way,
documented with a reproducible case.
Study 2 · Fine-Tuning
Teaching PLaMo 3 to Call Functions — A LoRA on PFN's Hidden Rails
The first open-weights PLaMo with function calling, built by activating undocumented control tokens
already present in pfnet/plamo-3-nict-2b-base and training a LoRA on glaive-style data.
A data-centric case study in two evaluation traps: a contaminated eval that inflated accuracy to 97.5%,
and a label flaw in the training data that silently degraded argument accuracy from 90% to 55% as training
progressed — fixed to reach 100% argument-exact accuracy on a frozen 52-case eval. Every transition
is a lesson about the data, not the model.
Study 3 · Serving
Prefill vs Decode: Where the Time Goes, and What Batching Buys
A measured tour of the two phases of inference on a laptop: prefill is compute-bound, decode is bandwidth-bound (19× slower per token), and continuous batching recovers ~5× aggregate decode throughput for the attention-only PLaMo 3 — while the Mamba-hybrid PLaMo 2 scales worse and crashes outright beyond 5 concurrent sequences, completing a three-study picture of what hybrid architectures cost in deployment today.