PLaMo Inference & Fine-Tuning Studies

Three hands-on studies of Preferred Networks' PLaMo model family — speculative decoding, LoRA fine-tuning for function calling, and prefill/decode batch-scaling — all measured on a single laptop.

Gurunath · June 2026

Study 1 · Inference

Speculative Decoding for PLaMo 2 (Mamba-Hybrid) in llama.cpp

A measured negative result: draft-model speculation with pfnet/plamo-2-1b drafting for pfnet/plamo-2-8b, and n-gram speculation, benchmarked on Apple Silicon with verified-lossless acceptance. Speculation does not pay off for this Mamba-hybrid architecture under the tested conditions — and the investigation surfaced a novel quantization bug in the conversion path along the way, documented with a reproducible case.

Study 2 · Fine-Tuning

Teaching PLaMo 3 to Call Functions — A LoRA on PFN's Hidden Rails

The first open-weights PLaMo with function calling, built by activating undocumented control tokens already present in pfnet/plamo-3-nict-2b-base and training a LoRA on glaive-style data. A data-centric case study in two evaluation traps: a contaminated eval that inflated accuracy to 97.5%, and a label flaw in the training data that silently degraded argument accuracy from 90% to 55% as training progressed — fixed to reach 100% argument-exact accuracy on a frozen 52-case eval. Every transition is a lesson about the data, not the model.

Study 3 · Serving

Prefill vs Decode: Where the Time Goes, and What Batching Buys

A measured tour of the two phases of inference on a laptop: prefill is compute-bound, decode is bandwidth-bound (19× slower per token), and continuous batching recovers ~5× aggregate decode throughput for the attention-only PLaMo 3 — while the Mamba-hybrid PLaMo 2 scales worse and crashes outright beyond 5 concurrent sequences, completing a three-study picture of what hybrid architectures cost in deployment today.