Teaching PLaMo 3 to Call Functions: a LoRA on PFN’s Hidden Rails

The first open-weights PLaMo with function calling — built by activating undocumented control tokens already present in pfnet/plamo-3-nict-2b-base, with two evaluation traps found and fixed along the way. A data-centric ML case study: 0% → 90% → 55% → 100% argument-exact accuracy, where every transition is a lesson about the data, not the model.

MacBook Pro M4 Pro · 48 GB unified memory · Apple MPS · LoRA r=16 · MLflow-tracked · June 2026 · Gurunath Lunkupali Venugopal · model on Hugging Face · code on GitHub · companion report: speculative decoding for PLaMo 2

1. Why this experiment

Preferred Networks ships tool calling only in its API-only PLaMo Prime line; every open PLaMo release is a base model with no chat or tool capability. So as of this writing there is no open-weights PLaMo you can hand a tool schema to.

The key discovery that makes this experiment cheap: pfnet/plamo-3-nict-2b-base (2.6B parameters, attention-only, PLaMo Community License) ships an official chat_template.jinja, and its vocabulary contains an undocumented control-token set clearly designed for chat and structured output. PFN laid the rails for these capabilities in the open weights — tokenizer, template, special tokens — but never exposed the capability itself. The question: how far does a small LoRA get by simply activating those rails, in a single local session on a MacBook?

Answer: surprisingly far — and the two failures encountered on the way turned out to be more instructive than the final number.

2. The hidden rails: PLaMo 3’s control tokens

The official chat template formats each turn as role<|plamo:msg|>content<|plamo:tag|>. Beyond those two, the vocabulary carries a family of control tokens that none of PFN’s public documentation for the open models mentions (none of these exist in PLaMo 2):

token	apparent purpose (inferred from names — undocumented)
`<\|plamo:tag\|>`	turn terminator in the official chat template
`<\|plamo:msg\|>`	separates role from content in each turn
`key`, `val`	structured key/value output
`choice`	selecting among enumerated options
`constrain`	constrained / schema-guided decoding
`fim_prefix`, `fim_suffix`, `fim_middle`	fill-in-the-middle code infilling

The thesis of this study: these tokens are the infrastructure of PLaMo Prime’s structured-output features, present but dormant in the open weights. A qualitative probe before training supports this: the base model already follows the chat template (it answers helpfully in role format) but ignores tool-call instructions — asked for the weather with a get_weather schema in the system prompt, it explains how to use a weather API instead of emitting a call. The rails exist; nothing runs on them. The LoRA’s job is just to put a train on the track.

3. Method

Base model: pfnet/plamo-3-nict-2b-base (2.6B, attention-only, PLaMo Community License). Pure-PyTorch modeling code, so it runs on Apple MPS directly.
Data: single-turn examples derived from glaiveai/glaive-function-calling-v2 (Apache 2.0), query-level deduplicated. The system message lists tool schemas; the assistant answers with a JSON call {"name": ..., "arguments": ...} — or a plain answer for no-tool cases.
Format: the official PLaMo 3 chat template (role<|plamo:msg|>content<|plamo:tag|>) — no custom tokens, no template surgery.
Training: LoRA r=16, α=32, all-linear, assistant-only loss masking, 1 epoch, bf16, Apple M4 Pro (MPS), MLflow-tracked (experiment plamo3-tool-calling).
Eval: 52 cases (40 call + 12 no-call), greedy decoding. Metrics: parse rate, function-name accuracy, argument exact-match (over the 40 call cases), false-call rate (over the 12 no-call cases). The 40 call questions are frozen across all conditions, with zero train overlap (verified). Per-question records for every condition are saved in finetune/results/*.json — they power the explorer in section 8.

4. Headline results

All rows below are the leak-free eval (see trap #1). All share the same frozen 40 unseen call questions:

condition	parse	func-name	args-exact	false-call
base zero-shot	35.0%	32.5%	0.0%	0.0%¹
base 2-shot	67.5%	30.0%	0.0%	41.7%
LoRA v1 400 ex (5 min)	92.5%	92.5%	90.0%	41.7%
LoRA v1 ckpt 3.2k ex	87.5%	87.5%	87.5%	25.0%
LoRA v1 ckpt 6.4k ex (run halted)	55.0%	55.0%	55.0%	16.7%
LoRA v2 1,718 ex fixed data 39 min	100%	100%	100%	16.7%²

¹ trivially low — the zero-shot base model rarely emits a call at all. ² measured on 12 strict no-call cases (see section 7); the v1 rows’ false-call column inherited the label flaw and is not directly comparable.

Three things to read off this table. Prompting cannot do this: 2-shot prompting raises JSON-shaped output (67.5% parse) but argument-exact correctness stays at exactly 0%, while the model starts firing tools on 41.7% of questions that need none. A 5-minute LoRA on 400 examples takes args-exact from 0% to 90% on unseen queries. And the v1 rows get worse with more data — the middle of the story, told in section 6.

Metric:

All six conditions on the frozen eval. Hover for exact values; use the dropdown to isolate one metric. For parse / func-name / args-exact, higher is better (40 call cases); for false-call rate, lower is better (12 no-call cases). The v2 false-call bar uses the strict no-call set.

5. Evaluation trap #1: the 97.5% that was too good

The first eval of the 400-example sprint adapter read 97.5% across the board. That number was wrong, and it was wrong in the most classic way: 41 of 52 eval queries appeared verbatim in the training set. glaive-function-calling-v2 contains massive row duplication — a random train/eval split does not separate anything. After query-level dedup, the entire dataset has only ~71 unique unseen call queries beyond our training set.

If you fine-tune on glaive-function-calling-v2 and evaluate on a random held-out split, you are very likely evaluating on your training data. Deduplicate at the query level first. Our contaminated run is archived (not deleted) in finetune/results/leaked-eval/ for honesty; every number in this report comes from the rebuilt eval with zero query overlap, verified.

The honest sprint number after the rebuild: 90% args-exact, not 97.5%. Still a striking result for 400 examples and 5 minutes — and now a real one.

6. Evaluation trap #2: more data made the model worse

With the leak fixed, the obvious next step was scale: the full v1 dataset, 11,182 deduplicated single-turn examples (~45% no-call mix), checkpointed every 200 steps with the frozen eval run at each checkpoint. The checkpoint evals showed something a loss curve never would:

Call recall (args-exact) degrades as v1 training data scales — 90% → 87.5% → 55% — while the precision of emitted calls stays at ~100% and the false-call rate falls. Stars: v2 on fixed data. Static version: finetune/figures/v1_data_flaw_scaling.png. Note the v1 sprint point (400 ex) was a separate run with a 12.5% no-call mix; the 3.2k/6.4k points are checkpoints of the full v1 run.

The decomposition is the diagnosis. At the 6.4k-example checkpoint the model emitted a parseable call on only 22 of 40 call cases — but all 22 were exactly correct (name and arguments). The model was not getting worse at calling functions; it was increasingly choosing not to call them. The per-question diff showed casual zero-argument requests (“flip a coin”) flipping from a correct call to polite prose (“Sure, let me…”). The run was halted deliberately at step ~870.

Root cause: 96% of the “no-call” class was mislabeled

glaive conversations are multi-turn. The assistant often first asks a clarifying question, or says filler like “Sure, let me calculate that for you” — and only calls the function in a later turn. Single-turn extraction labeled all of those first replies “no-call”, which means 96% of v1’s no-call class consisted of tool-worthy requests labeled as deserving prose. The model learned exactly what it was taught.

Striking sub-finding: under a strict definition of no-call (no function call anywhere in the conversation), all ~113k rows of glaive-function-calling-v2 contain only ~330 genuine no-call conversations. The dataset has almost no examples of an assistant correctly declining to use tools.

7. v2: fix the labels, not the model

The v2 fix changed nothing about the architecture, the LoRA config, or the hyperparameters. It changed the data: the no-call class was rebuilt from strictly call-free conversations only (318 train / 12 eval), combined with 1,400 call examples — 1,718 examples total, trained in 39 minutes. The deferred-clarification pattern was excluded entirely (a future v3 could model it properly as multi-turn behavior, which is arguably the correct assistant behavior and what PFN’s Prime API does).

Training loss vs. examples seen, from MLflow (experiment plamo3-tool-calling): the halted v1 run (11,182-example dataset, killed at step ~870 once checkpoint evals exposed the label flaw) and the finished v2 run (1,718 fixed examples). Both runs optimize happily — nothing in either loss curve hints that one of them is learning the wrong behavior. Zoom and hover are live.

Result on the same frozen 40 unseen call questions: 100% parse, 100% function-name, 100% argument exact-match. On the 12 strict no-call cases, the false-call rate is 16.7% — the two misses are tool-adjacent requests (e.g. “calculate my monthly loan payment”) where no matching tool was listed, and the model hallucinated a plausible function instead of declining. You can inspect both misses in the explorer below.

The case-study arc in one line: the model was never the problem. 90% → 55% was caused by mislabeled data; 55% → 100% was achieved by fixing labels and shrinking the dataset by 6.5×.

8. Per-question explorer

Every prediction from every condition, on every eval question — the raw records behind every percentage in this report. The 40 call questions are shared by all six conditions; the no-call questions come in two sets (the v1-era set, which inherited the label flaw, and the v2 strict set).

Question:

Badges are computed with the same scoring code as the headline metrics (finetune/eval_tools.py): a call-case prediction must parse as JSON with a name field, match the gold function name, and match the gold arguments exactly. A no-call prediction is a false call if it emits any parseable function call.

9. Honest caveats

n=40. By the rule of three, 40/40 supports only a lower bound of roughly ≥91% at 95% confidence. “100%” is the observed value on a small frozen set, not a claim about the population.
In-distribution eval. Train and eval both come from glaive’s synthetic style, and 39 of the 40 eval functions were seen in training by name (new queries and arguments, same schemas). This measures in-distribution generalization, not novel-tool generalization — that would need BFCL-style held-out schemas, which remains untested.
What is verified: zero query overlap between train and eval. The right summary is distribution-narrowness, not example-memorization.
English-only training data; single-turn only; single seed; exact-match argument scoring with no partial credit.
The two v2 false calls are a real failure mode: when a request sounds tool-shaped but no matching tool is listed, the model invents a plausible function rather than declining.

10. Artifacts & reproducibility

Model (LoRA adapter): Gurunath/plamo-3-nict-2b-tool-calling-lora — v2 adapter at the repo root, the 400-example sprint adapter in checkpoint-400/, per-question eval records included.
Code & logs: rajagurunath/pfn-plamo-inference-study
Companion study on the same hardware: Speculative decoding for PLaMo 2 in llama.cpp (a measured negative result, plus a quantization bug).

python finetune/prep_data.py        # build data/{train,eval}.jsonl from glaive-v2 (dedup + strict no-call)
python finetune/train_lora.py       # LoRA r16 all-linear, assistant-only masking; ~39 min on M4 Pro MPS
python finetune/eval_tools.py --condition base           # before, zero-shot
python finetune/eval_tools.py --condition base-fewshot   # before, 2-shot
python finetune/eval_tools.py --condition lora           # after

# loss curves
mlflow ui --backend-store-uri sqlite:///mlflow.db        # experiment: plamo3-tool-calling

Eval: 52 cases (40 call + 12 no-call), greedy decoding, frozen call set across all conditions, zero train/eval query overlap (verified). Per-question records: finetune/results/*.json; contaminated first eval archived in finetune/results/leaked-eval/. Training tracked in MLflow (sqlite, experiment plamo3-tool-calling): two killed v1 runs (n=11,182) and one finished v2 run (n=1,716 logged / 1,718 prepared). Data: glaiveai/glaive-function-calling-v2 (Apache 2.0). Base model: pfnet/plamo-3-nict-2b-base (PLaMo Community License). All numbers in the charts are embedded in this file; it has no network dependency except the Plotly.js CDN. Author: Gurunath Lunkupali Venugopal.