Teaching PLaMo 3 to Call Functions: a LoRA on PFN’s Hidden Rails

The first open-weights PLaMo with function calling — built by activating undocumented control tokens already present in pfnet/plamo-3-nict-2b-base, with two evaluation traps found and fixed along the way. A data-centric ML case study: 0% → 90% → 55% → 100% argument-exact accuracy, where every transition is a lesson about the data, not the model.

MacBook Pro M4 Pro · 48 GB unified memory · Apple MPS · LoRA r=16 · MLflow-tracked · June 2026 · Gurunath Lunkupali Venugopal · model on Hugging Face · code on GitHub · companion report: speculative decoding for PLaMo 2

1. Why this experiment

Preferred Networks ships tool calling only in its API-only PLaMo Prime line; every open PLaMo release is a base model with no chat or tool capability. So as of this writing there is no open-weights PLaMo you can hand a tool schema to.

The key discovery that makes this experiment cheap: pfnet/plamo-3-nict-2b-base (2.6B parameters, attention-only, PLaMo Community License) ships an official chat_template.jinja, and its vocabulary contains an undocumented control-token set clearly designed for chat and structured output. PFN laid the rails for these capabilities in the open weights — tokenizer, template, special tokens — but never exposed the capability itself. The question: how far does a small LoRA get by simply activating those rails, in a single local session on a MacBook?

Answer: surprisingly far — and the two failures encountered on the way turned out to be more instructive than the final number.

2. The hidden rails: PLaMo 3’s control tokens

The official chat template formats each turn as role<|plamo:msg|>content<|plamo:tag|>. Beyond those two, the vocabulary carries a family of control tokens that none of PFN’s public documentation for the open models mentions (none of these exist in PLaMo 2):

tokenapparent purpose (inferred from names — undocumented)
<|plamo:tag|>turn terminator in the official chat template
<|plamo:msg|>separates role from content in each turn
key, valstructured key/value output
choiceselecting among enumerated options
constrainconstrained / schema-guided decoding
fim_prefix, fim_suffix, fim_middlefill-in-the-middle code infilling

The thesis of this study: these tokens are the infrastructure of PLaMo Prime’s structured-output features, present but dormant in the open weights. A qualitative probe before training supports this: the base model already follows the chat template (it answers helpfully in role format) but ignores tool-call instructions — asked for the weather with a get_weather schema in the system prompt, it explains how to use a weather API instead of emitting a call. The rails exist; nothing runs on them. The LoRA’s job is just to put a train on the track.

3. Method

4. Headline results

All rows below are the leak-free eval (see trap #1). All share the same frozen 40 unseen call questions:

conditionparsefunc-nameargs-exactfalse-call
base zero-shot35.0%32.5%0.0%0.0%¹
base 2-shot67.5%30.0%0.0%41.7%
LoRA v1 400 ex (5 min)92.5%92.5%90.0%41.7%
LoRA v1 ckpt 3.2k ex87.5%87.5%87.5%25.0%
LoRA v1 ckpt 6.4k ex (run halted)55.0%55.0%55.0%16.7%
LoRA v2 1,718 ex fixed data 39 min100%100%100%16.7%²

¹ trivially low — the zero-shot base model rarely emits a call at all. ² measured on 12 strict no-call cases (see section 7); the v1 rows’ false-call column inherited the label flaw and is not directly comparable.

Three things to read off this table. Prompting cannot do this: 2-shot prompting raises JSON-shaped output (67.5% parse) but argument-exact correctness stays at exactly 0%, while the model starts firing tools on 41.7% of questions that need none. A 5-minute LoRA on 400 examples takes args-exact from 0% to 90% on unseen queries. And the v1 rows get worse with more data — the middle of the story, told in section 6.

Metric:

All six conditions on the frozen eval. Hover for exact values; use the dropdown to isolate one metric. For parse / func-name / args-exact, higher is better (40 call cases); for false-call rate, lower is better (12 no-call cases). The v2 false-call bar uses the strict no-call set.

5. Evaluation trap #1: the 97.5% that was too good

The first eval of the 400-example sprint adapter read 97.5% across the board. That number was wrong, and it was wrong in the most classic way: 41 of 52 eval queries appeared verbatim in the training set. glaive-function-calling-v2 contains massive row duplication — a random train/eval split does not separate anything. After query-level dedup, the entire dataset has only ~71 unique unseen call queries beyond our training set.

If you fine-tune on glaive-function-calling-v2 and evaluate on a random held-out split, you are very likely evaluating on your training data. Deduplicate at the query level first. Our contaminated run is archived (not deleted) in finetune/results/leaked-eval/ for honesty; every number in this report comes from the rebuilt eval with zero query overlap, verified.

The honest sprint number after the rebuild: 90% args-exact, not 97.5%. Still a striking result for 400 examples and 5 minutes — and now a real one.

6. Evaluation trap #2: more data made the model worse

With the leak fixed, the obvious next step was scale: the full v1 dataset, 11,182 deduplicated single-turn examples (~45% no-call mix), checkpointed every 200 steps with the frozen eval run at each checkpoint. The checkpoint evals showed something a loss curve never would:

Call recall (args-exact) degrades as v1 training data scales — 90% → 87.5% → 55% — while the precision of emitted calls stays at ~100% and the false-call rate falls. Stars: v2 on fixed data. Static version: finetune/figures/v1_data_flaw_scaling.png. Note the v1 sprint point (400 ex) was a separate run with a 12.5% no-call mix; the 3.2k/6.4k points are checkpoints of the full v1 run.

The decomposition is the diagnosis. At the 6.4k-example checkpoint the model emitted a parseable call on only 22 of 40 call cases — but all 22 were exactly correct (name and arguments). The model was not getting worse at calling functions; it was increasingly choosing not to call them. The per-question diff showed casual zero-argument requests (“flip a coin”) flipping from a correct call to polite prose (“Sure, let me…”). The run was halted deliberately at step ~870.

Root cause: 96% of the “no-call” class was mislabeled

glaive conversations are multi-turn. The assistant often first asks a clarifying question, or says filler like “Sure, let me calculate that for you” — and only calls the function in a later turn. Single-turn extraction labeled all of those first replies “no-call”, which means 96% of v1’s no-call class consisted of tool-worthy requests labeled as deserving prose. The model learned exactly what it was taught.

Striking sub-finding: under a strict definition of no-call (no function call anywhere in the conversation), all ~113k rows of glaive-function-calling-v2 contain only ~330 genuine no-call conversations. The dataset has almost no examples of an assistant correctly declining to use tools.

7. v2: fix the labels, not the model

The v2 fix changed nothing about the architecture, the LoRA config, or the hyperparameters. It changed the data: the no-call class was rebuilt from strictly call-free conversations only (318 train / 12 eval), combined with 1,400 call examples — 1,718 examples total, trained in 39 minutes. The deferred-clarification pattern was excluded entirely (a future v3 could model it properly as multi-turn behavior, which is arguably the correct assistant behavior and what PFN’s Prime API does).

Training loss vs. examples seen, from MLflow (experiment plamo3-tool-calling): the halted v1 run (11,182-example dataset, killed at step ~870 once checkpoint evals exposed the label flaw) and the finished v2 run (1,718 fixed examples). Both runs optimize happily — nothing in either loss curve hints that one of them is learning the wrong behavior. Zoom and hover are live.

Result on the same frozen 40 unseen call questions: 100% parse, 100% function-name, 100% argument exact-match. On the 12 strict no-call cases, the false-call rate is 16.7% — the two misses are tool-adjacent requests (e.g. “calculate my monthly loan payment”) where no matching tool was listed, and the model hallucinated a plausible function instead of declining. You can inspect both misses in the explorer below.

The case-study arc in one line: the model was never the problem. 90% → 55% was caused by mislabeled data; 55% → 100% was achieved by fixing labels and shrinking the dataset by 6.5×.

8. Per-question explorer

Every prediction from every condition, on every eval question — the raw records behind every percentage in this report. The 40 call questions are shared by all six conditions; the no-call questions come in two sets (the v1-era set, which inherited the label flaw, and the v2 strict set).

Question:

Badges are computed with the same scoring code as the headline metrics (finetune/eval_tools.py): a call-case prediction must parse as JSON with a name field, match the gold function name, and match the gold arguments exactly. A no-call prediction is a false call if it emits any parseable function call.

9. Honest caveats

10. Artifacts & reproducibility

python finetune/prep_data.py        # build data/{train,eval}.jsonl from glaive-v2 (dedup + strict no-call)
python finetune/train_lora.py       # LoRA r16 all-linear, assistant-only masking; ~39 min on M4 Pro MPS
python finetune/eval_tools.py --condition base           # before, zero-shot
python finetune/eval_tools.py --condition base-fewshot   # before, 2-shot
python finetune/eval_tools.py --condition lora           # after

# loss curves
mlflow ui --backend-store-uri sqlite:///mlflow.db        # experiment: plamo3-tool-calling

Eval: 52 cases (40 call + 12 no-call), greedy decoding, frozen call set across all conditions, zero train/eval query overlap (verified). Per-question records: finetune/results/*.json; contaminated first eval archived in finetune/results/leaked-eval/. Training tracked in MLflow (sqlite, experiment plamo3-tool-calling): two killed v1 runs (n=11,182) and one finished v2 run (n=1,716 logged / 1,718 prepared). Data: glaiveai/glaive-function-calling-v2 (Apache 2.0). Base model: pfnet/plamo-3-nict-2b-base (PLaMo Community License). All numbers in the charts are embedded in this file; it has no network dependency except the Plotly.js CDN. Author: Gurunath Lunkupali Venugopal.