Teaching PLaMo 3 to Call Functions: a LoRA on PFN’s Hidden Rails
The first open-weights PLaMo with function calling — built by activating undocumented
control tokens already present in pfnet/plamo-3-nict-2b-base, with two evaluation traps found and fixed
along the way. A data-centric ML case study: 0% → 90% → 55% → 100%
argument-exact accuracy, where every transition is a lesson about the data, not the model.
1. Why this experiment
Preferred Networks ships tool calling only in its API-only PLaMo Prime line; every open PLaMo release is a base model with no chat or tool capability. So as of this writing there is no open-weights PLaMo you can hand a tool schema to.
The key discovery that makes this experiment cheap: pfnet/plamo-3-nict-2b-base (2.6B parameters,
attention-only, PLaMo Community License) ships an official chat_template.jinja, and its
vocabulary contains an undocumented control-token set clearly designed for chat and structured
output. PFN laid the rails for these capabilities in the open weights — tokenizer, template, special tokens
— but never exposed the capability itself. The question: how far does a small LoRA get by simply
activating those rails, in a single local session on a MacBook?
Answer: surprisingly far — and the two failures encountered on the way turned out to be more instructive than the final number.
2. The hidden rails: PLaMo 3’s control tokens
The official chat template formats each turn as role<|plamo:msg|>content<|plamo:tag|>.
Beyond those two, the vocabulary carries a family of control tokens that none of PFN’s public documentation for
the open models mentions (none of these exist in PLaMo 2):
| token | apparent purpose (inferred from names — undocumented) |
|---|---|
<|plamo:tag|> | turn terminator in the official chat template |
<|plamo:msg|> | separates role from content in each turn |
key, val | structured key/value output |
choice | selecting among enumerated options |
constrain | constrained / schema-guided decoding |
fim_prefix, fim_suffix, fim_middle | fill-in-the-middle code infilling |
The thesis of this study: these tokens are the infrastructure of PLaMo Prime’s
structured-output features, present but dormant in the open weights. A qualitative probe before training supports
this: the base model already follows the chat template (it answers helpfully in role format) but ignores
tool-call instructions — asked for the weather with a get_weather schema in the system prompt, it
explains how to use a weather API instead of emitting a call. The rails exist; nothing runs on them. The LoRA’s
job is just to put a train on the track.
3. Method
- Base model:
pfnet/plamo-3-nict-2b-base(2.6B, attention-only, PLaMo Community License). Pure-PyTorch modeling code, so it runs on Apple MPS directly. - Data: single-turn examples derived from
glaiveai/glaive-function-calling-v2(Apache 2.0), query-level deduplicated. The system message lists tool schemas; the assistant answers with a JSON call{"name": ..., "arguments": ...}— or a plain answer for no-tool cases. - Format: the official PLaMo 3 chat template
(
role<|plamo:msg|>content<|plamo:tag|>) — no custom tokens, no template surgery. - Training: LoRA r=16, α=32, all-linear, assistant-only loss masking, 1 epoch, bf16,
Apple M4 Pro (MPS), MLflow-tracked (experiment
plamo3-tool-calling). - Eval: 52 cases (40 call + 12 no-call), greedy decoding. Metrics: parse rate, function-name
accuracy, argument exact-match (over the 40 call cases), false-call rate (over the 12 no-call cases). The
40 call questions are frozen across all conditions, with zero train overlap (verified).
Per-question records for every condition are saved in
finetune/results/*.json— they power the explorer in section 8.
4. Headline results
All rows below are the leak-free eval (see trap #1). All share the same frozen 40 unseen call questions:
| condition | parse | func-name | args-exact | false-call |
|---|---|---|---|---|
| base zero-shot | 35.0% | 32.5% | 0.0% | 0.0%¹ |
| base 2-shot | 67.5% | 30.0% | 0.0% | 41.7% |
| LoRA v1 400 ex (5 min) | 92.5% | 92.5% | 90.0% | 41.7% |
| LoRA v1 ckpt 3.2k ex | 87.5% | 87.5% | 87.5% | 25.0% |
| LoRA v1 ckpt 6.4k ex (run halted) | 55.0% | 55.0% | 55.0% | 16.7% |
| LoRA v2 1,718 ex fixed data 39 min | 100% | 100% | 100% | 16.7%² |
¹ trivially low — the zero-shot base model rarely emits a call at all. ² measured on 12 strict no-call cases (see section 7); the v1 rows’ false-call column inherited the label flaw and is not directly comparable.
Three things to read off this table. Prompting cannot do this: 2-shot prompting raises JSON-shaped output (67.5% parse) but argument-exact correctness stays at exactly 0%, while the model starts firing tools on 41.7% of questions that need none. A 5-minute LoRA on 400 examples takes args-exact from 0% to 90% on unseen queries. And the v1 rows get worse with more data — the middle of the story, told in section 6.
All six conditions on the frozen eval. Hover for exact values; use the dropdown to isolate one metric. For parse / func-name / args-exact, higher is better (40 call cases); for false-call rate, lower is better (12 no-call cases). The v2 false-call bar uses the strict no-call set.
5. Evaluation trap #1: the 97.5% that was too good
The first eval of the 400-example sprint adapter read 97.5% across the board. That number was
wrong, and it was wrong in the most classic way: 41 of 52 eval queries appeared verbatim in the training
set. glaive-function-calling-v2 contains massive row duplication — a random train/eval
split does not separate anything. After query-level dedup, the entire dataset has only ~71 unique unseen call
queries beyond our training set.
If you fine-tune on glaive-function-calling-v2 and evaluate on a random held-out split, you are very
likely evaluating on your training data. Deduplicate at the query level first. Our contaminated run is
archived (not deleted) in finetune/results/leaked-eval/ for honesty; every number in this report comes
from the rebuilt eval with zero query overlap, verified.
The honest sprint number after the rebuild: 90% args-exact, not 97.5%. Still a striking result for 400 examples and 5 minutes — and now a real one.
6. Evaluation trap #2: more data made the model worse
With the leak fixed, the obvious next step was scale: the full v1 dataset, 11,182 deduplicated single-turn examples (~45% no-call mix), checkpointed every 200 steps with the frozen eval run at each checkpoint. The checkpoint evals showed something a loss curve never would:
Call recall (args-exact) degrades as v1 training data scales — 90% → 87.5% →
55% — while the precision of emitted calls stays at ~100% and the false-call rate falls. Stars: v2 on fixed
data. Static version: finetune/figures/v1_data_flaw_scaling.png. Note the v1 sprint point (400 ex) was
a separate run with a 12.5% no-call mix; the 3.2k/6.4k points are checkpoints of the full v1 run.
The decomposition is the diagnosis. At the 6.4k-example checkpoint the model emitted a parseable call on only 22 of 40 call cases — but all 22 were exactly correct (name and arguments). The model was not getting worse at calling functions; it was increasingly choosing not to call them. The per-question diff showed casual zero-argument requests (“flip a coin”) flipping from a correct call to polite prose (“Sure, let me…”). The run was halted deliberately at step ~870.
Root cause: 96% of the “no-call” class was mislabeled
glaive conversations are multi-turn. The assistant often first asks a clarifying question, or says filler like “Sure, let me calculate that for you” — and only calls the function in a later turn. Single-turn extraction labeled all of those first replies “no-call”, which means 96% of v1’s no-call class consisted of tool-worthy requests labeled as deserving prose. The model learned exactly what it was taught.
Striking sub-finding: under a strict definition of no-call (no function call anywhere in the conversation), all ~113k rows of glaive-function-calling-v2 contain only ~330 genuine no-call conversations. The dataset has almost no examples of an assistant correctly declining to use tools.
7. v2: fix the labels, not the model
The v2 fix changed nothing about the architecture, the LoRA config, or the hyperparameters. It changed the data: the no-call class was rebuilt from strictly call-free conversations only (318 train / 12 eval), combined with 1,400 call examples — 1,718 examples total, trained in 39 minutes. The deferred-clarification pattern was excluded entirely (a future v3 could model it properly as multi-turn behavior, which is arguably the correct assistant behavior and what PFN’s Prime API does).
Training loss vs. examples seen, from MLflow (experiment
plamo3-tool-calling): the halted v1 run (11,182-example dataset, killed at step ~870 once checkpoint
evals exposed the label flaw) and the finished v2 run (1,718 fixed examples). Both runs optimize happily —
nothing in either loss curve hints that one of them is learning the wrong behavior. Zoom and hover are live.
Result on the same frozen 40 unseen call questions: 100% parse, 100% function-name, 100% argument exact-match. On the 12 strict no-call cases, the false-call rate is 16.7% — the two misses are tool-adjacent requests (e.g. “calculate my monthly loan payment”) where no matching tool was listed, and the model hallucinated a plausible function instead of declining. You can inspect both misses in the explorer below.
The case-study arc in one line: the model was never the problem. 90% → 55% was caused by mislabeled data; 55% → 100% was achieved by fixing labels and shrinking the dataset by 6.5×.
8. Per-question explorer
Every prediction from every condition, on every eval question — the raw records behind every percentage in this report. The 40 call questions are shared by all six conditions; the no-call questions come in two sets (the v1-era set, which inherited the label flaw, and the v2 strict set).
Badges are computed with the same scoring code as the headline metrics
(finetune/eval_tools.py): a call-case prediction must parse as JSON with a name field,
match the gold function name, and match the gold arguments exactly. A no-call prediction is a false call if it
emits any parseable function call.
9. Honest caveats
- n=40. By the rule of three, 40/40 supports only a lower bound of roughly ≥91% at 95% confidence. “100%” is the observed value on a small frozen set, not a claim about the population.
- In-distribution eval. Train and eval both come from glaive’s synthetic style, and 39 of the 40 eval functions were seen in training by name (new queries and arguments, same schemas). This measures in-distribution generalization, not novel-tool generalization — that would need BFCL-style held-out schemas, which remains untested.
- What is verified: zero query overlap between train and eval. The right summary is distribution-narrowness, not example-memorization.
- English-only training data; single-turn only; single seed; exact-match argument scoring with no partial credit.
- The two v2 false calls are a real failure mode: when a request sounds tool-shaped but no matching tool is listed, the model invents a plausible function rather than declining.
10. Artifacts & reproducibility
- Model (LoRA adapter): Gurunath/plamo-3-nict-2b-tool-calling-lora
— v2 adapter at the repo root, the 400-example sprint adapter in
checkpoint-400/, per-question eval records included. - Code & logs: rajagurunath/pfn-plamo-inference-study
- Companion study on the same hardware: Speculative decoding for PLaMo 2 in llama.cpp (a measured negative result, plus a quantization bug).
python finetune/prep_data.py # build data/{train,eval}.jsonl from glaive-v2 (dedup + strict no-call)
python finetune/train_lora.py # LoRA r16 all-linear, assistant-only masking; ~39 min on M4 Pro MPS
python finetune/eval_tools.py --condition base # before, zero-shot
python finetune/eval_tools.py --condition base-fewshot # before, 2-shot
python finetune/eval_tools.py --condition lora # after
# loss curves
mlflow ui --backend-store-uri sqlite:///mlflow.db # experiment: plamo3-tool-calling
Eval: 52 cases (40 call + 12 no-call), greedy decoding, frozen call set across all conditions, zero train/eval query
overlap (verified). Per-question records: finetune/results/*.json; contaminated first eval archived in
finetune/results/leaked-eval/. Training tracked in MLflow (sqlite, experiment
plamo3-tool-calling): two killed v1 runs (n=11,182) and one finished v2 run (n=1,716 logged / 1,718
prepared). Data: glaiveai/glaive-function-calling-v2 (Apache 2.0). Base model: pfnet/plamo-3-nict-2b-base (PLaMo
Community License). All numbers in the charts are embedded in this file; it has no network dependency except the
Plotly.js CDN. Author: Gurunath Lunkupali Venugopal.