11 min read

TOON Benchmarks 2026: Token Savings and Accuracy Across GPT-5, Claude, Gemini & Grok

A data-driven look at TOON vs JSON across 5,016 LLM calls: 39.9% fewer tokens at 76.4% retrieval accuracy, plus per-model and per-data-shape results.

By JSON to TOON Team

The official TOON benchmark — 5,016 LLM calls across four frontier models — shows TOON using 39.9% fewer tokens than JSON while achieving 76.4% retrieval accuracy vs JSON's 75.0%. That translates to 27.7 accuracy-points per 1,000 tokens, compared to JSON's 16.4: a 69% gain in token efficiency.

Benchmark Methodology: What Was Tested

The benchmark is reproducible and fully documented at toonformat.dev/guide/benchmarks. The test matrix covered:

  • 209 questions per format per model
  • 6 formats tested head-to-head (including JSON, TOON, TONL, CSV, YAML, TOML)
  • 4 models: claude-haiku-4-5, gemini-3-flash-preview, gpt-5-nano, grok-4-1-fast
  • 5,016 total LLM calls (209 × 6 × 4)
  • Tokenized with GPT-5 o200k_base via gpt-tokenizer for consistent cross-model comparisons

Questions spanned five types — field retrieval, structure awareness, validation, aggregation, and filtering — across five real-world data shapes. This is the most systematic publicly available dataset on TOON performance to date.

The Headline Number: TOON vs JSON Side by Side

Before diving into the data tables, it helps to see concretely why token counts diverge. Here is the same three-user array expressed in JSON and then in TOON:

// JSON — every key repeated on every row, every value quoted
[
  { "id": 1, "name": "Alice", "role": "admin",  "active": true  },
  { "id": 2, "name": "Bob",   "role": "viewer", "active": true  },
  { "id": 3, "name": "Carol", "role": "editor", "active": false }
]

// TOON — header declares schema once; rows are pure comma-separated values
users[3]{id,name,role,active}:
  1, Alice, admin,  true
  2, Bob,   viewer, true
  3, Carol, editor, false

The JSON version above consumes roughly 58 tokens with the o200k_base tokenizer; the TOON version consumes roughly 28 tokens — about a 52% reduction for this flat, uniform shape. The header line users[3]{id,name,role,active}: gives the model an explicit count and schema to validate against, which is one reason retrieval accuracy stays high even as tokens fall.

For a deeper look at the structural differences, see What is TOON? and the full JSON vs TOON comparison.

How Much Does TOON Save? Token Reduction by Data Shape

Token savings are not uniform — they depend almost entirely on how repetitive the data is. Flat uniform tables benefit most; deeply mixed structures benefit least. All figures are from the official benchmark using the o200k_base tokenizer.

Data ShapeTOON TokensJSON TokensReduction
Flat / uniform tables67,778164,45258.8%
Time-series (60 days)9,11522,24559.0%
GitHub repo data8,74415,14442.3%
E-commerce orders (nested)73,126109,59933.3%
Mixed structures227,830291,71121.9%

The pattern is clear: TOON's savings are proportional to key repetition. A flat table of 10,000 user records repeats the same field names 10,000 times in JSON; TOON declares them once. Mixed or deeply nested data — where every object has a different schema — captures far less benefit. This is the same principle behind columnar storage formats like Parquet, applied to text.

Want to see how TOON compares against YAML, CSV, and TOML? See the full format comparison.

Accuracy by Model: Which LLM Reads TOON Best?

Token savings only matter if the model can actually comprehend the format. The benchmark tested all four models on the same 209 questions with TOON as the input format. Source: toonformat.dev/guide/benchmarks.

ModelTOON AccuracyAssessment
Gemini 3 Flash96.7%Production-ready
GPT-5 Nano90.9%Production-ready
Claude Haiku 4.559.8%Use JSON for safety
Grok 4.158.4%Use JSON for safety

The gap between model families is substantial. Gemini 3 Flash and GPT-5 Nano have clearly internalized TOON's tabular structure, likely because both models were trained on large volumes of structured text including CSVs, log files, and markdown tables. Claude Haiku 4.5 and Grok 4.1 score near 59% — barely above random on some question types — suggesting these smaller haiku-tier models have not seen enough TOON or TOON-adjacent data during pretraining. If you are routing to Claude Haiku or Grok 4.1, benchmark your specific workload before committing to TOON at scale, or stick with JSON for those endpoints.

Accuracy by Question Type: Where TOON Excels and Where It Struggles

Aggregate accuracy numbers mask important variation across task types. TOON was designed primarily for data retrieval — and the question-type breakdown confirms that design intent.

Question TypeTOON AccuracyNotes
Field retrieval99.6%"What is the value of X?"
Structure awareness89.0%"How many records are there?"
Structural validation70.0%"Is this field present?"
Aggregation61.9%"What is the average X?"
Filtering56.8%"Which records match condition Y?"

Field retrieval at 99.6% is essentially perfect — TOON's explicit header schema gives the model a positional anchor for every column, so lookup is near-deterministic. Structure awareness benefits from the same mechanism: the users[3]{...}: header declares the row count directly, so the model does not need to count brackets.

Aggregation and filtering are harder because they require the model to iterate across all rows — a task that benefits less from TOON's structural clarity and more from the model's arithmetic reasoning capability. For workloads heavy in aggregation or filtering, consider pairing TOON with a query layer. See how to reduce LLM API costs in production.

The Efficiency Metric: Accuracy per 1,000 Tokens

Raw accuracy alone is a misleading metric if two formats have different token costs. The benchmark computes a composite efficiency score: accuracy-points delivered per 1,000 tokens consumed.

FormatOverall AccuracyAccuracy-pts / 1K tokens
TOON76.4%27.7
JSON75.0%16.4

TOON delivers the same answer using 39.9% fewer tokens, which means at equal budget you can send 69% more data to the model and still match JSON's accuracy. For RAG pipelines that retrieve dozens of records per query, this is the number that matters most. See how this plays out in practice in the JSON vs TOON deep-dive.

When TOON Does Not Beat JSON: The "Prompt Tax" and Non-Linear Payoff

No benchmark summary is complete without the caveats. An independent paper, "Token-Oriented Object Notation vs JSON: A Benchmark of Plain and Constrained Decoding Generation" (arXiv 2603.03306, Feb 2026), identifies two conditions where JSON wins or ties:

  1. Generation tasks. When the model must output TOON (not just read it), plain JSON achieves better one-shot and final accuracy. The reason is the "prompt tax": to reliably produce valid TOON, you must include format instructions in the system prompt. Those instructions consume tokens — partially offsetting TOON's savings — and introduce a new failure mode if the model ignores them.
  2. Small payloads. TOON's token savings are non-linear. The header declaration (fieldName[n]{col1,col2,...}:) has a fixed upfront cost. On a three-row array, that cost may exceed the savings from omitting repeated keys. The official benchmark confirms this: mixed and heavily nested datasets show only 21.9% reduction, while flat datasets with hundreds of rows see 59%. The practical implication: TOON's payoff threshold is somewhere around 20–30 rows of uniform data. Below that, JSON's simplicity wins.

Additionally, the model-accuracy table above shows that TOON comprehension is highly model-dependent. Claude Haiku 4.5 and Grok 4.1 score near 59%, which means TOON is a net negative for those endpoints relative to JSON's 75.0% baseline. Always test the specific model, task type, and data shape for your production workload before switching formats at scale.

What This Means in Production

The benchmark data supports a practical decision framework:

  • Use TOON when: injecting large uniform arrays (logs, time-series, product catalogs, user records) into a Gemini 3 Flash or GPT-5 Nano prompt. Expect 40–59% token reduction and near-baseline accuracy.
  • Use JSON when: the model must generate structured output, the payload is fewer than ~20 rows, you are using Claude Haiku 4.5 or Grok 4.1, or the data is deeply heterogeneous.
  • Test first when: your workload involves aggregation or filtering queries, where TOON accuracy drops to 57–62%.

To convert your existing JSON to TOON instantly without writing any code, use the free json2toon converter. For a comprehensive look at which format fits which scenario, read the TOON format comparison guide.

Frequently Asked Questions

How many tokens does TOON save vs JSON?

In the official TOON benchmark (5,016 LLM calls across 4 models), TOON used 39.9% fewer tokens than JSON on average. Savings vary by data shape: flat/uniform tables see up to 58.8% reduction, while mixed structures see around 21.9%. Source: toonformat.dev/guide/benchmarks.

Is TOON more accurate than JSON for LLMs?

Marginally yes for retrieval tasks: TOON achieved 76.4% overall retrieval accuracy vs JSON's 75.0% in the official benchmark. The real advantage is efficiency — TOON delivers 27.7 accuracy-points per 1,000 tokens vs JSON's 16.4, a 69% improvement in accuracy per token spent.

Which model reads TOON best?

Gemini 3 Flash leads at 96.7% accuracy on TOON inputs, followed by GPT-5 Nano at 90.9%. Claude Haiku 4.5 scored 59.8% and Grok 4.1 scored 58.4%, suggesting TOON comprehension quality varies significantly across model families.

Does TOON ever lose to JSON?

Yes. An independent arXiv study (2603.03306) found that for generation tasks — asking the model to output TOON — plain JSON had better one-shot accuracy due to TOON's "prompt tax." TOON's efficiency is also non-linear; on small payloads, format instructions can cost more tokens than they save.

What types of questions does TOON answer best?

Field retrieval is TOON's strongest use case at 99.6% accuracy, followed by structure awareness at 89.0%. Accuracy drops for validation (70.0%), aggregation (61.9%), and filtering (56.8%). TOON is purpose-built for lookup and comprehension, not complex multi-step reasoning.

Recommended Reading

TOONBenchmarkToken EfficiencyGPT-5ClaudeGeminiLLM