Using TOON with LangChain and LlamaIndex
Custom output parsers and document formatting to feed TOON-encoded context into LangChain and LlamaIndex pipelines for materially lower token usage.
Neither LangChain nor LlamaIndex speaks TOON natively — but a thin formatting step is all it takes. By serializing retrieved documents as TOON tables before they enter the prompt, you can reclaim 30–60% of context tokens on uniform row data, with no changes to your retrieval logic or model configuration.
Why token count matters in RAG
In a retrieval-augmented pipeline, the majority of tokens in every request come not from the user question but from the injected context: retrieved document chunks, metadata rows, or structured records pulled from a vector store. Each of those tokens costs money, consumes context window space, and — past a saturation point — begins to dilute the signal the model actually needs.
JSON is the default serialization for retrieved records in most frameworks because it is human-readable and universally parsed. But JSON was designed for machine interchange, not LLM consumption. Repeated keys, braces, quotes, and commas are pure structural overhead when the same schema repeats across dozens of retrieved rows. A token is roughly 0.75 English words, so every redundant delimiter eats into the budget that could hold more context.
TOON (Token-Oriented Object Notation) addresses this directly. Its table syntax declares the schema once in a header — for example products[50]{id,name,price,category}: — then emits bare comma-separated values for each row, with no repeated keys. The official TOON benchmarks (5,016 LLM calls across four models) show 58.8% fewer tokens for flat, uniform tables and 33.3% fewer tokens for nested e-commerce orders compared to JSON. See JSON vs TOON for a full format comparison, and What is TOON? for a format primer.
JSON context vs TOON context: what changes in a RAG prompt?
The table below uses a concrete example: 10 retrieved product records (id, name, price, category, stock) injected into a RAG prompt. Token counts are estimated with the GPT o200k_base tokenizer, consistent with the TOON benchmark methodology.
| Dimension | JSON context | TOON context |
|---|---|---|
| Approx. tokens (10 rows, 5 fields) | ~420 | ~175 (−58%) |
| Schema declared explicitly | Implicitly (repeated per row) | Yes — header line gives count + fields |
| Structural noise per row | High (braces, quotes, colons, commas) | Low (values only) |
| Field retrieval accuracy (benchmark) | 75.0% overall | 76.4% overall; 99.6% on field retrieval |
| Best for | Heterogeneous, deeply nested records | Uniform arrays of objects (typical RAG metadata) |
| LLM generation accuracy | Best one-shot output format | Suitable for context input, not output |
| Framework support | Native (LangChain, LlamaIndex) | Thin formatting layer (no native support yet) |
The accuracy gap is narrow at the overall level, but the token savings are decisive for large retrieval sets. An important caveat from an independent arXiv study (2603.03306): TOON's efficiency is non-linear. The per-row savings only amortize the format-instruction overhead at scale. On small document sets the instructions themselves may cost more tokens than they save, so test with your actual payload sizes before committing.
Pattern 1: LangChain — a TOON-aware format_docs function
LangChain pipelines typically assemble context through a format_docs function that turns a list of Document objects into a single prompt string. That function is the ideal injection point for TOON. The pattern below is illustrative — it shows the structure, not a guaranteed stable API — and assumes the toon-format JavaScript package (or its Python equivalent) is available.
# Pattern: LangChain format_docs with TOON serialization
# Illustrative — adapt field names to your actual Document schema.
from toon_format import encode_toon # hypothetical Python binding
def format_docs_as_toon(docs: list) -> str:
"""
Drop-in replacement for LangChain's default format_docs.
Serializes retrieved document metadata as a TOON table,
then appends raw page_content as prose beneath it.
"""
if not docs:
return ""
# 1. Extract structured metadata fields you want the LLM to see.
rows = [
{
"source": d.metadata.get("source", ""),
"title": d.metadata.get("title", ""),
"score": d.metadata.get("relevance_score", ""),
"chunk": d.metadata.get("chunk_index", ""),
}
for d in docs
]
# 2. Serialize the metadata table as TOON.
# Header emitted once: docs[N]{source,title,score,chunk}:
toon_table = encode_toon(rows, array_name="docs")
# 3. Append the actual text content below the table.
passages = "\n\n".join(
f"[{i+1}] {d.page_content}" for i, d in enumerate(docs)
)
return f"{toon_table}\n\n{passages}"
# Wire into a standard LCEL chain:
#
# from langchain_core.runnables import RunnablePassthrough
#
# rag_chain = (
# {"context": retriever | format_docs_as_toon, "question": RunnablePassthrough()}
# | prompt
# | llm
# | StrOutputParser()
# )
The metadata table benefits most from TOON — uniform fields across all retrieved rows are exactly the workload where the benchmark shows up to 58.8% token reduction. The free-text page_content is left as prose because TOON does not help heterogeneous, sentence-structured content.
One practical tip: expose both a JSON and a TOON path gated on payload size. For fewer than five retrieved documents the format overhead may not justify itself; switch to TOON when the row count exceeds a threshold you tune empirically for your data.
Pattern 2: LlamaIndex — TOON in a node postprocessor
LlamaIndex separates retrieval from synthesis cleanly. Retrieved NodeWithScore objects pass through a chain of NodePostprocessor instances before reaching the response synthesizer. That postprocessor slot is the right place to reformat node metadata as TOON.
# Pattern: LlamaIndex NodePostprocessor that rewrites node metadata as TOON
# Illustrative — check current LlamaIndex API docs before using in production.
from llama_index.core.postprocessor.types import BaseNodePostprocessor
from llama_index.core.schema import NodeWithScore, QueryBundle
from toon_format import encode_toon # hypothetical Python binding
import json
class ToonMetadataPostprocessor(BaseNodePostprocessor):
"""
Replaces each node's metadata dict with a TOON-serialized string
so the synthesizer sees a compact table rather than raw JSON.
"""
def _postprocess_nodes(
self,
nodes: list[NodeWithScore],
query_bundle: QueryBundle | None = None,
) -> list[NodeWithScore]:
if not nodes:
return nodes
# Collect all metadata rows for a single TOON table.
rows = [n.node.metadata for n in nodes]
toon_str = encode_toon(rows, array_name="retrieved_nodes")
# Attach the TOON table to the first node; clear duplicates on the rest.
nodes[0].node.metadata["toon_context"] = toon_str
for n in nodes[1:]:
n.node.metadata = {} # avoid re-injecting per-node JSON
return nodes
# Usage in a query engine:
#
# from llama_index.core import VectorStoreIndex
# from llama_index.core.query_engine import RetrieverQueryEngine
#
# index = VectorStoreIndex.from_documents(documents)
# retriever = index.as_retriever(similarity_top_k=20)
#
# query_engine = RetrieverQueryEngine.from_args(
# retriever=retriever,
# node_postprocessors=[ToonMetadataPostprocessor()],
# )
# response = query_engine.query("Which products are under $50?")
The synthesizer's prompt template will receive the TOON table as part of the context string. You can also go one step further and customize the synthesizer prompt itself to acknowledge the TOON table header:
# Pattern: custom synthesizer prompt that names the TOON table
# Use with LlamaIndex's get_response_synthesizer(text_qa_template=...).
TOON_QA_TMPL = """The following context contains a TOON-formatted table of retrieved document
metadata, followed by the source passages. TOON tables use the syntax:
array_name[count]{field1,field2,...}:
value1, value2, ...
Use the context below to answer the question. Answer in plain English.
Do NOT reproduce the TOON format in your answer.
Context:
-----------
{context_str}
-----------
Question: {query_str}
Answer: """
The explicit instruction "Do NOT reproduce the TOON format in your answer" is important. As the arXiv study on TOON generation (2603.03306) found, asking the model to output TOON reduces accuracy compared to plain JSON output. Use TOON as a dense input format, not as an output constraint.
How much can you actually save?
The savings depend almost entirely on data shape. The official TOON benchmark measured five real-world data shapes under identical conditions:
- Flat / uniform tables: 58.8% fewer tokens (67,778 vs 164,452 tokens). This is the sweet spot for RAG metadata rows.
- Time-series (60-day): 59.0% fewer tokens — nearly identical gains for log data or sensor readings.
- E-commerce orders (nested): 33.3% fewer tokens. Still significant even with moderate nesting.
- Mixed structures: 21.9% fewer tokens. Diminishing but still positive.
Overall, TOON achieved 27.7 accuracy-points per 1,000 tokens versus JSON's 16.4 — a 69% better efficiency ratio. For a deep dive on when these savings show up (and when they do not), see Optimizing RAG Pipelines with TOON.
The one counter-case to keep in mind: an independent study (arXiv 2603.03306) found that on very small payloads, the format-instruction overhead — the lines you add to explain TOON syntax to the model — can exceed the per-row savings. TOON's efficiency is non-linear: it pays off on large, repetitive payloads, not on a handful of heterogeneous chunks. Benchmark on your own data before deploying broadly.
A side-by-side prompt comparison
To make the token difference concrete, here is the same five-product retrieval result serialized both ways. You can paste either into the json2toon.co converter to see the conversion live.
# JSON version (~210 tokens for 5 rows)
[
{"id": 1, "name": "Wireless Headphones", "price": 49.99, "category": "Electronics", "stock": 12},
{"id": 2, "name": "USB-C Hub", "price": 29.99, "category": "Electronics", "stock": 45},
{"id": 3, "name": "Laptop Stand", "price": 34.99, "category": "Accessories", "stock": 8},
{"id": 4, "name": "Mechanical Keyboard", "price": 89.99, "category": "Electronics", "stock": 3},
{"id": 5, "name": "Monitor Light", "price": 24.99, "category": "Accessories", "stock": 20}
]
# TOON version (~88 tokens for 5 rows — ~58% fewer)
products[5]{id,name,price,category,stock}:
1, Wireless Headphones, 49.99, Electronics, 12
2, USB-C Hub, 29.99, Electronics, 45
3, Laptop Stand, 34.99, Accessories, 8
4, Mechanical Keyboard, 89.99, Electronics, 3
5, Monitor Light, 24.99, Accessories, 20
The TOON header products[5]{id,name,price,category,stock}: gives the model an explicit schema declaration and row count — the same structural signal JSON embeds implicitly via repetition, but at a fraction of the token cost. See JSON vs TOON for a comprehensive syntax comparison.
Implementation checklist
- Identify your context hot spots. Count tokens in a typical RAG prompt. Most of them are probably in the metadata table, not the passage text — that is where TOON helps.
- Gate by row count. Only switch to TOON when retrieved row count exceeds your empirically determined threshold (typically 5–10 rows). Below that, the format-instruction lines may not amortize.
- Inject TOON, ask for JSON output. Always request structured output in JSON from the model; never ask it to reproduce TOON. The arXiv study confirms plain JSON has the best one-shot generation accuracy.
- Add a format hint in the system prompt. A brief one-line explanation — "The context table uses TOON format: one header, then one row per item" — is sufficient for modern models.
- Monitor accuracy, not just tokens. Run a small retrieval eval on your domain data. The TOON benchmark showed 99.6% field retrieval accuracy, but aggregation (61.9%) and filtering (56.8%) were weaker — know which question types you rely on.
Frequently Asked Questions
Does LangChain support TOON natively?
No. LangChain has no built-in TOON serializer as of mid-2026. However, you can add TOON as a thin formatting step: convert retrieved documents to TOON using the toon-format library before assembling the prompt string, without touching any LangChain internals.
How do I use TOON in a RAG pipeline?
Intercept the document-formatting step — the function that turns retrieved chunks into a prompt string — and emit TOON instead of JSON or plain text. For tabular metadata (product rows, log entries, user records), TOON cuts token usage by 33–59% compared to JSON according to the official TOON benchmarks.
Will the LLM understand TOON context?
Yes, for context comprehension. The official TOON benchmark (5,016 LLM calls across four models) found 76.4% retrieval accuracy with TOON vs 75.0% with JSON. Field retrieval accuracy reached 99.6%. Use TOON to feed context, but do not ask the model to output TOON — generation accuracy is lower than plain JSON.
Does TOON work with LlamaIndex?
LlamaIndex has no native TOON support, but its NodePostprocessor and response-synthesizer prompt templates are easy extension points. You can serialize node metadata as a TOON table inside a custom postprocessor before the nodes reach the synthesizer, saving tokens on every query.
When is TOON not worth using in RAG?
An arXiv study (2603.03306) found that TOON's per-row savings only amortize the format-instruction overhead on large, repetitive payloads. For very small document sets (a handful of heterogeneous chunks), the extra prompt instructions can cost more tokens than they save. Test with your actual data volumes.
Recommended Reading
Optimizing RAG Pipelines with TOON
Learn how replacing JSON with TOON in your RAG context chunks can significantly reduce token usage, lower latency, and cut API costs.
Token-Efficient AI Agents: Using TOON for Tool Calls and MCP Pipelines
How to cut token costs in agent loops and Model Context Protocol servers by passing tool results as TOON instead of JSON, with concrete patterns and caveats.
When NOT to Use TOON: The Prompt-Tax Trap and How to Pick a Format
TOON isn't always the cheapest option. Learn about the 'prompt tax', the data shapes where JSON or CSV win, and a framework for choosing an LLM data format.