Parquet & Avro vs TOON: Storage Formats Meet LLM Formats
Parquet and Avro dominate big-data storage, but they're binary and built for disks, not prompts. Learn the store-in-Parquet, serialize-to-TOON pattern for AI pipelines.
Use Parquet or Avro for storage — they dominate data lakes and streaming pipelines for good reason. Use TOON for the LLM prompt. The correct architecture is not a choice between them: query your Parquet data, project only the columns the model needs, then serialize that slice to TOON. Complementary tools, not competitors.
What Are Parquet and Avro Actually For?
The big-data ecosystem has converged on two broad categories of file format, distinguished by how they lay out records on disk.
Row-based formats — CSV, JSON, and Avro — store each complete record together. Avro is the binary, schema-carrying row format that became the standard companion for Apache Kafka: write a stream of events row by row, replay them in order, decode on the consumer side. Its binary encoding is compact and its schema (written in JSON) travels with the file, enabling schema evolution without breaking existing consumers.
Columnar formats — Parquet and ORC — store data column-by-column rather than row-by-row. When an analytics query needs only two fields from a hundred-column table, the reader can skip every other column on disk without decoding it. According to Airbyte's format comparison, Parquet wins roughly 90% of data-lake use cases because of columnar compression and predicate/projection pushdown. Reading 2 of 7 columns from a Parquet file can be roughly 13 times faster than scanning the equivalent CSV.
The pay-per-scan economics are equally compelling. On cloud warehouses like BigQuery, Athena, and Snowflake that charge by bytes scanned, columnar storage can cut query costs 80–90% versus scanning raw JSON, because unused columns are never read off disk.
Neither format was designed with LLM prompts in mind. Both are binary. Both require a decode step before any token can be sent to a model.
Why Parquet and Avro Cannot Go Directly Into an LLM Prompt
LLM APIs — OpenAI, Anthropic, Google Gemini, and every hosted inference endpoint — accept text in the message body. Parquet and Avro are binary formats. To include their data in a prompt, an application must:
- Open and decode the file using a library (PyArrow, pandas, Fastavro, etc.).
- Select the relevant rows and columns (the projection step).
- Serialize the resulting data structure to a text format the model can read.
Step three is where format choice matters for LLM costs. Most developers default to JSON for that serialization. JSON is fine, but it repeats every field name on every row — a significant token overhead on wide or deep result sets. TOON declares field names once in a header and then emits only values per row. On flat uniform tables (the typical output of a SQL projection), the official toonformat.dev benchmarks show TOON at 58.8% fewer tokens than JSON (67,778 vs 164,452 tokens) across 5,016 LLM calls with 76.4% retrieval accuracy versus JSON's 75.0%.
The architecture is additive: Parquet gives you fast, cheap, columnar storage; TOON gives you a token-efficient text representation for the slice of that data the model actually needs.
For a deeper look at how binary formats compare to text formats across the full LLM pipeline, see our guide on embeddings and binary data in LLM formats.
The Recommended Pipeline: Parquet → Project → TOON
The pattern that minimizes both storage cost and prompt cost is a three-stage pipeline. Here is a concrete example using Python pseudo-code alongside the resulting TOON output.
# Stage 1: Read from Parquet, project only the columns the LLM needs
import pyarrow.parquet as pq
table = pq.read_table(
"orders.parquet",
columns=["order_id", "customer", "total_usd", "status"]
)
rows = table.to_pylist()
# rows is now a list of dicts — small, in-memory, only 4 of potentially 30 columns
# Stage 2: Serialize to TOON for the prompt
# (use the @toon-format/toon package or json2toon.co converter)
# Result sent to the LLM:
orders[5]{order_id,customer,total_usd,status}:
1001, Alice, 149.99, shipped
1002, Bob, 39.00, pending
1003, Charlie, 299.50, shipped
1004, Diana, 19.99, cancelled
1005, Eve, 89.75, processing
# Compare: the equivalent JSON would be:
[
{"order_id": 1001, "customer": "Alice", "total_usd": 149.99, "status": "shipped"},
{"order_id": 1002, "customer": "Bob", "total_usd": 39.00, "status": "pending"},
{"order_id": 1003, "customer": "Charlie", "total_usd": 299.50, "status": "shipped"},
{"order_id": 1004, "customer": "Diana", "total_usd": 19.99, "status": "cancelled"},
{"order_id": 1005, "customer": "Eve", "total_usd": 89.75, "status": "processing"}
]
# Every key repeated 5 times. On 500 rows, TOON saves ~58% of those tokens.The columnar projection in Stage 1 is critical: reading 4 of 30 columns from Parquet skips 87% of the stored data entirely. Sending the already-slim result to the model as TOON rather than JSON then cuts prompt tokens by up to 58.8% on top of that. Both savings compound.
For the RAG use case specifically — retrieving rows from a data store and inserting them into a retrieval-augmented generation prompt — see our dedicated guide on optimizing RAG pipelines with TOON.
Parquet vs Avro vs JSON vs TOON: Full Comparison
| Criterion | Parquet | Avro | JSON | TOON |
|---|---|---|---|---|
| Orientation | Columnar | Row | Row (text) | Row (text, header once) |
| Binary? | Yes | Yes | No | No |
| LLM-promptable directly? | No (decode first) | No (decode first) | Yes | Yes |
| Column-selective reads | Up to ~13× faster than CSV | Must scan full row | Must scan full record | N/A (in-memory slice) |
| Prompt token count (vs JSON) | N/A | N/A | Baseline | −39.9% overall; −58.8% on flat tables |
| Schema evolution | Yes (metadata) | Yes (schema file) | Implicit / manual | Header declares fields per block |
| Best for | Analytics, data lakes, pay-per-scan warehouses | Kafka / streaming, schema evolution | APIs, configs, LLM output | Feeding structured data into LLM prompts |
When Should You Use JSON Instead of TOON for the Prompt?
TOON's token savings are not uniform across all data shapes. The official benchmarks show a range: 58.8% on flat uniform tables down to 21.9% on mixed structures. For small result sets — fewer than roughly ten rows — the fixed cost of TOON format instructions can erode the per-row savings. In those cases, plain JSON is simpler and the cost difference is negligible.
TOON's advantage is strongest precisely when Parquet is most useful: large, wide, uniform tables where you are projecting a slice of many rows sharing the same schema. The two formats were made for each other at their respective layers.
If your Parquet data has highly non-uniform or deeply nested records — think document-style JSON stored in Parquet's nested group types — then plain JSON may be a better serialization target for the prompt slice. See our JSON vs TOON comparison for the full decision framework.
For purely flat, single-level data with no nesting at all, CSV is worth considering — it carries zero structural overhead and every model understands it without format instructions. Our CSV vs TOON guide covers exactly where that boundary sits.
Why Tabular Structure Specifically Helps LLMs
There is independent research explaining why the Parquet-to-TOON pipeline works beyond token counting. A January 2026 arXiv paper (arXiv 2412.17189) — "Talking with Tables for Better LLM Factual Data Interactions" — found that providing data as tabular structures yields a 40.29% average performance gain plus better robustness and token efficiency versus semi-structured formats like knowledge graphs and JSON. The authors' attention analysis showed that tables help LLMs attend to relevant information more efficiently, which explains the accuracy gains alongside the token savings.
Parquet organizes data into columns internally; TOON organizes data into a table block externally (in the prompt). Both exploit the same fundamental property: structured, regular layouts are easier to process than freeform nesting. The difference is that Parquet exploits it for disk I/O while TOON exploits it for attention heads.
Frequently Asked Questions
Should I use Parquet/Avro or TOON for LLMs?
Use Parquet or Avro for storage and transport — they are binary formats optimized for analytics and streaming respectively. Use TOON for the LLM prompt. The correct pattern is: query your Parquet or Avro data, project only the columns the model needs, then serialize that slice to TOON. The two formats are complementary, not competitors.
Why can't I send Parquet files directly to an LLM?
Parquet is a binary columnar format. LLM APIs accept text. To include Parquet data in a prompt you must decode it first, which means reading it into a data structure and then re-serializing to a text format. TOON is the most token-efficient text format for that final serialization step, saving 39.9% tokens versus JSON on average according to official benchmarks.
What is the difference between Parquet and Avro?
Parquet is columnar: it stores data column-by-column, making it ideal for analytics queries that read only a few columns from wide tables. Avro is row-oriented: it stores records row-by-row and is the standard pairing for Kafka and streaming pipelines. Parquet wins roughly 90% of data-lake use cases; Avro dominates real-time streaming.
How much faster is Parquet than CSV for column-selective queries?
Reading 2 of 7 columns from a Parquet file can be roughly 13 times faster than scanning the equivalent CSV, because Parquet's columnar layout lets the reader skip irrelevant columns entirely. On pay-per-scan cloud warehouses like BigQuery, Athena, and Snowflake, columnar storage can also cut query costs 80–90% versus scanning raw JSON.
How many tokens does TOON save versus JSON for tabular data?
According to official toonformat.dev benchmarks across 5,016 LLM calls, TOON uses 39.9% fewer tokens than JSON overall. On flat uniform tables — the typical output of a SQL projection or Parquet read — the reduction reaches 58.8% (67,778 vs 164,452 tokens), with 76.4% retrieval accuracy versus JSON's 75.0%.
Recommended Reading
MessagePack vs TOON: Binary Wire Formats vs LLM-Readable Tokens
MessagePack is about half the size of JSON on the wire—but binary formats Base64-bloat inside LLM prompts. Here's why TOON wins for prompts and MessagePack wins for transport.
Markdown Tables vs TOON for LLM Prompts: Which Saves More Tokens?
Markdown tables look tabular but their pipes and dashes are pure token bloat. See how TOON keeps the table structure LLMs love—worth a 40% accuracy gain—without the alignment tax.
NDJSON vs TOON: Streaming JSON for LLMs Compared
NDJSON (JSON Lines) is the streaming workhorse of data engineering. See how it compares to TOON on token efficiency, structure, and LLM readability.