Skip to main content

Command Palette

Search for a command to run...

Apache Avro vs. JSON: An Intro into Data Serialization

Updated
6 min read
Apache Avro vs. JSON:  An Intro into Data Serialization

Apache Avro is a data serialization system that balances performance with flexibility. To understand why Avro is useful, we must first look at the inefficiencies of standard text-based formats like JSON.

The Problem: Why JSON Wastes Space

JSON is self-describing, meaning every single message carries its own structural metadata (field names).

Example JSON Message:

{
  "userName": "Martin",
  "favoriteNumber": 1337,
  "interests": ["daydreaming", "hacking"]
}

The "Waste" Breakdown

If we analyze the raw string of the message above (ignoring whitespace):

{"userName":"Martin","favoriteNumber":1337,"interests":["daydreaming","hacking"]}
  • Total Size: 82 bytes.

  • Actual Data Values: "Martin", 1337, "daydreaming", "hacking" (approx. 33 bytes).

  • Metadata Overhead: userName, favoriteNumber, interests, plus {, ", :, [ and , (approx. 49 bytes).

In this example, ~60% of the payload is wasted space. If you send 1 million records, you repeat those same 49 bytes of structural "noise" 1 million times, wasting ~49 MB of network transfer per million records.

Why this is inefficient:

  1. Redundant Metadata: In high-throughput systems like Kafka, the majority of your bandwidth is often spent sending field names rather than the data itself.

  2. Textual Representation: The number 1337 is stored as four ASCII characters ("1", "3", "3", "7"), taking 4 bytes. In binary, a number like 1337 can be packed into just 2 bytes.

  3. No Strict Typing: JSON doesn't enforce types; a "number" could be an integer in one message and a float in the next, forcing parsers to be slow and cautious.

Avro solves this by stripping away the metadata from the data payload entirely.

How Avro Packs Data

Avro’s efficiency stems from its "Schema-Direct" approach. It does not store field names or types within the data itself.

The Binary Layout

When Avro serializes an object, it writes values in the exact order defined in the schema:

  • Positional Writing: Unlike JSON which labels every value with a key, Avro writes only the raw values in a strict sequence. The reader knows the first field is a userName because the schema defines it as the first entry.

  • Variable-Length Encoding: Integers (int and long) use zig-zag encoding. Small numbers take 1 byte; larger numbers take more.

  • Strings/Bytes: Stored as a long length followed by that many bytes of UTF-8 data.

  • Nulls: Occupy zero bytes in the stream (handled via Unions).

Example Packing

Schema:

{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "userName", "type": "string"},
    {"name": "favoriteNumber", "type": "int"},
    {"name": "interests", "type": {"type": "array", "items": "string"}}
  ]
}

Data:

{ "userName": "Martin", "favoriteNumber": 1337, "interests": ["daydreaming", "hacking"] }

Binary Output:

Image taken from Designing Data Intensive Applications

  • 0C: Length of "Martin" (6).

  • 4D...6E: ASCII for "Martin".

  • F2 14: Zig-zag encoded 1337.

  • 04: Array block count (2 items).

  • 16...: String lengths and content for interests.

  • 00: Array termination marker.

Crucial Point: Without the schema, this binary blob is undecipherable. You wouldn't know if F2 14 is a number or part of a string. The payload contains 0% metadata.

Why Two Schemas?

The core innovation of Avro is Schema Resolution. For this to work, the system must use two distinct schemas simultaneously at read-time. These schemas often live inside the applications themselves.

  1. The Writer's Schema: When an application encodes data, it uses the schema version it currently knows. This version is often compiled directly into the application code.

  2. The Reader's Schema: When an application decodes data, it expects the data to follow a specific structure. This is the schema the application code is relying on.

The Key Idea: Decoupling

The writer's schema and the reader's schema do not have to be the same. They only need to be compatible.

  • The Problem: In Avro's binary world, there are no keys. If the Writer added a new field (e.g., email) at the beginning of the schema, all the subsequent bytes shift over.

  • The Solution: At read-time, the Avro library compares the two schemas. If the Writer's schema has a field the Reader doesn't recognize, the library uses the Writer's schema to calculate the byte-length and skip it. If the Reader expects a field the Writer didn't include, the library fills it with a default.

This "handshake" allows different versions of an application to communicate without ever needing to embed field names in the binary payload.

Backward and Forward Compatibility

Backward Compatibility

Reading old data with a new schema.

  • Requirement: New fields in the Reader's schema must have a default value.

  • Mechanism: If the Reader sees a field that isn't in the Writer's data (old data), it fills it with the default.

Forward Compatibility

Reading new data with an old schema.

  • Requirement: Any field removed in the new schema must have had a default value in the old schema.

  • Mechanism: If the old Reader encounters data it doesn't recognize, it uses the Writer's schema (the new one) to know exactly how many bytes to skip for the "unknown" new fields.

Summary of Rules

FeatureActionRequirement
Adding a FieldBackward CompatibleNew field must have a default.
Removing a FieldForward CompatibleDeleted field must have had a default.
Changing TypesPromotionOnly "upward" (e.g., int to long).
Renaming FieldsAliasesUse aliases to map old names to new ones.

Important: If a field was originally created without a default value, it can never be removed. Removing it would break forward compatibility, as old readers would encounter a missing field and have no default value to fall back on, resulting in a decoding error.

When to use JSON vs. Avro

While Avro is highly efficient, it introduces complexity (schema management). Choosing between them depends on your infrastructure.

Use JSON when:

  • Human Readability is Priority: You need to debug data by looking at raw logs or using simple curl commands.

  • Public APIs: You are providing data to third-party developers who shouldn't be forced to implement Avro libraries or fetch schemas.

  • Frontend/Browser Communication: Communication from the frontend should typically use JSON. Syncing schemas between a backend and a browser/client environment often carries a higher overhead than the network savings are worth.

  • Low Volume / Simple Architectures: If you only send a few messages an hour, the storage savings of Avro are negligible compared to the overhead of setting up a Schema Registry.

Use Avro when:

  • High Throughput & Scale: You are processing millions of events regularly (e.g., Kafka streams).

  • Big Data Storage: You are storing petabytes of data in a data lake (Hadoop, S3). The 30-70% space savings translate to massive cost reductions.

  • Internal Microservices: Within a private ecosystem, you can enforce strict schema contracts between teams, which saves time when internal services communicate with each other millions of times every hour.

  • Long-term Evolution: You need a formal system to guarantee that a change in one service won't accidentally break five downstream consumers.