DDIA Chapter 5 Guide — Encoding and Evolution, Part 1

In Chapter 5 of DDIA, the author compares common data encoding formats used in production systems, including JSON, XML, Protocol Buffers, and Avro. The chapter is not only about how these formats represent data, but also about how systems handle compatibility as data formats evolve.

In real systems, you rarely get to update every service, every record, and every client at the same time. Old and new versions of code and data often need to coexist. That makes schema evolution a central concern: can the data format tolerate changes to its structure over time?

The second half of the chapter then discusses where these formats are used in practice: databases, web services, REST APIs, RPC, workflow engines, and event-driven systems such as message queues.

Encoding Data

When we write programs, we usually deal with at least two representations of data. One is the in-memory data structure, such as arrays or trees, optimized for CPU operations. But if we want to send data over the network, we cannot simply transmit the memory representation directly. Memory addresses only make sense inside the current process. On another machine, or even in another process, those pointers are meaningless.

So when data needs to be transmitted, we encode it: we convert it into a format that can be stored in a file, sent over the network, and understood by another machine. JSON is one of the most widely used examples.

In large systems, data almost never stays inside one piece of code. It gets written to databases and sent through APIs from the backend to the frontend. That makes encoding an important topic in data-intensive systems: we need to turn convenient in-memory structures into formats that other machines and other parts of the system can understand.

So when we compare JSON, XML, and Protocol Buffers, the real question is: when data needs to move, how do we represent it in a way that is safe, stable, and efficient for others to read?

JSON, XML, and Binary Variants

When people think about standardized encoding, JSON and XML are often the first formats that come to mind. Almost every programming language has mature tools for parsing and generating structured data. Java has java.io.Serializable, Python has pickle, and Ruby has Marshal. If the backend is written in Java, the frontend in JavaScript, and data processing in Python, agreeing on JSON lets all sides exchange data.

Still, JSON and XML come with several practical drawbacks.

The first issue is number encoding. In everyday programming, a number may feel like just a number and a string like just a string. But in XML, <id>12345</id> does not tell us whether the value is intended to be a number or a string unless there is an additional schema. JSON can distinguish numbers from strings, but it does not distinguish integers from floating-point values, and it does not define precision.

That can be risky. If a system uses 64-bit integers as IDs and sends them as JSON numbers to JavaScript, precision may be lost after parsing. The book mentions Twitter as an example: tweets were identified by 64-bit numbers, so API responses included both a numeric id and a decimal string representation to avoid JavaScript precision problems.

The second issue is that JSON and XML do not natively support binary data. If you need to send images, audio files, or encrypted file contents, you are dealing with raw bytes. A common workaround is to encode those bytes as Base64 strings. That works, but it is awkward: you are turning bytes into text just to transmit them, and the result is usually larger. For large files, that extra space matters.

The third issue is that the data schema is not always strong enough. Looking at a JSON document alone does not tell you the full shape of the data, so you often need JSON Schema. But that also adds complexity. JSON Schema supports both open and closed content models. Suppose we define userName like this:

{
  "userName": "string"
}

In an open content model, the following document is still allowed:

{
  "userName": "Martin",
  "age": 30
}

That means JSON Schema is not necessarily saying which fields may or may not exist. It says that if a defined field appears, it must follow the rule. Undefined fields may still appear; they are simply not constrained. This makes evolution easier, because a newer service can add fields while remaining compatible. The downside is that you cannot always infer the exact shape of the data from the schema alone.

JSON Schema can go further with patternProperties. The example below, adapted from the book, requires every key to match the regular expression ^[0-9]+$, meaning it must consist only of digits, and each corresponding value must be a string. This gives you very fine-grained control, but it also increases schema management complexity.

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "patternProperties": {
    "^[0-9]+$": {
      "type": "string"
    }
  },
  "additionalProperties": false
}

Because JSON and XML have these limitations and are not especially compact, people have proposed binary encodings that can express the same kind of data more efficiently. In the book's example, the same value userName: "Martin" can be represented using bytes such as 0xa8 for userName and 0xa6 for "Martin". This takes less space.

But these formats have not become universally popular. The main reason is readability. In many scenarios, using a little more space is acceptable, but making data hard for humans to inspect can impose a large communication cost across teams, such as frontend and backend teams or teams that own different microservices. In those cases, the trade-off may not be worth it.

Protocol Buffers

Compared with ad hoc binary encodings, Google's Protocol Buffers and Facebook's Thrift also represent data compactly, but they preserve readability at the schema level. The transmitted data is still binary, but the data structure is explicitly defined through .proto files or an IDL. Developers can read the schema and understand the fields and their meanings. That is why these formats have been more widely adopted.

Protocol Buffers is a binary serialization format that requires you to define the data format in advance. It can encode data more compactly than JSON because it relies on a predefined schema that must be followed, unlike JSON's more permissive open model.

Consider the example from the book. Here is the data in JSON:

{
  "userName": "Martin",
  "favoriteNumber": 1337,
  "interests": ["daydreaming", "hacking"]
}

For a single record, this is fine. But if you have a long list of records, every item repeats field names such as "userName", "favoriteNumber", and "interests". At scale, those repeated strings waste space and network bandwidth.

With Protocol Buffers, you first define the structure in an IDL:

message Person {
    string user_name = 1;
    int64 favorite_number = 2;
    repeated string interests = 3;
}

This means field number 1 represents user_name, field number 2 represents favorite_number, and field number 3 represents interests. The encoded data only needs to carry the numbers and values, which significantly reduces storage and transmission size. As long as the receiver has the same definition, it can reconstruct the meaning.

But how do we keep data compatible when fields change?

Protocol Buffers has mechanisms for both forward and backward compatibility. Suppose the Person message adds string email = 4. Older code does not know field 4, but Protocol Buffers treats it as an unknown field and ignores it instead of failing. That gives forward compatibility. At the same time, when newer code reads older data that does not contain email, it receives a default value, which gives backward compatibility.

There is an important rule: if a field is removed, you should not reuse its field number for a different purpose. If favorite_number = 2 is removed and later replaced with string email = 2, old data may be misinterpreted as email. In practice, removed field numbers are usually marked as reserved to avoid accidental reuse.

Changing a field type also requires care. If int32 favorite_number = 2 becomes int64 favorite_number = 2, it may look like a harmless widening of the numeric range. But newer code could write a value too large for older code still reading it as int32, leading to truncation. Schema changes always need to be evaluated against these edge cases.

Support ExplainThis

If you found this content helpful, please consider supporting our work with a one-time donation of whatever amount feels right to you through this Buy Me a Coffee page.

Creating in-depth technical content takes significant time. Your support helps us continue producing high-quality educational content accessible to everyone.