Intro
Serialization is one of those things that’s really tempting to build - it’s a constrained problem, has a simple abstraction with lots of depth behind it, is domain-specific so general-purpose solutions leave something to be desired, is relatively easy to test, and can be built and productionized within a reasonable amount of time by a solo engineer. As evidence, I can recall at least 10 different popular serialization formats off the top of my head: Protocol Buffers, JSON, BSON, MsgPack, Cap’n Proto, FlatBuffers, Simple Binary Encoding (SBE), cereal, Avro, Thrift, Arrow Flight, etc. And I’m sure this is missing a large number of alternatives.
In this post, I’d like to talk about serialization & schema design under a specific but rather widely applicable set of constraints:
- There is some data that I’d like to both send from one app to another, and write to long-term storage.
- I’d like serialization, data transfer, deserialization to be as fast & resource efficient as possible, within reason. So concretely, (de)serialization should be fast, data size should be relatively small, data copying & memory allocation should be kept to a minimum.
- I’d like good schema evolution support - stored data must be backwards and forwards compatible, and I must be able to freely update schemas within some compatibility rules.
- I’d like to use a somewhat widely known format, so there is community tooling and support for many languages.
That’s it. This is pretty much the basic requirements for communication between micro-services (whether through direct RPC or a message broker). It can also be appropriate for single-host multi-process communication, or logging structured events, or frontend-backend communication, etc.
The Landscape
Despite the common enough requirements, I don’t think there is a satisfactory, general-purpose solution to this problem (yet).
Let’s look at a few popular options and their problems:
- JSON / BSON - Increases data size, relatively slow. Additionally, JSON lacks native support for binary data.
- Protocol Buffers - Fine as purely serialized data, but has a large library dependency & generated code footprint, allocations everywhere, generated language-specific classes are neither fast nor easy to use for apps, lacks robust & established file format for storage (closest being Riegeli).
- Cap’n Proto / FlatBuffers - Closer to the ideal, but sacrifices a lot in the name of “zero-cost deserialization”. Schema evolution support is a little worse than Protocol Buffers, data size tends to be larger (no varint), serialization for large messages becomes slower due to data size, etc.
- SBE - Very fast but very barebones, poor schema evolution support so not suitable as a storage format, shares data size downsides with Cap’n Proto & FlatBuffers, uses XML as schema IDL.
What We Want
Going back to our original requirements, what do we really want in a serialization format, from an application’s point of view?
In terms of a balance of data size / (de)serialization speed / schema evolution support, I believe Protocol Buffer Encoding strikes quite a good balance here. It’s got some lightweight compression via varints, great schema evolution support, encoding/decoding can be quite fast, and finally has a large community. If we consider only the wire format, it’s difficult to beat it on all fronts. Optimizing for speed on top of Protocol Buffers Encoding usually means sacrificing data size & schema evolution a little. Optimizing for data size means more compression, which sacrifices speed.
Next let’s talk about what an application really wants from Protocol Buffers but doesn’t get automatically - fast (de)serialization, and ease of use.
A typical application trying to send/receive data will likely have native in-memory types for that data defined (let’s not worry about pure data forwarders here). With stock Protocol Buffers, the app has to maintain translation logic between the native types and the protobuf generated types, or be forced to use the generated types for business logic - neither of which is great. Instead, I’d like the Protocol Buffers schema and native in-memory type to be seamlessly integrated. For example if I have (examples in C++ as that’s what I work with mostly):
struct Order {
double price;
int qty;
string order_id;
};
and then the following Protocol Buffers schema:
message Order {
double price = 1;
int32 qty = 2;
string order_id = 3;
}
I’d like the C++ type and the protobuf wire format to “map” to each other. I’d like to be able to write:
Order order1;
Serialize(order1, buffer, size);
Order order2;
Deserialize(buffer, size, &order2);
without having to implement or maintain Serialize/Deserialize myself, or to update them when schemas change. Achieving this will require a lot of reflection - possibly in both the native language and the schema IDL. Protocol Buffers already provides excellent reflection, and popular languages either support reflection natively, or there’s usually some way to get around the lack of reflection to achieve it anyway.
Additionally, Serialize/Deserialize should be zero-allocation and minimal-copy, moving straight from native type to bytes, and vice versa. No intermediate generated types, and no memory allocation, hence the buffer+size API. This gets us close to Cap’n Proto level performance out of Protocol Buffers.
However, it’s unrealistic to always generate Serialize/Deserialize auto-magically when between native types and serialization schemas, because wire-format schemas are not expressive enough. Native in-memory types can be much richer - maps, sets, queues, “newtype” types, AoS vs SoA (array-of-structs vs struct-of-arrays), etc. Perhaps I’d like to translate the following schema:
message NamedPoints {
repeated string names = 1;
repeated double xs = 2;
repeated double ys = 3;
}
to the following native type:
struct Point { int x; int y; };
DEFINE_NEW_TYPE(Name, std::string);
using NamedPoints = unordered_map<Name, Point>;
Trying to auto-generate Serialize/Deserialize directly becomes quite pointless in this case. Instead, a more realistic goal is to make it as easy as possible to implement a custom zero-allocation Serialize/Deserialize, based on auto-generated helpers from the message schema.
A Better Solution
Given the above, let’s consider what an ideal library for Protocol Buffers looks like. It should support:
- Auto-generation of per-field zero-allocation writers based on a message schema. For example, with the above Order message, there could be an
OrderWriterclass with methodsOrderWriter.write_price,OrderWriter.write_qty,OrderWriter.write_order_id. Nested messages will require a slightly more complicated API but should be solvable. - For “simple” schemas, direct auto-generation of the
Serialize()method. Requirements for such simple schemas might be:- One-to-one field mapping between native types and protobuf schema types. No single-field to multi-field mappings, in either direction.
- Schema fields must have “supported” types - e.g. no oneof, no maps.
- Fields must have compatible types.
- Native types should be relatively boring - no special tricks like inheritance, bitfields, etc.
- That’s mostly it. A lot of other things can be supported through extra protobuf annotations - missing fields, extra fields, field renaming, newtype.
- Auto-generation of visitor-style API for decoding serialized messages, such that the user can supply a callback to handle each field in the message schema. Require users to handle every field, or compilation fails. Think something like
std::visitfor variants for C++. - For simple schemas with clean per-field mapping, auto-generation of the
Deserialize()method.
I don’t think this library exists today. I expect there probably are already similar closed-source libraries out there, but none open-source that I know of. The one that comes closest to the above requirements I’ve seen is Pigweed Protobuf, but it is advertised as an embedded library and doesn’t seem intended for use in servers. I’m not sure why (except for lacking some minor feature support). It looks quite close to an ideal general purpose solution. Also it doesn’t support the visitor-style API mentioned above.
Caveats
I’ve glossed over some considerations to make some points more salient. Notably, serialization requirements differ a lot between problem domains, so the requirements mentioned above aren’t always applicable.
If your problem domain does not need much in terms of schema evolution, one can get true zero-copy deserialization in the vein of Cap’n Proto or something language-specific like rkyv, where the serialized format is also the in-memory format, even for complicated data structures like hash maps.
If you’d like to do selective reads over serialized data (e.g. via mmap) where the data being read is a small portion of the overall corpus, or if you’d like to selective in-place mutation of the data, then something like Cap’n Proto and FlatBuffers will be better. A downside of Protocol Buffers’ compact wire format is the lack of random access, which requires scanning the entire message even for a single field.
If you only work within a single language, then protocol buffers doesn’t buy much. You can use a language-specific serialization implementation which will always be more ergonomic than language-agnostic one.
Efficient Schema Design
Making schemas generate fast, easy-to-use code is only part of the story. It turns out that if you want good performance out of serialization, designing schemas approriately for efficiency is just as important.
As application programmers, our intuition may be to try as much as possible to have a one-to-one correspondence between our app’s native types and the wire format schema: an int for an int, an array for a repeated field, an optional for an optional, a hash map for a protobuf map, a sub-message for a protobuf sub-message, etc.
This intuition is not always the best use of protobuf schemas. While it may feel natural, ultimately native types and protobuf schemas are different things. Protobuf schemas are quite limited in their type selection and has its own quirks. For example:
- Protobuf maps are very slow in generated code and causes lots of tiny memory allocations.
- Protobuf fields have default values that may not be the same as your native types.
- Marking a protobuf field “optional” does not prevent readers from reading its value without checks. Instead readers will still see the default value for empty optional fields, so its semantics is subtly different from native optional types.
- Repeated scalar fields are packed. Repeated submessages are not packed. This leads to large wire size discrepancies between different schema designs.
- Integers are encoded as varints by default. Varints are great for small values. They are not so great for large values. So if you are storing high-precision timestamps as an int64, think again.
- Sub-messages in protobuf carry additional costs in both serialization and allocation (in generated types), so grouping related fields in a sub-message causes overhead and isn’t purely a question of business-logic appropriateness.
I’m probably missing other gotchas. The overall point stands - designing protobuf schemas for performance is a very different game from designing schemas for native types.
My current advice when designing protobuf schemas is the following:
- Prefer structure-of-arrays design to take advantage of packed scalar fields. You can encode a n-dimensional matrix of numbers with n+1 arrays.
- Default to non-optional fields. Only use optional if business logic really requires presence checks and can’t rely on default value checks. Non-optional saves wire size and prevents gratuitous presence checks.
- Use sub-messages sparingly. Use repeated sub-messages even more sparingly. Prefer flat schemas.
- Do not use maps. Validate key uniqueness at app level.
- Use sfixed / fixed types for large integers.
- Prefer many small messages over one large message. Avoid unusually high field numbers (128+ for example).
Conclusion?
You might expect this is where I will start shilling my own library, but there isn’t one! You’re welcome to build it if you agree this is a good idea. Or if you know a serialization format that readily solves all of the problems better than proposed, please do let me know.
Protocol Buffers are great. Let’s use them to their fullest potential.