Shouldn’t the fact that there is a better way, be reason enough? Is a thousand times faster reason enough?
When we talk about data serialization we usually mean converting between two data formats. A format we can save to disk, or send over the network, and a format we keep in memory. Now would it not be awesome to have a format that we can use for both? Cross language and no conversions, just flat, direct access. This is exactly what FlatBuffers is.
Optimized for performance
Flatbuffers are fast, very fast, just because the format on disk is the same as the format in memory. Because the format is fully indexed, you can just access the part of the data you are interested in, instantly. The only overhead is reading the file into memory. And there are some things we can do to improve the reading and writing speed as well, as we will see later on.
Another huge pro of FlatBuffers is that they are completely type safe. Because all clients are generated from the same schema, every client knows what is where in the data file. So it knows there is a float at index x. Whenever that value is read it will be a float and whenever you write it, it will be float. All generated accessors are typed, so there is no way to accidentally write the wrong type at that position. This type safety is preserved cross platform and cross language. Anything written in C, can safely be read in Java, C#, Go or even PHP and vice versa.
FlatBuffers are also a lot smaller than other data formats. There is no need for field names and there is no need for separators and markup to improve readability. Nobody needs to read the raw data anyway. When you are debugging your code you can still access every field using it’s accessor. Actually it is not different from using Json, but instead of reading from the Json data, you read from the FlatBuffer object. The APIs ensure the data is written and read correctly.
Because the data format contains almost no overhead, it is also faster to transmit it over the network and it takes less space on disk. This is also true when applying gzip compression. The format can still be compressed very well.
Something you might usually forget about when serializing data are allocations. Allocations cause delays and the memory needs to be freed at some point. Platforms that use a Garbage Collector, and especially Android, benefit greatly when less garbage is generated. When used correctly, FlatBuffers generate almost no garbage at all. You should be aware that object reads, including strings, do allocate a new object every time you read it.
So, cache this if possible when you are accessing a nested object repeatedly. You can also choose to wrap your FlatBuffer Object in an Object that caches the allocated objects for you. This allows you to minimize the amount of allocations even further, at the cost of an extra wrapper class. Just remember to measure before you start applying optimizations.
Unfortunately there is one catch when using FlatBuffers. FlatBuffers are immutable. There are two approaches you can take to update them.
Your first choice is to regenerate your FlatBuffer with the updated data. And even though this is still very fast there may be a better option. But it does not work in all cases.
The second and most of the time, better option, is to generate mutators when you run the schema compiler. This allows you to update all fields that have a static size. In practice this is everything except strings and arrays. And although you can’t change the length of the array you can still manipulate the objects inside them. You just can’t add or remove items.
Actually there is a third option. Facebook has been using FlatBuffers for some time now, and they have implemented their own method of storing updates. The store the updates in a secondary FlatBuffer so they do not have to regenerate the entire buffer every time. Read more about this.
FlatBuffers in practice
All I can say is: “When you use ’em, use ’em right”. There are multiple ways of using FlatBuffers when you write them to disk or to the network. FlatBuffer performance is great and there are a few things you can do to make them perform even better.
Writing to FlatBuffers
But before we talk about persisting them, there are a few things you should be aware of. FlatBuffers are created using the builder. The most important thing to know is that you cannot write nested objects. So when you start writing an object, you first write all nested objects to the buffer, before you start writing the object itself. This may seem a bit odd at the beginning, but actually it does not really matter.
You have two choices here, you can convert it to a byte and write it to an OutputStream like most people usually do, or you can choose to use something a bit more powerful, channels. Channels are a way that java exposes lower level operating system APIs to java. For example, they allow you to allocate direct buffers or to map a file or socket directly into memory and read data from them or write data to them. These are low level operating system operations and a lot faster than using streams. In practice this means you flip the ByteBuffer and write it to the channel as long as more bytes are remaining.
When reading FlatBuffers, you have the same choice, either use an InputStream, or an operating system optimized FileChannel (or SocketChannel). The cool thing here would be that if the file isn’t too large, you can map it into memory your memory space and create a FlatBuffer on top of it. This is even faster because the data does not need to be copied into local memory. Instead you will be reading from a directly allocated ByteBuffer instantly. Just call getRootAsMyObject on the MyObject class, providing the ByteBuffer to start reading data.
Evolving your data format
As your projects evolve, so do your data formats. And FlatBuffers supports evolving data models as well. You can add and remove fields as needed and everything will just keep on working. New fields will be ignored on older clients and old fields will be ignored on newer clients. For example when a field becomes deprecated, just add deprecated to your definition and no accessors will be generated for the field. One thing to note is that new fields must always be added after existing fields. This makes sure they do not conflict with existing field indexes.
The most obvious use cases for FlatBuffers are sending data across the net and when persisting to disk. But there are some other useful cases for this. For example the Nearby API allows you to send data using byte arrays between devices. Another good example is sending data from a phone to a watch. Both of these cases become a lot simpler when using FlatBuffers. The data format suddenly becomes known and well documented (schemas) and can evolve without compatibility problems. And in case of the watch, you can store the data you received locally and load it instantly when the app is restarted.
Google has published several benchmarks comparing FlatBuffers to different serialization technologies. And FlatBuffers are about a 1000 times as fast as other solutions like Json. Facebook reduced the time needed to load a story from 36ms to 4ms and reduced transient memory allocations by 75 percent. See the conclusion ofthis document.
It all started with a simple question: “Why can’t data on disk have the same structure as the data in memory?” It is funny, that something we never give much thought, can have such an impact on performance. I will definitely start making use of FlatBuffers in my projects whenever there is a good use case for it. And it turns out, most of the time there is. Apps tend to spend a lot of time dealing with data.
Next we need to get to people that design our rest APIs, to embrace FlatBuffers and add support for them as well. Shouldn’t the fact that there is a better way to do things, be the biggest motivator to start using it, especially when it is at least a 1000 times faster?
This post first appeared on the warm beer blog.