Comparing the performance of various serializers

There exist various articles on benchmarks of serializers already. Even various github projects exist without articles attached to them. This goes for every language, a testament to how much performance matters to some people.

However, it’s incredibly hard to create reproducible, consistent benchmarks. That’s not so surprising, as Hanselman points out, there are many factors to control for:

CPU Affinity/Process Priority
Other running processes on your test machine
If present, handle Garbage Collection without skewing results
Using the correct time measuring calls
Using the frameworks/libraries under test properly
Taking result outliers into account
Displaying the results in an easy to interpret way

And probably some other factors that I’ve missed. It’s no surprise then that many existing benchmarks can be criticised for missing one or more of these points. Let’s take a look at some:

GLD.SerializerBenchmark

Even though it’s been superseded by serbench, I wanted to give it some attention because it illustrates a lot of oversights even though on first glance it seems to be a valuable source of information.

The first thing you notice with GLD.SerializerBenchmark is a build error:

This apparently needs a manual DLL download as described by a comment in the code. Fixing that, don’t forget to input “100” as a program argument, as is stated in the article. This produces the following output:

Full output here

As you can see, not all tests run succesfully. There are plenty of exceptions and checks that are failing, making comparison between other benchmarks, even its own, harder. Not to mention that the article hasn’t given the full output to compare against.

Aside from problems running it, there are a couple of measuring problems I noticed:

It doesn’t set CPU Affinity not Process Priority
In the Jil Serializer, the options that can be given a default like done here.
Also in the Jil Serializer, constructing a StreamWriter is not necessary, but supported in case you need it for something like WebAPI
Even though the parent class contains an overridable Initialize member, it’s actually never overriden and sometimes even used during the serialize call. Serializers such as Jil use heavy reflection, but only on the first call, making the initiliaze function crucial to get proper results.
NetJSON Serializer contains an unnecessary extra StringWriter
The Garbage Collecter isn’t run between every test run, leading to possible garbage collections during crucial code execution.
All results are averaged, making it impossible to see what the minimum and maximum values were, nor is it obvious how many tests fell within a certain range (i.e. to see how much jitter there is)

There has been a lot of work to add a lot, though not all, Serializers used in the .NET environment. But I wouldn’t trust this benchmark to output very consistent measurements.

Serbench

GLD.SerializerBenchmark’s README points to this project. However, at the time of writing, some of the shortcomings of the GLD project appear to also exist in Serbench. The Jil Serializer contains the same unnecessary stringreader/writer, Jil is still not initialized properly and CPU Affinity and Process Priority are not set.

It does seem to have collect the garbage before every run as well as having removed the duplicate StringWriter and giving Jil a new Options object every run as well as adding the Apolyton.FastJson reference directly in the repository. Despite that, there is no mention of having done a Memory Profile to ascertain if collection is done exclusively induced or that it happens during test runs. Also, serbench creates an extra thread upon which to run all the tests, which strikes me as odd.

However, I couldn’t get serbench to work. Apparently it requires something called NFX, which is capable of writing the results into a variety of outputs, such as an RDBMS. It seems they’re aiming for a benchmark that emulates running serializers in parallel, supposedly making it more real-life than the synthetic benchmarks usually shown. While chain-benchmarking would certainly be interesting, I’m quite skeptical as to what they’re trying to prove. Isolating serializers in synthetic benchmarks makes them easily comparable, but once you introduce a whole chain of software stacks, you’re bound to run into the issue that everyone has their own combination. Soon, you’ll have to create an all-encompassing benchmark suite that runs on multiple platforms, has pluggable frameworks and pluggable serializers.

Admittedly, I did not spend too much time getting it to work, but I do expect being able to open the project and press start and it should work. I hope they’ll make it easier to use in the future.

SimpleSpeedTest

While not a complete benchmark for a vast array of serializers, it provides a relatively easy setup to create one.

One of the examples shows how to use it, and in it you can see that it doesn’t have the smaller oversights found in the GLD and serbench projects. It only uses streams where necessary, though doesn’t compare the same library with and without streams. It also does garbage collection, but like GLD, it averages all the results into one number.

While the results will be reasonably trustworthy, here too, I am missing CPU Affinity and Process Priority. I can see this working for a quick one-on-one comparison, but not for a full benchmark suite.

A Better Comparison?

I’m sure I’ve not looked at all possible benchmark software for serializers in the .NET environment, my time is limited after all. However, I did create a project which aims to account for as many factors as possible, for a wide variety of serializers, in an isolated, synthetic benchmark style.

Hardware & Software

As is usual in proper benchmarks, to minimize variance and increase reproducability, one states the used hardware and software.

As you can see in these screenshots, I run all tests on a non-overclocked i5–4570

With 8 GB of RAM at the following clock/timings.

As for the software, I’m running a 64-bit Windows 10 Professional, with the latest updates, all applications except file explorer closed, but with automatic updates and all privacy sensitive settings disabled. I noticed that automatic updates easily takes up all your disk I/O and one whole CPU core while downloading and installing updates in the background. This would not be viable for a benchmark.

I’ve used Visual Studio 2017 (not an update version) with .NET Framework 4.6.2 to compile the C# solution.

For used libraries in C# see the packages.config.

For NodeJS I used version 7.7.4

For C++ I used Visual Studio 2017 (not an update version) with Cereal 1.2.2 (which uses rapidjson and rapidxml), protobuf 3.2.0 (static library can be found on the repository)

Methodology

What most other benchmarks do, is create a couple of objects to serialize and deserialize, run those a number of times in a row and calculate the average. While this gives you a representation of total time required, it does lose some valuable data.

For this project, I want to create a moderately large object, measure serialization and deserialization separately and store each single run in a list of measurements. This way I can create an OHLC graph. However, I’m going to alter the definition of the various points. High and low will be the highest time and lowest time measured respectively, but the open point will be the 20/100th measurement and the close point will be the 80/100th measurement.

On the left you can see an example with 250 repetitions. The fastest sample(L for Low) is 2117 µs, the slowest sample(H for High) is 2888 µs. All measurements are sorted from lowest to fastest. The 20/100th measurement(O) is measurement #50, which is 2117 µs the 80/100th measurement(C) is measurement #200, which is 2302 µs. You can see that most of the samples (60% of them) are in the 2117–2302 µs range, with only a couple of outliers below that but more outliers above that.

This way, you have information about how jittery/consistent the library is as well as having a general idea how the library will perform.

Further, all benchmark processes will be run on CPU #0 (so CPU affinity is set), having Process Priority High and will be run as administrator, so that the previous two settings will be possible to do by the process itself.

As a last step, I’ll profile the memory to see if garbage collections are strictly occurring when I want them to and not during a test.

My first test will be Jil in various setups (normal, streaming, with and without attributes on the data object class, with and without options) on x86 and x64.

My second test will be a few JSON serializers in C# on x86 and x64.

Third test will be a few binary serializers in C# on x86 and x64.

My third test will be comparing a couple of serializers in C# to ones in C++ and NodeJS.

Code

Just to make sure I’ve done it correctly, I’d like to run you through some of my code.

First thing the program does is set Affinity and Priority.

Full code here

This is the code used to run a single measurement. .NET 4.6 introduces the GC.TryStartNoGCRegion function, which allows you to tell the garbage collector to pre-allocate memory and tell it to try and not run garbage collection until you end the region. I try to allocate 1M before calling the action.

Full code here

Each test is warmed up, so we don’t measure cold startup time. Then the tests are run for 250 repetitions, which is hard-coded. Technically, this is the only “measurement” that is thrown out for all tests.

Full code here

For all tests, I create an object with 1000 documents. I tried to give a representative object, containing datetimes, UTF-8 strings and an integer. I realise that many more combinations are possible, but I’m not sure if they would add much.

Of course, there are some variations of the specific action, such as when streams are required or when a specific file with preloaded json/xml/binary contents are read into memory before running the test, or when a different type of Person/Document is required for the library, but this is the basic structure for all types.

Results — Test #1

Click for bigger view. Ser = Serialization test, Des = Deserialization test, StrSer = Stream Serialization, StrDes = Stream Deserialization, x86 run, be aware of the Y-axis not starting at 0

Click for bigger view. Ser = Serialization test, Des = Deserialization test, StrSer = Stream Serialization, StrDes = Stream Deserialization, x64 run, be aware of the Y-axis not starting at 0

I apologise for not letting the Y-axis start at 0, I am using Live Charts, and I have yet to find out how to change that. If you know how, let me know!

The first thing you see in these graphs is that Jil is fast and consistent, especially for a JSON serializer. Most of the calls are under 1 ms. The consistency is probably due to being fast, as we’ll see in the next tests. Second, stream serialization slows it down, but stringwriter apparently speeds it up. Third, the “With Attributes” means that the data object was created with DataContract and Serializable attributes on the class. But what actually happens is that Jil is unable to recognize some of the DataContract attributes, which leads to datetimes not being serialized/deserialized. So the speedup you see in the graph is completely attributed to that. Lastly, deserialization with streams is a lot slower than without, which is a shame. When using Jil with WebAPI, the stream API is used instead of the direct version.

The profile will be at the end, since all of the results are from one run of the benchmark.

Results — Test #2

Click for bigger view. Ser = Serialization test, Des = Deserialization test, StrSer = Stream Serialization, StrDes = Stream Deserialization, x86 run, be aware of the Y-axis not starting at 0

Click for bigger view. Ser = Serialization test, Des = Deserialization test, StrSer = Stream Serialization, StrDes = Stream Deserialization, x64 run, be aware of the Y-axis not starting at 0

This is where it starts to get interesting! One of the findings of GLD.SerializerBenchmark was that NetJSON was faster than Jil — but not in this benchmark! Except for x64 serialization, Jil is faster by a pretty noticable margin. And even in the x64 serialization case, NetJSON is ~75 µs faster, whereas with x64 deserialization NetJSON is 374 µs slower per call.

NewtonSoft.JSON is 2–3x as slow as either Jil and NetJSON, but the real slowpoke is the DataContractJsonSerializer deserializer, which comes with the .NET framework. It is about 6–8x times slower than Jil and NetJSON as well as being the least consistent of every framework. I have no explanation as to why though.

Results — Test #3

Click for bigger view. Ser = Serialization test, Des = Deserialization test, StrSer = Stream Serialization, StrDes = Stream Deserialization, x86 run, be aware of the Y-axis not starting at 0

Click for bigger view. Ser = Serialization test, Des = Deserialization test, StrSer = Stream Serialization, StrDes = Stream Deserialization, x64 run, be aware of the Y-axis not starting at 0

As I had expected, binary serializers can be even faster than JSON serializers. They don’t have to parse text, after all. But what I was really surprised about was that ZeroFormatter was so incredibly fast. Its bold claims on github are indeed no lie. It’s faster than Hyperion (the successor to Wire) and protobuf.

If you compare Hyperion or Protobuf to Jil or NetJSON you won’t find much speed difference — in C# at least. MsgPack is a bit slower but you really want to steer clear of BinaryFormatter.

Results — Test #4

Click for bigger view. Ser = Serialization test, Des = Deserialization test, StrSer = Stream Serialization, StrDes = Stream Deserialization, C++ x64 run, be aware of the Y-axis not starting at 0

Click for bigger view. Ser = Serialization test, Des = Deserialization test, StrSer = Stream Serialization, StrDes = Stream Deserialization, c++ x64 run

Click for bigger view. Ser = Serialization test, Des = Deserialization test, StrSer = Stream Serialization, StrDes = Stream Deserialization, nodejs x64 run, be aware of the Y-axis not starting at 0

These last three graphs are about serializers in C++, C++ and Node.JS respectively.

What’s interesting about the first graph is that I had expected JSON and XML serialization in C++ to be faster than in C#. But somehow, cereal JSON serialization is as slow as DataContractJsonSerializer deserialization. The deserialization is incredibly fast though. I’m not sure if I found a performance bug or if I did something wrong. That’s why in the second graph, I removed JSON and XML, and I find that Protobuf is indeed faster in C++. More importantly though, since C++ does not have extra garbage collection metrics, the results are incredibly consistent. The second graph might make it look like it’s equally jittery as C#, but it’s actually all within a few hundred µs.

The third and last graph are the results of a couple of Node.JS serializers. Of course the default one and a couple that claim to be faster. The thing though, I think I might be dealing with some problems getting an accurate time in js. Node.JS supports hrtime, which should be accurate, but still my results are all over the place. I’m not sure if I could call the Node.JS benchmark accurate.

Results —Profiling

To be sure that my C# benchmarks were not hindered by garbage collection, I did a memory profile.

Garbage Collection

As you can see, garbage collection only happens when GC.Collect() is called.

CPU Profile

In the CPU profile, most of the time is spent in some P/Invoke stuff. I’m hazarding a guess that it’s the Garbage Collector stuff I’m doing, since a GC.Collect every 4 milliseconds does add up. Otherwise, the CPU time goes to the serialization libraries.

Raw measurements can be found here: C# x86, C# x64, C++ and Node.JS.

Conclusion

In this (rather long) article, I have shown that most benchmarks miss some steps when doing measurements, that doing measurements correctly is hard and that binary serialization definitely is faster than JSON serialization.

While I have undoubtedly not tested with a data object that is like whatever you’re going to use in your software, I think it’s safe to say that Jil and NetJSON are the fastest JSON serializers I tested for C# and ZeroFormatter definitely is the fastest binary serializer I tested.

But if you really, REALLY want every ounce of performance, you still have to use C++, or other statically compiled languages.

If you’ve stuck with me this far — thank you for reading! I hope you had as much fun reading as I had making this. If you have any questions, don’t hestitate to ask.