Why We Built Our Own Data Format

Written by Zaiku | Published 2017/09/24
Tech Story Tags: programming | internet-of-things | startup | tech | microservices

TLDRvia the TL;DR App

Today we live in a 24/7 data driven world where it is estimated that on average we generate around 2.5 quintillion bytes of data per day, a lot of that data being generate by users of popular services from household names such as Google, Facebook, Amazon, LinkedIn, Netflix etc. A recent report by Cisco predicts the following:

● By 2020, the gigabyte (GB) equivalent of all movies ever made will cross the global Internet every 2 minutes.

● Globally, IP traffic will reach 511 terabits per second (Tbps) in 2020, the equivalent of 142 million people streaming Internet high-definition (HD) video simultaneously, all day, every day.

● Global IP traffic by 2020 will be equivalent to 504 billion DVDs per year, 42 billion DVDs per month, or 58 million DVDs per hour.

Cisco also recently updated their Global consumer web, email, data traffic prediction for 2016–2021.

Another interesting prediction by Cisco relevant to this post is related to file sharing for 2016–2021.

Building smarter, scalable and fault tolerant applications that handle such a high volume of data is a huge challenge, but also represents a big opportunity for both startups and consumers. For disruptive startups (e.g. Blochchain startups) addressing big markets such as financial services, it puts them in a strong position to challenge the big incumbent financial institutions that mostly rely on centralised legacy technologies that were not designed for the current 24/7 explosion of connectivity and big data generated by consumers. For consumers, it opens up more choice and better quality of services at more competitive pricing than with the big incumbents.

Why do we need a new data format?

Recently we have witnessed a sensational return of distributed systems to the mainstream software industry. Microservices being without doubts amongst the hottest hypes and buzzwords in the software industry right — a proof of this is the presence of microservices in the so — called Gartner Hype Cycle! In this post we asked whether early stage startups should adopt microservices.

A couple years ago the benefits of distributed computing seemed to be only centered around the use case of academic research. A common use case often cited is the ability of scientists tackling hard scientific problems in fields such as genomics via cross disciplinary/institutional collaboration using grid computing to easily perform huge data processing/analysis tasks that otherwise would take months to accomplish.

With distributed systems now in the mainstream software industry. At Nanosai we are of the view that when exchanging data between nodes in a distributed system, it is very advantageous to encode data using a fast, compact and versatile data format. We felt that the existing formats (e.g. Protobuf, CBOR, MessagePack, JSON) were not versatile and fast enough for the type of use cases that we envisioned distributed systems of the future will be. Other reasons include the following;

  1. A fast data format is of course faster to read and write (deserialize and serialize) for the communicating nodes. We also wanted something that can be traversed in its binary form if developers need maximum speed.

  2. A compact data format requires less bytes to represent the encoded data. Fewer bytes requires less network bandwidth and can thus be transferred faster across the network.

Finally, let’s take the example of JSON which is currently a very popular data format that is widely used by developers. In fact so popular that it is common for companies to make their APIs JSON only. But JSON also has shortcomings including the following;

i) JSON is not a good format for raw binary data. Raw bytes must be Base64 or Hex encoded and transferred as strings. Base64 encoding increases the size of the encoded data to 4/3 of the raw size, and Hex encoding increases the size to 2/1 of the raw size.

ii) JSON is not that versatile in the sense that it is not that good at modelling all types of data structures. For example JSON is weak at modelling tables of similar data with rows and columns (e.g. CSV files). JSON would encode such tabular data as arrays of objects, meaning the column name would be repeated for every single object (row) in the table. This is a clear waste of data.

iii) JSON is not the fastest data format to read or write. Being verbose it is also slower to transfer, especially for devices with limited bandwidth like small IoT devices, mobile phones on weak connections or ships floating in the middle of the ocean.

What Is ION?

Before proceeding let us clarify something that otherwise may cause confusion. We published our ION data format well before Amazon Web Services publicly published their format also called ION! Therefore we did not choose the name ION because of its similarity to Amazon’s ION. In fact we did not discover the similarities until after we had designed and named our ION format. The acronym ION derives from “IAP Object Notation” where our open network protocol IAP stands for “Internet Application Protocol”. A straightforward evidence is the following;

  • Our Co — Founder published an article at Infoq.com here about IAP which clearly mentions ION.
  • Our Hacker News announcement here about ION versus a Hacker News announcement post here about Amazon`s ION. As you will notice our Hacker News announcement is much older than the Amazon`s announcement.

In short, our ION is a versatile binary data format that can be used to encode a wide variety of data. It is expressive enough to contain serialized objects (e.g. Java or C# objects), CSV, JSON, XML, text and binary data. It is very fast and reasonably easy to parse and generate, more compressed on the wire than JSON and XML, and easy to handle for servers and routers and other lightweight hardware (we believe).

ION is one of the central pieces of our open distributed systems stack as illustrated bellow. We designed ION as default data format for our open source protocol IAP, thus, all IAP messages are encoded using ION. IAP is a versatile message oriented network protocol designed for both synchronous and asynchronous communication, making IAP suitable for many different use cases such as RPC, file exchange, streaming, message queue subscriptions. We created IAP because existing protocols such as HTTP did not meet our versatility and high performance requirements.

Being a data format ION can be used independently of our network protocol IAP. Developers can use ION as a data format in data files, log files, as data format for binary messages transmitted over HTTP etc. It can contain binary data so developers can also embed other formats inside when necessary e.g. an MP3 file, ZIP file, JPG file etc.

We have a more detailed description of how ION compares to other data formats in the post ION vs. Other Formats and ION Performance Benchmarks. For those curious to try out ION please check out release 0.5.0. Over coming weeks we`ll be pushing new ION updates and update documentation, benchmarks etc.

Nanosai Data Streaming Survey

If you are a developer or a startup CEO/CTO — we are currently working on a persistent data streaming service (built on ION & launching this year) as an alternative to existing offers such as AWS Kinesis and hosted Kafka services. It would be great if you have a spare minute to complete our short Streaming Survey. Hackernoon readers who complete it will have special free tier accounts (with support) when we launch. Many thanks in advance!

Sensor City UK

We are delighted to have Nanosai join Sensor City`s ecosystem of startups building innovative sensor related technologies. We're particularly excited about Nanosai`s use case pilot project around realtime streaming of sensor data, where ION will play a very important role. So watch this space!

Posted by Bambordé Baldé, Co — Founder | Twitter: @cloudbalde | LinkedIn: linkedin.com/in/bambordé|


Published by HackerNoon on 2017/09/24