Beneath this mask is Data — Part 4 (or Rise of the Open Data-ards)

“This is your last chance. After this, there is no turning back. You take the blue pill — the story ends, you wake up in your bed and believe whatever you want to believe. You take the red pill — you stay in Wonderland, and I show you how deep the rabbit hole goes. Remember: all I’m offering is the truth. Nothing more.” — Morpheus to Neo in ‘The Matrix’

Although the notion of Open Data and especially Open Government Data has been around for quite some time; the idea started gaining traction circa 2009. Open Government Data started becoming visible in the mainstream with various Governments such as the USA, UK, Canada, New Zealand among others announcing numerous initiatives towards opening up their public information. The Indian Government also released some of its data as a part of the Open Government Data initiative on the website . These datasets are substantially large in nature and their value lies in how these can be reused and recombined in multiple ways.

A woman in Denmark built findtoilet.dk, which showed all Danish public toilets. So people with bladder problems can trust themselves to go out more. The Indian Institute of Tropical Meteorology has created a service called SAFAR-air (I’m not a big fan of the UI here) that shows the city’s air quality (you could also look at OpenAQ). In New York you can easily find out where you can walk your dog, as well as other people who use the same parks. ‘Round’ is an app that lets you find out the most beautiful route to take when going to a place, rather than the shortest distance. Services like ‘mapumental’ in the UK and ‘mapnificent’ in Germany allow you to find places to live, taking into account the duration of your commute to work, housing prices, and how beautiful an area is. All these applications make use of Open Data.

How to successfully Open Data

Releasing Open Data is largely considered to be the domain of big data collectors and aggregators like Governments, and large organizations like Google, Facebook, and the likes. This however is not the whole story. Even individuals can contribute to this large pool of open data for others to recombine and reuse in their projects.

Here is an example of open data released by an individual. I mapped all the tea stalls in the IIT Bombay campus and created a simple application to know which shops are open during what times of the day. The data collected for the application was released openly as a CSV file.

Openness of data means that anyone is free to use this data for any purposes without any restrictions from copyrights, patents and other control mechanisms. Cheap and easy access to this data is paramount for it to be used in novel ways. There are a few thumb rules one could follow when releasing open data :

Keep it simple: Data provided should be easily machine readable. The most common machine readable formats are tabular data structures like CSV, TSV; widely popular formats like JSON, geoJSON among others. Stay away from PDFs or Word files when releasing Open data since these are difficult to parse.
Move fast: A constant updating and inflow of new data is important. Data already available should be digitized.
Be pragmatic: It is impossible to have accurate data at all times. Sometimes good enough is good enough. More data often always trumps accurate data. It is better to give raw data now than perfect data six months later. Embrace the messiness. This does not mean the data should be blatantly incorrect or false, however it needs to provide room for inaccuracy. Perfection is a non-existent state.

Giving cheap and easy access through a download on the internet or some other means should be the first priority of Open data providers. Slight inaccuracies are forgivable and the decision to release the data is at the data providers’ discernment.

What about security

Providing cheap and easy access to data enables a large number of individuals to use it in clever ways. It is of great value to the Governments themselves. With the data already available, people ask less questions (or more but important ones) which in turn increases efficiency and reduces costs. However, it should also be considered that not all data can be made public and there will always be exceptions to providing access to sensitive data related to national security and the likes.

Not all data is of this nature though, and there will be proponents that would argue against opening any public data due to security concerns. However, I believe that the bad guys will get the data anyway, through open or closed means. There are more good guys in the world than the bad ones and keeping the data secret does more harm than good for us. The goal is to take the species forward, and in the information age, data is the catalyst.

It is difficult to imagine the full potential of using and reusing Open data. The possibilities are endless; like how Dr.John Snow (not from Game of Thrones) used the map of England and location of cholera victims to find the cause of the outbreak at the dawn of the 19th century. This led to discovering that cholera was a waterborne disease. There are numerous opportunities now, more than ever, since data is so abundantly available.

Dr.John Snow and the famous cholera outbreak map

Many of the Open datasets are extremely large and it would be difficult, and in many cases impossible for humans to sift through the data manually and generate insights. These datasets easily contain more than a billion data points and are being used extensively to make important decisions related to our lives. We even have a name for this now.

Behold BIG DATA!

To be continued…

P.S

This post is the fourth part in a series. Here are links the previous posts:

References

There is no spoon