Serious about big data visualization? Consider using MapD.

Last summer MapD open-sourced their technology and made it available for everybody. At that moment me and my colleagues at Dimebox where working on a POV for a potential big client which we had to impress. Our data analytics capabilities where already advanced, but couldn’t handle a lot of data due to the fact it worked client side. We decided to hop on the MapD-train and the results thus far are pretty amazing.

MapD is a GPU database platform. It consists of a few standalone packages that work together. These are:

MapD Core: an in-memory, column store, SQL relational database that was designed from the ground up to run on GPUs.
MapD Charting: Dimensional charting built to work natively with crossfilter rendered using d3.js
MapD Crossfilter: JavaScript library for exploring large multivariate datasets in the browser. Based on crossfilter.
MapD Connector: A JavaScript library for connecting to a MapD GPU database and running queries.

Combine them all together and you have a platform that can almost instantly visualize billions of data records.

The GPU database

The big difference between the MapD platform and a lot of other data visualization platforms is the fact that MapD runs on a GPU database. GPU databases offer significant improvements over the conventional CPU database when performing repetitive operations on large amounts of data. This is because a GPU can have thousands of cores and a CPU usually just has a few. This way a GPU can handle a lot of simultaneous streams while a CPU can handle only a few.

Mark Litwintschik conducted a benchmark with a 1.1 billion record taxi dataset. The results are as follows:

Image from https://www.mapd.com

Query 1: SELECT cab_type, count() FROM trips GROUP BY cab_type;

Query 2: SELECT passenger_count, avg(total_amount) FROM trips GROUP BY passenger_count;

Query 3: SELECT passenger_count, extract(year from pickup_datetime) AS pickup_year, count() FROM trips GROUP BY passenger_count, pickup_year;

Query 4: SELECT passenger_count, extract(year from pickup_datetime) AS pickup_year, cast(trip_distance as int) AS distance, count(*) AS the_count FROM trips GROUP BY passenger_count, pickup_year, distance ORDER BY pickup_year, the_count desc;

System configurations

MapD: 1 machine (16 cores, 512 GB RAM, 2 x 1TB SSD, 8 Nvidia Pascal Titan X GPUs)
Redshift: 6 machines (36 cores, 244 GB RAM, 16TB HDD, AWS ds2.8xlarge)
Presto: 50 machines (4 cores, 15 GB RAM, 100GB SSD, GCP n1-standard-4)
Spark: 11 machines (4 cores, 15 GB RAM, 2 X 40GB storage, AWS m3.xlarge)

As you can see MapD runs only on one machine, but is around 10 to a 100 times faster than the other options. Pretty awesome, isn’t it?

Visualizing the data

For me as a front-end developer this obviously is the most exciting part. As mentioned earlier, we at Dimebox used a client-side solution first. This solution was a combination of dc.js and crossfilter. These libraries are pretty awesome, but since they run client-side the amount of data you can display is limited. With MapD this problem is solved. When using MapD Charting and Mapd Crossfilter you have the same libraries but with the ability to display billions of data records. The possibilities are endless, here are some examples:

Because of crossfilter all graphs are linked

You can also crossfilter while drawing on a map

Especially the map examples are pretty awesome, but you probably end up with some “normal” graphs more often. Graphs that are supported are:

Bar chart
Bubble chart
Row chart
Pie chart
Line chart
Count chart
Number chart
Geochoropleth chart

Some charts still have some issues, they’re working on improving those and adding new ones. Nevertheless, combine the above charts and you can already make some pretty powerful dashboards. I have also created an example dashboard with all those graphs. You can find this on my github profile: https://github.com/luukgruijs/mapd-examples. This is also a nice reference for if you want to get started with any of the above graphs.

This all looks very promising

Yes it is, but there are also a few points which certainly can be improved or should be adressed:

First of all, the documentation is not very rich. A lot of the graphs have no examples, so it’s a bit of a shot in the dark if you’re new to dc.js. Also there are not really written guidelines yet on how to for example leverage MapD in your existing API. You can of course ask yourself wether it’s their job to provide this, but i think it could help with bigger adoption and thus more open-source contributions. Luckily there is https://community.mapd.com/ where you can ask questions and usually you get quality responses in a decent timeframe.

Second, the database does not support UPDATE and DELETE queries yet. They say here that they are working on this though. This however means that with the current possibilities you have to wipe the entire database and re-insert new data or that you have to work with partly duplicate data.

Third, by default MapD is vulnerable to SQL injections. Since queries are send from the browser to the server. You can intercept the requests and extend or change the query in whatever you like. You need to create some logic on your server to fix this and prevent bad shit from happening.

Fourth, MapD did not publish their packages on NPM yet. You can ofcourse still get it by getting it directly from their github, but an NPM package would make it a lot easier to install in existing projects.

Last but not least, GPU instances are relatively expensive. While this of course is not really MapD’s problem, it’s worth mentioning. If you for example have multiple clients and need to run multiple GPU instances things can get costly quite quickly. The cheapest GPU instance on Amazon costs 700 dollars a month. While you always have to place costs like this in perspective, let’s just say you probably can’t use MapD for a fun data rich hobby project.

Conclusion

To me MapD is certainly one the most exciting technologies out there now. But it’s not for everyone, yet. To use MapD in your existing product you have to have some knowledge about d3, dc and crossfilter in the front-end. You should also have some knowledge to make everything safe and polished to your needs in the back-end. I hope the project receives more contributions over the next month. I already started with some contributions myself in the Mapd Charting project and am planning to do more. Exciting times!

Thanks for reading. Please hit the clap button if you liked this article. Any feedback? Let me know. Also check my other articles:

Follow me on Medium or twitter and let’s connect on LinkedIn