Interacting with MapR-DB

Written by anicolaspp | Published 2018/12/11
Tech Story Tags: big-data | mapr | database | spark | hadoop

TLDRvia the TL;DR App

MapR Platform is an excellent choice for solving many of the problems associated with having humongous, continuously increasing dataset of nowadays businesses.

The distributed, highly efficient file system along with the powerful yet simple and standard streaming API, are key components of the success of the platform. However, one of its most celebrated pieces is its distributed, non-SQL, highly available, JSON database.

MapR-DB supports the HBase API for retro-compatibility, but the newest OJAI API is the core of it.

Let’s look at an example of a document we could store in MapR-DB.

This is a pure JSON that MapR-DB can store. The document could be as complicated as we want. There is virtually no limit on the document size, the number of fields or recursively nested fields.

Documents are stored across the MapR cluster so that reading and writing from/to a table happens in parallel, distributing the workload and gaining impressive performance numbers as shown in some independent benchmarks.

The images below show some of them.

MapR-DB can do quite more operations per seconds than the rivals.

MapR-DB keeps latency low, constant, and predictable.

The entire comparison can be found here.

When reading or updating documents, MapR-DB knows what part of a document needs to be read or updated and only those parts are actually touched. MapR-DB try to efficiently manipulate documents, tables, and the underlying file system in order to keep performance at its best.

Querying MapR-DB

MapR-DB is a non-SQL database, so it does not support SQL natively. The OJAI API is the preferred way to interact to MapR-DB and by using this API we can take advantage of every feature this database offers.

We can use any of the provided clients to run queries on MapR-DB. An example of creating a document using the Java API is the following.

As we can see, the API allows manipulating objects in a friendly way as they represent JSON documents.

Through the OJAI API, we can do all kinds of operations against MapR-DB such as inserts, updates, etc…

Basically, from any application that is able to use the OJAI API, we are able to do most of the work in MapR-DB. However, we could ask ourselves, what about other types of tools that required different processing capabilities?

Example of these is BI tools doing aggregations such as counts, groups by, sums, etc… On the other hand, we also should be able to quickly look at values on the database without the need for writing applications, but is this possible in MapR-DB? Let’s explore our options.

MapR DB Shell

MapR-DB offers a tool called dbshell that can be used to query the database using its native language.

Using the dbshell we can explore what tables we have, query them in all possible ways and more. Let’s see some examples.

Let’s start by listing the tables we have under a path.

Let’s insert some values into this table.

Now, let’s list the documents.

We can query by id.

Or we can use any other fields.

Notice how the query is done. This is the OJAI query language and API playing their roles. This is native to MapR-DB. Remember, it is not a SQL database.

As you could imagine, the dbshell is nice way to taste of how MapR-DB works and for doing quick and simple explorations. However, it might be hard to think about it as the preferred tool for large and complex queries.

Let’s continue to explore the options we have and how to use them.

MapR-DB Connector for Apache Spark

MapR offers a connector for Apache Spark that can be used for large data processing on top of MapR-DB.

The connector can be used on the different Spark APIs such as RDD[A], DStream[A], and DataFrame/DataSet[A].

For using the connector we must add the right dependencies to our spark project first. The following is a build.sbt file from the[Reactor](https://github.com/anicolaspp/reactor) project.

Now, we should be able to use the connector without problems.

The example above only shows a fragment of the app, but notice how the connector is used to load and save DataFrames from/to MapR-DB. The same can be done for other Apache Spark abstractions as mentioned before.

Using the MapR-DB connector for Apache Spark we open a limitless of possibilities since we can combine the distributed nature of MapR-DB and Apache Spark together so we are able to truly process data at scale.

Even though Apache Spark is one of the best tools we can have in our toolset, sometimes it is just not enough. We need to ask ourselves how users that have no coding experience can use the powerful features of MapR-DB without going through the learning process of Spark which, sincerely, it not short nor easy.

Distributed Processing using Apache Drill

When we need SQL, we have Drill.

Using Apache Drill we can query almost dataset living in the MapR Platform regardless where it is stored, how it is formatted, or its size.

Interacting with Drill can be done through its different interfaces. Let’s start by using the drill shell since it offers a very simple, shell based solution.

As we can see, we can query MapR-DB, which is a non-SQL database, using pure SQL through Apache Drill. The result, as expected, comes back as a table. As you might suspect, queries of all kind can be executed, aggregations are especially interesting.

Running queries like this on top of MapR-DB is mind-blowing. Drill knows exactly how to transform the SQL queries to the underlying MapR-DB query language.

It is important to notice that Drill also runs distributed on the MapR cluster so the same principles for data distributions and high performance continue to apply here.

Other Apache Drill Interfaces

The shell is not the only interface Drill supports. We can also use Drill through the REST interface.

Also, Drill offers a Web interface for a more friendly usage. Accompanying these interfaces comes the JDBC and ODBC interfaces. These ones are very important to BI tools like Tableau, Microstrategy, and others to connect and interact with Drill.

The same ideas we discussed before apply here. For example, Tableau could connect to Drill through JDBC and Drill will run distributed queries on top of MapR-DB. This makes MapR-DB a very versatile and capable database.

Conclusions

MapR-DB is one of the most capable, non-SQL options out there. It offers HBase and JSON capabilities under the same platform. It runs distributed on the MapR cluster, sharing most of the properties on the underlying platform (MapR-FS). MapR-DB can be queried in many forms such as OJAI API for application, dbshell for quick and simple interactions, Apache Spark for data processing at scale and Apache Drill for SQL queries and data analytics and BI tools integrations. Regardless of the tool being used, MapR-DB keeps performance a priority by maintaining low latency and fast operations per seconds at any scale which makes it perfect for the next generation workloads of the future.

Other tools for MapR-DB are independently developed, for instance, [_maprdbcls_](https://github.com/anicolaspp/maprdb-cleaner) that can be found here. It allows deleting documents (records) based on queries.


Published by HackerNoon on 2018/12/11