Going From Data Lakes to Oceans

Aggregating into data lakes is the solution of today — but are Federated Sources the solution of tomorrow?

TL;DR

Data lakes are considered to be a popular solution for uniting insights from ALL the organization’s data sources.

An emerging alternative is the federated data sources approach, which generates the same output with a fraction of the effort and cost — without the data duplication. In this post, we’ll explore the advantages and disadvantages of both options and where it may make sense for your team.

A perfect database. What If?

Imagine the perfect database. The best aspects of all databases types, combined into one — without limitations:

Unlimited data storage and processing you get with Hadoop, the fast query execution enjoyed with Redis or DynamoDB; lightning-fast CRUD (create, read, update delete); RDBMS capabilities like Joins, Unions and aggregation; support for structured and unstructured data; querying as if you’re using a data warehouse; with permissions configurable per table, column and row.

I’m not sure that even if such a super computing database existed, all of these capabilities could logically reside in one database. Anyway, it’ll be awhile before we get there. Until then, organizations are using multiple databases with different capabilities for different use cases.

But that comes with it’s own issues.

The problem

Two main issues arise from juggling data in multiple databases and database types. Because each database uses a different API/language, having different database types requires having experts at hand for each DB: IT/admin, DevOps, Application Developers. Backups and maintenance are different for each DB as well. More importantly is that the main resource of the organization — data — is distributed, duplicated, and cannot be aggregated for global insights in a clean, unified manner.

The data lake solution

The first solution, which has become a common buzzword over the past few years, is to build a data lake. The rationale for having a data lake is to bring all data into one place, where it can be queried using a single query engine.

However, the process to get it into the lake requires many ETL (Extract, Transform and Load) processes using tools like Airflow, AWS Glue, Google Cloud Dataflow, among others. During this process, data is often duplicated and not updated as frequently as needed. Every organization has its data graveyards — huge stores of data that no one knows about or doesn’t have the courage to delete. With data lakes, the graveyard’s size will multiply.

Lastly, ETL processes can raise security and regulation concerns, they have high maintenance and costs associated with them, and current lake solutions are provider-locked, allowing you access to data for some of their own services only.

The Federation solution

Federated Sources brings a new approach to the table. Each database processes and storing its own data as they were meant to. Data is not transferred out. Instead, there is one engine that can query multiple types of databases and seamlessly merge the results.

What are the important features of an engine capable of querying federated sources?

This engine must have three important characteristics:

One common API or language to query the database (SQL?)
A mechanism that can unify all the data from disparate sources so it can be queried and aggregated.
The security, governance, and auditing controls one has with a data lake solution.

Two important notes about federation:

Performance isn’t the main benefit of this approach. If your output requires specific capabilities, load the data into the relevant data engine (BigQuery, DynamoDB, etc.) and use it there. But if you need to get business logic from multiple database types, federation will do the job for you.Business logic output will always be relatively small. For example, the DWH will do what it does best and respond with the aggregated output to this new engine, which will be able to join/merge/union the results with other database outputs in order to help with important business insights.

This federation engine will also have to have integrations to each database which implements each API, translate data formats so they can be merged/joined with other results, and expose one common (SQL) syntax to query all these databases.

Who is investing in federated solutions?

Google BigQuery is investing some effort in the direction of federated sources, with a caveat — Google-only services (and only three, as of now). They sometimes refer to it as federation, and sometimes “external data sources”. In their case, BigQuery acts as the federation engine and SQL is the common syntax used. You can query external resources: Cloud BigTable, Cloud Storage, and Google Drive. The data sources can be MySQL tables or file types like ORC and Parquet.

Amazon AWS also invests in the federation approach with a new project called “PartiQL”. It enables unified query access across multiple data stores and data formats by separating the syntax and semantics of a query from the underlying format of the data or the data store that is being accessed — and it is open source.

Wrapping up

While federated querying has its obvious advantages over data lakes, the solutions out there today aren’t complete.

Once we can use SQL syntax to query any data source while being able to Join, Union, and aggregate the outputs in a single command line, then we’ll have a complete solution.

It’ll look something like this:

select * from
bigqueryTable as bqt join
dynamoTable as dt join
facebookApi as fbt join
logfiles as lt
where ….
group by …

At superQuery, we believe that this “complete” engine is possible, and are building for it — we call it “Data Alloy”.