What the Heck Is Malloy?

Written by ProgRockRec | Published 2022/10/07
Tech Story Tags: data-analytics | duckdb | data-lake | data-science | business-intelligence | guide | beginners-guide | how-to

TLDRMalloy is an experimental language for describing data relationships and transformations. It is both a semantic modeling language and a querying language that runs queries against a database. Malloy natively supports DuckDB, BigQuery and Postgres, and it can connect to DuckDB. There is a Visual Studio Code extension for Malloy, as well as npm modules for javascript and typescript, in addition to a Composer. The language is more evolved and open-source than Looker, which Google bought back in 2019.via the TL;DR App

I follow some very bright people on Linkedin and Twitter (about the only reason I’m on Twitter), and I learn about interesting tech and interesting use cases by random chance.

The latest is Malloy, and as an added benefit, the example is going to include DuckDB. The intro paragraph from the Malloy GitHub page says:

Malloy is an experimental language for describing data relationships and transformations. It is both a semantic modeling language and a querying language that runs queries against a relational database. Malloy currently connects to BigQuery and Postgres, and natively supports DuckDB. We've built a Visual Studio Code extension to facilitate building Malloy data models, querying and transforming data, and creating simple visualizations and dashboards.

Malloy was developed by Lloyd Tabb the creator of Looker, which Google bought back in 2019. Malloy is meant to be a more evolved and open-source version of Looker essentially. Looker has some clever ideas in it obviously, but I’ve also seen it referred to as:

 "feels like JSON made a baby with SQL, in the worse possible way"

To be clear, I know about Looker, but I haven’t done any real work with it. So that brings us to Malloy.

What Is Malloy?

Lloyd describes Malloy on the GitHub page as

an experimental language for describing data relationships and transformations. It is both a semantic modeling language and a querying language that runs queries against a relational database.” also stating: “SQL is complete but ugly…Everything is expressible, but nothing is reusable; simple ideas are complex to express; the language is verbose and lacks smart defaults. Malloy is immediately understandable by SQL users, and far easier to use and learn.

Malloy natively supports DuckDB, which I find exciting, and it can currently connect to BigQuery and Postgres. There is a Visual Studio Code extension for Malloy, as well as npm modules for javascript and typescript, in addition to a Composer. Malloy can (from the docs):

  • Queries compile to SQL, optimized for your database.
  • Has both a semantic data model and a query language. The semantic model contains reusable calculations and definitions, making queries short and readable.
  • Excels at reading and writing nested data sets.
  • Things that are complicated in SQL are simple to express in Malloy. For example, level of detail calculations, percent of the total, aggregating against multiple tables across a join safely, date operations, reasonable ordering by default, and more.

First, a couple of examples taken from the docs so you can see what a Malloy description looks like and the SQL it generates:

query: table('malloy-data.faa.airports') -> {
  group_by: fac_type
  aggregate: airport_count is count()
  where: state = 'CA'
  order_by: airport_count desc
}

SELECT
   base.fac_type as fac_type,
   COUNT( 1) as airport_count
FROM `malloy-data.faa.airports` as base
WHERE base.state='CA'
GROUP BY 1
ORDER BY 2 desc

This will translate to this SQL:

And from the dataset, it will produce this result:

Note that you can display this as HTML, JSON, or SQL. The basic structure of a Malloy Query takes the form of:

query: <source> {
  join_one: <source> with …
  join_many: <source> on …
} -> {
   group_by:
    <field/dimension>
    <field/dimension>
   aggregate:
    <aggregation/measure>
    <aggregation/measure>
   nest:
    <named_query OR query_def>
    <named_query OR query_def>
   where: <filter_expression>, <filter_expression>, …
   having: <aggregate_filter_expression>, <aggregate_filter_expression>
   order_by: <field/dimension>, <aggregation/measure>, …
   limit: <limit>
}

This maps to the following SQL query structure:

SELECT
   <group_by>, <group_by>, …
   <aggregate>, <aggregate>, …
   <nest>, <nest>, …  		       -- very much a simplification; read more in Nesting Queries doc.
FROM <source>
LEFT JOIN <source> ON …
LEFT JOIN <source> ON …
WHERE (<filter_expression>) AND (<filter_expression>) AND …
GROUP BY <group_by>, <group_by>, …
HAVING <aggregate_filter_expression> AND <aggregate_filter_expression> AND …
ORDER BY <group_by> | <aggregate>
LIMIT <limit>

If you don’t know Mimoune Djouallah, you should definitely follow him on Twitter or Linkedin, he does lots of very clever things in the data and analytics space. I’m going to shamelessly make use of the example he just put out for this article, just because it is so elegant.

So, I’m going to summarize what Mimoune has going on here on his Malloy Fiddle.

In the left-hand pane, we have the schema information, which is pretty obvious. Then to the right of that on top is our Malloy Query, in which you can see the nested aggregations easily described in just a few lines.

Below that is the Semantic Model, and to the right is the generated bar chart that was created from the query. You can try out various saved queries from the “Queries” on the top row. Here is what is really wild, however, look at the below screenshot:

That is the generated SQL from the Malloy Query, which was only 6 actual lines of code; it replaced 40 lines of SQL. You can see the embedded DuckDB connection that is accessing the various Parquet files that have the data for the query in the Semantic Model.

It’s totally worth clicking into the Fiddle and playing around a bit. There is another really fun Fiddle from Lloyd that connects to the IMDB dataset that is absolutely worth checking out here.

What I really like here is how you can create nested data sets that Mallow will run together in a single query; you can aggregate pretty much anything into a single query and simplify your SQL.

Summary

The main purpose of this article is to bring Malloy and some ideas to your mind; it’s not meant to be an exhaustive tutorial by any means. Yes, it’s another language, but it’s similar enough that it should be easy to pick up on.

What really excited me about this whole affair was doing full analytics and visualizations on a ton of data in a browser, with no servers running, with DuckDB and Malloy. I’ve seen a lot of use cases where analysts would love to have this speed and ease of use.

I hope I got you thinking a little bit and you’ll check it all out, but keep in mind that Malloy isn’t an actual product at this time and no one is offering support.


Written by ProgRockRec | Technology and blockchain developer and enthusiast as well as a prolific musician.
Published by HackerNoon on 2022/10/07