Introduction

I like to spend time on Linkedin reading through posts from companies about news, new releases, funding, new products, etc.. During a recent perusal, I saw something about “DuckDB”, and as much as I pay attention to the data space, I hadn’t heard of this one, so I thought I’d investigate and answer the question “What the heck is DuckDB?”.

Overview

DuckDB itself is an MIT-licensed open source project whose code is hosted on GitHub. DuckDB Labs is a commercial company formed by the creators of DuckDB in July 2021 to provide support, custom extensions, and even custom versions of the product as a way to monetize it. This model reminds me of the early days of open source monetization and is one I like.

DuckDB is briefly described as SQLite for analytic workloads. While SQLite is an embeddable, row-based, and b-tree indexed data store well suited for OLTP workloads, DuckDB is an embeddable column-based data store that uses vectorized processing to optimize OLAP workloads, you could also think of it as an embedded Amazon Redshift or a mutant offspring of SQLite and Redshift. Some of the features are:

Simple installation
Single-File storage format
No server
Fast processing
Language library integrations
Not reliant on any external config files or settings
Programmatic SQL API
Fully ACID compliant
WASM (web assembly) version available

OLTP	OLAP
For your operation workloads	For your analytic workloads
Shorter queries	Longer queries for complex questions
Tables are more highly normalized	Tables are de-normalized
Typically implemented as row-oriented data stores	Typically implemented as column-oriented data stores

Testing

I decided to use their very clever WASM web-based shell to try out querying some Parquet files, of which I grabbed some to play with from here. I started with the “.files add” command to load up the parquet file:

Then I did some basic SQL to check it out:

duckdb> select count(*) from userdata1.parquet;
┌──────────────┐
│ count_star() │
╞══════════════╡
│         1000 │
└──────────────┘
Elapsed: 1 ms

duckdb>

duckdb> select first_name, last_name, email from userdata1.parquet where country = 'Nigeria';
┌────────────┬───────────┬────────────────────────────┐
│ first_name ┆ last_name ┆ email                      │
╞════════════╪═══════════╪════════════════════════════╡
│ Emily      ┆ Stewart   ┆ estewart9@opensource.org   │
│ Annie      ┆ Torres    ┆ atorrest@ning.com          │
│ William    ┆ Green     ┆ wgreen63@phpbb.com         │
│ Jack       ┆ Medina    ┆ jmedina7y@fda.gov          │
│ Jeremy     ┆ Bennett   ┆ jbennettck@wikipedia.org   │
│ Carlos     ┆ Day       ┆ cdaycn@gravatar.com        │
│ Ryan       ┆ Mills     ┆ rmillsgj@angelfire.com     │
│ Betty      ┆ Gibson    ┆ bgibsonka@tamu.edu         │
│ Wanda      ┆ Stanley   ┆ wstanleymz@sourceforge.net │
│ Evelyn     ┆ Spencer   ┆ espencerpi@ted.com         │
│ George     ┆ Howard    ┆ ghowardqh@mapquest.com     │
└────────────┴───────────┴────────────────────────────┘
Elapsed: 2 ms

You can even do an ‘explain’:

duckdb> explain select first_name, last_name, email from userdata1.parquet where country = 'Nigeria';
┌───────────────────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         first_name        │
│         last_name         │
│           email           │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│        PARQUET_SCAN       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          country          │
│         first_name        │
│         last_name         │
│           email           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│  Filters: country=Nigeria │
│   AND country IS NOT NULL │
└───────────────────────────┘

A very nice feature with DuckDB is if you are working with Python for example, you can just add it as a library by adding “import duckdb” to your python script, and then it is in your python process, so it then feels very integrated into your program, unlike working with something like MySQL or Postgres. So, using our userdata1.parquet file, we could do something like this:

import duckdb
myconnector = duckdb.connect('myduckdb.duckdb')
cursor = myconnector.cursor()
cursor.execute("""
CREATE TABLE userdata(
   registration_dttm  date,
   Id                 int,
   first_name         varchar, 
   Last_name          varchar,
   email              varchar,
   gender             varchar,
   ip_address         varchar,
   cc                 varchar,
   country            varchar,
   birthdate          varchar,
   salary             float,
   title              varchar,
   comments 
)
"""
)
 
cursor.execute("COPY userdata FROM 'userdata1.parquet' (HEADER)")
 
print(cursor.execute('select count(*) from userdata).fetchall())
cursor.close()
conn.close()

In the code snippet shown above, we connect to the ‘myduckdb.duckdb’ database, create a table that matches our parquet file, copy the data into it and then perform a simple count query.

Summary

This is a really cool project. While I’ve been aware of the advantages of columnar data stores for about eight years because of Sisense initially, I only started working with them more extensively in the past year. I’m especially excited by their WASM implementation and the clever things they did with Arrow as a data protocol for the data import and query results. Not only is it a great technical example of WASM, but it has some great utility as well. Then, as I was wrapping this up, I ran across a YouTube video of DuckCon, which was the first DuckDB user group meeting from April 2022 that you might want to check out. Lots of great things on the roadmap. As a final note, I have nothing to do with this project or the people behind it, they don’t even know I wrote this, so if I seem like a bit of a fanboy, it’s because I’m legitimately very keen on the project.

What The Heck Is DuckDB?

Introduction

Overview

Testing

Summary