Nested Maps Considered Harmful

Based on a talk given at the Clojure NYC meetup

The thing that makes Clojure one of my favorite languages, the one I choose first when I want to fluidly bang out an idea, is its commitment to generic programming.

Most people associate generic programming with C++ because of the Standard Template Library, but few are aware the ideas in the STL were originally conceived by Alexander Stepanov in Scheme. This is where C++ vectors got their name (after all, vectors in the mathematical sense are of a fixed length).

Stepanov’s original Scheme course notes are still available on his website. He begins by introducing the primitives of list processing, then moves on to recursion, higher order functions like map and reduce, and finally builds up to the conception of the iterator — which he demonstrates by implementing implement several common sorting and searching algorithms.

Coming from a Scheme background myself, I was struck by how Clojure seamlessly extended the list processing paradigm to apply to the most cutting edge immutable data structures: hash array mapped tries for associative data (introduced by Phil Bagwell only six years prior), similarly designed binary search tries for sequential data, plus one of the best implementations of lazy sequences in a strict language I’ve ever seen.

The familiar cons, car, and cdr are still front and center (albeit with different names, although “first” and “next” appeared in dialects of Lisp long before I was even born) except targeting interfaces rather than specific implementations. Even higher order functions are generic, for example the built-in reduce uses Java’s Iterator interface instead of first/next recursion whenever possible for efficiency’s sake.

Clojure really seemed to have done the unthinkable and yet again advanced the idea of what it means to be a Lisp.

The Problem

Building support for generic collection-based operations into the core language is not without its downsides. Tightly coupling complex data structures with a common set of functions used to manipulate them can encourage poor design decisions. As an example, I’ll turn to the title of this piece and discuss one of the most common antipatterns I see in Clojure code—deeply nested hash-maps — as well as presenting a technique I’ve developed to replace it.

Nested hash-maps are so used so often because, abstracted from their implementation, they simply make sense as a way to organize data. Clojure has famously never provided many obvious functions for working with them, many of which fill various utility libraries, yet the core language does provide functions that replicate basic van Laarhoven lenses: assoc-in, update-in, and get-in all take a vector of keys to in order to act as getters and setters for values in nested maps.

Personally, I’ve written code that produces nested hash-maps in the process of recursively generating data that “fans out” at each level. An example from my Madhava library is a function that calculates all partial derivatives of a given function, stored in sparse form using n-tuples, up to a certain order. Each order is generated from the previous one and contain n^k partials, where n is the number of dimensions and k is the order. In an equation of three variables the first order would contain three partials, the second order would contain nine, the third order would contain twenty-seven, and so on. It made sense to me both in writing the function that does the computation and ones that need to access certain partials, particularly related ones, to use nested maps:

{0

Okay, so a bit ugly even with pretty-printing. But hopefully it’s still legible how the curly brackets group together partials of the functions that precede them. With this style of nesting, I could access the second-order partial derivative pertaining to the x variable like so: (get-in diff-map [1 1]).

For a very different (and less monstrous) use of nested maps, let’s look at the team at Ladders’ DSL for extracting SQL queries to Clojure hash-maps: sqlium. I don’t mean to pick on them; they’re good engineers and sqlium is a pretty cool idea.

Sqlium represents the results of individual SQL queries as separate maps based on the tables they’re pulled from with keywords representing table names. Here’s an example of one result from a typical query:

{:name "Abbey Road":artist [{:name "The Beatles"}]:tracks [{:name "Come Together":number 1:artist [{:name "The Beatles"}]:songwriter [{:name "John Lennon"}{:name "Paul McCartney"}]:producer [{:name "George Martin"}]}{:name "Something":number 2:artist [{:name "The Beatles"}]:songwriter [{:name "George Harrison"}]...}... etc]}

It’s three layers deep, meaning the map representing all such queries would be four layers deeps. However, the top layer—also the one with by far the most elements — would be entirely flat, which is why I referred to this example as less of a monstrosity than my own. In fact, the Ladders team may be so concerned with vertical scaling that the amount of performance lost from the nesting in each individual entry is preferable in order to carry the structure of the original data explicitly through their ETL pipeline.

Regardless, these two examples are different enough it should be clear nested hash-maps are produced in a variety of manners for a variety of applications, in all of which they seem like the right choice semantically.

What’s Wrong?

Hash-maps are trees, recursive data structures, so it makes sense to want to store data in hierarchical batches. Clojure hash-maps are tries, immutable yet with an impressive log32(n) insertion and lookup time (effectively treatable as constant), so these batches would take the form of leaves with common parents and the nesting pattern tracing all the way up to the root.

But is that actually what we’re getting in the examples above? Clearly not. Most of the time it’s even worse than one leaf in a HAMT being the root of another HAMT. HAMTs don’t make sense for small amounts of data, so Clojure’s implementation uses a simple array (of the class PersistentArrayMap) to store less than 16 elements (or 8 key-value pairs) and then gracefully switches to a PersistentHashMap as it grows. This means that in both examples above we’re creating ad hoc trees out of doubly-boxed (the elements themselves as well as the additional class containing the data structure) Java arrays. This is obviously a pile of shit.

My approach to this problem follows from two beliefs:

From an API design standpoint it still makes sense to store elements in batches with nested semantics. This comes with benefits in terms of lookup time as well as the qualitative value of human legibility.
The answer cannot involve further complecting data structures with the core language. In other words, we don’t want to expose details of how values are actually stored in order to give programmers access to finer-grained optimization. This would likely only encourage further abuse. There’s also great benefit from encapsulating the implementations, demonstrated by several alternative versions that meet the same interfaces: including one we’ll see in a minute.

A Solution

I wouldn’t be so brazen as to claim this is an ideal solution for everyone. Software architecture clearly isn’t bijective. However, it’s one that has worked quite well for me and I hope it will get others thinking about how to better structure their own data.

I used to be really into information theory. One of my favorite pieces of trivia is how Claude Shannon changed the name of his famous paper “A Mathematical Theory of Communication” to The Mathematical Theory of Communication when publishing it in book form. I remember first implementing Huffman coding as exercises in SICP. These type of coding schemes are where I found inspiration to solve my nested map problem.

I should first note that most key coding schemes, including the two I’ll present here, require the map be sorted. This wasn’t a problem for me personally as instead of the built-in Clojure hash-map I used Zach Tellman’s int-map, a radix trie optimized for integer keys. Being able to use integer keys not only gave me a ~10% increase in speed, but sorting for free.

Clearly this constraint is not possible in the majority of cases, although I would argue more people could use integer keys than they think — likely by developing their own coding schemes and building custom pretty-printers that convert them to strings only when necessary. Therefore, I’ll also present a solution that uses keywords with Clojure’s built-in sorted-map. In this case sorting does come with around a 10% penalty on insertion speed, but this is more than made up for in using a flat HAMT vs. nested arrays.

Once I thought of this solution it seemed almost too obvious. It’s even in the name of the type of data structure I was using: “radix.” All I needed was to simply increment the keys an order of magnitude for every order of partial derivatives. When generating the partials, this meant passing the index along with the previous partial and multiplying it by ten in every recursive call. For lookup, I similarly needed only multiply indices by ten. In cases where I needed all partials of a given order, such as the gradient or Laplacian of a scalar function, I could simply filter for keys between a given range using a predicate such as (and (> k 9) (< k 100)).

Here’s what my map of partials looks like now:

{0 [[5 4 3 3] [8 2 1 2] [1 0 4 0] [2 0 0 3] [5 1 0 0]],1 [[20 3 3 3] [16 1 1 2] [5 0 0 0]],2 [[15 4 2 3] [8 2 0 2] [4 0 3 0]],3 [[15 4 3 2] [16 2 1 1] [6 0 0 2]],11 [[60 2 3 3] [16 0 1 2]],12 [[60 3 2 3] [16 1 0 2]],13 [[60 3 3 2] [32 1 1 1]],21 [[60 3 2 3] [16 1 0 2]],22 [[30 4 1 3] [12 0 2 0]],23 [[45 4 2 2] [16 2 0 1]],31 [[60 3 3 2] [32 1 1 1]],32 [[45 4 2 2] [16 2 0 1]],33 [[30 4 3 1] [16 2 1 0] [12 0 0 1]]}

Of course, this limits me to functions of nine dimensions (since I’m indexing from one for clarity’s sake), but that seemed like more than enough in this case. I don’t really understand string theory, but from what I could gather even the equations describing 15 degrees of freedom only use the four independent variables of Minkowski spacetime since conformal symmetry implies the ten from the Poincaré group as well as the one for time dilation cancel each other out. I concluded my representation was sufficient for even the most complex practical applications.

Had I required more than ten keys at the first stage of nesting, one option would be to use a different base. However, I almost certainly would have stuck with decimal and simply incremented by as many orders of magnitude as necessary to represent each level of would-be nesting.

One important takeaway from this was that any significantly complex operations on the keys themselves would negate most of the value of the coding scheme. One of the greatest benefits of this particular one was that insertion and lookup required nothing more than simple multiplication. Had I needed to convert back and forth to a digit-wise sequence on every single operation it likely would have become prohibitory. This was important guidance in the more difficult case where I couldn’t use integer keys.

All in all, I experienced an ~18% speedup on insertion using a flat map with my key coding scheme vs. the nested mess above. Many programmers might scoff at that, but in the numerical world 18% is quite significant. Furthermore, using only one map allowed me to fully take advantage of Clojure’s transient feature that temporarily creates mutable versions of data structures for aggressive manipulation. Using a transient _int-map_ increased speed of insertion by nearly 4x over my original nested version.

What About Keywords?

The keyword version, I’ll admit, is a bit of a hack. I was also notified it doesn’t work in ClojureScript, which throws exceptions when namespaces don’t resolve. I haven’t used it in my own code, but as mentioned I hope it will at least get people thinking about alternatives for their own use cases.

First, let’s talk a bit about keywords in general. They’re a fairly idiosyncratic convention developed in Common Lisp and I see Clojurians using strings as keys often enough that an explanation is clearly in order.

Lisps historically have had a concept of symbols, which refer to all global references including built-in functions. Clojure complicates this a bit by drawing a distinction between symbols as names themselves and vars, or symbols which refer to a value or function. This stems from the fact that most of the language is written in Java and the compiler emits JVM bytecode, but from here on I’ll only talk about symbols for simplicity’s sake.

A keyword is just a symbol that evaluates to itself. In Clojure it’s prefixed by one colon, or two to automatically qualify it with the current namespace, e.g. :my-keyword or ::my-keyword, which is the same as :my-app.core/my-keyword. Keywords are implemented as Java objects that store their own name and namespace with public getters and setters for each. When used as keys in associative data structures, Clojure keywords are hashed using a Java implementation of the common MurmurHash3 developed by Austin Appleby at Google. This alone is an obvious reason to use them instead of strings.

Another reason to use keywords instead of strings is that strings are inherently slow. If you’re a professional string salesperson there are many interesting problems in that domain, but I would argue a lot more programmers think they’re in the string business than actually are. For example, a large amount of development these days consists of writing CRUD apps. Although the vast majority of the “crud” being shuffled around consists of strings, the very acronym implies they’re not being directly manipulated.

Clojure even goes so far as to put most string functions in a separate namespace, and for good reason. It can be tempting to use strings as an intermediary format, for example converting integers to digit-wise sequences — the bullet we dodged above. This is the absolute worst since you’re not even working with strings at all. Just don’t do it.

Taking our lesson from the integer case that coding schemes are only valuable if they require minimal operation on the keys themselves, our rule of thumb here will be that although keywords are obviously just strings stored inside objects we should never do any actual string manipulation directly.

It was this heuristic that led me to exploiting the fact that keywords store namespaces as separate values (and, crucially, on the JVM that Clojure does not check to see whether they resolve to actual namespaces) in order to work with compound field names without converting them to strings.

The trick goes like this:

(keyword "foo" "bar") => :foo/bar(name :foo/bar) => "bar"(namespace :foo/bar) => "foo"

(keyword "foo" (str (keyword "bar" "baz"))) => :foo/bar/baz(namespace (keyword (name :foo/bar/baz))) => “bar"

Let’s run some microbenchmarks with Criterum to see how this technique compares to splitting strings:

(bench (namespace :foo/bar)) ;; ~3.6 ns(bench (clojure.string/split (str :foo/bar) #”/”)) ;; ~170 ns

(bench (namespace (keyword (name :foo/bar/baz)))) ;; ~51 ns(bench (clojure.string/split (str :foo/bar/baz) #”/”)) ;; ~212 ns

With one level of nesting, it’s nearly as fast as the multiplication used in our example with integer keys. With two levels, it becomes significantly slower due to the necessity of converting the string returned by name back into a keyword just once. However, it’s still over 4x as fast as string manipulation. Further examination shows this scales linearly, as one would expect as the calls to keyword are doing all the work here:

(bench (namespace (keyword (name (keyword (name (keyword (name :foo/bar/baz/qux/quz))))))));; ~153 ns

(bench (clojure.string/split (str :foo/bar/baz/qux/quz) #”/”));; 290 ns

This made me wonder: at how many levels of nesting would this multiple-namespace keyword hack reach parity with clojure.string/split?

(bench (namespace (last (take 10 (iterate (comp keyword name) :foo/bar/baz/qux/quux/quuz/corge/grault/garply/waldo/fred)))));; ~631 ns

(bench (clojure.string/split (str :foo/bar/baz/qux/quux/quuz/corge/grault/garply/waldo/fred) #”/”)) ;; ~615 ns

I can’t say I’ve ever seen a nested hash-map 11 levels deep.

To round things out, we can test that creating nested keywords also scales linearly:

(bench (keyword “foo” “bar”)) ;; ~18.7 ns(bench (keyword “foo” (str (keyword “bar” “baz”)))) ;; ~32 ns(bench (keyword “foo” (str (keyword “bar” (str (keyword “baz” “qux”)))))) ;; ~51.8 ns

and takes less than half the time of calling str:

(bench (keyword “foo” (str (keyword “bar” “baz”)))) ;; ~32 ns(bench (keyword “foo” (str “bar/” “baz”))) ;; ~77.9 ns(bench (keyword (str “foo/” “bar/” “baz”))) ;; ~127 ns

Using this technique Ladders’ SQL query for Abbey Road could become:

{:name "Abbey Road",:artist/name "The Beatles",:tracks/name/Come Together/name "Come Together",:tracks/Come Together/number 1,:tracks/Come Together/artist/name "The Beatles",:tracks/Come Together/songwriter/name/1 "John Lennon":tracks/Come Together/songwriter/name/2 "Paul McCartney"],:tracks/Come Together/producer/name "George Martin",:tracks/Something/name "Something",:tracks/Something/number 2,:tracks/Something/artist/name "The Beatles",:tracks/Something/songwriter/name "George Harrison",... etc}

Obviously this involves some design decisions in how to deal with repetitive keys, which should be considered carefully with regard to the specific data.

Lessons Learned

When discussing the Unix philosophy in Notes on C Programming, Rob Pike famously wrote:

Data dominates. If you’ve chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.

He explicitly referenced Fred Brooks’ The Mythical Man-Month:

Show me your flow charts and conceal your tables and I shall continue to be mystified, show me your tables and I won’t usually need your flow charts; they’ll be obvious.

In the context of generic programming, I’ll take this one step further: