Binary Search Trees and Order Statistics for Ethereum

Written by robhitchens | Published 2019/03/02
Tech Story Tags: ethereum | solidity | data-storage | smart-contracts | data-structures

TLDRvia the TL;DR App

Finite Trees for Infinite Sets

This should not be a first choice

This post is about what we can do when we’ve exhausted all of the alternatives.

I’ve written before that sort order can almost always be handled outside Ethereum Smart Contracts: The Joy of minimalism in Smart Contract Design.

And, I’ve written before about patterns for efficiently organizing data in unordered lists. Solidity CRUD.

Thanks to mappings, there is usually no need to “search” to find records of interest in a Solidity contract. It follows that a principle reason for sorting records, namely finding them again, is generally not a sound justification for the increased cost and complexity of sorting records.

In many cases, sorted and filtered data is interesting to software-clients. But, since software-clients are well-capable of sorting the information themselves, that too is a poor justification for burdening a smart contract with the heavy lifting of sorting a list.

Many statistics such as lowest, highest, average and moving average can be computed on-the-fly without resorting to sorting the data. Those, too, are poor justifications for sorting a list in an Ethereum smart contract.

If contracts only need to find previous and next records and enumerate them in sorted order, a linked list is a simple pattern. When combined with hints about where to start, insertion cost can be scale-invariant. That’s what we want because anything that gets more expensive as the set expands is probably a serious defect: Getting Loopy with Solidity.

We should always prefer an efficient workaround over the considerably more complex idea of sorting a list.

There are cases, however, where the sort order is important to a contract and none of the aforementioned tools are right for the job.

What if a contract needs to know the median value in a set? Or, the rank of a given value? Or, the value at a certain rank? For example, consider test scores graded on a curve. Suppose everyone in the top 10% of the sorted test results is to get an A+ and the contract logic (a prize, perhaps?) depends on this result. And, suppose this requirement needs to be handled by a contract. An Oracle is a possibility, but what if the implied return to centralization isn’t acceptable?

Intuitively, we can see that such a process is both simple and deterministic. A smart contract ought to be able to do it. But, such calculations may depend on a sorted list and the kind of random access that a linked list doesn’t excel at.

Cases like this call for a general-purpose and efficient sorting solution.

This is a non-trivial undertaking and the alternatives mentioned are better-suited to Ethereum if they can be applied successfully.

Challenges

I believe Nick Johnson first starting talking about amortization of work in smart contracts. The idea is to take processes that would cost too much gas if tackled all once and break them down into smaller tasks that can be completed individually on a limited gas budget. Some extra difficulty comes from the need to ensure the contract state is valid at all times even when the overall process in only partially complete.

If we’re going to deal with sorted lists, then it would be a good idea to deal with lists that are sorted at all times rather than a monolithic process that sorts a completely disarranged set. A Binary Search Tree is such a structure.

I won’t cover the details here. In summary, BSTs maintain an ordered list by maintaining a set of pointers as data is added and removed.

BSTs offer certain assurances. For example, in a “perfectly balanced” tree, any record can be found in O(log n) steps. In a set of 1,024 records, one can find any record in, at most, 10 steps because 2**10 is 1,024. Each node is a sorted value with up to two child nodes, usually called left and right. Left points to something smaller, and right points to something larger. This is the “tree” aspect of the structure.

“Perfectly balanced” means that any node in the tree has an ~equal (+/- 1) number of children on the left and right side, so the search algorithm can reduce the search area by half with each step. This is the “binary” aspect of a BST. Achieving and maintaining this balance is accomplished by re-organization during inserts and deletes. (Modification is a delete followed by an insert).

When everything is organized like this, it is always possible find an empty node where new data can be inserted while maintaining the searchable aspect of the tree. Re-organization for tree balance is always possible with a little rearranging of node positions. However, remember O(log n). Execution cost increases as the tree gets bigger. And, balancing adds steps to the insertion and removal processes.

This arrangement is not optimized for the economics of Ethereum smart contracts. Smart contract costs are heavily weighted toward write operations, so it doesn’t make a lot of sense to perfectly optimize read-back efficiency while disregarding the cost of frequent, intricate reorganization.

Rising transaction costs as the tree grows is a serious concern for Ethereum smart contracts because it implies that the cost of using the contract might increase to the point of being impractical or even impossible to use. When the gas cost exceeds the block gas limit it’s not possible to run a transaction even if one is willing to pay for it. That’s usually unacceptable.

On the other hand, a “significantly imbalanced” tree is also unacceptable. It defeats the purpose of using a BST and we would end up with costs comparable to iteration through a linked list.

We need some sort of compromise.

Red Black Tree

A Red Black Tree is a variant of a BST that reduces update costs by tolerating limited imbalance in the tree. Such a tree can be a little lopsided, but not too much. It’s still possible to reason about the maximum cost in the worst-case scenario. Tolerating limited imbalance significantly reduces the frequency and extent of necessary reorganizations.

Cost still increases with size, but at a reduced rate of increase. We get most of the benefits of a balanced tree at significantly reduced update cost.

This is progress, but …

What we really need is fixed maximum cost, so we can be sure that it will always be possible to use the contract, at a reasonable cost.

Finite Trees for Infinite Sets

Returning to the purpose of sorting in the first place, there are problems that can be solved with imperfect sorting. Finding the records in the top 10% is such a case. Tolerating less-then-perfect sorting will be key to ensuring it’s possible to calculate a fixed cost for the worst-case scenario on a set of unlimited size.

Let us consider the question of pinpointing the value (say … test score) at a precise percentile rank. Let’s say we want the 90th percentile of 100,000 test scores in the range of 0/50 to 50/50.

A naïve way to approach this would be to consider just sorting the results, implying a BST with 100,000 nodes. We can see right away this is probably too expensive and subject to failure due to the gas cost of insertion and deletion into any sort of tree structure. Do we really need to do that?

Perhaps surprisingly, we could accomplish our goal with a tree of 51 nodes.

Each node will be the set of test scores with a certain grade. The range of possible scores is 0–50, so at most there will be 51 unique grades. There will be thousands of records that share the same grade, but that’s not important. By definition, they also share the same rank.

The point is that we don’t need to sort the 100,000 test scores, only the 51 unique possible values. We can treat each unique value as a set that contains many results from many instances (student test results). We will need to know how many such instances exist at each node so we can figure out how many instances exist above and below, so we can figure out the rank.

For emphasis, we don’t need to sort the tests with the same result, at all.

Each node contains the unordered set of instances that share the same sorted value and the count of the instances that exist in the subtree starting at that node. Illustration adapted from image at https://en.wikipedia.org/wiki/Red%E2%80%93black_tree

What if we want more precision?

Suppose we have values in the range of 1–5,000,000 (and lots of them) and we want to be able to pinpoint the rank, percentile, median, etc., to three significant digits of accuracy. The node count has nothing to do with the number of instances to sort or even the range of the sorted values. It’s the number of unique values to sort that matters.

We could scale and round the values to guarantee that there can never possibly be more than 1,000 nodes in the tree and to ensure that we don’t pay for sorting beyond the precision we actually need.

We don’t need to sort similar values with the same result, at all.

Each node in the tree contains the set of instances with approximately equal values. These are treated as an unordered set.

Limiting Tree Size is key to establishing a maximum update cost. It means there will always be a calculable cost for the worst-case scenario.”

We know that we can maintain an unordered set in each node using the Solidity CRUD pattern, so …

If we have maximum tree maintenance cost coupled with maximum set maintenance cost, then we have sorted (enough) lists with maximum gas costs at any scale.

Photo by Zoltán Cse on Unsplash

Bingo. We can work with that. A BST with a known insertion and deletion cost at any scale is appropriate for Ethereum.

You can put as many instances as you want into such a structure and it will never exceed the predetermined maximum gas cost.

FIFO method of limiting tree size.

For something like a reputation system, it might suffice to limit the tree to n recent records by ejecting the oldest first when a limit is exceeded.

The method of limiting the overall tree size is flexible. Just don’t skip it. Before you use this structure, you should be able to show how your overall design imposes an upper bound on the tree size.

Order Statistics Tree

An Order Statistics Tree is a BST with nodes that contain a count. The count tallies the number of instances in the subtrees that start at each node. This makes it possible to efficiently (cheaply) work out the number of instances above and below any position. When you have above and below counts, you have rank. When you have the rank of a given value and the records at each level, you can enumerate all the records in the top 10% which is what we wanted to do.

A contract can see things like the fact that there are 37 records in the 65–70th percentile and it can enumerate them.

In the linked implementation, unordered lists with insert and delete logic (See Solidity CRUD) contain the identifiers of all the instances that share the sorted value of a given node. It’s a self-balancing tree that uses the Red Black Tree re-balancing algorithm. That’s everything we need for a potpourri of handy tree exploration and statistical functions.

The tree size limiting process is intentionally left out for maximum flexibility. Two ideas, precision and FIFO are described above. More are contemplated. You have flexibility here. Use any strategy you like as long as the gas cost is acceptable at your maximum possible tree size.

Other Statistics Tree

The example implementation gets a lot of mileage out of maintaining a count at the node level. A similar approach can be applied to all sorts of statistics about the sets.

For example, consider an order book where the product of bid and volume is interesting. If each node contains a sum of bid x volume in the subtree at the node_,_ then it would be just as fast (and cheap) to work out market depth above and below any price. Could that be interesting to a contract? The process and the algorithm that maintains the structure can probably be applied to a number of statistics that haven’t even been considered.

A word of caution: This is a non-trivial library that has not been audited. You should consider it experimental and remember that it is presented without warranty of any kind. You are advised that you should not use it or any variation of it in a production contract without first publicly disclosing your quality-assurance process, code audits, test plan and bug bounty. You use it at your own exclusive risk.

Feedback and contributions are welcomed with gratitude.

The code: https://github.com/rob-Hitchens/OrderStatisticsTree

Here’s a peek at the functions in the library:

using HitchensOrderStatisticsTreeLibrary for ... Library.Tree

HitchensOrderStatisticsTreeLibrary.Tree tree;

function snippet() public {uint value; // inspecting by valueuint row; // inspecting instances of value, by rowuint _rank; // inspecting by rankuint _percentile; // inspecting by percentile

    uint root = tree.root;  
    uint frst = tree.first();  
    uint last = tree.last();  
    uint next = tree.next(value);  
    uint prev = tree.prev(value);  
    bytes32 key = tree.valueKeyAtIndex(value,row);  
    bool vexs = tree.exists(value);  
    bool kexs = tree.keyExists(key, value);  
    uint cont = tree.count();  
    uint pctl = tree.percentile(value);  
    uint prml = tree.permil(value);  
    uint atpc = tree.atPercentile(\_percentile);  
    uint atpm = tree.atPermil(value);  
    uint medn = tree.median();  
    uint rank = tree.rank(value);  
    uint belw = tree.below(value);  
    uint abov = tree.above(value);  
    uint atrk = tree.atRank(\_rank);

    // Insert and Remove key/value pairs

    tree.insert(key, value);  
    tree.remove(key, value);  
}

Shout out to Bokky Poobah who was first to release a Solidity Red Black Tree implementation upon which the Order Statistics Tree is constructed.

Rob Hitchens is a Canadian smart contract design consultant, co-founder of Ethereum smart contract auditor Solidified.io and a courseware co-author and instructor of Ethereum, Hyperledger Fabric, Hyperledger Sawtooth Lake, Corda and Quorum bootcamps by B9lab.


Published by HackerNoon on 2019/03/02