Analyze Almost 1 Billion Bitcoin Transactions in Less Than 1 Minute Using This Tool

The lead image for this article was generated by HackerNoon's AI Image Generator via the prompt "blockchain transaction hashes on a whiteboard"

In this article we will talk about the secret open-source tool that no-one knows about which allows you to query useful data from blockchains at lighting speeds.

We will give sample queries that one may make to get useful data out of the blockchain and we will talk about the tool's architecture that allows it to do analysis at such high speeds.

Before that though, let's talk about why you need to analyse transactions from the blockchain in the first place.

Each Bitcoin transaction contains the

sender address
recipient address
amount sent

but also

sender's balance
how long the coins were dormant in sender's wallet
recipient's balance

And from the above, you can extrapolate and get information about

How long sender's usually keep their Bitcoin before sending it
What the average balance of all wallets on the blockchain are
Average transaction size

and 100+ different other analytics.

As you can see, the data that we can get about the Bitcoin market is much more comprehensive compared to legacy markets. And most importantly the average person can get access to it. In contrast, only hedge funds and banks have access to data like this in legacy markets.

Hence, by analysing transactions on the Bitcoin blockchain one can understand the Bitcoin's market better and make better investment decisions based on this data.

Apart from applications of blockchain analytics to trading and investing, there are multiple other ways how blockchain analysis is used:

Deanonymization of wallet addresses
Identifying illicit activity
Compliance

So how does this tool work?

BlockSci's architecture

BlockSci aims to address three pain points of existing blockchain analysis tools:

poor performance,
limited capabilities
and a cumbersome programming interface.

Poor performance is the pain point that BlockSci solves best. For example, the brute force approach to analysing Bitcoin transactions would also work. You can run a Bitcoin node on your computer/server, and query your own node directly. But this approach is so slow that it may take years to process all 1 Billion Bitcoin transactions.

Other existing tools also suffer from poor performance, especially when using general-purpose graph databases, which makes them hundreds of times slower for sequential queries and substantially slower for all queries, including graph traversal queries.

BlockSci's design is predominantly based on the fact that in blockchains, blocks in the past cannot be altered, and all the new data that appears on the blockchain is append-only.

This means that the ACID properties of transactional databases are unnecessary, making an in-memory analytical database the natural choice. Using memory instead of disk storage significantly speeds up data processing which is exactly what we need.

In fact, BlockSci loads the whole blockchain in memory to perform calculations and avoids the distributed processing approach. This is motivated by the fact that blockchain data is graph-structured, and thus hard to partition effectively.

Its designer's conjecture was that the use of a traditional, distributed transactional database for blockchain analysis has infinite COST (Configuration that Outperforms a Single Thread), in the sense that no level of parallelism can outperform an optimized single-threaded implementation.

It also applies several techniques such as converting hash pointers to actual pointers and deduplicating address data, to increase speed even more and decrease the size of the data.

To import data from the Node, BlockSci uses its own high-performance importer that directly reads the raw data on disk and NOT from the built in JSON-RPC interface. Even then, it usually takes 24 hours to import and index all the data from the node. Once the data is parsed though, the actual analytics are fast.

The way blockchains are stored on disks isn't easy to analyse. They are made to do other things like check transactions and find data in a big network. They are also made to save space in memory by keeping blocks in a basic format on the disk. But we need to change the data to make it fit in memory, so BlockSci has a parser that can handle this step. It was made sure the this parser is well optimised.

Another way the tool achieves such high speeds is the “bloom filter optimisation”. The bloom filter is a probabilistic data structure that allows testing membership in a set. In the context of the blockchain, it stores all seen addresses and ensures correctness of lookups for existing addresses while minimising the number of database queries for nonexistent ones. This is achieved based on the fact that about 88% of inputs spend outputs created in the last 4000 blocks and that only 8.6% of Bitcoin addresses are used more than once.

The last most important optimisation that allows BlockSci to achieve high analysis speeds is the data layout structure that gives both high analytics speeds and doesn't take too much of a toll on memory.

BlockSci's data layout divides the available data into three categories and combining it in a hybrid scheme. The core transaction graph is required for most analyses and always loaded in-memory, stored in a row-based format. Scripts and additional data required for only a subset of analyses are stored in a hybrid (partially column-based, partially row-based) format and loaded on-demand. Indexes to look up individual transactions or addresses by hash are stored in a separate database on disk.

Additionally, it uses fixed-size encodings for data fields where possible, optimizes the memory layout for locality of reference, links outputs to inputs for efficient traversal, and shares memory mapping and parallelism.

Lastly, here is the query that finds all transactions with a fee>0.1 Bitcoins in under a minute.

chain.blocks.txes.where(lambda tx: tx.fee > 10**7).to_list()

Summary

BlockSci is a blockchain analysis tool that allows for lightning-fast querying of useful data from blockchains. By analyzing Bitcoin transactions, one can understand the market better and make better investment decisions.

BlockSci's architecture is based on the fact that blockchain data is graph-structured and hard to partition effectively, so it loads the whole blockchain in memory to perform calculations. It applies several techniques to increase speed, including converting hash pointers to actual pointers, deduplicating address data, and using bloom filter optimization. The data layout structure gives both high analytics speeds and doesn't take too much of a toll on memory.