Here's Why We Built An Open-Source Goldmine of Crypto-Markets Datasets

Written by julian-molina | Published 2020/08/16
Tech Story Tags: datasets | crypto | cryptocurrency | crypto-trading | algorithmic-trading | crypto-trading-bots | data-structures-and-algorithms | hackernoon-top-story

TLDR Julian Molina is co-founder of Superalgos.org, an open-source project building a Collective Trading Intelligence.org. He explains how to run a distributed data-mining operation to source and process crypto market data at zero cost. Here's Why We Built An Open-Source Goldmine of Crypto-Markets Datasets. The Data Mines infrastructure was built as the information leg of an algorithmic crypto-trading platform, but, to be precise, it may handle all sorts of data, provided that the right sensors were added.via the TL;DR App

How to run a distributed data-mining operation to source and process crypto market data at zero cost.
I should have known from that first conversation that I was about to get dragged into yet another multi-years long project. It all started back in the summer of 2017.
Luis Molina, my brother, had been toying for weeks with a visual interface he envisioned would be the foundation for a new kind of trading bots platform.
He insisted that being able to visualize entire datasets was at the core of creating trading intelligence.
To me, data visualization and trading bots in the same sentence seemed counter-intuitive. However, his next statement made it obvious…
Bots — he said — don’t need data visualization, but humans building bots do.
Little did I know that we would spend the next three years building a system of titanic proportions! Boy was I dragged into a roller coaster ride!
We launched a startup, built a team, and — at some point — decided to radically change plans and go open-source instead.
I will get to pen that story one day…
But today, I’m writing to share with the data science community the free and open-source data-mining infrastructure we have created for Superalgos, which may be of use to people doing crypto-markets research, building trading AIs, or looking for a visual-scripting solution to process data or run backtests.
Like quants in the community have rightly pointed out, the infrastructure not only serves the purpose of providing our systematic strategies with high-quality Technical Analysis datasets…
Superalgos Data Mines solve the software and data engineering needs of data scientists — like extracting, collecting, cleaning, and storing market data — as well as some of the data analysis needs — like transforming, aggregating, and processing data.
The Data Mines infrastructure was built as the information leg of an algorithmic crypto-trading platform, but, to be precise, it may handle all sorts of data, provided that the right sensors were added.
What follows is the bird’s-eye view of what we’ve built so far, and what you get right out-of-the-box, free of charge, no ads, no collection of personal information, no paid pro-version… just good, old-school open-source software you download and run on your machine.

System Architecture

We wanted a flexible visual environment that would lower the barrier of entry and help the less technically-minded users make the most out of the system. At the same time, we wanted a robust and reliable backend, worthy of a mission-critical financial system.
At the highest level, the system is divided into a frontend web app running an animated HTML5 canvas on the browser, and a backend consisting of a set of Node JS processes.
The codebase is distributed with the system and constitutes a fully functional development environment right out of the package. The system runs uncompiled on the actual code.
Data mines are designed to sustain a real-time data-mining and data-processing operation that feeds elaborate information to trading bots, but may be used to download data and process it offline as well.

Exchange Connectivity

We started out working solely with Poloniex, the top-of-the-food-chain crypto exchange back in mid-2017. By the time the core of the system had gone through a few iterations, Poloniex was nowhere to be found in the top-20.
Boy, do things change fast in crypto-land! It was time to open up the game to more exchanges…
Crypto exchanges expose APIs which may evolve and change over time. This is a long-term challenge if you have to maintain different connectors on your own.
To deal with multi-exchange communication, we implemented the CCXT Library — an open-source standardized interface with which cryptocurrency exchanges are invited to comply. In short, the implementation grants access to most major exchanges.

Sensor Bots

A single sensor bot may interact with all supported crypto exchanges and fetch data from any market. Data completion and integrity are crucial in a trading application, so we put a sizeable effort in building a bullet-proof sensor.
The sensor downloads OHLCV data (open, high, low, and close prices, and volume) of one-minute candles and outputs a data product that is later used as input for other bots.
Some exchanges support downloading raw trades and order book data as well.

Indicator Bots

We borrowed the name indicator to describe a type of bot because the main interest of a trading application is to produce Technical Analysis and Technical Studies. However, the infrastructure allows all sorts of calculations.
In Superalgos, an indicator bot takes other bots’ data products as inputs, processes it applying a user-defined algorithm, and outputs a new dataset.
The data-processing logic may be nested in multiple product definitions within an indicator, or even across indicators, producing multiple data products that may feed other processes downstream.

Visual Definitions

Indicator bots may be created and defined from within a visual environment and the only coding required is the actual data building procedure algorithm. The definition of processes, dependencies, records, dataset structure, and so on, is done from within the visual-scripting interface.
The visual definition of processes and data products.



Dataset Structure

Indicators produce standardized datasets consisting of highly fragmented files with data stored in the form of arrays in the JSON format for every time frame: 1, 2, 3, 4, 5, 10, 15, 20, 30, and 45 minutes; and 1, 2, 3, 4, 6, 8, 12 and 24 hours.
... ,[1586044800000,1586131199999,6339.464,440.893,881.7869310984372], [1586131200000,1586217599999,6440.327,424.904,849.8082144837152],[1586217600000,1586303999999,6530.541,382.382,764.7649354906382], ...
Files are stored in a tree structure of folders organized by
Exchange > Market > Bot > Product > Time Frame
, and in the case of lower time frames — to improve accessibility — files are also sorted by
Year > Month > Day
.

Execution

We needed an execution environment that would work well both offline — for instance, to process data on which to run backtests — and in a real-time trading situation when raw data needs to be fetched from the exchange, processed in real-time, and fed to trading bots.
Bots are managed from within the visual environment too.

When in a real-time data-mining operation, bots run in short bursts lasting seconds, usually in one-minute intervals. Because bots depend on other bots, the system tracks dependencies and coordinates the sequential execution of bots to maintain the chain of data processing from the moment the raw data is extracted from the exchange until the last indicator bot has run and stored its output.

Data Visualization

I already mentioned how obsessed we are with data visualization! We wanted to integrate the powerful data building capabilities with a robust, user-defined data visualization solution… so we invented plotters.
Plotters are devices defined from within the visual environment to create visual representations of entire datasets — with zero coding requirements. It’s visual scripting at its best!
Plotters… visual-scripting at its best!

You start by defining data points with
[datetime, price]
coordinates and use the points to define polygons, which in turn may be assigned graphic styles. The resulting graphics are overlaid on top of a timeline, where you may analyze all datasets, synchronized in time, in all time frames.
The system allows browsing data in all time frames along the timeline and makes all data products available as layers that may be overlaid on top of each other. It allows for creating multiple charts and even multiple dashboards with sets of charts.
Plotters produce custom data visualization over the charts.

Distributed Data-Mining

We wanted to enable different use cases, some of which required using outdated or minimalist hardware to process data, while others involved deploying vast resources.
Moving tasks across nodes in the network.

We ended up setting up the system to run distributed on a network of nodes. This allows setting up data-mining farms to analyze multiple dimensions of market data in real-time, across multiple pieces of hardware, with central management. It is the available hardware, not the software, that determines how much processing you may accomplish.
The network configuration and the distribution of mining tasks across the network are too done from an intuitive visual environment, with zero coding requirements.
Every time I start a new project, I know my whole world will end up revolving around it for years. Superalgos has been no exception. We are obsession-driven people!
What I just shared with you is probably around one-fifth of the functionality currently packed in the system and one-tenth of what remains to be developed!
We’d love to get some feedback from the data science community and hope that what we’ve created so far is of use to you.
All the information you need to get started is available in the Superalgos documentation site. You’ll find the links to download the software as well as video tutorials that walk you through the learning curve.

If you have any questions or need help, you may touch base with the Superalgos Community, where you’ll find other users, the developers, and myself.
Enjoy!
(Disclaimer: The author is the co-Founder at SuperAlgos)



Written by julian-molina | Co-founder of Superalgos.org, an open-source project building a Collective Trading Intelligence.
Published by HackerNoon on 2020/08/16