Getting to know Tarantool 1.6

Evgeniy Shadrin (Sberbank Digital Ventures)

This is a translated transcript of Evgeniy’s talk at the HighLoad++ conference — #1 Internet/engineering/database conference in Russia — that took place in Moscow in November 2015. Original transcript available at https://habrahabr.ru/company/oleg-bunin/blog/319968/

If you’ve been following tech news for the last few years, you’ll have noticed that new NoSQL solutions are released almost every other week. Of course, not many of them establish themselves in the market, being ousted by the competition or fading into oblivion. But the fact is that the NoSQL ecosystem is constantly resupplied with new products.

At this conference, you can find both those who have never used NoSQL and those who have been using NoSQL in their projects and at their companies for over five years. Some attendees even contribute to open-source projects, but they’re relatively few.

My name’s Evgeniy and I work at Digital Ventures, a Sberbank division that implements innovative products and solutions. The long and the short of it is we create IT prototypes based on various cutting-edge technologies.

In this talk, I’d like to describe an example use case of a NoSQL solution, so let’s first quickly refresh the theory.

What exactly is NoSQL?

The acronym stands for “not only SQL” and refers to a class of solutions based on data models other than the relational one and designed with a specific purpose in mind — say, to simplify scaling. Since using NoSQL solutions doesn’t require specifying schemas, entities and endless configurations, it’s usually very easy to scale systems, deploy multiple clusters comprised of many nodes and to add/delete these nodes. Also, NoSQL solutions are often quite specialized: each group of developers is trying not to create a versatile project, but to handle a specific task. Such specialization makes for high performance when dealing with concrete issues. Using NoSQL solutions for these tasks can be handy and easy.

The slide above shows the most popular databases that fall into several categories. You’ve heard about key-value stores Redis and Riak — they use the key-value model for storing data. MongoDB, a document-oriented database, is quite widespread and well known. The document-oriented model is slightly more complex than the key-value model and allows storing massive hierarchical data. Then there’re column-oriented databases, such as Apache HBase, which make it easier to work with lots of distributed information. A database that stands separately from the rest is OrientDB — it’s multi-model, but I classify it as a graph one. The graph model has one advantage: it’s very convenient to trace links between data, which might come in handy when working with projects similar to social networks.

How not to get lost in this abundance and choose the solution that’s right for you? I make my decisions based on the following principles:

Don’t reinvent the wheel. I’ve seen many eager developers who were going to create their own little database that would suit their needs and store only the necessary data types. Turns out it’s easier said than done. Take Tarantool — this database has been in development for over four years now. It’s maintained by a team of professional developers, and they’re regularly faced with new issues. Many NoSQL solutions are more than ten years old. That’s why you should pick a database that’s good at solving your particular problem.
Most databases are created for performing particular tasks. If you understand your task well, you’re highly likely to find whatever solution you’re looking for.
I hope the “learn from others” point is clear. In the age of the Internet, it’s pretty easy to look up things online or boldly write developers an email saying something like “Here’s my use case. How do you think I should better go about it?”. Many developers prove to be quite cooperative and do give you some advice.
If you’re lucky enough to have several tools capable of solving your problem, you shouldn’t spend too much time on benchmarking and testbedding to find out which one’s better. Just pick whichever tool you have experience with and save yourself some time studying a completely new technology. If you know a tool, that’s good; if your colleague knows it, not bad either: you can always ask them for a piece of advice.

Below are a few typical NoSQL use cases:

I have first-hand experience with most of them.

Data “caching” is a commonplace task for a well-known database Memcached. You can also mention storing intermediate data here — it’s some data you need quick access to, right now or at some specific point in time.
“Big data” might seem incorrect, since relational databases also work with big, massive data streams. What I meant here is exactly the word “stream” — say, you have a stream of server requests that you need to quickly save and then later figure out what to do with them. That’s what HBase, being a Hadoop database, is good at.
Queue services. NoSQL can be a part of a queue service. For example, I’ve seen the RabbitMQ + Redis bundle a couple of times — it’s a simple and easy-to-use NoSQL backend.
Statistical data processing is a separate use case, where, due to limited memory capacity, you don’t want to store all the data you receive. You can process this data on the fly and obtain relevant user features, normalize them and store as key-value vectors in, say, Redis; all the irrelevant features can be discarded.
You can also use NoSQL as a nifty little storage backend to, well, simply store things. MongoDB is a fast and easy-to-deploy database well suited for this task.

Of course, there’re many more use cases, but listed above are only those I personally encountered. In fact, in Sberbank Digital Ventures I develop real-time systems that receive and save data from a server and then process it to figure out what data type it is and send back a relevant response to the server.

For example, I receive all the useful information I was able to gather about a user surfing the Internet, I analyze it and, as a result, can segment this user, that is determine that this user is, say, a 25-year-old man interested in cars or a girl of 18 trying to enter a university.

To solve this particular problem, I’m using a NoSQL database called Tarantool. Later I’ll tell you why I’ve chosen it and how it helps me deal with my tasks.

The slide above features a quote from the main page of the Tarantool team’s site. It’s their product statement: “A NoSQL database running in a Lua application server” — that is the developers themselves position Tarantool as a product consisting of two parts: a NoSQL database and a Lua application server.

Incidentally, notice how most NoSQL logos nowadays use gray and red colors. I guess they’re all the rage right now =)

There will be code snippets further down in the text, so if you feel like following along, check try.tarantool.org — it’s an interactive service that allows you to run a Tarantool instance (allocated specially for you on the developers’ servers) in your web browser. You can type my code examples directly in there.

So what exactly makes Tarantool stand out from the large NoSQL crowd?

Tarantool stores all data in RAM, which makes for really quick access to it. The fact that Tarantool stores everything in memory doesn’t mean it’s not safe and data can be lost. Tarantool has data persistence mechanisms — transaction logs and snapshots — that work together: you have save points and descriptions of operations performed on data before and after a particular save point. With this information on hand, you can always restore the data to a particular state.

Storing data in RAM used to quickly deplete memory resources in the past. To be fair, memory can get used up even nowadays, but RAM capacity is constantly growing, so in-memory databases are becoming increasingly widespread. Tarantool is based on a document-oriented model: it stores data in an abstraction called a document that has its own fields, which is exactly what Tarantool works with.

One peculiarity of Tarantool as a database is support for secondary indexes, which speeds up data processing and makes it more vivid and fun.

I haven’t used this feature in my project yet, but Tarantool supports full-blown transactions. As far as I know, some companies, such as Mail.Ru Group or Avito, successfully use them in their projects. Also, Tarantool has a lightweight thread (or so-called green thread) model: it’s a multi-thread model, whereby threads are created not on the Unix level, but inside the application itself, which allows implementing asynchronous things like event models.

Besides, Tarantool can work with network and files: it has its own HTTP server and libraries that open and save files — this came in handy as well when I was working on my tasks.

Tarantool is a Lua application server, and Lua is Tarantool’s embedded language. Below is a contrived code example that would never be used in real life, but that illustrates the essence of Lua well:

#!/usr/bin/tarantool-- This is a Lua script

function hw(a, b)print (a.hello..b.world)end

b = {}a = { hello = ‘Hello ‘ }

b[‘world’] = ‘world!’

hw(a, b)

Lua was designed in Brazil, at a catholic university. It descended from SOL, a data-description language created for working with databases. As you can see, the snippet above is not just a script, but an executable script. At the top, we’re using a Unix shebang (#!), which specifies how this script should be run. If we type tarantool script.lua into the console, we’ll see Hello world! appear on the screen. The snippet contains a function that works with two objects, which are initialized below the function declaration.

The main data structure in Lua is a table. Objects a and b are tables, and I initialized them differently on purpose, just to show you that Lua is quite flexible and syntactically nice. These tables can contain some other data — for example, similar tables that, in their turn, can also contain other tables. Sometimes, due to lack of experience, I ended up having deeply nested structures. Functions can also be stored in tables. In fact, you can even treat a function object as a table — Lua provides special methods for it.

Below is a more practical script that can be improved upon and potentially deployed to production. It solves a small problem, and does it in a pretty straightforward way: it simply counts unique page visitors.

#!/usr/bin/tarantool

-- Tarantool init script

local log = require(‘log’)local console = require(‘console’)local server = require(‘http.server’)

local HOST = ‘localhost’local PORT = 8008

box.cfg {log_level = 5,slab_alloc_arena = 1,}console.listen(‘127.0.0.1:33013’)

if not box.space.users thens = box.schema.space.create(‘users’)s:create_index(‘primary’,{type = ‘tree’, parts = {1, ‘NUM’}})end

function handler(self)local id = self:cookie(‘tarantool_id’)local ip = self.peer.hostlocal data = ‘’log.info(‘Users id = %s’, id)if not id thendata = ‘Welcome to Tarantool server!’box.space.users:auto_increment({ip})id = box.space.users:len()return self:render({ text = data}):setcookie({ name = ‘tarantool_id’, value = id, expires = ‘+1y’ })elselocal count = box.space.users:len()data = ‘Your id is ‘ .. id .. ‘. We have ‘ .. count .. ‘ users’return self:render({ text = data })endend

httpd = server.new(HOST, PORT)httpd:route({ path = ‘/’ }, handler)httpd:start()

This is a so-called executable Lua script that’s run by Tarantool and performs a series of predefined actions.

Let’s briefly go over the main portions of the script and then dwell on each in greater detail.

First I’m loading the necessary packages (log, console, server) via a Lua mechanism called require and then I’m declaring a couple of variables for later use.

After that, I’m configuring the Tarantool database via a box.cfg module, where I specify two parameters that I need. I’m launching the console and creating database entities with box.schema.space.create(‘users’) — here I’m creating a users space. I’ll talk about all of this a bit later.

The second part of the script works with a Tarantool server: I’m declaring a handler function to handle requests and further down I’m creating a server and a route. After that I’m launching this server.

From a user’s perspective, the execution of this script results in something like this:

When a user goes to, say, localhost, they see a welcome message. If this user refreshes the page, they’ll just be shown the number of unique page visitors, since by that time the user will have a cookie and be assigned some id.

This short script solves my problem, and this answers the question of why we’re using Lua.

Lua is a fairly simple language. The Internet abounds in Lua in 15/30 minutes crash courses. It does take little to start using it: in a couple of hours, you’ll know all of its peculiarities.

Tables being the main data structure in Lua, it’s very convenient to work with the rest of your data in the same way.

The standard Lua interpreter in and of itself isn’t particularly fast — it’s quite slow, in fact. But there’s an alternative interpreter, LuaJIT, that performs JIT compilation, and it’s way faster. Lua owes much of its high performance to this interpreter.

There’s a library called luafun that allows for functional-style Lua programming, and thanks to LuaJIT it’s lightning fast. You can look it up on the Internet and read performance reviews — it’s fascinating stuff.

Also, Lua is a great embedded language that boasts a seamless integration with C: C procedures can be run from inside Lua, and vice versa — this feature accounts for a wide adoption of Lua in game development. Fun fact: in a popular game World of Warcraft, a great number of extensions, quests and various game mechanics were and are being implemented in Lua.

Tarantool is a full-fledged Lua interpreter, which means once you run Tarantool, you can work with Lua. Just like that.

Tarantool can be run in two ways:

As an interpreter: just run Tarantool and execute commands line by line. It may come in handy if you’re unfamiliar with some command and simply want to execute it to see what it does.
Via init.lua (name is arbitrary, you may pick whatever you like), a startup script containing a series of commands.

Let’s now study the startup script I provided above in more detail.

It all starts with configuring the database via box.cfg — here, box is a module that contains a configurable cfg table. This module’s responsibility is working directly with the database. You can run Tarantool, execute some procedures or functions, print some messages, but you won’t be able to run the database without configuring box.cfg. In my example, I specified two important parameters that I need: a logging level of 5 (DEBUG) and slab_alloc_arena of 1 GB — this is the amount of RAM allocated for my data.

The box module contains a lot of other useful things, such as:

box.info — library that displays general information about Tarantool.
box.slab — important table for monitoring memory capacity.
box.stat — statistical library that shows the number of insert, select and other operations you performed.

If you type box.cfg in the Tarantool interpreter after specifying all the necessary parameters, you’ll get an object with all the available parameters described: not only those that I specified explicitly, but also the default ones.

On the slide above, you can see the two parameters I specified, as I mentioned earlier, — slab_alloc_arena (RAM capacity of 1 GB) and log_level (5, or DEBUG) — along with some other important parameters like snapshot_count, which defines how many snapshots Tarantool should store. In this case, 6 latest snapshots are saved. By the way, snapshot periods are regulated by a parameter called — you guessed it! — snaphot_period. It defaults to 3,600 seconds, that is Tarantool will be taking snapshots hourly. Setting the appropriate security level is up to you: you can configure Tarantool to take snapshots every minute or even second, but it’ll severely affect the overall performance. As for snap_dir and wal_dir, these parameters determine where you keep your snapshots and transaction logs, respectively.

The slide above illustrates the use of the box.info module. Here, you can get general information about Tarantool: if it’s run as a daemon, you can obtain its PID, version (at the time of this talk, the latest version is 1.6.5), uptime and current status.

After the configuration is over, you can turn to creating entities, or data itself inside Tarantool.

The slide above displays an image from the official documentation that details Tarantool’s data model: all data is stored in spaces, each having an entity called tuple — which is analogous to a record in a relational database — and primary and secondary indexes.

Once I set all the necessary parameters, I need a space to store all my user data.

As you may have noticed, I’m creating a space inside an if statement, and I’m doing so on purpose. Suppose your Tarantool instance was stopped for some reason. If you have some snapshots and xlogs saved and you re-launch Tarantool, it will first take the latest snapshot and perform the operations contained in the latest xlog, thus restoring its state. If that’s the case, Tarantool won’t let you create a users space (but you probably don’t need it anyway), so you’ll often see such if statements that allow avoiding unnecessary errors. If you don’t have a users space, it gets created, along with an index. In my example, it’s a primary tree index, which is just a single number.

Further down in the script, I need to add new user records. It can be done with a regular insert operation, where a key-value pair is passed, but in this case it’s achieved much easier with auto_increment: when a new user visits the page, they’re automatically assigned a key that’s equal to the current number of database records plus one. If I want to know how many records I have in my database, I can use a built-in len() function. As you can see, the syntax is quite simple and clear.

As I mentioned earlier, Tarantool is not just a database, but a full-blown Lua application server. What the developers probably meant here is that you can write your own modules and packages in Lua and implement any missing logic that you need. Actually, you don’t reinvent one large wheel, but rather a few small ones if really necessary or if other solutions don’t have what you’re looking for.

You can find the details in GitHub repositories. Packages that are most often used are http and queue. For example, try.tarantool.org that I recommended at the beginning of my talk is written completely in Tarantool, with a Tarantool store and a Tarantool server. Also, Tarantool supports LuaRocks, a package manager that works with its own repository and makes installing packages a breeze — it’s done with just one command.

Let’s talk about packages now. The first thing to know about them is that they need to be loaded.

A package is another Lua script containing some logic. By loading a package you can use methods, data and variables defined in it. On the slide above, I’m loading two packages (console and log) via Lua’s require mechanism.

I’m launching the console on localhost and make it listen to port 33013. With the log package, I can write to my log. The console, in this context, is an admin console or a remote control console that allows monitoring Tarantool’s state. It’s not that tricky to do: if you have your console running, you can use standard Unix utilities or something like telnet or rlwrap. telnet is used for connecting to and listening to a port, while rlwrap comes in handy when entering commands and saving command history.

You can connect to a Tarantool instance that’s currently running and get some information from box.info or box.stat.

One package that I use most often is http. It’s an HTTP server with limited functionality, but it supports many useful mechanisms. On the slide above, I’m loading the package, creating a server and a route and then launching this server. After that the handler function’s returning a server response as some text information, and I’m assigning a cookie to a user (name = ‘tarantool_id’) and setting value to id (value = id). I’m also specifying expiration date, that is when cookies get deleted; in my example, cookies are stored for one year.

http’s main mechanisms allow you to implement some basic logic, as the package provides both a full-fledged server and a client. http works with cookies and supports Lua as an embedded language used in some variables inside Template. It means that you can write little Lua procedures inside HTML.

#!/usr/bin/tarantool-- Tarantool init script

local log = require(‘log’)local console = require(‘console’)local server = require(‘http.server’)

local HOST = ‘localhost’local PORT = 8008

box.cfg {log_level = 5,slab_alloc_arena = 1,}console.listen(‘127.0.0.1:33013’)

if not box.space.users thens = box.schema.space.create(‘users’)s:create_index(‘primary’,{type = ‘tree’, parts = {1, ‘NUM’}})end

I tried to go over the basics of my example script, so it should make more sense to you now. To make sure you have it down, let’s briefly review it once again. What we have is an executable Lua script with a comment on top. First I’m loading packages via require. Then I’m declaring two variables, HOST and PORT. After that I’m configuring the Tarantool database via box.cfg, where I’m specifying two parameters: log_level (logging level) and slab_alloc_arena (necessary RAM capacity).

I’m creating an admin console that I’ll be using further down in the script. Then, if I don’t have a users space, I’m creating it with box.schema.space.create and setting an index on it.

httpd = server.new(HOST, PORT)httpd:route({ path = ‘/’ }, handler)httpd:start()

In the handler function, I’m receiving a page visitor’s cookies. I’m looking up the visitor’s IP address and writing it to my log. If their id’s not in tarantool_id, I’m adding their IP address to my database with auto_increment, looking up their id and returning a welcome message data; the cookie value gets set to the visitor’s id (value = id). Otherwise, I’m counting how many records I have in my database and showing the visitor the number of unique page views. At the bottom of my script, after the function declaration, I’m running the server and working with it.

It’s a relatively simple example, but, given all the modules and Lua’s extensibility, it can iteratively be improved upon until it’s fit to be used in real-life projects.

Tarantool has lots of different packages. There’s one for working with JSON, there’s a package called fiber (I’ll provide more details on it a bit later), yaml, a cryptographic library digest (contains basic encryption mechanisms). Also, Tarantool has a package of non-blocking sockets, so you can work over network and implement various protocols. There’s a package that allows working with MessagePack, and a library called fio (file input/output) for handling files. One particularly interesting mechanism is net.box that enables Tarantool to work over the binary protocol — say, with another Tarantool instance. It’s very fast and convenient. You can also find net.box.sql that allows interacting with relational SQL databases.

Fibers are so-called lightweight threads based on the green thread model. Their main difference from regular threads is that they’re created and work inside Tarantool, so it takes very little time to create them and they have fairly low switch time. They may come in handy if you’re implementing an asynchronous model or if you need to launch a daemon that performs some side task in parallel with the main one.

Basic principles to keep in mind when working with fibers: a fiber needs to be created with fiber.create, it can be put into the wait mode with fiber.sleep, fiber_object can always be cancelled if you want to stop working with it.

fiber.time is a handy library that can always get you a necessary value from an event loop that counts time.

A very popular library built with the fiber library is expirationd that, based on some predefined criteria (usually it’s time), deletes records from your database: say, everything older than a month gets removed.

I could go on and on about Tarantool, but I don’t know all there’s to know about it. I doubt even the developers know everything. You can always check the official documentation at tarantool.org — it’s become more readable lately.

Tarantool supports most Unix-like systems — the team has their own Buildbot — and we at Sberbank Digital Ventures constantly keep an eye out for new packages, since we have Red Hat Enterprise Linux installed on our machines. The developers also maintain the official Tarantool package shipped with Debian.

One thing I like a lot about Tarantool is that you can contact the Tarantool dev team. I had some questions, so I just found some members via Skype and pinged them. Konstantin Osipov, principal Tarantool developer, gave a short talk on queues at this conference. Developers, especially new to the field, find it very important to be able to ask questions and learn first-hand how to better approach a particular problem. You need to be prepared for the fact that the open-source community is quite peculiar. Perhaps this image will tell you more than I’d ever be able to:

At the same time, interacting with community members may be an exciting experience that helps you grow and make your projects a little better.

I’d like to wrap up this talk by sharing a few takeaways with you.

Each NoSQL solution has its own application. It’s often very difficult to say what database is better or worse, or more or less performant. They are just different and usually created for solving different problems.

Development tools are extremely important: if chosen well, they allow you to speed up and simplify the development process and avoid lots of unnecessary problems. But you shouldn’t forget about what’s more important still: your ideas and end goal. After all, every developer’s objective is to solve a problem at hand, bring their ideas to life and make this world a slightly better place.

I hope I’ve managed to persuade you that Tarantool isn’t that complicated and you can start using it as well. Thanks for your attention!