How to Make Rough Estimates of SQL Queries

Imagine writing increasingly complex requirements. More than just the retrieval of one entity by primary key, especially if you run it in a loop or in an environment that is extremely intolerant of long-duration queries. In more complex cases, you also don’t have access to the production database and execute a real query on live data and production hardware.

How do you process rough estimates of SQL query execution?

Stages of implementation

Each database runs a query through the standard process (with some deviations depending on dB implementation):

Syntax parsing
Semantic parsing
Planning
Executing

The most time-consuming part is usually executed only because the database must go to disk, get some data and send it to the client. Other parts of the process go mostly to the CPU and use some cached metadata. The planning creates schemes, compares them, selects the best one, and sends it to the executing stage. The executing stage runs the plan - goes to disk, loads data, sorts or filters it, goes to disk again for another chunk of data, joins datasets, and so on until the plan is completed and the result is ready to be sent to a customer. These operations usually take up most of the total query time.

Plans of executions

Let’s find out what happens to the sample query:

SELECT u.id, u.last_login, ud.phone
FROM user u
JOIN user_detail ud ON ud.user_id = u.id
WHERE u.firstname = 'John';

From the software perspective

Let’s write some plans to run such a query.

For example:

The direct way should be the best in most cases.

for (User u: Users) {
  if (u.firstname == 'John') {
    for (UserDetail ud: UserDetails) {
      if (ud.user_id = u.id) {
        result.put(u.id, u.last_login, ud.phone);
      }
    }
  }
}
return result;

If we know that almost all users are “Johns” and just a small part of them have user_detail extension the following code should work faster.

for (UserDetail ud: UserDetails) {
  for (User u: Users) {
    if (ud.user_id = u.id && u.firstname == 'John') {
      result.put(u.id, u.last_login, ud.phone)
    }
  }
}
return result;

Additional data

Indexes. They work as additional structures and in terms of code. The indexes usually look like pre-calculated maps, usually TreeMaps. Let’s for example take the first plan. Some user_id to UserDetails map can be recalculated on each change of the table and that structure can speed up our select-request pretty much.

Using index over user_id in UserDetails

for (User u: Users) {
  if (u.firstname == 'John') {
    result.put(u.id, u.last_login, user_id_to_UserDetails[u.id].phone);
  }
}
return result;

Using index over firstname in User. Since firstname is not unique value this index works as a MultiValuedMap.

for (User u: firstname_to_User['John']) {
  for (UserDetail ud: UserDetails) {
    if (ud.user_id = u.id) {
      result.put(u.id, u.last_login, ud.phone);
    }
  }
}
return result;

Using both indexes

for (User u: firstname_to_User['John']) {
  result.put(u.id, u.last_login, user_id_to_UserDetails[u.id].phone);
}
return result;

So, we have plenty of plans. How do we choose the best one?

Initially, looks like the latest plan could be the best choice. But it is not true in every situation.

Let’s find out why.

Statistics and hardware data.

The database has already completed several queries on the table and it knows the amount of data and some other things, like how and where the data is stored. The database also knows how and where additional structures (indexes) are stored.

After all, each plan has its own complexity. The complexity also depends on additional structures because some useful things can be precalculated. The final timing (or cost) of running depends on the speed of the hardware. This depends not only on the amount of data or requests that should be done to retrieve data but also on the nature of the requests. They can generate both sequential and random reads. And the nature of the queries can greatly affect the final timing.

Executing a query

To select the best plan, the database should have:

the plans themselves;
knowledge of pre-calculated things such as indexes or caches from earlier queries;
knowledge of the location of data and the additional structures (disc or memory cache);
knowledge of the disc and memory work (speed of sequential and arbitrary access, number of IO operations per second).

All this helps databases choose an optimal plan to run. Moreover, the best plan can be changed during the database’s lifetime as the table itself changes. At one point the table can fit in memory entirely, at the next - it won't. At one point the index can fit completely in memory, at the next - it won’t. In some cases, the data occupies less space than the indexes themselves.

The good news is that we needn’t go too far into the timing of our request. All we have to do is predict the most obvious scenarios for our plan.

Searching for the best scenario

Each table can be obtained from a disc with some filters in two ways:

by full scan. DB loads all entities of the table, iterates through them, and filters them. This method usually generates consecutive reads in storage;
by index. DB loads an index, searches for necessary places in the index, then loads only requested rows for further processing. This method usually generates random reads in storage.

Let’s compare both approaches with an example

To get the rows of the table by firstname ='John':

If there are no indexes by firstname, the database:

loads the entire table (all records). The operation requires sequence reads on the disc system and the count of operations fully depends on the table size;
filters all entries by fields;
sends a result to the client.

If there is an index by firstname, the database:

searches by index and receives positions in which there are entries with firstname=’John’. The operation requires a few random read requests to the disc and depends on index size and rows count but since it is only logarithmic complexity, it isn’t too many requests, usually just a few;
uploads records one by one directly to their positions;
sends a result to the client.

To access the timing of these approaches, we must know the specifications of the disk system and the size of tables and indexes. For example, let’s take some characteristics of the HDD disk up to 500MB/s for consecutive reading and up to 300 operations per second for random reading. Assume we have 1 million rows and each averages 50 bytes. This means that we have a table size of ~50 Mb and 6-7 Mb of index size (provided we have an average of 6 characters per first name). Assume we have ~300 different Johns in the table.

So, to calculate the timings of both approaches:

with no indexes, we must read 50 Mb from the disc and filter it by CPU. Reading is much slower than the CPU, so it will take 99%+ of time. So this approach will cost us 50 / 500 = 0.1 sec
with the index we need to read the pointers of all records with firstname=’John’ for several reads.

Then we need to do up to 3000 random reads from the disc to extract line by line all required rows. Some of them could be on the same page of memory so the actual timing could be lower. But for high assumption, it is 300 reads. 300 ops/sec / 300 ops = 1 sec. This means that the second approach takes up to one second to complete.

It highlights that a database doesn’t always need an index, in some cases a fullscan of the whole table is just more efficient. It also represents in general how databases make assumptions about implementation plans and how they choose the best of them.

Other optimizations

Actually, databases have different tools to speed up queries. But in the end, all of them work over the disc and memory system trying to reduce operations over the disc system or move them to the memory.

For example, when query can use a few indexes over the same data the database can do it with a bitmap operation. It is an operation in which a database loads two or more indexes at the same time. It then performs logical operations over them - union for operation OR and intersection for operation AND. Then, the database can load the table using result operation over indexes.

So in that case the whole operation requires fewer and smaller disc reads, some operations on the CPU, and random reads from the disk to sample table rows.

Having knowledge about typical internal db operations helps developers not only estimate queries but also choose the correct set of indexes and write faster queries.

Conclusion

One important modern thing about assumptions is that today we use clouds everywhere and sometimes it is just difficult to find out about hardware in the depth of the cloud provider, and, after all, can be difficult to do an assumption. Nevertheless, there are laws of physics that we cannot circumvent. Each reading requires some time and each operation has a delay. And if you do not have terabytes of memory in your database setup, there is sufficient knowledge to make certain assumptions based on typical queries.