Super Duper SQL Tips for Software Engineers

In this post, we will talk about the features of working with SQL, and how you can improve your database queries and speed up your applications.
Everything is described in this piece is based on my experience and parts found on the Internet. Everything that was found, of course, was tested.

How to check the speed of a request? Prove it works fast.

When I want to check my massive select query, I make sure to run it through explain on the current data sub.

explain request text-- 
or 
explain analyze query text.

The answer that follows is a lot of incomprehensible information, and I try to understand it. What you need to know about this thing is that, this command exists and proves, even to yourself, how optimal the query is or how much the query has improved after applying optimizations. This is a more constructive option than determining by eye or sitting with a stopwatch.

Indexes in databases. Or how to speed up my select query.

Of course, you can't go anywhere without indexes, so you will usually find suboptimal readings after analyzing a query. You can try to add filtering, etc., but if we decide to add an index to some fields, then I try to check the data selectivity first.

Of course, the size of the table matters. But this is already a topic for separate discussions. This is about the data structure in the table.
I see the meaning of using an index field like this.

select 
column_name, 
count(1)::float/(select sum(1) from table_name) * 100 
from table_name
group by column_name
order by 2 desc

The query may not be optimal, but it's enough for us to understand that the lower the value of the second column, the more likely it is that the index will be justified.

Be sure to check the use of indexes.

After creating indexes or having inherited the created indexes, is it worth checking if they are used at all? The primary query that can be used:

SELECT
   t.schemaname,
   t.tablename,
   c.reltuples::bigint              AS num_rows,
   pg_size_pretty(pg_relation_size(c.oid))    AS table_size,
   psai.indexrelname               AS index_name,
   pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size,
   CASE WHEN i.indisunique THEN 'Y' ELSE 'N' END AS "unique",
   psai.idx_scan                 AS number_of_scans,
   psai.idx_tup_read               AS tuples_read,
   psai.idx_tup_fetch              AS tuples_fetched
FROM
   pg_tables t
   LEFT JOIN pg_class c ON t.tablename = c.relname
   LEFT JOIN pg_index i ON c.oid = i.indrelid
   LEFT JOIN pg_stat_all_indexes psai ON i.indexrelid = psai.indexrelid
WHERE
t.schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY 1, 2;

What is not used falls under deletion, well, more precisely, I prefer to delete it if the index takes up a lot of space but is not used, or mutations on the table are slow.

Broken and swollen indexes.

Indexes can break and swell, and indexes break so that no query can be executed. After that, you can recreate the index. When I see that the index size is somehow suspicious, my hand reaches out to re-index the index.

REINDEX INDEX index_name

REINDEX TABLE table_name -- recreates all table indexes

The expression "In" and what to do with it.

Sometimes you have to use the expression in(value, value2...), but it may be necessary to search by tuples, well, more precisely like this:

select *
from table_name
where (column_one, column_2) in (('1','2'), ('h', 'z'));

Sometimes you have to use the in (...) expression in queries. It's not very good when many values are passed in query parameters, but life sets its priorities, so sometimes it's helpful to know about such a construction.

select *
from table_name 
where column_name in (values ('a'), ('b'), ('c'));

In this situation, values will create a virtual table, and fetching will be a little faster if you have 10K values, but again, it's evil to try to use many values in 'in'.

Check yourself. Check your request.

Before the current how to execute a query or add to a migration that changes the data in the database, I check the time and results of its execution on a pred-prod/prod database. I do it through a transaction with rollback.

begin;
--Request
--Check after query result query data
rollback;

In addition to the fact that this approach allows you to check the correctness of data changes, it also enables you to understand how long the changeset will roll. This is especially true for projects that do not separate the roll-out of migrations. From the start of the application, when deployed in the orchestrator (let's say in k8s), if the migration slows down the exit of the instance to the live state, then the instance will be beaten by the scheduler, and an endless cycle of restarts begins.

Remember that it is better not to perform the check on the prod database, especially if you have replication enabled or the request can block any records (and it will do it) for a critical time. Thus you can kill your prod and/or load prod DB.

A little deeper about what happens with replication: the fact is that during any transaction, WAL (Write Ahead Log) is written, and it will be written during a transaction with rollback. Only at the end command will be given that this is what happened - roll back, and it will all be played on the slave. From this, we again conclude that it is better to use this on sub-real data, a stage.
You can also add various kinds of fields, BUT keep in mind that when you change the DDL (table structure), locks can occur. This is especially true for PostgreSQL < 11 versions.

Notes about Materialized view.

Everyone knows about views, but many databases also have materialized views. They are almost like ordinary views but materialized.

Here, I want to highlight the point that they need to be refreshed periodically ... if you runREFRESH MATERIALIZED VIEW name_view, you will be blocked from all the requests to this view, but a consistent state is guaranteed. If the consistency of the state is not critical to you, then you can REFRESH MATERIALIZED VIEW CONCURRENTLY name_view.

Let the whole world wait! Notes about blocking.

Here, I would like to remind those who do not know that there are pessimistic locks in databases…
One moment: it is better not to use this mechanism at all, but it happens that you also need a conditional select * from table_name where id = ? for an update might come in handy.

About exclusive locks

Sometimes, an incident can occur such that, a bunch of requests are not executed for a long time but completely stop. This applies primarily to mutating operations. In this situation, it is worth looking at blocked requests and trying to understand what slows them down.

See the example that follows:

SELECT blocked_locks.pid     AS blocked_pid,
      blocked_activity.usename  AS blocked_user,
      blocking_locks.pid    AS blocking_pid,
   blocking_activity.usename AS blocking_user,
   blocked_activity.query   AS blocked_statement,
   blocking_activity.query  AS current_statement_in_blocking_process
  FROM pg_catalog.pg_locks    blocked_locks
 JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
   JOIN pg_catalog.pg_locks    blocking_locks 
       ON blocking_locks.locktype = blocked_locks.locktype
       AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
       AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
   AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
   AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
   AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
   AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
   AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
   AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
   AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
   AND blocking_locks.pid != blocked_locks.pid
 JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
 WHERE NOT blocked_locks.granted;

It can also be useful to look at the currently active requests in general:

select * from pg_stat_activity;

How to filter it - you can quickly figure it out.

Notes about table partitioning.

If the plate has become, well, indecently weighed, then you can partition it.
More precisely, it is better to say this: if I know that the table will be significant, I will include partitioning in my table design. The main note for me as a developer is understanding that queries against partitioned tables must use the partition key. Otherwise, you can create very slow queries.

The partitioning key can be on any date. You can also make nested partitions with a different partitioning key (let's say a city), but then you will need to use this date and city in queries to partitioned tables.

Based on the previous, partitioning should be included in the design as early as possible in those places where there is a confidence that it will be needed.

About master/slave.

If you have a loaded system and need to remove reading operations from the base mother, then please consider one point: the data on the slave may indecently lag behind the master. You can understand that this is happening with such a request.

select pg_last_xact_replay_timestamp();

Instead of a conclusion.

I hope the described points will help offer you a deeper understanding of the work of SQL. There is a lot of information on the Internet for each topic, and I believe that what I have written will serve as a starting point in improving your knowledge of using SQL