Practical scaling techniques for web sites

After 8 years of working on high-volume web sites, I’ve seen a lot of interesting scaling techniques. While it’s true that very specific strategies can eke out enormous efficiency gains, complication doesn’t come without a cost. In my experience, scaling problems that a moderately-sized site will encounter can be solved with:

Asynchronous work queue
Proper database use
Appropriate caching

The examples below are written in the context of Ruby on Rails, but should translate well to other languages and frameworks.

Asynchronous Work Queue

As a web site grows, controllers can easily become bloated and slow down with all the additional tasks they might have to perform, such as:

Analytics tracking
Sending emails
Creating additional database records
Inadvertent N+1 query side effects

For example, imagine a user wants to delete his profile and thousands of records associated with it — this can take a while. And it really doesn’t have happen immediately; it could instead acknowledge that the request has received and they’ll be sent email when it’s finished. You could then perform the actual delete in an asynchronous work queue. An async work queue consists of:

Queue of performable jobs and job parameters (often in Redis, MongoDB, MySQL, etc).
Pool of workers that pluck jobs from the queue and perform them.

Basically, anything that can be deferred from the controller should be:

API requests to 3rd party services. These are a must-defer because their response time could be variable/slow and the web request shouldn’t block rendering the response. Even worse, an API provider that is failing could bring down your site if you don’t properly detect and abort slow responses. Sidestep the problem and perform as many API calls as possible in async workers.
Email. Basically an API request.
Database record creation. While often orders of magnitude faster than an API request, inserts can still take 10–100 milliseconds because of factors like database load and the number of indexes and foreign key constraints.

The only reason not to defer is if the result is important to the response; e.g. a payment request might block so that the customer is immediately notified that payment has failed.

Outside of the web request cycle, async work queues are very useful to parallelize large work loads. For example, imagine a daily scheduled script that announces new products to a 10k email distribution list. If this script sends all the emails in sequence, this could take upwards of an hour. With a dozen async workers this might take closer to 5 minutes. It’s also simple to temporarily scale the worker count if you need to handle job/traffic spikes.

Asynchronous workers are tremendously effective for a variety of problems and have a low implementation cost, making them a pragmatic solution in a variety of situations.

Proper Database Use

The programming language your web site is written is probably very general purpose — it’s optimized for flexibility at the cost of efficiency. A database is a highly optimized computation engine for relational algebra. Literally hundreds (if not thousands) of books have been written about databases, so I’m only going to mention a few common scenarios I run across.

N+1 Query Problems

One of the biggest complaints I’ve heard of Object Relational Mappers (ORMs) — especially ActiveRecord in Rails — is that they make it really easy to write bad queries/sets of queries without knowing. Consider for example:

User.each { |u| puts u.address }

Seems pretty innocuous. But what’s not obvious is that Rails is doing one query to load the User relation, and then in each iteration of the loop querying the Address table:

SELECT * FROM users;

SELECT * FROM addresses WHERE user_id = 1;

SELECT * FROM addresses WHERE user_id = 2;

...

This class of performance problem is called an N+1 query problem and is simple to identify by a simple analysis of your SQL log. The fix is as straightforward too: instead of querying in a loop, collect the IDs of your users and query all of them in one statement:

SELECT * FROM addresses WHERE user_id in (SELECT id FROM users);

SELECT * FROM users INNER JOIN addresses ON (users.id = addresses.user_id);

Ruby on Rails provides a few “eager loading” mechanisms to make it easy to avoid N+1 query patterns:

User.includes(:address).each {|u| puts u.address}

Under the hood, this instructs Rails uses SELECT..FROM..WHERE id IN (?) pattern to select against addresses with the user_ids from user.

Missing Indexes

When querying a table by any column the database engine has two options: look at each record in the table or use an index — a highly-optimized data structure that can be used to select or eliminate records before looking at the table.

On tables smaller than 10k records, it can be easy to not realize you’re missing an index because it’s fast enough. Beyond 10k though, queries will become noticeably slow. Foreign key relationships (other_table_id) are an obvious place to put an index. Covering indexes, where all information you need to return is in the index, can speed up queries by obviating the need to access the table itself at all. If you’re not sure if you’re using an index, you can use your database engine’s EXPLAIN command:

EXPLAIN SELECT * FROM foo;

                   QUERY PLAN

---------------------------------------------------------Seq Scan on foo (cost=0.00..155.00 rows=10000 width=4)(1 row)

EXPLAIN SELECT sum(i) FROM foo WHERE i < 10;

                         QUERY PLAN

--------------------------------------------------------------------Aggregate (cost=23.93..23.93 rows=1 width=4)-> Index Scan using fi on foo (cost=0.00..23.92 rows=6 width=4)Index Cond: (i < 10)(3 rows)

This tool will quickly show you the strategy your database engine will use. The documentation for Postgres’ EXPLAIN should be followable even if you’re using another database. The excellent database administrator tool PgHero will also automatically surface index add/drop recommendations based on usage statistics.

Serialized Data

Relational algebra works on structured data: pre-defined tables, columns, and relationships. Semi-structured data such as XML and JSON are not always predefined though. The workaround has traditionally been to serialize your XML or JSON and store them as a text record. Recent database releases have added native column support for JSON (original text stored) and JSONB (binary structure).

API responses are popular JSON structures to store in a database. Using a native JSON format, API response details can be easily be queried in the database instead of in your programming language after deserialization:

SELECT * FROM stripe_charges WHERE (transfer->'amount')::int > 1000;

Or you can pluck individual values from a large JSON object:

SELECT (transfer->'created'), (transfer->'amount')::intFROM stripe_charges;

Aggregate in the Database

Imagine you wanted to select all users who have purchased more than $100:

users = User.includes(:payments).select do |u|u.payments.map(&:amount).sum > 100end

This will instantiate an ActiveRecord object for every user in the table, every payment in the table, and then throw away records it doesn’t need. Instead, we can instruct the database to do the filtering and only return User records that meet the criteria:

User.joins(:payments).group('users.id').select("users.*, SUM(payments.amount) AS total_amount").having("SUM(payments.amount) > 100")

This while generate the appropriate JOIN, HAVING, and GROUP BY:

SELECT users.id, users.name, SUM(payments.amount) AS total_amountFROM usersINNER JOIN payments ON payments.user_id = users.idGROUP BY users.idHAVING SUM(payments.amount) > 100

Database Wrap-up

Databases are mature, highly-optimized data stores. Whenever possible, let databases do what they’re good at instead of doing it more slowly in your web programming language.

Appropriate Caching

A common saying in the development community is that there are only two hard things in computer science: naming, caching, and off-by-1 errors. Caching can be hard to get right, leading to hard-to-track-down stale value bugs. I’m definitely not claiming that caching is a silver bullet; but it is certainly a lead bullet that should be in your arsenal.

Often, entire responses can be cached, such as:

API endpoints with the same response for every user (e.g. product catalog, search typeahead database)
Logged out homepage
sitemap.xml

These cacheable responses are some of my favorite low-hanging fruit for performance improvements. In the case of a cache hit, no template engines needs to be invoked and the web server immediately begins serving content to the client. It’s hard to beat a .1ms response. If 50% of traffic enters via the logged out homepage, caching the entire page can be a huge win. Especially because Google considers page response time as a signal when they rank search results. Beyond user and ranking benefits, the cached responses will also lighten server load.

If at first it doesn’t seem like the entire response is cacheable, think again. A logged-in homepage might just be a logged out homepage with a logged-in menu bar; in which case you can quickly serve the logged-out homepage, and then use AJAX to replace the logged-out menu with the logged-in menu.

Over 50% of requests to www.cameralends.com were served in under 10ms

Even if an entire response isn’t cacheable, expensive bits can be:

results of expensive queries (e.g. aggregation queries like SELECT COUNT(*), complicated joins)
view partials that require heavy computation, are used with many times on the same page, or can be cached for a very long time
3rd-party API responses

A common problem with caching is the cache stampede problem:

Cache stampede_A cache stampede is a type of cascading failure that can occur when massively parallel computing systems with caching…_en.wikipedia.org

The problem deals with cache expiration; imagine a JSON endpoint that takes 10s to generate and is about to expire:

Until expiration, it returns in .1 ms
It expires and is evicted
The first process that tries to access the cached value finds that it is missing, and begins performing the 10s calculation
.1s seconds later, the second process tries to access the cached value, finds it missing, and also kicks off a 10s calculation

If requests come in every .1s, 100 expensive requests will have begun before the original 10s request finishes. If it ever finishes! If it’s an expensive database query, the load increases drastically. It could take 20s (or more!) for the first query to finish, and that could be enough to cause downtime. If there are only 100 web workers, then it will cause downtime.

A common solution is to use a scheduled job to cache the value before it expires (or maybe even never expire the value, so there’s always a fast response). This strategy can be a bit of boilerplate to add, which can discourage common use. That’s why I favor an asynchronous cache recompute strategy.

Something like a product catalog might only change a few times a day or week. And when it does, it’s acceptable for tens-hundreds of requests respond with a slightly-stale catalog. The asynchronous caching strategy I’ve used works like that:

Until expiration, it returns in .1 ms
It soft expires: not evicted but marked as stale
Soft expiration queues an async job to refresh the cached value
Processes that access the stale value before the async job finishes will receive the stale cached value in .1ms
After the async job finishes, the cached value is updated and stale flag is cleared

This strategy will let you always return quickly unless it’s the initial access and the value needs to be computed. Although, you could also have it return nil and degrade gracefully, and queue an async job to fill the cached value for future responses.

Rails provides a simple solution to cache stampedes viarace_condition_ttl in [ActiveSupport::Cache::Store#fetch](http://api.rubyonrails.org/classes/ActiveSupport/Cache/Store.html#method-i-fetch). When this parameter is set, the first process that accesses the stale value will:

Extend the expiration on the cache by race_condition_ttl
Recompute the cached value

The second and further process that access that cache key will be returned the slightly stale value. If the recomputed value hasn’t been cached by race_condition_ttl, the current process will bump the expiration and recompute. It’s worth noting that some unlucky requests will be stuck with the task of recomputing, so this strategy will have an effect on user experience.

Static caching

With heavy amounts of unchanging content, static caching can be a good option. Static site generators like Jekyll can be used to build the entire file structure to serve through Nginx or Apache. On Heroku, you can simply use Rack to serve a static directory.

Concluding Thoughts

This collection of easy wins is far from a comprehensive optimization guide, but can take the scalability of your web site surprisingly far. Before adding a complicated performance solution, it’s worth checking to see if dumber, more straightforward solution will work.