How to Scale WebSockets

Scaling WebSockets

While talking to developers who haven’t used WebSockets yet, they usually have the same concern: how do you scale it out across multiple servers?

Publishing to a channel on one server is fine, provided all subscribers are connected to that one server. As soon as you have multiple servers, you need to add something else to the mix. That’s what this post attempts to address.

Scaling HTTP vs WebSockets

To see why scaling out WebSockets might seem daunting, let’s contrast it with HTTP, because most people understand it well.

With HTTP, you have a once off request/reply pattern, you don’t expect the next request from the client to come back to the same server. At least you shouldn’t, because that means you have a sticky session problem and that you can’t easily scale out for performance or redundancy.

With HTTP, you can run a virtually unlimited amount of web server instances behind a load balancer. The load balancer passes the request to a healthy web server instance when it comes in and the response gets passed back to the client once the web server has calculated it. HTTP connections are usually very short lived, they exist only until the response has been given. This is a well understood and ubiquitous approach and it scales very well. There is an exception to this in the form of long polling, but it’s not as common and not at all important for this post.

On the other hand, WebSockets differ from HTTP requests in the sense that they are persistent. The WebSocket client opens up a connection to the server and reuses it. On this long running connection, both the server and client can publish and respond to events. This concept is called a duplex connection. A connection can be opened through a load balancer, but once the connection is opened, it stays with the same server until it’s closed or interrupted.

This in turn means that the interaction is stateful; that you will end up storing at least some data in memory on the WebSocket server for each open client connection. For example, you’ll probably be aware which user is on the client-side of the socket and what kind of data the user is interested in.

The fact that WebSocket connections are persistent is what makes it so powerful for real-time applications, but it’s also what makes it more difficult to scale.

An example WebSocket app

Let’s talk about an example app, that way we can discuss problems and approaches in the context of something a tad more concrete.

For our example, let’s settle on a collaborative whiteboard app. In this app, there are multiple whiteboards, meaning multiple drawings that people can collaborate on. When a user draws on a particular whiteboard, it publishes the coordinates over a WebSocket connection and through to all the other users who have the same whiteboard open. In other words, we have a pub/sub pattern exposed over WebSockets.

In this example, it means that the server-side of the socket connection for each user of the app will need to know, at the very least, what whiteboard the user has open.

Web socket implementations like socket.io have the concept of channels. Think of it as an address where clients subscribe to, and either a service or other clients publishes to.

It might be tempting to think that all we need to do to build our collaborative white board app is to employ channels (each whiteboard has it’s own channel) and then sit back and relax, but as you’ll see through the rest of this post, you’ll still have issues with scaling out and fault tolerance.

You need a Publish/Subscribe broker

First off, what do I mean when I say “pub/sub broker”? There is a variety of technologies around that support the pub/sub pattern at considerable scale.

When you need to scale out a pub/sub architecture over sockets, you will need to find a good pub/sub technology to have at the core of your solution.

We don’t need to tie down a specific option for this post to make sense, but here are some good options: Redis, RabbitMQ, Kafka, and RethinkDB.

To see why we need to add a pub/sub broker to the mix to help you scale your WebSockets, let’s think about our example in the context of one server first.

With one server, it’s actually quite easy to build a pub/sub service with just WebSockets. This is because on one server the server will be aware of all clients and what data the clients are interested in.

Think about our example app. When a client sends through coordinates for a drawing, we just find the correct channel for the drawing and publish the updates made to the drawing to that channel. All the clients are connected to the one server, so they all get notified of the change. It’s kind of like in-memory pub/sub.

But in reality, we would want to scale across multiple servers and we want to do this for 2 reasons: 1) sharing processing power, and 2) redundancy.

So what can we do to ensure that our app scales out? Well, we need some way to let other services with connected clients know that the data has changed.

When building an app like this, you would probably have a database in the mix already, even before you start thinking about scaling. You wouldn’t just trust connected clients to store all the data for all the drawings would you. No, you’ll want to persist the drawing data as it comes in from the clients, so that you can serve up drawing data anytime a user opens up a drawing.

But here’s the problem. If a WebSocket on Server A writes some data to the database, how would WebSocket on Server B know to go and get the latest data from the database so that it can notify it’s clients of the new data?

Let’s talk through the process of using Redis at the centre of your solution. Although you might have hundreds of WebSocket servers in your cluster, let’s imagine you have just 3 to make things a bit simpler. We’ll call these servers WS1, WS2, and WS3. Sometimes I surprise myself with my amazingly creative names for stuff!

Ok, so let’s say that you have 9 people that opened up a specific drawing of a dog riding a pony riding a dinosaur, saved in your database with an id of abc123. And let’s say that you have 3 people connected to each server in the cluster (WS1, WS2, WS3).

One of the users connected to WS1 draws something on the whiteboard. In your WebSocket server logic, you write to the database, to ensure that the changes have been persisted, and then publish to a channel based on a unique identifier associated to the drawing, most probably based on the database id for the drawing. Let’s say that the channel name in this case is drawing_abc123.

At this point, you have the data written away safely in the DB and you have published an event to your pub/sub broker (Redis channel) notifying other interested parties that there is new data.

Because you have users connected to the other WebSocket servers (WS2, WS3), interested in the same drawing, they will have open subscriptions to Redis on the drawing_abc123 channel. They get notified of the event, and each of the servers queries the DB for updates and emit it on the WebSocket channel used on your WebSocket tier.

So you see, the pub/sub broker is used to allow you to expose a pub/sub model with a scaled-out WebSocket cluster.

Handling failover

Another benefit of using a pub/sub broker to coordinate your WebSockets is that now it’s possible to easily handle failover.

When a client is connected to a WebSocket server, and that server falls over, the client can open a connection through the load balancer to another WebSocket server. The new WebSocket server will just ensure that there is a subscription to the pub/sub broker for the data that the WebSocket client is interested in and start piping through changes on the WebSocket when they occur.

Working with deltas

One thing to take into consideration when a client reconnects is making the client intelligent enough that it sends through some sort of data synchronization offset, probably in the form of a timestamp, so that the server doesn’t send it all the data again.

If every update to the drawing in question is time stamped, the clients can easily store the latest timestamp that they received. When the client loses the connection to a particular server, it can just reconnect to your websocket cluster (through your load balancer) by passing in the last timestamp that it received and that way the query to the DB can be built up so that it’ll only return updates that occur after the client last successfully received updates.

In loads of applications it might not be that important to worry about duplicates going down to the client. But even then, it might be a good idea to use a timestamp approach to save both your resources and the bandwidth of your users.

Conclusion

Building a pub/sub service running on one server is relatively easy. The challenge is to build a service that can be scaled horizontally for load sharing and fault tolerance.

When you scale out, you need a way for the web socket services to subscribe to changed data, because changes to said data will also originate from other servers than itself. A Database that supports live queries is perfect for this purpose, for example RethinkDB. That way you have only WebSockets and your DB. That said, you might already be using a pub/sub capable technology (Redis, RabbitMQ, Kafka) in your environment, and it’ll be a much easier sell than introducing a new DB technology to the mix.