A Guide To Launching Battle-tested Apps For Engineering Teams

On the runway towards the launch, working at startups includes a lot of putting out one fire after the other. You squash a bug when a user reports it. Speed up the product when someone complains. Roll out a feature when there’s a definite demand. It’s so important to keep the momentum and build what the users want.

But once you spot the sweet product-market fit, you can no longer not worry about crossing the metaphoric bridge. Beta testing is over. Users like and want your product. Now it’s time to level up. That means it’s time for proactive action, not just reaction.

Of course, it’s impossible to get everything right in the first rollout. But being on the backfoot is worse than floating a slightly faulty first impression. To ready ourselves for the launch, we at OSlash launched a mission—internally named Project Apollo.

Project Apollo

“One small step for OSlash, a giant leap for productivity”

We wanted to make sure that when we open the gates to the public, we avoid nasty surprises, and that nobody dies. Of course, that’s asking a lot but a team can try.

Our objectives were clear:

Build a baseline for our current systems’ performance
Prepare for widening service limits
Build an observability platform
Design a process for incident management & identifying SLOs

To meet the objectives, we divided the engineering team into three groups of people selected at random.

The teams worked towards the success of the mission along with performing their everyday tasks. It was not an easy task, yet one they couldn’t have done better.

Squad 1 - Performance Squad

The objective of this group was to answer the question “How many users can our platform serve today?”. With this answer, we gained two new abilities — identifying performance degradations and capacity planning for expected traffic surges (like, well, a product launch!).

For most engineering systems, the usual metrics for quantifying performance are:

🛎 Availability

Availability is a measure of how much time we’re available to respond to a request. This is usually expressed in the percentage of uptime in a given year.

⏱ Latency

Latency is a measure of how long it takes for a request to get a response. This is often measured between different layers (for example, API gateway ↔ Lambda), usually expressed in 95th or 99th percentiles.

🚛 Throughput

Throughput is a measure of how many requests can be handled. This is usually expressed in rps (requests per second).

🪖 Durability

Durability is a measure of the data we’re looking for always being there. This is usually a function of the database we use. For example, see DynamoDB’s durability promise.

✅ Correctness

Correctness is a measure of the system always returning the expected result.

To get these numbers straight to our dashboard, the performance squad decided to measure different aspects of performance by breaking them down into parts.

Part I: Measure How Fast the Current System Is

To do that, the squad employed Firebase Performance Monitor that tells us exactly how our users are experiencing our product.

It returns:

Load Time
Wait Time
Regional Delays
Performance in Wifi, 4G networks, etc.

The squad complemented Firebase with Sentry Performance Tracing to delve further into the exact user experience. Sentry returns the amount of time taken in different parts of our product.

Let’s say a user requests a shortcut -> o/roadmap.

Sentry returns:

the time it takes for the server to find it
the time it takes for the network to deliver it
the time it takes for the browser to redirect it on the client-side ‍

Part II: Identify the Bottlenecks

We split every touchpoint in the product into smaller chunks known as transactions to easily calculate how much time a transaction takes to complete. Any transaction that takes more than 1.2 seconds to complete leads to user misery. Identifying such transactions gave us a fine-grain analysis of all the potential bottlenecks.

The bottlenecks were identified using the Sentry User Misery Score which returns the following:

Transactions per minute: How many times a given operation occurs in a minute

Latency: Measured in P50, P95, P99

P50 - 50% of the operations finish in this time (average speed)

P95 - 5% of the operations take this much time on the slower side (worst speed)

Part III: Keep an Eye on the Speedometer

Another tool that we set up to monitor all our lambda functions, including the time taken to complete a transaction, is Lumigo

By tagging every prod release appropriately, we are able to ensure that the performance doesn’t degrade over time.

Part IV: Ensure Preparedness For Incoming Traffic

Once our product is launched, in the best-case scenario, we should expect a sudden spike in traffic. To be sure we don’t falter at this crucial juncture, the squad helped us answer two crucial questions:

How many simultaneous users can our product handle? More users will lead to more requests to our servers leading to HTTP 429 error
How fast will our servers be when the number of users is high? A transaction that was taking 40ms in regular condition might take 400ms in high-traffic scenarios. This meant that we needed to figure out the threshold for the essential transactions in this situation ‍

Load Testing with Vegeta

Vegeta is a tool that allows teams to simulate heavy traffic. If you are gearing towards a big launch or PR, it is highly recommended to simulate every condition to see that the product does not break anywhere. With Vegeta, we were able to figure out how much time the most frequent transactions took in peak traffic conditions.

Part V: Fix the low-hanging fruits

With the help from Sentry and the data obtained from Vegeta, the squad made it possible for the engineering team to immediately fix the issues that instantly made the product faster for all users.

Also, by linking all issues in Linear to Sentry, we were able to ensure that the fixed issues don’t show up in production.

Squad 2 — Observability Squad

Observability is being able to quickly find anything you want about a system. In other words, observability is a bunch of systems like error trackers, uptime monitoring systems, log aggregators, and tracers, all giving a bird’s eye view of all system components at any given point in time. Observability is also different from monitoring — monitoring will tell you about errors (that you already know of but haven’t fixed yet) that happen again; observability, on the other hand, can give you more real-time information and help you predict faults.

We wanted to ensure that the observability system is our single source of truth for all platform events (errors, alerts, system & app metrics) and that it helps us quickly jump to a particular flow or transaction to help debug and fix issues.

Basic building blocks

1. 🐞 Error Alerting

We’ve been using Sentry for tracking all errors. How can we better leverage Sentry’s capabilities? Enable and set up source maps so errors are readable. Use appropriate levels to mark the severity of the issue. Identify users so we can identify deeper patterns. Set up Slack alerts for high-severity alerts.

2. 🔄 Tracing

Tracing is generating a unique ID that can be used to identify a transaction across systems. For example, when a user initiates an action, the initial request that is sent to the backend should have a unique ID (called a trace ID) that is propagated across all discrete systems. This way, if we search for, say request ID req-abc-123, we should be able to trace it from the extension to the backend and back to the extension, and finally to the user. Tools & platforms (search term: “distributed tracing”) include [Jaeger](https://🔄 Tracing), Honeycomb, AWS X-Ray, and OpenTracing. Sentry errors can also be enriched with this tracing metadata.

3. 📡 Status Page

Set up a status page (can be private too) so we know all systems are operational.

4. 📊 Dashboard

A company-wide real-time dashboard with the most important metrics (like total installs, total active users, daily users signups, the total number of Shortcuts created, overall system throughput, error rate, number of extension installs, etc.) are incredibly useful for a number of reasons:

Throw it on a giant display in the center of the office and everyone gets around to watch and celebrate big milestones (like say, the 1000th user).
Everyone knows what the most important goals for the company are.
Easier to notice anomalies and take corrective action.

Product and engineering teams came together to identify important metrics for both teams and build an interactive dashboard together.

To make sure the dashboard is built super fast, the Observability Squad used Retool.

With Retool, we’ve unearthed some great insights into the product. Each time a crucial number sees a sudden spike or an expected uptick, our hearts collectively skip a beat.

‍

Squad 3 — Incident Management & Preparedness Squad

Incident management is a set of definitions and rules that answer the following questions —

What exactly happens when an incident (could be an error, a security breach, or downtime) happens?
Who or which team is the first responder?
Who takes responsibility for coordinating various teams in responding to the incident?
How long can a bug take to fix before it is escalated?
Who will respond to incidents on non-working days?
How much is reasonable pay for on-call duty?
What is the template for a responsible disclosure or an RCA?

To make sure there is a process in place that can answer every question, the Incident Squad created playbooks that followed a carefully laid-out set of steps.

Step I: Categorization & Severity of issues

‍

Step II: Service-Level Agreements according to SOC-2

Severity critical: 3 business days
Severity high: 30 days
Severity moderate: 60 days
Severity low: 180 days ‍

Step III: Define SLAs according to OSlash Security

Critical (S0): Within 1 week of being reported
High (S1): Within 1-2 weeks of being reported
Medium (S2): Within 2-3 weeks of being reported
Low (S3): Within 3-4 weeks of being reported

To make the whole process seamless, the incident squad ended up trying out a bunch of tools for issue monitoring such as PageDuty, Opsgenie, VictorOps, and Incident.io

In our personal experience, incident.io ticked all the boxes we were looking for.

After classifying all issues depending on the level of severity, the incident squad went on to describe how all issues will be communicated to the users, who will stay on call and how that person be monetarily compensated for the extra hours put in.

If you are looking at building your own version of Project Apollo, here are a few key tips that might prove helpful:

1. Document Everything

Don’t worry about where to put them or how they should be laid out. Just document everything.

2. Checkpoint Daily

Catch-up daily over squad standups, and feel free to go as deep and for as long as you want. Keep your eyes on the prize and make sure you are achieving your squad goals on a day-to-day basis.

3. Prefer No-code and Managed Solutions

Prefer no-code and managed solutions. Prefer tools that are already battle-tested over building something on your own.

4. Accommodate Existing Priorities

Keep your everyday tasks at a higher priority and work on the project for a couple of hours daily.

5. Find Consensus with Help from Senior Folks.

If the squad members cannot come to an agreement, consult with the senior folks; they’ll help by asking more questions, inviting more discussion, and helping find common ground between what’s best for the team and the company.

6. Think Critically.

Ask questions like “should we even do this?”. A squad coming to a “we don’t need this at all” conclusion is still okay if they can back it up with data.

The entire mission took us two weeks to complete. In hindsight, to lessen the burden on the already stressed engineering team, we could have earmarked a couple more weeks for the activity.

We hope you found value in our experience. We wish to find you next to us as we travel.

First published here