Scale Summit 2018 Notes

This year I went to my 3rd Scale Summit event, an unconference about building scalable, high performance systems.

I’ve summarised notes from the various sessions I attended…

Scaling Teams & Culture Keynote

The day started with a Keynote from @Geek_Manager titled 5 Things I Wish I Knew Sooner About Scaling Teams & Culture…

Scaling Teams & Culture Keynote at Scale Summit_I was delighted to do the opening keynote at Scale Summit 2018 this morning - thanks for the warm welcome, folks. Below…_blog.geekmanager.co.uk

Dry doesn’t work for human communication

People need to hear things more than once for it to register
Architectural decision records show context for why things were done in the past

2. Scaling teams is about creating conditions for success

First, Break All The Rules

Understand Drive
Purpose, Autonomy, Mastery and Inclusion

3. Inflection points

Different things come for free at different inflection points
In smaller teams it’s easier for everyone to know what’s going on
Focus on the right problems at the right time
You’re not Netflix and probably don’t have the same problems
Stumbling on happiness
Humans edit the past and are bad at predicting the future
The most useful person to learn from is just ahead of you

4. With people observability > testing

Your impact may not match your intent, no matter how hard you try
Need to check if your intentions match reality

5. Culture add matters a lot more than culture fit

Focus on getting the most out of difference
We’re not interchangeable resource units

Think of people and roles as a matter of casting
Assemble teams with complementary abilities

Pragmatic Approaches to Real Problems

1 massive server is easier to deal with than a distributed system and costs about as much as a few smaller ones
Risk of failed automation can be greater than the cost of downtime by having to do things like manual failovers
Simpler, smaller, systems let you get to market faster
What's the cost of being wrong Vs the cost of making sure you’re right?
What’s the business context
GDPR may tip the cost of taking the risk in a different direction
Tradeoffs against the cost of it becoming a legacy system you have to maintain
Legacy systems can just become anything you don’t understand
Make it clear what the priorities are and why you are making the tradeoffs
You can build new replacement systems for legacy things and never fully migrate to them
Make old things better before building replacements. Ringfence the old systems and put APIs infront of them
Phrase Jira tickets in terms of problems so you can discuss lots of solutions
How will you know its working when you ship it?
Explain “Why can we just?”

Tin for Cloud Kids

If you’ve only ever used the cloud, how do deal with a project that requires you to run on ‘real’ servers

Run your own hypervisor? Many devices like NICs support passthrough to the VM

Killing and replacing machines

Can netboot and install machines
Speed of doing so can be limited by your BIOS
Can achieve 10 -15 minute cycles for a full install

Create ground rules that no server is sacred

Cattle not pets
Think about availability when building everything
Have hot spares
Raw hardware is cheap compared to AWS
Have a lower threshold for capacity when you buy new hardware
Consider support packages and hardware life cycles
Classify expense as apex rather than capex
You can get a new server within 6 hours
Capacity planning needs to be someones job
Can use out of warranty hardware for things like Jenkins
You have a pool of compute rather than servers with dedicated jobs
You will hit physical limits of hardware at somepoint if you continue to run more VMs on the hardware. E.g. switching on NICs is done in hardware up until a point where its emulated in software much more slowly
Have to deal with disk failures
Metal as a Service from Ubuntu
Packet.net for buying bare metal compute

Production Performance Monitoring

New Relic

Easy to get started
Very expensive
Push data to them, if your server falls over it may not be able to push the crucial data you needed

Tracing

Zipkin
Jaeger
XRay, cheap, UX is poor
Sampling can loose the traces you really need
Can’t choose to trace retroactively after you’ve hit an error condition, need to choose to trace at the ingress point
XRay < OpenTrace because you can’t switch it out as easily
Hard to get started as you need people to modify there code
Get buy in, by showing it off on hack days
Span tracing can be tricky if context passes between threads
Even with 1 thread it needs to work with all the libraries you use
Easy for new code, but no ones really going to go back and instrument all the old code

Graphite vs Prometheus

Pull model can be hard to deploy for existing projects, if you are in regulated environments or have security teams that make it hard to get the access to the scraping endpoint
Push gateway can be a way arond this
Need code changes to expose metrics endpoint for Prometheus
Managing TSDBs is hard
Prometheus struggles with a 300–400 node Kubernetes cluster, need to add more instances, federation is hard
Hosted Graphite is nice, adds alarms on top

Why use Prometheus?

Kubernetes adds endponts for it
Standardised a metrics format
Can add labels to metrics
Query language and data storage is nice

HoneyComb?

Nice for seeing the bigger picture and then drilling down into common factors
Not a logging replacement
Cheap way to store lots of datapoints, costs less than building and maintaining and similar solution yourself
Can sample events based on if they are a success or error

Logging

Buy an ELK stack, everyone has them and maintaining your own is hard
Splunk is amazing if you have the money
Attach trace ids to logs so you can link them back
Turn async writes on for Elasticsearch

Strategies for Testing in Production

Monitoring systems that log into prod and perform a bunch of actions work well for a few people in the room

Have found breakages and issues belonging to other teams
Add headers to request to identify them as test runs so that systems can decide how to respond to it. Useful for things like payment systems that can drop the request and return dummy data
Need to be sure that you update the mock data when the systems change
Becareful when interacting with 3rd party APIs easy to get banned for posting data that looks like spam or hit rate limits and break your prod app
Running these tests against prod systems means they also emit metrics / logs to the standard systems that you can monitor

Canarying

CNAMEs that let you switch between environments
Rollout your code to a subset of users, controlled by things like feature flags
If you can rollback fast, less need to things like blue / green deploys
How to Deploy Software
Github Scientist — For testing new codepaths safely (available for a bunch of languages)

Shadow Traffic deploys

Becareful about the extra load your placing on downstream systems

Ways to duplicate traffic

Have code in the client, controlled by a feature flag that makes requests to new and old systems
Use something like Kafka streams to replay prod traffic against new systems, or pipe it into development environments

Envoy Proxy

Rate limiting requests to 3rd party APIs
Test credential swapping

Feature Flags

Put all new features behind feature flags so you can deploy it in small pieces early on
Be sure to remove the flag at the end
Product owners can control who / when the flag is turned on

Observability or has anyone tried HoneyComb?

If you have events you can generate metrics
If you just store metrics you lose the events
Can correlate things like CPU spikes with other events occuring in the system
Don’t know what data you’ll need until the event has passed
Metrics show you, you have a problem. observability show you common traits of anomolous metrics
Etsy Skyline

eBPF

Awesome but new, not in alot of kernels
Can hook into events happening in the kernel, without overhead
Hard to use currently

What’s Changed Since Scale Summit 2017 / Predictions for Next Year

Predictions from last year

Alexa / Voice interfaces won't take off

Seem to be bigger in the USA than in Europe
Homepod doesn’t seem to have taken off
Alexa laughs at you
If you live in a small house and have an Alexa that controls your lock, you can shout “Alexa open the door” from outside the house

Rust

Go seems to be big in the operations field but not as popular elsewhere
Firefox Servo rewrite big success for Rust
WASM looks interesting

Brexit

Still a mess and unclear

IR35 / Gov Tech

People have been leaving GDS and no one is really taking over the community leader roles

Yarn

Package lock files have become more popular
People are moving from npm -> yarn
Hard to keep up with the rate packages are updated
Dependabot, alerts you about updates and looks at the test run results for the new version across the internet to workout how safe it is

Kubernetes on AWS

Happened
All the major clouds now have it
Not ready for prime time yet, released at reInvent as a marketing thing
Kubernetes is complex, Nomad is easier to run if you’re going to do it yourself
Lots of excitement about managing stateful services on Kubernetes now

ELK CVE

Didn’t happen
Ransomware for Elasticsearch clusters accidentally exposed to the internet

Predictions for next year

Smarter viruses that dont kill the host

More valuable to stay hidden and mine cryptocurrencies
Until that market collapses

SWE Ethics

Will become more of a hot topic, is growing after the Volvo incident
Machine Learning ethics will become a bigger topic
More attacks against machine learning

Crowd Sourcing Behind the Scenes

Expensify using Mechanical Turk
Duolingo swapping translations with users learning the opposite languages
Will continue to grow

More attacks against hardware

Maybe new attacks against AMD chips have already happened?

A country will have its CA chain revoked

Social Media Regulation

More transparency around who paid for ads you see
More spam messages that are really close to looking like a human wrote them

Private companies will start competing with branches of Government

City Mapper busses.

You can view tweets from the event on the #ScaleSummit18 hashtag, I’m even in one of them