A Guide to Building Complex Serverless Infrastructures

In this article, we’ll be rewinding back to the very beginning of the AWS Well-Architected Framework to understand how and why it came to be, and why is it of utmost importance, but very often underrated, for serverless developers to learn, understand and apply this framework of best-practices. We’ll also be looking into how the framework has evolved and how it should be used in 2021.

The History of AWS Well-Architected Framework

History of the Well-Architected Framework

In 2012, the AWS Well-Architected Framework came about in response to the build out of its portfolio. The industry and user feedback consistently showed that while there was plenty of documentation on the services, there wasn’t enough of best practice. As ever, AWS listened to its users and published its first framework in 2015.

In 2016, the Operational Excellence pillar was introduced, again as a result of user feedback. The original Well-Architected Framework was rightly technically heavy, but more was needed and wanted for operational posture improvements, understanding how to reduce heavy lifting, and improve the day-to-day running of AWS infrastructure.

By the next year, there was now a massive proliferation of services and AWS wanted to cater for the segregation of environments seeing that the AWS Service Reviews were very different between classic architecture and serverless. So, in 2017, lenses were introduced which would overlay the original framework enabling speaking to specific workloads. It meant that the Reviews and adoption of AWS could be much more refined. We’ll dig deeper into the lenses later on in this article.

Today, in 2020, there have been some big framework updates including updates to all pillars. AWS has also launched a Serverless Lens, which can be found in the Well-Architected tool within the console.

Why use Well-Architected?

Customers want to build and deploy faster

Often, customers start in experimentation and workloads tend to develop organically with increasing additions. It’s most common that in this growth process, a deviation of best practice can happen to make it harder to layer that complexity on the top. For this reason, we and AWS want to teach customers to align what they already have with best practices to ensure faster deployment and a better security posture.

Lower or Mitigate Risks

Risks span all pillars of the Framework (Reliability, Performance, Operational Excellence, Cost, and Security), and the Review process works to lower or mitigate risks over a period of time. Many customers find that they aren’t sure what their risk profile is, which becomes a big worry for C-level profiles who start asking “where do we sit in our risk profile?” and “how do we work to reduce that over a 6-12 month period?”

Make Informed Decisions

With education and knowledge comes more power and informed decisions. Let’s say a customer who has a 2-3 year-old workload needs to be redeployed into a new environment, which has segregated accounts. Their options are; improve the existing environment or migrate the workload to a new one. The Well-Architected Review will be able to show the work needed for the remediations, including the upfront changes, ongoing maintenance, and costs.

Learn AWS Best Practice

We have found that the happiest customers are those that feel well-educated from an AWS platform and service point of view. This includes instilling best practices, knowledge of new releases, the release cycles, and how services evolve. From here, they are more willing to share and build greater trust with AWS as a service.

What is the Well-Architected Framework?

What is the AWS Well-Architected Framework?

The pillars are the cornerstone of any AWS architecture, where the vertical segregation can be applied to any workload whether that is Serverless, EC2 or Big Data.

AWS Well-Architected Framework pillars

Operational Excellence is the ability to understand how customers use their workloads. For example, how much time an employee spends on routine tasks, or how business objectives are being met?

It’s crucial that security is at depth and at every layer, and for it to be aligned with every service. This should always be a Day 1 initiative.

Reliability is the ability to withstand a geo-specific event or even something more local. Things to consider are; what steps have been taken so that no matter the event, your service can still run as needed?

Performance often bleeds into scalability. For example, on Black Friday, can your environment scale and remain elastic? Can it cope with a 5,6,7 fold increase? This also works in reverse. We want to avoid over the deployment of services that are too compute-heavy, which will have an impact elsewhere. Cost – a pillar that always comes with a welcome smile.

Cost optimization is so often a positive with customers. It’s important to know how services evolve, different cost modeling options, and how to spot instances for some areas of architecture. It also includes working with a partner to recode, using microservices, and using Lambda.

Design principles

The design principles represent the goals we are aiming to achieve in each pillar. Looking at the objectives of the workloads is key here, and when running a Review, there will be many questions, architectural diagrams, and a gathering of information before the review itself. The quest here is to have a data-driven, informative review.

Intent of Review

The intent of a Review is to provide insight into best practices for AWS. It’s important to remember that it is not aligned with an audit or any regulatory body and that the data isn’t shared. It’s there to simply improve posture against all Well-Architected Framework pillars.

The Reviews provide pragmatic, proven advice that AWS knows work and that which is tailored to the customer’s need.

For example, if the customer has a tight security posture requirement, we will bias the review towards that.

When it comes to an AWS Review, it’s important to keep in mind that these aren’t intended as a one-time check. These reviews and best practice sessions should be run with regular cadence; twice a year is often sufficient to avoid any glaring holes, and for any holes that are found, we want to find them early. Its simple, regular cadence provides greater efficiency.

Well-Architected Lenses

The Well-Architected Lenses were created to be specific to workload type. While the same review over diverse workloads was positive, AWS wanted to allow more specificity and so over the coming years, more lenses will emerge into the Well-Architected tool itself.

The design principles are specific to each lens, and the lens documentation includes popular scenarios; for instance, the Serverless Lens includes restful APIs and mobile devices. There is also the High-Performance Review and a lens for IoT.

At its core, the lenses are there to enable maximum effect to work towards a customer’s business outcome.

The Serverless Lens Design Principles

1. Speedy, Simple, Singular

It’s important that functions remain concise and single function in their nature. Customers are already moving away from the monolith design, however what’s showing is Lambda code running in a tentacle manner. We don’t want a Swiss army knife Lambda style!

2. Concurrency

Making full use of the concurrency model is a trade-off made at the start. Remember that you don’t need to look at the total number of requests.

3. Share Nothing

Functions, by their nature, are short-lived and so, the underlying infrastructure isn’t guaranteed. Instead, persistent storage with a decoupled nature is preferred for durable requirements.

4. No Hardware Affinity

By using technology and hardware in an agnostic method means that code will work over a breadth of time.

5. Orchestration

This is undoubtedly one of the key benefits of Serverless. Chaining functions together is akin to our standard monolith designs, so please be mindful and don’t fall into that trap. AWS has state machine structures to build out the complex orchestration needed, so make full use of this.

With this in mind, combining functions that are precise means that you can build out those complex workflows much more easily.

6. Event-driven

A biggie in Serverless. Make use of this principle to ensure events and responses align with business functionality.

7. Failures and Duplicates

Another major component in Serverless. Ensure that appropriate retries for downstream calls are included within your code.

Challenges for Serverless Teams

1. When it comes to Serverless, there is a lot of surface area, functions, and managed services making a complex environment to understand and navigate.

2. As a result of this, often a significant amount of time is spent on debugging, as opposed to building.

3. The ability to respond to incidents quickly and to have the confidence that you’ll know of them before any customers find out and in real-time.

4. Continuing to follow best practices as other areas become a priority.

Some of the common issues are:

How can I be sure I’m following them as I should?
I want somebody to tell me what they are, and recommend how to be better!

How Can Tooling Help?

Observability

With such a large surface area and so much data moving around, observing your infrastructure is one of the hardest elements to keep up with. To make this easier, data should be centralized and made as accessible as possible.

It’s incredibly important to not see logs, metrics, and traces in only silos, but instead to look across the spectrum of your managed services. Asking, how does your SQS queue interact with your Lambda functions?

Tooling also helps in reducing time to discovery and resolution. Good tooling tells you when something is wrong, what has gone wrong, and the best way to fix it. Through this, it naturally encourages best practices too making it the best way to automate all of the above.

Tooling for Serverless, particularly monitoring, security, alert and failure detection should come down to automation and abstraction.

Looking at Cognito and SQS queues, for a team to use SQS, they need to first implement them, and then understand the risks and monitor them. Once you start adding new queues and functions, that sort of alert coverage and monitoring for unknown failures must always be extended to the rest of the infrastructure.

It’s important, therefore, that tooling constantly adjusts itself to the ever-changing infrastructure.

A bit of a no-brainer but so important to highlight is that tooling helps to manage underlying the infrastructure. The log pipeline and log ingesting can be managed, as can an alarm or alerting system. This really is the Serverless way!

As touched on before, it also enables learning. A good tool makes it understandable and clear as to how the system has worked historically, and how the changes have affected the system to perform over time.

This article was put together based on a Dashbird webinar with Tim Robinson, Well-Architected Framework’s Geo Lead at AWS, and Taavi Rehemägi, CEO at Dashbird.

Previously published at https://dashbird.io/blog/building-complex-well-architected-serverless-architectures/