How to Build Multi-Tenant Internal Services in AWS and CDK (Part 1): API Gateway and AppSync

In this blog post series, I would like to discuss best practices for building multi-tenant services in AWS. Existing literature on how to build multi-tenant services is usually aimed at SaaS applications with hundreds of customers ( e.g. Building a Multi-Tenant SaaS Solution Using AWS Serverless Services).

The main rationale for this series is to focus on building multi-tenant services for use cases with fewer clients that are all deployed to AWS accounts. Usually, this would apply to scenarios when you build a multi-tenant service for internal use.

I will split the series of blog posts into three parts for each type of service-to-service integration: synchronous, asynchronous, and batch integration.

Part 1 will discuss multi-tenant architecture for two AWS services: API Gateway and AppSync. Throughout the article, I refer to the code from the sample application app built for this article in Typescript and AWS CDK: https://github.com/filletofish/aws-cdk-multi-tenant-api-example/tree/main.

Content Overview

Multi-tenancy for internal services

1.1. Tenant isolation

1.2. Multi-tenant monitoring

1.3. Scaling
Multi-tenancy for internal services

2.1. Tenant-isolation - acess-control

2.2 Tenant-isolation - noisy neighbor problem

2.3 Multi-tenant monitoring

2.4 Metrics, Alarms, Dashboards

2.5 Onboarding and offboarding API clients
Multi-tenancy with AWS AppSync
Conclusion

1. Multi-tenancy for internal services

Multi-tenancy is the ability of software to serve multiple customers or tenants with a single instance of the software.

Once you allow more than one team to call your service API, your service becomes multi-tenant. Multi-tenant architecture introduces additional complexity to your services, such as tenant isolation, tenant-level monitoring, and scaling.

1.1. Tenant isolation

Generally, tenant isolation addresses security concerns by ensuring that tenants are prevented from accessing another tenant’s resources. Also, tenant isolation is implemented to make sure that any failures caused by one tenant don’t impact other tenants of your service. It’s also often referred to as a noisy neighbor problem. See more in the AWS Whitepaper on Tenant Isolation Strategies https://d1.awsstatic.com/whitepapers/saas-tenant-isolation-strategies.pdf.

1.2. Multi-tenant monitoring

Once multiple tenants start sharing infrastructure resources, you would need to monitor how each of your tenants uses your system. It usually means that the tenant name or identifier should be present in your logs, metrics, and dashboards. Multi-tenant monitoring could be useful for several reasons:

Troubleshooting issues: simplifies problem identification and resolution, distinguishing tenant-specific issues from broader ones.
Resource allocation and capacity planning. Multi-tenant monitoring can help you track per-tenant resource consumption for resource allocation and capacity planning. Even if your service is serverless, you still need to understand your client resource consumption to understand if you are going to hit any of AWS limits soon (a typical example is a Lambda Function Concurrent Execution limit).
SLA Management: Allows tracking of tenant-specific performance against SLAs.
Billing. It’s unlikely that you start billing other teams for using your internal service. However, at some scale of the company growth billing other teams could be a good idea to ensure frugal usage of the service.

1.3. Scaling

Multi-tenant services are likely more exposed to scaling challenges than single-tenant services. However, scalability is a huge topic and I won’t cover it in this blog post.

2. Multi-tenancy with API Gateway

If you are building your AWS web service with REST, HTTP, or WebSocket API in AWS you are most likely using API Gateway.

2.1. Tenant-isolation — access-control

AWS recommends deploying each service in its own AWS account(s) to isolate the service’s resources and data, easier cost-management, and separation between test and production environments (see details in AWS Whitepaper Organizing Your AWS Environment Using Multiple Accounts).

If your company services are deployed in AWS then the most obvious solution for managing access to your API Gateway is AWS IAM. AWS Cognito is another option for managing access to multi-tenant API (see Throttling a tiered, multi-tenant REST API at scale using API Gateway, The case for and against Amazon Cognito).

Comparison between AWS IAM and AWS Cognito deserves a separate deep-dive. But for this article, I would stick with AWS IAM as it’s the simplest way to manage access when your company services are in AWS.

Once you enable AWS IAM authorization for the API Gateway Method (see CFN), all API requests for this method should be signed with credentials of IAM identity allowed to call your API Gateway.

By default, no access is allowed between AWS accounts. For example, invoking your API Gateway with credentials of another AWS account will fail. To integrate your customers with your API you need to set up cross-account access. For granting cross-account access to your API Gateway you can use two methods: resource-based authorization (not available for API Gateway HTTP API) and identity-based authorization (see more at https://repost.aws/knowledge-center/access-api-gateway-account):

Onboarding a client with resource-based authorization. For resource-based access, you need to update the API Gateway Resource Policy and add the AWS Account of your client. The main disadvantage of this method is that once you update the resource policy, the API Gateway stage needs to be redeployed for changes to take effect (see AWS docs [1] and [2]). However, if you use CDK you can automate the deployment of new stages (see AWS CDK Docs for Api Gateway). Another disadvantage is the limit for the maximum length of resource policy.
Onboarding a client with identity-based authorization. For identity-based access control, you need to create an IAM role for the client and allow the client to assume it by updating the role’s resource policy (trusted relationships). You could use IAM users, but IAM roles are better from the security point of view. Roles allow authentication with temporary credentials and do not require storing IAM user credentials. There is a limit of 1,000 roles per account, but this limit is adjustable. Plus, another disadvantage of the role-based method for getting cross-account access to your API is that you need to create an IAM role for every new API client. However, role management can be automated with CDK (see code sample from provided CDK app).

AWS IAM authorization only allows you to control access to the API Gateway (using IAM policy you can specify what AWS account can call what API Gateway endpoints). It’s your responsibility to implement control access to the data and other underlying resources of your service. Within your service, you can use the AWS IAM ARN of the caller that is passed with API Gateway Request for further access control:

export const handler = async (event: APIGatewayEvent, context: Context): Promise<APIGatewayProxyResult> => {
  // IAM Principal ARN of the api caller
  const callerArn = event.requestContext.identity.userArn!;

  // .. business logic based on caller
  return {
    statusCode: 200,
    body: JSON.stringify({
      message: `Received API Call from ${callerArn}`,
    })
  };
};

2.2. Tenant-isolation — noisy neighbour problem

The default API Gateway limit is 10,000 TPS (API Gateway Quotas and Limits). However, due to your downstream dependencies, your service might require a lower TPS limit. To avoid an overload of API requests from a single tenant that will impact the availability of the whole system you should implement per-tenant API rate limiting (also referred to as “throttling” or “admission control”).

You can use API Gateway API Usage Plans and Keys to configure limits for each client separately (for details see AWS documentation [1], [2], and [3])

2.3. Multi-tenant Monitoring

API Gateway has two types of logs:

API Gateway Execution Logs: contains data such as request or response parameter values, what API keys are required, whether usage plans are enabled, and so on. Not enabled by default, but can be configured.
API Gateway Access Logs feature: allows you to log who has accessed your API, how it was accessed, what endpoint was accessed, and the result of the API call. You can provide your log format and choose what to log with context variables (see docs, in CDK).

To monitor the requests of your API clients, I would recommend enabling access logging. You can log at very least AWS IAM ARN of the caller ($context.identity.userArn ), the request path ($context.path) , your service response status code $context.status and API call latency ( $context.responseLatency).

Personally, for a service with AWS IAM Auth and Lambda function as compute I found this API Gateway Access Logging configuration useful:


const formatObject = {
  requestId: '$context.requestId',
  extendedRequestId: '$context.extendedRequestId',
  apiId: '$context.apiId',
  resourceId: '$context.resourceId',
  domainName: '$context.domainName',
  stage: '$context.stage',
  path: '$context.path',
  resourcePath: '$context.resourcePath',
  httpMethod: '$context.httpMethod',
  protocol: '$context.protocol',
  accountId: '$context.identity.accountId',
  sourceIp: '$context.identity.sourceIp',
  user: '$context.identity.user',
  userAgent: '$context.identity.userAgent',
  userArn: '$context.identity.userArn',
  caller: '$context.identity.caller',
  cognitoIdentityId: '$context.identity.cognitoIdentityId',
  status: '$context.status',
  integration: {
    // The status code returned from an integration. For Lambda proxy integrations, this is the status code that your Lambda function code returns.
    status: '$context.integration.status',
    // For Lambda proxy integration, the status code returned from AWS Lambda, not from the backend Lambda function code.
    integrationStatus: '$context.integration.integrationStatus',
    // The error message returned from an integration
    // A string that contains an integration error message.
    error: '$context.integration.error',
    latency: '$context.integration.latency',
  },
  error: {
    responseType: '$context.error.responseType',
    message: '$context.error.message',
  },
  requestTime: '$context.requestTime',
  responseLength: '$context.responseLength',
  responseLatency: '$context.responseLatency',
};

const accessLogFormatString = JSON.stringify(formatObject);
const accessLogFormat = apigw.AccessLogFormat.custom(accessLogFormatString);

Once logging is enabled, you can use CloudWatch Insights to easily get the latest calls from a chosen API client with:


fields @timestamp, path, status, responseLatency, userArn
| sort @timestamp desc
| filter userArn like 'payment-service'
| limit 20

2.4. Metrics, Alarms, Dashboards

CloudWatch Metrics supported by API Gateway by default are aggregated for all requests. But you can parse API Gateway access logs to publish custom CloudWatch metrics with an additional dimension of your client name to be able to monitor client (tenant) usage of your API. At the very minimum, I would recommend publishing per-client CloudWatch metrics Count, 4xx, 5xx, Latency split by Dimension=${Client}. You could also add dimensions like status code and API path.

2.4.1. Using metric log filters for publishing per-client metrics

CloudWatch Metric Log Filters (see docs) allow you to provide a custom filter and extract metric values from API Gateway Access Logs (see example below). Metric Log Filters also allow extracting value for custom metrics dimensions from logs. For multi-tenancy monitoring, the dimension Client could be the IAM ARN of the caller.

The main advantages of Metric Log Filters are (1) no compute to manage (2) it’s simple & cheap. But you cannot do any data modifications (e.g. set more readable client names instead of IAM ARNs) and there is a limit of 100 metric filters per single log group (docs).

Example of CloudWatch Metric Log Filter to Publish Count with dimension Client and Path


new logs.MetricFilter(this, 'MultiTenantApiCountMetricFilter', {
  logGroup: accessLogsGroup,
  filterPattern: logs.FilterPattern.exists('$.userArn'),
  metricNamespace: metricNamespace,
  metricName: 'Count',
  metricValue: '1',
  unit: cloudwatch.Unit.COUNT,
  dimensions: {
    client: '$.userArn',
    method: '$.httpMethod',
    path: '$.path',},});
});

See all metric filters for 4xx, 5xx error, and latency metrics at the provided sample CDK application.

2.4.2. Using Lambda function for publishing per-client metrics

The alternative option is to create a Lambda function to parse the logs, extract metrics and publish them. This allows you do more custom stuff like filtering out unknown clients or extract client name from the userArn.

With just a couple of lines of CDK code to subscribe Lambda function to API Gateway Access Logs:

const logProcessingFunction = new lambda.NodejsFunction(
  this, 
  'log-processor-function',
  {
    functionName: 'multi-tenant-api-log-processor-function',
  }
);

new logs.SubscriptionFilter(this, 'MultiTenantApiLogSubscriptionFilter', {
  logGroup: accessLogsGroup,
  destination: new logsd.LambdaDestination(logProcessingFunction),
  filterPattern: logs.FilterPattern.allEvents(),
});

See full example in code as well as implementation of Log Processor Lambda Function.

Once you have started publishing API Gateway metrics that are split by Client, you can now create CloudWatch Dashboards and CloudWatch Alarms for each client separately.

2.5. Onboarding and offboarding API clients

Your CDK app could be an easy solution to store a config with client names, their AWS accounts, requested TPS limits, and other metadata. To onboard a new API client you would need to add it to the config managed in code:

interface ApiClientConfig {
  name: string;
  awsAccounts: string[];
  rateLimit: number;
  burstLimit: number;
}

const apiClients: ApiClientConfig[] = [
  {
    name: 'payment-service',
    awsAccounts: ['111122223333','444455556666'],
    rateLimit: 10,
    burstLimit: 2,
  },
  {
    name: 'order-service',
    awsAccounts: ['777788889999'],
    rateLimit: 1,
    burstLimit: 1,
   },
];

Using this config the CDK app can then create an IAM role, API Gateway Usage Key, and pass the name of the client to Lambda Function that parses access logs (see it in the sample application code).

3. Multi-tenancy with AWS AppSync

If your service has a GraphQL API you probably use AppSync. Similarly to API Gateway, you can use IAM Auth to authorize AppSync requests. AppSync does not have a resource policy (see GH issue), so you can only use a role-based authorization for setting up access control to AppSync API. Similarly to API Gateway, you would create a separate IAM role for every new tenant of your service.

Unfortunately, AppSync has limited support for per-client throttling that we need for tenant isolation and monitoring. While you can set up TPS limits for AppSync with WAF, you cannot create separate per-client limits to isolate your service tenants. Similarly, AppSync does not provide access logs as API Gateway does.

Solution? You can add API Gateway as a proxy to your AppSync and use all the above described API Gateway features to implement multi-tenancy requirements like tenant isolation and monitoring. On top of it, you can use other API Gateway features like Lambda Authorizers, Custom Domain, and API lifecycle management that do not yet exist in AppSync. The disadvantage is a slight additional latency for your requests.

4. Conclusion

That’s it. If you have any questions or ideas, let me know in the comments or contact me directly. In the next part of this series, I will review best practices for asynchronous internal integration with AWS Event Bridge and AWS SQS / SNS.

If you want to dive deep into the topic of building multi-tenant services on top of AWS I found these resources useful:

Also published here.