Enterprise Cloud Security: Setting Up Structure, Identity-Based Access, and Network Control

This article is part of the Enterprise Cloud Security Series with Part I: Introduction introducing the space and how it differs from on-premise security. Part II covers the security consideration for building the cloud foundation.

Enterprise cloud infrastructure can primarily be split into two pieces: foundation and application landing zone. The foundation refers to infrastructure setup typically performed by the cloud platform team within the enterprise. The application landing zone is the part of the infrastructure provided to developers with appropriate guardrails in place to deploy applications and associated infrastructure.

Please note that depending on the size and maturity of the organizations, multi-cloud strategy and hybrid design, there may be a business unit, geolocation and/or privacy/sensitivity specific landing zones to enable the organization to meet its organizational or regulatory needs.

This article focuses on the role of security in various components that form part of the cloud foundation.

Management Structure

An important part of any cloud deployment within the enterprise is the need to organize the resources. Depending on the size of the organizations, there are different approaches that are recommended by cloud service providers.

Multi-account model

There has been significant interest in the use of an AWS Account, Azure Subscription, and GCP Projects to create landing zones for applications or higher organizational components.

This approach, if implemented correctly, can be an important part of the strategy to control the “blast radius” by providing micro-segmentation across identity and network which is an important tenet of zero-trust architecture. Most standard cloud models use one or more of the following types of landing zones.

Management: The management zone is primarily used to concentrate all the components that are used for the management of the environment
Shared service: The shared service zone contains shared services like DNS, NTP that is typically managed by cloud platform or similar infrastructure team.
Network: Most networking models (like a hub-spoke model) use a network-specific landing zone to terminate the on-premise and/or inter-cloud connectivity to simplify the management and connectivity.
Log aggregation: Log aggregation is an important part of logging and monitoring to ensure that a comprehensive view across all the landing zone is available for analysis.
Application: Application or organization unit-specific landing zones are zones with specific guardrails to ensure application teams can deploy infrastructure within the predefined parameters like access to the network, allowed services and associated configuration, allowed access model.

Folders

Most Cloud Service Providers provide the capability to build a hierarchical model that enables organizations to structure a very large set of such landing zones into more manageable chunks. AWS Organizational unit (Control Tower), Azure Management group, and GCP folders can be used to structure landing zones based on the organizational structure, environments, security, and privacy considerations.

In addition to that, the ability to apply a specific set of organizational policies (AWS Config AWS service control policies, Azure Policy, GCP Organization policies) based on specific criteria (e.g. non-prod vs prod, accounts with PII data) can also play a role in designing this model. I have seen the organizational hierarchy, where supported, used to provide access to a large number of resources (e.g. Database Administrators may need access to all databases across the accounts) which may be a questionable use of this functionality.

Given that a lot of these features are currently evolving, it will be interesting to see whether organizations go through multiple phases of re-organization of management structure to better reflect and align with their requirements or use new capabilities that Cloud Service Providers introduce in this space.

Tagging

Tagging provides an orthogonal mechanism to the hierarchical model described above. Tagging is an important part of overall resource organization strategy and, if used correctly, can play an important part in security classification.

Tags can be used to apply a lot of security facets, like privacy, criticality across the resources instead of making these aspects part of the hierarchical model. But lack of consistent features like tag inheritance, disabling tag overriding, referencing tags in policies, and access rules have made use of this mechanism a challenge while designing management models within the enterprise.

Identity-Based Access

Identity forms one of the core foundations of cloud platforms. It is very important to ensure that the identity layer is built to handle different use-cases and associated access models.

Account store

An important initial design decision for the cloud foundation is which user account store should the cloud platform use for each type of account. If the cloud’s internal account store (e.g. Azure AD Users, AWS IAM Users) is used, then each user account lifecycle must be managed within the cloud.

In addition to that, the internal store should also be able to support MFA, password management capabilities. Alternatively, by leveraging an external store and integrating through federated SSO, the user can be mapped (for example through SAML assertion or JWT token) during the authentication process to an access role or profile in an internal store. Thus, it removes the need to manage the account in the cloud account store.

Please note that there will still be some basic accounts like tenant owner, break-glass accounts, and service accounts that are needed for a technical reason. This should be created in the cloud’s internal account store and managed through applicable privileged account management processes.

Authentication

The cloud identity system should typically support user authentication against the internal store and through federated SSO mechanisms like SAML and OpenID Connect. It is important that the authentication process is flexible enough to support different types of authentication mechanisms for different types of accounts across all the access interfaces i.e. Portal/Web console, Command Line Interface (CLI), REST API. Lack of support for these mechanisms across all the access point has been one of the significant challenges to automation in past.

Types of accounts

The cloud should be designed to support access from the following types of users.

End-users: represent the application users who access the environment through the edge components (e.g. AWS Application Load balancer, Azure API Management). As part of the zero-trust architecture, their identity may need to flow to downstream PaaS and SaaS components to ensure the access can be managed and audited in the context of the end-user instead of the generic service account.
Developers (Privileged access): represent users with access to make changes to the infrastructure. Depending on the regulatory guidance and other internal policies, some or all of the developer’s access may be deemed privileged access. Privileged access typically triggers additional control requirements like access approval, session monitoring and recording, just in time access, additional audits, and monitoring across all the components to ensure comprehensive coverage. This helps to understand who tried to perform the action within a given context and for what reason. It is important to evaluate the authentication mechanism and identity store configured to ensure that the policy requirements, especially with regards to having user identity details in audit and monitoring details. It should be available for traceability purposes.
Service Account: represents accounts used by automation scripts and other software that connect to the cloud to perform various operations. These accounts have specific lifecycles different from user account lifecycle and may use different authentication mechanism like API Key, certificate. The credential management may involve the expiry and rotation of the credential at a regular interval.
Machine Identity: is a special case of the service account that represents the identity of the VM or service making the call. It allows the establishment of identity within the cloud infrastructure (and possibly outside) without any need for identity and password.
Break-glass and owner accounts: represent cloud-native accounts that are expected to be used in case of any significant failure or lockout across the identity infrastructure. These accounts typically have a very high criticality associated and are typically used as accounts of last resort for accessing the environment.

Access Model

Most cloud services support a Role-based access model (RBAC) that associates the permission(s) with a role based on the access model. The access model should follow the principle of least privilege to ensure that roles are defined in alignment with specific use-cases and then assigned to the users through a group.

It is important to go through the access model development for each and every service being used to ensure that permissions are classified in to at least core operational part associated with cloud foundation (e.g. creation of VPC or VNet) and application development specific set (e.g. creation of VMs within a specific subnet). Depending on specific use-cases and operational model, additional roles may be created for devops operations (e.g. continuous deployment).

Access Model for Azure RBAC

In addition to that, the following practices should be evaluated while designing an access control model.

Where possible, lockdown security attributes (e.g. disk encryption should be enabled and not disabled). If not supported by the cloud platform, use preventative policy or monitoring/remediation to overcome this platform deficiency.
Ensure that all the machine identities associated with application infrastructure for non-management components have very limited access (like adding log) to ensure that in case of any breach there is no additional access available through these machine identities.
Service accounts should have very limited access and/or should request access to the very limited scope needed for the work being performed by scripts to reduce chances of major outages
Use cloud-specific features (e.g. AWS IAM limits scope of an authenticated session to a specific role in an account) where available to limit the scope of the access during assignment and as part of operations to avoid potential major impact due to human or automation errors.
Consider identity and access model within the service where applicable (e.g. VMs or databases have an internal identity and access model) as part of the overall service access model. In addition to that either integrate the services’ internal identity and access model with cloud platform identity or have a privileged access management solution in place to ensure adequate controls are in place for monitoring privileged access.

Network Control

Two aspects have a very significant impact on how most enterprises design the cloud:

Network connectivity with on-premise through datacenter connections (AWS Direct Connect, Azure Expressroute, GCP Cloud interconnect).
Chokepoint for ingress and egress traffic to reduce the attack surface, cost and enforce various security controls like DLP, malware monitoring.

Most enterprises use a hub-spoke model with defence in depth while designing network architecture even though that may not be the most appropriate model for zero-trust architecture.

A simple hub-spoke model is shown below.

Most of the previous hub-spoke model was built with the hub located in partner colocation or datacenter to ensure that all the traffic could pass through on-premise security appliances. Over the past few years, there has been significant growth in the availability of network appliances for routing, next generation firewall, Intrusion Detection System (IDS), Intrusion Prevention System (IPS) malware-scanning, content monitoring, data loss prevention (DLP) in the cloud. It enabled the creation of more efficient connectivity between regions or across the cloud without the need to route the traffic through a datacenter unless required due to organizational policy.

In addition to that, you can use intermittent networks between hub and application network with static routing to isolate sensitive workloads.

Some of the on-premise approaches like multi-homed VMs with dedicated nic for administration and backup services are typically not replicated in the cloud. Identify such practices and plan for alternate designs that may scale-up in the cloud.

Ensure that you plan for presence of shared services like DNS, NTP, Vulnerability scanning, EDR, etc in each cloud and region to reduce the need to communicate with on-premise infrastructure and collect logs in local storage for analysis to reduce egress charges.

Ensure adequate sizing of IP networks for the workload of different sensitivity to enable simple firewall rules and route rules to avoid a leak of traffic across sensitivity and criticality boundaries. Where available use named IP collections or tags to create rules to simplify rule updates across complex network architectures.

Where possible reduce calls to cloud control plane and data plane over a private network (e.g. Azure Private Link, AWS VPC endpoints) to reduce the flow of data over a public network.

Additional Considerations

An important lesson to take away is the intricate mesh various foundation technologies like identity, network, management structures form with each other and the security implementation percolates through these technologies. Besides these foundational technologies, there are additional considerations that should be kept in mind while building the cloud foundation.

Resiliency

Resiliency is an important pillar of building a foundation to ensure that the application can build the failover and disaster recovery over a platform that provides resiliency across identity, network, shared services. This is typically achieved through the right combination of leveraging cloud native capabilities like using global service, paired region, geo-replication and designing the shared services to be resilient across all the active regions and geographies.

Shared services

Most cloud foundations are developed with a few shared services like DNS, NTP, etc. These services may be either built-in cloud capability or deployed as infrastructure to achieve integration with on-premise infrastructure where such integration is not possible for built-in cloud capability. It is very important to ensure that all the controls expected to secure any other application should be applied to these services including but not limited to

Privileged Access: access to these services should be limited to very limited team and all changes should preferably be made through change control workflow with adequate review and approval in place.
Vulnerability and Drift: services and platforms should be scanned and penetration tested to detect vulnerabilities and patched at an appropriate interval. In addition to that, the configuration should be monitored to identify any change from “desired state” based on the hardening configuration.
Restrict network access: reducing attack surface by restricting network access from trusted source networks to specific port needed. Where possible, access over secured channels (e.g. TLS) should be enforced.
Data: stored by service should be secured at rest and in motion. All the data should be backed up at a regular interval and recovery should be tested at appropriate time intervals to ensure that backup process is appropriate. Use built-in or custom data integrity checks to ensure that data has not been tampered with.
Logging: of the audit events and other operations performed should be stored for duration as prescribed by organization policies and industry practices (e.g. leading practice for incident management suggest storage of log data for at least 180 days which may not align with organization policy or may exceed compliance requirements).
Resiliency: of services should be planned for at design time and the implementation should be verified at regular interval to ensure that service is highly available within the parameter needed by depending services.

Log aggregation

One of the shared services that forms an important part of security operations is log aggregation infrastructure. This capability enable collection of logs across different landing zones, services, platforms, networks into one or more aggregation storage site like AWS S3 bucket, Azure Log Analytics Workspace or GCP Cloud storage for further analysis. The log aggregation platform designed should be able to handle the following requirements in addition to the regular security requirements identified above for shared services.

Large volume of data: with growth of services being used and size of cloud footprint, the data injection can grow up to 500GB/day for very large implementations.
Integrity: the data should typically be stored as write-once-read-many (WORM) to ensure that the integrity of sensitive audit data is maintained
Expiry: the old data should automatically be removed from the platform to ensure that only the appropriate data is maintained over time.

This article tries to cover various security consideration while building the cloud foundation within an enterprise. This is an on-going exercise that I will try to continuously improve upon.

Also published at https://medium.com/jhash/enterprise-cloud-security-foundation-f2cdeb0c84a4