From on-prem to AWS to ECS and beyond. The past 5 years at Arthrex Digital Media.

This December marks 5 years for me at Arthrex within the Digital Media department (which didn’t even exist when I started). Arthrex is a healthcare company specializing in orthopedic devices, hardware, and techniques. The Digital Media Team is specifically focused on surgeon and patient education, surgery outcome tracking, and more.

While reflecting upon this 5 year milestone, I felt that it was an infrastructure evolution story worth sharing. A story of our initial migration from our on-prem VMs managed by another internal team to the AWS cloud. Then, you can follow along our migration from EC2 centric deployment to our current ECS approach. We can then take a look into the near future and cover our plans for 2018. I will gloss over deprecated/retired technologies we have used and get more detailed as we get more current with our deployment.

Day 1: December 17, 2012, 4 short days until the supposed end of the world, we had a small handful of web applications we were responsible for, written in either coldfusion or flex. Our subject of study for this era will be arthrex.com which was hosted on two Windows 2008 VMs and one bare metal windows system (affectionately known as “node a”, b and c). These app servers were load balanced by a Cisco NetScaler. The datastore for all of our apps at the time was a 2008 MSSQL server setup for manual failover across DCs.

Software and system patches along with deployments were done manually and would at most require dropping the node from the load balancer, patching, and rebooting the box before re-adding them to the load balancer all by hand, zero automation. At the very least the application server would need to be restarted during deploys.

Over time these nodes started to show their age, mostly during deployments, which at the time, had a bad habit of happening late in the day. Anytime a node had to be restarted, we were not certain it would come back up in a timely manner. Couple that with requests backing up at the load balancer and overloading the very slow (at the time) app servers when they were ready to serve and back in the load balancer. We ended up having to pre-warm our nodes before adding them back, coordinating over a WebEx to bring them in and out of the load balanced group.

All these resources we just covered, along with DNS, Email, monitoring (or lack thereof), were all managed by an infrastructure team outside of our department. Conflicting priorities in a growing company and different technical ideologies made for difficulty making changes, planning deploys, or even troubleshooting at times.

Things needed to change, time to take our fate into our own hands.

Let’s skip ahead to Q4 2013, one year later, and the context of surgicaloutcomesystems.com (SOS), an internally developed Rails application ready for deployment. This application was a rewrite of the, at the time, on-prem flex application containing PHI. With it moving from flex to rails, the question was, how do we host it? Do we submit a help desk ticket along with and email over to the infrastructure team and be at there whim? After much internal debate, discussions with AWS, and many meetings, the business case was made and agreed upon to move to AWS. As far as I know, we were one of the first companies to sign a BAA in regards to HIPAA compliance on AWS. Certainly, at least, SOS would become the first Arthrex application hosted in the cloud.

The stage was set, red tape cut, and handcuffs removed. The order, one rails application hosted in a highly available, HIPAA compliant manner on AWS. Infrastructure as code is a core tenant of our DevOps team. For configuration management, Puppet solo would sync the system with our desired state. Orchestration was done via nested Cloudformation stacks partitioned logically on networking, compute, datastore, etc. As for CI and deployment jobs, we spun up a Jenkins server, on here we also installed Hubot for ChatOps driven deployments in Hipchat. For monitoring, we brought in New Relic. OpenVPN would provide any private network access we need for dev, qa, dba and ops.

Our VPC design utilized public subnets for ELBs with Internet Gateways and private subnets with EC2 nat instances for our application and database servers. Trying to take AWS best practices into account in regards to high availability we deployed our EC2 instances across multiple AZs* in autoscaling groups with prebaked AMIs, meaning no more pet nodes a, b and the beloved node c. Being in an autoscaling group and on EC2, these application nodes had to be stateless, no local session stores here, because now they could disappear at any time, unlike on-prem.

VPC Design (simplified):

During any production code change, once it had been QA tested and signed off on, we would do a canary deploy to a single node. Once that looked good, we baked a new AMI based on that node, updated a cloudformation parameter with the new AMI, which updated our autoscaling groups with a rolling deploy. To kick off this fully automated process you would run hubot deploy production to do the canary, and hubot ship production to bake the AMI and update cloudformation.

*A cross region deployment is still something we are looking to accomplish, and the new cross region VPC peering makes that a lot more exciting!

To check the compliance box, we were limited to a subset of AWS services, some with certain restrictions, some of these requirements stand today. Any EC2 resources had to live within a dedicated VPC, all PHI must be encrypted at rest and in-transit, ELB could only be ran in tcp/ssl mode making capturing the requesters true remote ip much more difficult than it should have been, no RDS support, the list goes on.

The data migration plan was ready for moving from the old schema to the new. The DNS was ready to be updated. The infrastructure, monitoring and deployment process were all in place. Early in 2014, SOS was relaunched on AWS and continues running the same code base to this day, however, the infrastructure and process have changed drastically.

The infrastructure architecture pattern used in SOS was applied to all greenfield projects taken on by the team. We had some pre-processable cloudformation templates driven by yaml that you could more succinctly declare applications in to easily follow the pattern. Also where we could we moved some websites to s3 static sites. By the end of 2014 we had a solid foothold on AWS, but we still had a proportionally large footprint on-prem, including arthrex.com with all its ever growing aches and pains.

We began to deploy and migrate more supporting enterprise tools to the cloud as well, such as Jira, Confluence, and Github, and given we had many users at our corporate office, it made sense to setup a VPN connection at the office using the an AWS VPN connection. We also segmented our overall deployment into multiple VPCs for logical separation of duties, and connected each new VPC to the office via VPN connection. This evolved to a multi account strategy based on duty, where we can use IAM roles to control privileges and access across accounts.

2015, enter our container adoption era. As a DevOps team of one, it took a while to find a way to get containers into our workflow and how to run a production docker cluster. The experimentation spanned K8s, Deis, Flynn, Panamax, ECS, etc. At the time some of these were immature or running them meant DIY etcd or ceph management and HA architecting… something that after a bit of experimentation I discovered was not worth the overhead or stress, at least when there was a nice open source managed AWS ECS offering called Convox that worked right of the box for both local development and for production quality deployments.

Convox AWS Architecture:

Convox provides a core preprocessed cloudformation template that provisions the above architecture (or close to it — dated diagram) for you, in which it runs its own api within AWS ECS.

Convox’s awesome convox start command looks in your cwd for both a Dockerfile and a “docker-compose.yml” file. They support a subset of docker compose v2 to describe both local development and deployment. convox start will run a docker-compose up command behind the scenes, hook up code syncing, and attach stdout logs from your application and route them to your terminal.

To get your app up on AWS, you can createconvox apps create my-appYou can then deploy your application via convox deploy -a my-app -f docker-compose.prod.yml, which in turn Convox creates a new cloudformation stack. We use multiple docker-compose.*.yml files to control if we want a certain container or port configuration for different purposes. This new app stack will create a load balancer exposing the ports defined in your docker-compose.yml file on a per docker compose process bases. It will also create an ECS task definition and ECS service using your Dockerfile, which is attached to the provisioned load balancer. There are a few other AWS services leveraged but these are the heaviest lifters.

Initially I had to add support for the private networking and custom VPC CIDR design for us to be able to deploy production loads and connect our VPNs. But aside from that, out of the box it has offered everything we have needed to get all of our applications containerized and ready to start using standard CI pipelines across the board. With all that said, we had found our container platform and began the arduous process of containerizing both the remaining legacy on-prem applications, and the existing ec2 based applications. From this point on, in early 2016, all new development was docker based. Up to today, Convox has been a great platform both operate and deploy to, and has served a great purpose in speeding up our initial container adoption given limited resources.

For service discovery with convox/ecs, we are using kong and kongfig to declare our APIs. Kong is an open source API Gateway, of which we run one kong node for dev, and one cluster for prod. We use this for DNS for web apps and as a true api gateway for microservices, including header injection, JWT authentication, and more. We keep service discovery at this layer, agnostic and declarative to ensure it is reproducible and consistent across applications .

We run kong in 3 environments, local, development, and production. Each has its own set of api consumers, JWT secrets, and api endpoints reachable via hostname based routing, allowing us to both map addresses and ports to a DNS name, which is exactly what we needed to address to our individual containers without having to worrying about the ports, everything could use :443 and :80 regardless of what is exposed on our ELBs. Kong itself is of course deployed as a container on our convox local or ecs cluster.

Each kong environment gets 2 dns entries in route53 mapped to its ELB, one directly to the kong api i.e. mydevdomain.comand a wild card entry for any apis that the node will be hosting i.e. *.mydevdomain.com which accounts for kong apis exposed atmyservice.mydevdomain.com or mybetterservice.mydevdomain.com without registering directly in route53.

Taking a further look at local development, we are running a suite of containers to provide greater dev-prod parity, namely kong and its supporting containers, postgres, and kong dashboard. Kong Dashboard is used to view and troubleshoot local and deployed kongs. This suite is described in a docker-compose application that also provisions a user defined network when a docker.sock is mounted. We hide this behind a local cli but the internals at work here are as follow:

Create or launch the application, named “arthrex” and detach the process:

docker-compose --project-name arthrex up -d --force-recreate .

And here is the corresponding docker-compose.yml:

version: '2'services:kong-database:image: postgres:9.4ports:- 5432:5432networks:dev:aliases:- kong-database.arthrex.xyzkong:image: kong:0.9.3ports:- 80:8000- 443:8443- 8001:8001- 7946:7946- 7946:7946/udpenvironment:- KONG_PG_HOST=kong-database.arthrex.xyz- KONG_PG_DATABASE=postgres- KONG_PG_USER=postgres- KONG_PG_PASSWORD=postgresrestart: alwayssecurity_opt:- seccomp:unconfinednetworks:dev:aliases:- kong.arthrex.xyzkongdashboard:image: pgbi/kong-dashboard:v2.0.0ports:- 3070:8080networks:dev:aliases:- kong-dashboard.arthrex.xyzmailcatcher:image: yappabe/mailcatcher:latestports:- 1025:1025- 1080:1080networks:dev:aliases:- mailcatcher.arthrex.xyznetworks:dev:driver: bridge

Now, let’s take a look at a sample application bootstrapped with kongfigure for self declaration on convox start and convox deploy.

version: "2"services:kongfigure:build: ./kongfigureenvironment:- WWW_NAME- KONG_HOST- WWW_VIRTUAL_HOST- API_URLlinks:- wwwvolumes:- /var/run/docker.sock:/var/run/docker.sockwww:build: ./wwwenvironment:- API_URLports:- 4200networks:dev:external:name: arthrex_dev

./kongfigure/Dockerfile

FROM arthrex/kongfigure:0.3

./kongfigure/kongfig.yml

apis:

name: $WWW_NAMEattributes:request_host: $WWW_VIRTUAL_HOSTupstream_url: "http://$WWW_HOST:$WWW_PORT"

When the application starts, kongfigure execute its baked in ONBUILD command kongfig apply. This registers the application load balancer with kong using the local kongfig.yml file. The $KONG_HOST variable dictates which kong environment we are targeting, local, develop, or production. Convox populating the $WWW_NAME and other variables via an integration with the docker-compose.yml links section.

Kongfigure also manages the applications private network configuration, specified in our docker-compose.yml file “arthrex_dev”, this network exists only locally. This is just another approach driven by infrastructure as code, here we have service routing declared by code.

By the end of 2016 we migrated our last app out of on-prem and into docker, and by Early 2017 all of our applications were running on convox. Greatly simplifying and genericizing the operations of our rails, go, coldfusion, node, and other apps. Current footprint is ~35 docker nodes, over ~130 application instances, and over 300 containers.

2017 has been another year of reworking our infrastructure to be easier to use and operate. This has been greatly accelerated by the addition of two DevOps team member, so now we actually have a real team, not an army of one! Some big enhancements this year is standardization of environment variables, migration away from MSSQL to Postgresql, PRs now create and deploy themselves into on demand containers for isolated testing, we brought in vault as a single source of truth for our configuration, and more!

Vault is a now central configuration and secret store by hashicorp. We store numerous types of data within it including, third party tool credentials, database creds, server creds, application configuration and secrets, the list goes on. One of the main reasons we set up such a store was for the programatic access of application configurations to power our automation plans.

My new(ish) colleague, senior DevOps engineer Braxton Beyer, has been busy replacing our hipchat/hubot/jenkins based deployment scheme with github hook based CircleCI. This new platform has already been through several incantations from circle 1, to circle 2, to a custom docker build image hosting a generic build pipline written in ruby. The ruby built pipeline allows for standardization of deployment across all of our applications, as the pipelines contact point with our apps is that the Convox level, which is agnostic about what it is deploying. From an ops perspective, there is now only one deployment process to learn, but nothing hands-on. From a dev perspective, their github PR’s are now all power and their interface to deployments. Looking ahead, this standard pipeline is the perfect place to begin enforcing code quality checks, linting, static analysis, and other custom checks related to the deployability of your service / application, and more, all in one place, and for all applicable apps at once.

A key step in this pipeline is the realization of a long held idea of creating on demand environments from github PR’s. Anytime a new PR is opened on one of our repos, circleci kicks off the process of provisioning a new convox/ecs application.

Following our conventions, we then inject the variables $APP_NAME and $APP_ENVIRONMENT into the build pipeline, which are based upon the github branch or PR name. We then pull a templated env file from vault, and evaluate it i.e.:

MY_SECRET=thisissecretVIRTUAL_HOST=$APP_NAME.$APP_ENVIRONMENT.mydevdomain.comNEW_RELIC_NAME=$APP_NAME-$APP_ENVIRONMENT

Given APP_NAME=myapp and APP_ENVIRONMENT=shiny-feature. This env files becomes

MY_SECRET=thisissecretVIRTUAL_HOST=myapp.shiny-feature.mydevdomain.comNEW_RELIC_NAME=myapp-shiny-feature

Next, we inject the environment into convox via convox env set. We then deploy your application via convox deploy and from there, your app will start, kongfigure will run and declare your application routing configuration, and hipchat/github will be notified of the build success and provide the newly generated endpoint. Quick smoke testing, full QA testing, or developer review can then occur on that live version of code without having to clone and run locally. These environments are temporary and are set to be destroyed on PR merge.

This github PR pipeline is the most robust of all, and since it is a superset of our standard development CI environment and production environment deployment process, our entire infrastructure as code picture is getting very DRY.

In order to expose application configuration control to developers, they each access have access to vault via the vault-web UI and are able to update the entries for their application instance as JSON objects. We run a daemon on our cluster that ensures the state of application configuration is met on our convox cluster. That evaluated env file generated from our PR pipeline is also written back into vault to provide access to the new app’s configuration.

So here we are, wrapping up 2017, still on-boarding our older applications to the PR pipeline process and beginning to look beyond our convox deployment to the future. AWS Managed EKS looks very exciting, the control plane features k8s offers are second to none, a managed offering on AWS is a big win. Our experience with Convox and ECS will give us a big boost up at getting this new platform vetted, and if ready, spun up for use.

We are also looking to take further steps into defining key application metadata used to provision and drive the entire automation pipeline from concept to post deployment monitoring for new and existing application, lowering the time to launch for new service, while keeping our legacy application on board with current internal best practices. This top level metadata would further drive configuration for downstream integrations, infrastructure, and applications.

If any of this sounds appealing to you Arthrex is hiring many positions from dev, to devops and beyond, email me at cleblanc@arthrex.com.