Bill Rastello · 2022-02-11 · ecs, aws, platform

Migrating from EC2 to ECS Services and Tasks

Our old system architecture here at Aha! has served us well. On top of RDS, ElastiCache, and other AWS services, we had hundreds of EC2 instances running Unicorn to serve web traffic. We used nginx and AWS ELBs for load balancing, Resque / resque-scheduler for workers, and a few other EC2 instances that ran miscellaneous services. These EC2 instances were all managed by AWS OpsWorks and Chef. This worked great for us for years, but started to introduce some challenges as we grew. Patching servers was an elaborate process, autoscaling was slow, and if deploys via Capistrano went wrong, we could be left in an inconsistent state. Worse, a bad Capistrano deploy had the potential to cause an outage. After evaluating our options, we decided to migrate our applications to ECS.

Why ECS

ECS is Amazon's container orchestration service. You can choose to run your Docker containers on your own EC2 instances, or can run them in AWS Fargate, which is a serverless option. When evaluating using Docker containers on ECS, the following really stood out to us:

Scalability. When scaling up, we used to have to wait for a large number of EC2 instances to start, be provisioned by Opsworks, and then deploy via capistrano manually. Now we can have ECS start and stop containers as needed, which takes significantly less time and uses already-built Docker images to ensure that we're deploying the code we're expecting to deploy. New EC2 instances join the cluster and automatically start hosting containers.
Security. Docker is generally very secure, with strict isolation on the application and kernel level. Using ECS on top of it allows us to set up IAM roles and security group rules, which can be unique for each individual ECS task. This allows us great power in isolating our resources.
Reproducible environments. When we deploy some code to staging, the Docker container is set up the exact same way that it's run in production. OS-level changes don't affect our application environments.
Potential cost savings. Instead of needing to run so many EC2 instances, we can run multiple ECS tasks on an EC2 instance. We then can allow these tasks to temporarily consume more memory (see memory vs. memory reservation) when needed. Before, we needed the EC2 instances to be sized to support the potential maximum amount of memory the application could consume.

Why not EKS / Kubernetes?

While Kubernetes and EKS are great solutions for orchestrating containers, we felt that the solution we came up with didn't need the additional Kubernetes layer on top of everything else. The big difference for us was that ECS allows for better IAM and networking integration, which allows us to set restrictive permissions and network access for each individual service. If we weren't already so heavily invested in AWS this wouldn't be as much of a benefit; but we have gotten a lot of mileage out of AWS' offerings and services.

New architecture

After some work, we were able to successfully migrate our infrastructure over from EC2 to ECS. On average, we have about 100 EC2 container instances that run the ECS agent. Across these container instances, we have a handful of ECS services - one for unicorns, a few for our various resque worker types, and ones for the additional web-facing applications. These services together run well over 500 ECS tasks. We also run some tasks that are not part of any service with higher memory limits for some jobs which we know require the additional resources. We also run these ad-hoc tasks as part of a deploy (more on that later).

We chose CodeDeploy to roll out new versions of the ECS services, using a blue/green deployment type. Blue/green CodeDeploys do not move traffic to the new code unless it has been healthy for a set amount of time. This means that a problem with our code or environment that is about to be deployed never causes an outage; the health-checks protect us from sending traffic to a badly configured revision. Since deploying ECS, our uptime has been well above 5 9's.

We chose CodeBuild to build the Docker images that CodeDeploy uses to roll out new code. Our main application's Docker image is very large (over 1GB). While CodeBuild does have a variety of caching options, none of them are optimized for large images, so a lot of time is spent either retrieving the existing Docker image from a cache, or just rebuilding the entire image from scratch. AWS is working on implementing support for Docker BuildKit's cache manifest, which will speed up CodeBuild jobs significantly since we will only pull the layers needed during build time. We are also looking into other ways to cache information, like using EFS to store compressed versions of directories that are often static.

All of this is managed via Terraform. We previously were using CloudFormation to manage our ECS clusters - however, we had to implement some workarounds in order to have true blue/green CodeDeploys. CloudFormation has some built-in support for blue/green deploys of ECS when you do a stack update, however we wanted to have the deploy process totally separate from any stack updates. This is not an issue with Terraform.

Deploying

One of the biggest challenges during the migration to ECS was the deploy process. Coming from using Capistrano for deploys, we are used to seeing Rails migrations output during the deploy process. That is especially critical so we can monitor any in-flight migrations and abort them if, for some reason, they are taking longer than we anticipate. This also lets us immediately see the reasons for failed migrations.

The default deploy process for ECS when using CloudFormation involves doing a CloudFormation stack update, which starts up new containers, waits for them to be healthy, then shuts the old containers down immediately. There is little in the way of monitoring of this process (you need to view the events for the specific service that's being updated), and if something goes wrong, there's no quick way to roll back to old code. While this would work for some of our smaller services that don't have any customer-facing components, we wanted something that was more robust when it comes to being able to monitor the deploy, have record of deploys, and be able to roll back code quickly and efficiently.

CodeDeploy ends up doing much of what we wanted out of the box - it keeps record of every deploy, lets us monitor the stage of the deploy, and allows us to quickly roll back code or abort in-progress deploys if anything goes wrong. However, it doesn't let us monitor (or even easily see the result of) migrations. We also have other tasks we want to execute - such as uploading assets to our CDN and tagging the release in Datadog and Sentry.

The solution - a deploy script

To get the exact results we needed, we built a deploy script which does the following:

Runs any steps that need to be executed before migrations begin by starting ECS tasks that run these steps. The deploy script waits for these to complete before continuing.
Runs migrations across all of our separate production environments. This is done by starting an ECS task that has all of our Rails code on it, but has the CMD in the Dockerfile set to just /bin/bash. Once started, the deploy script will SSH to the underlying EC2 instance that is running this ECS task, then run the equivalent of docker exec -t container /bin/bash -c 'run_migrations.sh'. This means that the migrations would be output to the deployer's terminal so the deploy can be monitored. A separate thread is spawned for each production environment so all migration logs are printed to the screen as they are executed (with a prefix to make it clear what environment the log is associated with). If something goes wrong and the migration exits with an exit code that is not 0, all migrations will be aborted. Additionally, if the deployer does a ctrl+c, all migrations would be halted as well.
- Note: the deploy script was written before Amazon ECS Exec was introduced early in 2021. If we were writing this deploy script now, we would likely use Amazon ECS Exec to connect to the containers, as opposed to SSHing to EC2 + running docker exec.
Starts the CodeDeploy deploys. We have a separate deploy for each production environment / Unicorns / worker types. The deploy script then monitors the status of all of these CodeDeploy deploys. If something goes wrong during any of these CodeDeploys (health check fails, new ECS tasks don't start, etc.), or the deployer does a ctrl+c, the deploy script will automatically trigger a rollback for each deploy.

Once all CodeDeploy deploys have completed, the script immediately terminates the old resque workers, instead of having both the old a new run for an extended period of time. Workers are considered successfully running if they start and stay running - we implemented a very simple TCP health check service which runs in the worker's container so that CodeDeploy (and the various listeners and load balancers) know that the workers are running.

At this point, the script runs anything that needs to be executed after the CodeDeploy deploys have finished (Slack notifications, etc.)

We've found this deploy system satisfies our needs well at this point, and also has helped catch bad code before it reaches production. However, we do eventually want to take things further, and use something like CodePipeline to handle this process. We would still need to work out how to have visibility into running migrations, but this will help simplify the deploy process, and rely less on a script that runs on the deployer's development machine.

Resque workers, long running jobs, ECS, and deploying

One of the biggest challenges we encountered was supporting long running resque jobs. When a CodeDeploy deploy finishes, the old instances are terminated after a set amount of time (or immediately, if desired). By default, when an ECS task is told to shutdown, it's sent a SIGTERM, then a SIGKILL after a 30 second grace period. This doesn't work well for resque jobs, since a SIGTERM or a SIGKILL would stop any currently-processing jobs. Changing the SIGTERM to a SIGQUIT instead is easy enough; our wrapper script traps the SIGTERM and sends a SIGQUIT instead.

Handling the automatic SIGKILL after a specified timeout is a bit more difficult. You can configure the timeout to be as long as you want - however, this puts the ECS task in a state where its desired state is STOPPING, but it's currently in a RUNNING state. At the time of this writing, anytime there is a task that's in this state, no additional ECS tasks can start on that container instance. ECS will assign a pending task to this 'stuck' instance, and wait until the stopping task finally does stop. We couldn't let a stopping job block a container instance like this.

The solution

After some experimenting, during a deploy we tell the old instances to terminate immediately after the new instances are up and healthy. This sends a SIGQUIT to all of the old running resque workers. We configured the SIGKILL timeout to be 3 minutes - this is a reasonable trade off, since our old unicorn containers would still be live at this point to serve as a backup.

For jobs which could take longer than 3 minutes to finish, we created special resque queues. Our job wrapper middleware intercepts the enqueue action, and additionally creates a new ECS task to handle the request. These new ECS workers continue processing until their current job is finished, or until they have been alive for 5 minutes, whichever is longer. We also keep some of these workers idle to respond to requests immediately, reducing processing latency.

This architecture has worked well for us, and also let us introduce additional kinds of workers, like ones that need to consume more memory than a standard job, or run cron jobs. If needed, we can also run some cluster instances with more memory than CPU power to run these 'jumbo' jobs.

Conclusion

Once we figured out the solution to workers during our deploy process, we went live. The best part is, nobody noticed - to our customers and the rest of the Aha! team, everything worked exactly as before. Now that we're on ECS, we've been able to easily scale up to handle unexpected load, quickly rollback deploys of code that didn't work as expected, and even enhance the developer experience at Aha! by introducing dynamic staging environments based on ECS.

Interested in working on fun projects like this? Check out our careers page to see our current openings in Engineering.