Steve Lamotte · 2022-04-14 · aws, backend, devops, dynamic staging, ec2, ecs, rails, rds, redis, resque, ruby, slack, staging, terraform, extending the monolith

Building a dynamic staging platform

Aha! dynamic stagings

Our team at Aha! recently gained the ability to quickly create and destroy dynamic staging environments.

Our platform team maintains several general-purpose staging environments that our engineering and product teams use to test new features. These staging environments closely mimic our production infrastructure — but at a greatly reduced scale. As our team has grown, demand for these environments has increased, making our stagings somewhat of a hot commodity at times.

We've given a lot of thought to how best to address this. One obvious option would have been to simply create some new staging environments. This ultimately just kicks the can down the road because we would find ourselves in the same position in a few months. And since there are times when our stagings are not fully utilized, having a bunch of idle resources is wasteful. We really needed something more dynamic.

And so the notion of dynamic staging environments was conceived. This allowed anyone to quickly and easily spin up a brand new test environment on-demand, deploy their feature branch, test it, and then destroy it when no longer needed.

Planning environment design

We wanted these dynamic staging environments to be relatively lightweight. They are not a resource-for-resource mirroring of our production infrastructure the way our normal staging environments are. This allows them to be quickly created, updated, and destroyed — and also helps keep costs down.

Engineers at Aha! use our home-grown ops utility to initiate deployments, SSH to remote machines, and perform other various DevOps tasks. This utility has been updated over the years as our infrastructure has changed, so it seemed natural that we should add new commands to manage our dynamic stagings. We wanted this to be something as simple as an engineer writing ops ds create in their feature branch.

Like many companies, Aha! uses Slack for internal communications. We have a Slack bot named "stagingbot" that we use to manage our staging environments — with commands such as claiming and releasing a staging and deploying a branch to a staging. To allow non-engineers to use them as well, we would need to update stagingbot to manage dynamic stagings.

Since we use AWS Elastic Container Service (ECS) for our container orchestration, our dynamic stagings would also run under ECS. Rather than create a whole new set of infrastructure for each new dynamic staging, we selected one of our existing staging environments to serve as the dynamic staging host environment. This allows dynamic stagings to share certain resources that are owned by the host environment — in particular, the database instance, Redis/ElastiCache instance, and certain other Aha! compute resources. This approach further helps achieve the goal of being able to quickly create and destroy dynamic stagings.

This design came with some drawbacks. For example, the shared database instance means database migrations cannot be applied without potentially breaking other dynamic staging environments. It would still provide a very effective solution for the vast majority of our testing needs while freeing up the legacy staging environments for specialized testing such as infrastructure changes, and we could still add the ability to create a new RDS instance later if we did want to test those types of changes.

Staging environment preparation

Like most web-based SaaS applications, Aha! relies on DNS to direct customers to the right services. Our staging environments' services are coordinated by DNS subdomains that tell Aha! which load balancer to hit. Once the application receives a request, the host header tells us which account should be used.

For dynamic stagings, we updated our DNS to point all "dynamic" subdomains at a new EC2 application load balancer. Then when new dynamic stagings are created, a listener rule is added to this load balancer to direct traffic to the new ECS service.

This new load balancer is a convenient demarc point between dynamic stagings and the existing staging environments. It greatly simplifies the Terraform changes required to augment the host environment to begin creating dynamic stagings.

The first implementation

Our first implementation was intended to quickly relieve the pressure caused by staging contention. This results in several limitations when compared to a normal staging environment, but it provides an ideal test environment for at least 80% of our use cases. It also gets the team used to considering dynamic stagings as their default option for testing.

Resource creation

There are six steps in the workflow for standing up a new dynamic staging:

1. Create a new ECS task definition

In ECS, a task definition specifies how a service's task (roughly analogous to a Docker container) will be run. It includes things like the IAM execution role, which specifies what AWS APIs may be invoked by the task, networking details such as security groups, and runtime parameters such as the container image to run, including environment variables and memory/CPU constraints.

To keep things simple, we fetch the task definition for the web service that is currently running on the host environment and tweak certain parameters to make it run as a dynamic staging service. This includes:

The service name/family
Its memory footprint — We size dynamic stagings smaller than regular staging services.
An environment variable with the name of the dynamic staging — This environment variable lets the application know that it's running as a dynamic staging and instructs it to tweak things like URLs

2. Create a new load balancer target group

When an ECS service is configured, you can specify a target group. As each task for your service starts up, it is registered with the target group that allows it to start serving traffic.

The target group periodically performs health checks on each task, replacing any that become unhealthy. A target group also acts as an endpoint for our ECS services when associated with a listener rule.

3. Create a new load balancer listener rule

We then create a new listener rule and attach it to the load balancer. A listener rule can perform a variety of checks on incoming requests and redirect traffic accordingly.

In our case, we create a host header condition (i.e., the DNS name in the web request that is sent in the HTTP request headers) to associate the dynamic staging's DNS name with the new ECS service via the target group created in the previous step.

4. Create a new ECS service

The previous steps provide the prerequisites needed to stand up the actual ECS service.

When creating the service, we reference the task definition that we created earlier as well as the load balancer target group. ECS will create the service and start up the requested number of tasks, which is one for our dynamic stagings. These do not see much load so there's no need to scale out or provide redundancy.

Once the service has started and it is healthy, it can start accepting requests. This can take a few minutes.

5. Create a new Aha! account

Since dynamic stagings share the host staging environment's database, we create a new Aha! account in that database that points at the new dynamic staging's subdomain. This was easy to implement because we already had a method to create a random account in staging, something frequently used during development. I modified the code slightly to handle dynamic stagings and made it available to call from the ops command via a web API that is only available in the host environment.

While the ECS service is initializing, we invoke this API and pass it the name of the new dynamic staging environment. A new account is created in Aha! along with a new user with full access to this account. The details are returned to the ops script.

6. Share new dynamic staging details

Once the ECS service is healthy, we output details about the new environment including its URL and user credentials. The entire process takes less than five minutes.

Passing around one-off user credentials to everyone who wants to use a dynamic staging can become inconvenient, so we enabled SSO in the new account. This allows any Aha! user to quickly and easily log in to any dynamic staging.

Limitations

One of the goals of this first iteration was to create something for testing the majority of our features. Our shared database model meant migrations could not be tested without endangering other dynamic staging environments. Up until this point, we simply did not run migrations when creating a dynamic staging environment.

Background jobs are another limitation. Since only the Aha! web service runs in a dynamic staging, the host environment's workers would process any Resque jobs that were sent to the shared Redis instance. If your branch hadn't updated any background-able methods, this would be no big deal. But if you were hoping to test changes to these methods, you would be out of luck.

Company-wide dynamic staging

While the first implementation of our dynamic stagings worked well enough and was adopted by many at Aha!, the limitations caused by the shared database and other resources prevented it from being accepted as everyone's default staging solution. My next task was clear: dynamic stagings for the masses!

Dedicated databases

The biggest barrier to entry was clearly the shared database. Not being able to test features that had migrations was a severe restriction, so we decided to spin up an AWS RDS instance based on the latest hourly snapshot of our host environment's database.

This operation adds just a few minutes to the dynamic staging creation process, so it's a pretty attractive solution. We also chose a small db.t3.micro instance class for the database since, for the intended use case of testing by one developer or a small team, it would not be under significant load.

Our Customer Success, marketing, and engineering teams deal with trial and signup flows. They have special test data configured in staging, for example, to prepare screenshots for updates about product enhancements. To make dynamic stagings more attractive to them, we provide the option to select any staging as the hourly snapshot source. When one of these other stagings is used as the source, we have to update the dedicated database's master account credentials to match the host environment's credentials, otherwise our dynamic stagings would not be able to connect to them. Unfortunately this operation (Aws::RDS::Client#modify_db_intance when using the Ruby SDK) requires that the initial database create process finishes completely (i.e., is "Ready") before it can be invoked. Then we need another wait for that call to finish so the credentials are confirmed to be updated. This adds up to 10 extra minutes to the processing but the flexibility to choose the data source is well worth the wait.

Once we know the hostname of the new database, we set an environment variable AHA_DB_INSTANCE_ID in the task definition. At runtime, we check that environment variable to determine whether we want to connect to a non-default database instance. If so, we tweak the database connect string to point at the dedicated database instance:

  # If an environment variable is present to point us at a specific database
  # instance, update the target environment variable accordingly.
  def maybe_override_database_id(var_name)
    database_id = ENV["AHA_DB_INSTANCE_ID"]
    return unless database_id

    # Replace the first part of the intsance's hostname
    ENV[var_name] = ENV[var_name].sub(/@.*?\./, "@#{database_id}.")
  end

Worker processes

Like many Ruby on Rails apps, we use background jobs in Aha! to help provide a robust and responsive user interface. We use Resque backed by Redis for queuing up these jobs in the web app for subsequent processing by worker processes.

Similar to the shared database we initially employed, the shared Redis instance made it difficult to test background jobs because all dynamic stagings plus the host environment would be enqueuing and dequeuing jobs with the same Redis.

The solution to this was actually quite simple: namespaces. By prefixing Resque queue names with the name of the dynamic staging, each environment would see only the jobs that it cares about.

Similar to what we did for dedicated databases, an environment variable AHA_REDIS_NAMESPACE was added to the task definitions containing the name of the dynamic staging environment. Then in our Resque initializer, I simply added a conditional to the existing code to check for the environment variable and optionally set a namespace:

Resque.redis = begin
  redis = Redis.new(url: configatron.resque_url, thread_safe: true)

  # Use a Redis namespace if so configured e.g. for dynamic stagings
  namespace = ENV["AHA_REDIS_NAMESPACE"]
  if Rails.env.production? || namespace.blank?
    redis
  else
    Redis::Namespace.new("#{namespace}-resque", redis: redis)
  end
end

Since this code is executed by both the web app and the worker service, they will use the same Resque queues. Too easy!

Creating the ECS service for the worker is similar to the web app's service — only simpler. Because it runs silently in the background listening for Resque jobs, it does not require a target group or listener rules. We simply clone the host environment's worker task definition, tweak it in a similar fashion to how we modified the web service task definition (including setting the AHA_REDIS_NAMESPACE environment variable), and create the service.

What's next?

These recent changes have truly turned dynamic stagings into first-class staging environments. We expect that they will cover at least 95% of our testing use cases, leaving the legacy staging environments primarily for specialized infrastructure work.

There are other esoteric limitations that we know about but we've decided to accept those for now. We can always turn on a new stack using Terraform if we need to test something obscure in a way that legacy staging environments cannot replicate. Since rolling out these last few changes and seeing increased adoption by our teams, I've already received feedback on several issues that were easy to fix. On a personal note, it's very rewarding to see the rate of adoption of dynamic stagings at Aha! increase the way it has, and it validates all the time and effort I've put into this project.

One of the features I'm most looking forward to adding is an extension to Aha! Develop that lets you create, update, and remove dynamic stagings while viewing your features and requirements. This is something that a group of us at Aha! started working on during a recent hackathon, and it would make dynamic stagings even easier to use.

Our team at Aha! strives to make development a joy by creating tools that enhance and simplify day-to-day work. I have worked at many companies where this was not a priority and the difference in developer happiness is clear. Consider taking some time to identify pain points within your own organization's technical processes. You may find that building a script to automate frequently performed manual tasks makes everyone's job easier.

If you're like me and love building innovative technical solutions that help empower others, you should come work with us.

Start a free trial today

Our suite of product management tools work seamlessly together to help teams turn raw concepts into valuable new capabilities — for customers and the business. Set strategy, spark creativity, crowdsource ideas, prioritize features, share roadmaps, manage releases, and plan development. Sign up for a free 30-day trial or join a live demo to see why more than 5,000 companies trust our software to build lovable products and be happy doing it.