Michael Shiel · 2024-04-08 · ruby, rails, postgres

Using calculated attributes to help users surface delivery risks

Feature delivery can be impacted by many factors. Some are readily visible, but others are more nuanced — or buried in mountains of data.

What if we could show developers and project managers the factors that would affect feature delivery in real time? And what if that information were part of the record itself? (So anyone looking at or running a report could see what to keep in mind to ensure the team's delivery stays on track.)

What are delivery risks?

We refer to these individual risk factors as delivery risks in Aha! software. During the design phase of this feature, we built up a list of the factors we felt were most important to ensuring reliable delivery. These were based on customer feedback as well as our own internal experience.

Some of the delivery risks we identified were:

Not having a team assigned to a work item
Dependencies not being started, and at least one linked record not having an In progress workflow status
Workflow status not changing for a specific number of days

Additionally, you can customize which delivery risks are enabled on a per-workspace level as well as fine-tune some of the values used in calculating whether or not the risk is active (e.g., how many days is "late" to my team?).

Deciding on an approach

With our risks defined, we needed to figure out an approach for surfacing them to the user in a real-time and efficient manner. We knew we wanted to have delivery risks visible in our record drawers for visibility. We also wanted them to be in reports, which could mean large amounts of data and computation for the more complex risk factors.

Data representation

Let's start with a simple delivery risk: No team assigned. The data required to make the decision about whether this risk factor is active or not lives with the record itself, so it's "free" to load and wouldn't need to be stored. Luckily for us, a majority of the delivery risks we identified fall into this category. They are trivial to implement and virtually free to load or calculate.

Complex risk factors

For a more complex risk factor, let's take a look at the Work stalled delivery risk. Here, we're concerned with a fixed relationship structure within Aha! as well as in epics, features, and requirements. Epics have features, and features have requirements. We define the work as stalled when:

The record's own workflow status hasn't changed in N days (unless it has child records in progress, in which case ... )
Any of its child records that are in progress have workflow statuses that haven't changed in a specific number of days

These bubble up to their parent records. For example, an epic might be stalled because it has a feature that has a requirement whose status hasn't changed recently.

The bottom-up approach

The approach that first sprung to mind was attempting to do the calculations at the point or event where the data changes. We could check for any parent records whenever our status changes and do the necessary work at that point in time. That could work for some of our risk factor definitions. But the catch we saw with Work stalled was that being time-based, it can become a risk without any changes in the underlying data.

One way we thought about solving this was by having some job scheduled to run on a periodic frequency, let's say overnight or hourly. The downside to that is you're compromising to some degree on the real-time aspect. I have to wait for this job to run to see the risk — that's not something I want to think about as a user. We'd also be concentrating a large amount of work at one point in time (for every record, check all of its children's statuses, and so on). And that's just for this one risk factor.

The top-down approach

What if we instead flipped the problem on its head, taking the perspective of looking "down" from the place where we want to see the output of the risk factor? That leads us to calculating each risk factor every time the record loads or the report runs. This definitely gives us the real-time responsiveness we're looking for — but does come with some valid performance concerns.

Generated columns and triggers

We use PostgreSQL as our database, and it comes with an amazing feature that allows you to define generated columns. Unfortunately for us, there was a catch here: Some of the delivery risks we wanted to define have dependencies on other records (e.g., other tables), and a generated column can only refer to the current row.

Another database-centric approach could be to leverage triggers to recalculate a given risk factor whenever the record changes. However, this solution does not allow recalculating based purely on time and without some kind of scheduled job.

Equations

Aha! already has a framework for calculating fields from equations and other data in our reporting system. This was a promising avenue at first: The system automatically tracks dependent records based on the formula used and only updates when they change. Unfortunately, we also ended up ruling out this approach — the equation system has no concept of variables outside of the data itself, such as the current date or time. Because this system is driven purely by changes within other fields, having updates based on time would mean introducing some kind of scheduling component to run on arbitrary granularity. This would add additional complexity and computational overhead.

Our solution: Virtual attributes

Aha! has a calculated_attributes gem that we wrote and open-sourced to solve similar issues. And as a bonus, we already have it in place on many of our models. It works by patching ActiveRecord to include custom SQL statements as part of the normal model query process. By using the calculated(:calculated, :fields, :here) method on a relation, the fields would automatically be executed and added into the model data when it was loaded.

Prior to this work, we were mostly using it for counts or relatively simple calculations. But because it's driven by plain old SQL statements, there was no reason we couldn't use it to execute more complex queries.

Some quick prototyping on the simpler fields showed promising results right away:

calculated :no_team, -> { arel_table[:team_id].eq(nil) }
calculated ...

def risk_factors
  slice(:no_team, ...)
end

This would allow us to write arbitrary SQL to define all of the desired delivery risks.

Query complexity and performance

Calculating all of these fields every time the record loads would inherently make those queries more complex, so we set out to establish a baseline and see just how much more complex they would get. The queries below are simplified and only include risk factors related to a lack of assignee and scheduling.

Here is loading one feature without risks:

SELECT "features"."id", ...
FROM "features"
LIMIT 1;

And here is loading one feature, including the two risk factors mentioned above:

SELECT "features"."id", ...,
  (
    "rf_features_prc"."no_assignee" = TRUE
    AND "features"."assigned_to_user_id" IS NULL
    AND (
      SELECT
        TRUE
      FROM
        workflow_statuses ws
      WHERE
        ws.id = features.workflow_status_id
        AND coalesce(ws.internal_meaning, 0) < 20
    )
  ) AS no_assignee,
  (
    "rf_features_prc"."not_scheduled" = TRUE
    AND "features"."start_date" IS NULL
    AND "features"."due_date" IS NULL
    ...
  ) AS not_scheduled
FROM "features"
LEFT JOIN projects rf_features_projects ON rf_features_projects.id = features.project_id
  ...
LIMIT 1;

Let's say we've added two subselects and a few joins. This seems reasonable looking at just the SQL, but let's also take a look at the query plans for these to get deeper insights.

First, here is the regular feature load:

 Limit  (cost=0.05..0.16 rows=1 width=279) (actual time=0.021..0.021 rows=1 loops=1)
   ->  Seq Scan on features  (cost=0.05..1001.00 rows=9003 width=279) (actual time=0.020..0.020 rows=1 loops=1)
         Filter: (project_id = ANY ('{...}'::bigint[]))
 Planning Time: 0.217 ms
 Execution Time: 0.052 ms
(5 rows)

That's about as simple as it gets for a query. But what about including the calculated risk factors?

 Limit  (cost=0.34..17.51 rows=1 width=281) (actual time=0.079..0.081 rows=1 loops=1)
   ->  Nested Loop Left Join  (cost=0.34..154647.50 rows=9003 width=281) (actual time=0.079..0.080 rows=1 loops=1)
         ->  Nested Loop Left Join  (cost=0.20..3807.43 rows=9003 width=282) (actual time=0.050..0.051 rows=1 loops=1)
               Join Filter: (rf_features_rr.riskable_id = features.id)
               Rows Removed by Join Filter: 8
               ->  Nested Loop Left Join  (cost=0.20..2455.79 rows=9003 width=282) (actual time=0.034..0.034 rows=1 loops=1)
                     ->  Seq Scan on features  (cost=0.05..1001.00 rows=9003 width=279) (actual time=0.026..0.026 rows=1 loops=1)
                           Filter: (project_id = ANY ('{...}'::bigint[]))
                     ->  Index Scan using index_project_risks_configurations_on_project_id on project_risks_configurations rf_features_prc  (cost=0.14..0.16 rows=1 width=11) (actual time=0.005..0.005 rows=1 loops=1)
                           Index Cond: (project_id = features.project_id)
                           Filter: enabled
               ->  Materialize  (cost=0.00..1.21 rows=10 width=8) (actual time=0.010..0.014 rows=9 loops=1)
                     ->  Seq Scan on record_risks rf_features_rr  (cost=0.00..1.16 rows=10 width=8) (actual time=0.006..0.008 rows=9 loops=1)
                           Filter: ((riskable_type)::text = 'Feature'::text)
                           Rows Removed by Filter: 3
         ->  Index Scan using releases_pkey on releases rf_features_releases  (cost=0.14..0.16 rows=1 width=9) (actual time=0.009..0.009 rows=1 loops=1)
               Index Cond: (id = features.release_id)
         SubPlan 1
           ->  Index Scan using workflow_statuses_pkey on workflow_statuses ws  (cost=0.28..8.30 rows=1 width=1) (actual time=0.012..0.013 rows=1 loops=1)
                 Index Cond: (id = features.workflow_status_id)
                 Filter: (COALESCE(internal_meaning, 0) < 20)
         SubPlan 2
           ->  Index Scan using workflow_statuses_pkey on workflow_statuses ws_1  (cost=0.28..8.30 rows=1 width=1) (never executed)
                 Index Cond: (id = features.workflow_status_id)
                 Filter: (COALESCE(internal_meaning, 0) < 20)
 Planning Time: 1.554 ms
 Execution Time: 0.338 ms
(27 rows)

OK! We cranked up the complexity quite a bit there, but the execution time is still well within reason (at least while running Postgres 14 on an M1 Pro laptop). These experiments gave us the confidence to know that this was a viable approach — but that we should still be cautious about loading delivery risks that are not strictly necessary.

When do risks load?

There are three places you can see delivery risks within Aha! software:

In the drawer for a record (if the delivery risk item has been added to the layout)
On the card for a record (if one of the delivery risk's fields has been added to the card layout)
In a report (if you're filtering by delivery risk or if you've added a delivery risk column to the report)

Notice that all of these places are conditional — they all have a component that makes loading delivery risk information optional. We can use this to help avoid paying the performance penalty when loading these views. Due to the way our drawer layouts are implemented, the data just naturally isn't loaded if the delivery risks item isn't in the layout. So this would happen for free.

Similarly for the card data, when we're generating the query to load the data required to render the board, we have access to the card layout and can make the decision on whether we need to include the risk factors or not.

Reporting is actually the most optimal case in that we have enough information to even know exactly which risk factors need to be loaded. When you select a delivery risk in a filter or add a delivery risk column to the report, the SQL for just that risk factor can be dynamically inserted into our query builder so performance is as optimal as possible.

The final implementation

After reviewing the options available to us, the only solution that ticked all of our boxes was calculating the fields for each object as they load using the calculated_attributes gem.

We were able to have proof of concepts for most of the other delivery risks pretty quickly and started to abstract them into a concern and build a DSL to clean up their definitions.

Defining a new risk is now as simple as calling risk_factor:

risk_factor :no_assignee, arel_table[:assigned_to_user_id].eq(nil)

risk_factor :not_scheduled, arel_table[:start_date].eq(nil).and(arel_table[:due_date].eq(nil)), except: Requirement

The risk_factor method tracks the risk factors in an object (active_risk_factors) on the class, allowing us to build a pre-defined scope to handle calling calculated with all of the delivery risks automatically passed as a parameter:

scope :with_risk_factors, -> {
  HasRiskFactors.joins_for_class(klass).reduce(current_scope || self) do |acc, join|
    acc.joins(join)
  end.calculated(*active_risk_factors.keys)
}

Now, whenever we'd like to load a feature with the active risk factors, it's as simple as:

Feature.with_risk_factors.find("DEMO-123").risk_factors

=> { :no_assignee => true, :not_scheduled => false }

Local benchmarking and performance

Here are some benchmarks done on loading features with and without delivery risks, performed on my MacBook Pro (M1 Pro CPU) using Postgres 14. The units are in seconds. All here means "with no limit," and that is 8998 features on my local database. Each row was run for 10,000 iterations and averaged. This also includes all 17 of the available delivery risks in Aha! — so the query and plan are even more complex than the examples above.

                               user     system      total        real
1    Feature               0.021903   0.000080   0.021983 (  0.022046)
10   Features              0.019104   0.000032   0.019136 (  0.019164)
100  Features              0.018840   0.000026   0.018866 (  0.018901)
1000 Features              0.018174   0.000018   0.018192 (  0.018203)
All  Features              0.017293   0.000034   0.017327 (  0.017340)
1    Feature with risks    0.617923   0.000863   0.618786 (  0.619724)
10   Feature with risks    0.601085   0.001258   0.602343 (  0.603338)
100  Feature with risks    0.588217   0.000949   0.589166 (  0.590493)
1000 Feature with risks    0.589340   0.000919   0.590259 (  0.591504)
All  Feature with risks    0.589590   0.000745   0.590335 (  0.591605)

From this, we can see that loading the delivery risks is meaningfully slower. However, we should also take into account the complexity of the additional data being requested. What is potentially the most interesting insight here is that there doesn't appear to be any correlation between the number of records loaded and the query time: a testament to Postgres' raw performance. This insight gives us confidence that we're going to have stable performance on even the largest workflow boards or reports.

Aha! is happy, healthy, and hiring. Join us!

This feature took a lot of tinkering and guidance from other engineers. Without such a knowledgeable and helpful team, I would not have arrived at this clean solution. If you want to build lovable software with this talented and growing team, apply for an open role.