shaky.sh

Readable SQL queries with Common Table Expressions

When I'm writing ad hoc SQL queries, I often build them up slowly as I work through the problem at hand; usually, they take the form of several nested SELECT queries. However, that's rarely very readable, not great for sharing with the rest of the team, or for referring back to later. Recently, I've started using WITH queries to clean up my queries.

An Example

Let's look at a simple (and not too contrived, I hope) example. Let's say I'm investigating a state change in my users: I want to know how quickly they move from pending to active. If you want to follow along, check out the SQL Fiddle.

Let's start by finding all active users.

SELECT *
FROM users
WHERE state = 'active';

The result:

| id |  state | activation_method |                  created_at |
|----|--------|-------------------|-----------------------------|
| 2 | active | email | 2021-04-01T13:06:22.943296Z |
| 3 | active | text | 2021-04-06T11:06:22.943296Z |
| 4 | active | text | 2021-04-02T13:06:22.943296Z |
| 5 | active | email | 2021-04-03T13:06:22.943296Z |

Then, let's find user_events for those users:

SELECT *
FROM user_events
WHERE user_id in (SELECT id FROM users WHERE state = 'active')
AND state IN ('pending', 'active');

Now we have a list of events where the state of currently-active users changed.

| id | user_id |             action |   state |                  created_at |
|----|---------|--------------------|---------|-----------------------------|
| 2 | 2 | registered-account | pending | 2021-04-01T12:57:36.903034Z |
| 3 | 2 | activated-account | active | 2021-04-03T12:57:36.903034Z |
| 4 | 3 | registered-account | pending | 2021-04-06T10:57:36.903034Z |
| 5 | 3 | activated-account | active | 2021-04-06T12:57:36.903034Z |
| 6 | 4 | registered-account | pending | 2021-04-02T12:57:36.903034Z |
| 8 | 4 | activated-account | active | 2021-04-02T16:57:36.903034Z |
| 7 | 5 | registered-account | pending | 2021-04-04T12:57:36.903034Z |
| 9 | 5 | activated-account | active | 2021-04-06T12:57:36.903034Z |

Let's use a window function to find the activation delay. We'll take the difference between registration time and activation time:

SELECT
user_id,
created_at - LAG(created_at, 1) OVER (PARTITION BY user_id ORDER BY created_at) as activation_delay
FROM user_events
WHERE user_id in (SELECT id FROM users WHERE state = 'active')
AND state IN ('pending', 'active');

Nice! Now we can see how long each user took to activate.

| user_id |                               activation_delay |
|---------|------------------------------------------------|
| 2 | (null) |
| 2 | 0 years 0 mons 2 days 0 hours 0 mins 0.00 secs |
| 3 | (null) |
| 3 | 0 years 0 mons 0 days 2 hours 0 mins 0.00 secs |
| 4 | (null) |
| 4 | 0 years 0 mons 0 days 4 hours 0 mins 0.00 secs |
| 5 | (null) |
| 5 | 0 years 0 mons 2 days 0 hours 0 mins 0.00 secs |

Finally, let's filter out the extra rows, join this with the users table to get the activation method, and then group and average by that activation method:

SELECT
activation_method,
avg(activation_delay)
FROM
(SELECT
user_id,
created_at - LAG(created_at, 1) OVER (PARTITION BY user_id ORDER BY created_at) as activation_delay
FROM user_events
WHERE user_id in (SELECT id FROM users WHERE state = 'active')
AND state IN ('pending', 'active')) AS t
JOIN users ON users.id = t.user_id
WHERE activation_delay IS NOT NULL
GROUP BY activation_method;

And we're done!

| activation_method |                                            avg |
|-------------------|------------------------------------------------|
| email | 0 years 0 mons 2 days 0 hours 0 mins 0.00 secs |
| text | 0 years 0 mons 0 days 3 hours 0 mins 0.00 secs |

We can see that users who activate via text message activate much sooner. Our Product Owner or Marketing Director would probably be interested in that.

Adding Readability with WITH

So, how do Common Table Expressions (AKA WITH queries) make this query better?

Here's the idea: think of each SELECT in the above query as creating a temporary table, that only lives for the duration of the query. Instead of nesting those, we can pull them out, alias them, and move them all to the top. I like this for two reasons:

  1. This gives you the opportunity to assign names to each SELECT query, which acts like code-as-documentation.
  2. This allows you to organize your query as you would normally read: from top to bottom. With nested SELECTs, you have to read from inside out.

So what does this look like?

WITH
alias1 as (select_query1),
alias2 as (select_query2),
...
SELECT ...

And of course, you can use previous "temporary tables" within subsequent ones.

So, here's how to clean up the above query with with:

WITH
active_user_ids as (
SELECT id FROM users WHERE state = 'active'
),
user_activation_delays as (
SELECT
user_id,
created_at - LAG(created_at, 1) OVER (PARTITION BY user_id ORDER BY created_at) as activation_delay
FROM user_events
WHERE user_id in (SELECT * from active_user_ids)
AND state IN ('pending', 'active')
)
SELECT
activation_method,
avg(activation_delay)
FROM user_activation_delays
JOIN users ON users.id = user_activation_delays.user_id
WHERE activation_delay IS NOT NULL
GROUP BY activation_method;

We've pulled out the two inner SELECTs and given them names, so it's very clear what their purpose is. Also, we can read this top-to-bottom and see the story behind how we've stitched this data together. Finally, notice that we can treat those "temporary tables" just like normal tables: joining, filtering, grouping.