Readable SQL queries with Common Table Expressions
When I'm writing ad hoc SQL queries, I often build them up slowly as I work
through the problem at hand; usually, they take the form of several nested
SELECT
queries. However, that's rarely very readable, not great for sharing
with the rest of the team, or for referring back to later. Recently, I've
started using WITH
queries to clean up my queries.
An Example
Let's look at a simple (and not too contrived, I hope) example. Let's say I'm
investigating a state change in my users: I want to know how quickly they move
from pending
to active
. If you want to follow along, check out the SQL
Fiddle.
Let's start by finding all active users.
SELECT *
FROM users
WHERE state = 'active';
The result:
| id | state | activation_method | created_at |
|----|--------|-------------------|-----------------------------|
| 2 | active | email | 2021-04-01T13:06:22.943296Z |
| 3 | active | text | 2021-04-06T11:06:22.943296Z |
| 4 | active | text | 2021-04-02T13:06:22.943296Z |
| 5 | active | email | 2021-04-03T13:06:22.943296Z |
Then, let's find user_events
for those users:
SELECT *
FROM user_events
WHERE user_id in (SELECT id FROM users WHERE state = 'active')
AND state IN ('pending', 'active');
Now we have a list of events where the state of currently-active users changed.
| id | user_id | action | state | created_at |
|----|---------|--------------------|---------|-----------------------------|
| 2 | 2 | registered-account | pending | 2021-04-01T12:57:36.903034Z |
| 3 | 2 | activated-account | active | 2021-04-03T12:57:36.903034Z |
| 4 | 3 | registered-account | pending | 2021-04-06T10:57:36.903034Z |
| 5 | 3 | activated-account | active | 2021-04-06T12:57:36.903034Z |
| 6 | 4 | registered-account | pending | 2021-04-02T12:57:36.903034Z |
| 8 | 4 | activated-account | active | 2021-04-02T16:57:36.903034Z |
| 7 | 5 | registered-account | pending | 2021-04-04T12:57:36.903034Z |
| 9 | 5 | activated-account | active | 2021-04-06T12:57:36.903034Z |
Let's use a window function to find the activation delay. We'll take the difference between registration time and activation time:
SELECT
user_id,
created_at - LAG(created_at, 1) OVER (PARTITION BY user_id ORDER BY created_at) as activation_delay
FROM user_events
WHERE user_id in (SELECT id FROM users WHERE state = 'active')
AND state IN ('pending', 'active');
Nice! Now we can see how long each user took to activate.
| user_id | activation_delay |
|---------|------------------------------------------------|
| 2 | (null) |
| 2 | 0 years 0 mons 2 days 0 hours 0 mins 0.00 secs |
| 3 | (null) |
| 3 | 0 years 0 mons 0 days 2 hours 0 mins 0.00 secs |
| 4 | (null) |
| 4 | 0 years 0 mons 0 days 4 hours 0 mins 0.00 secs |
| 5 | (null) |
| 5 | 0 years 0 mons 2 days 0 hours 0 mins 0.00 secs |
Finally, let's filter out the extra rows, join this with the users
table to get
the activation method, and then group and average by that activation method:
SELECT
activation_method,
avg(activation_delay)
FROM
(SELECT
user_id,
created_at - LAG(created_at, 1) OVER (PARTITION BY user_id ORDER BY created_at) as activation_delay
FROM user_events
WHERE user_id in (SELECT id FROM users WHERE state = 'active')
AND state IN ('pending', 'active')) AS t
JOIN users ON users.id = t.user_id
WHERE activation_delay IS NOT NULL
GROUP BY activation_method;
And we're done!
| activation_method | avg |
|-------------------|------------------------------------------------|
| email | 0 years 0 mons 2 days 0 hours 0 mins 0.00 secs |
| text | 0 years 0 mons 0 days 3 hours 0 mins 0.00 secs |
We can see that users who activate via text message activate much sooner. Our Product Owner or Marketing Director would probably be interested in that.
Adding Readability with WITH
So, how do Common Table
Expressions (AKA
WITH
queries) make this query better?
Here's the idea: think of each SELECT
in the above query as creating a
temporary table, that only lives for the duration of the query. Instead of
nesting those, we can pull them out, alias them, and move them all to the top. I
like this for two reasons:
- This gives you the opportunity to assign names to each
SELECT
query, which acts like code-as-documentation. - This allows you to organize your query as you would normally read: from top
to bottom. With nested
SELECT
s, you have to read from inside out.
So what does this look like?
WITH
alias1 as (select_query1),
alias2 as (select_query2),
...
SELECT ...
And of course, you can use previous "temporary tables" within subsequent ones.
So, here's how to clean up the above query with with
:
WITH
active_user_ids as (
SELECT id FROM users WHERE state = 'active'
),
user_activation_delays as (
SELECT
user_id,
created_at - LAG(created_at, 1) OVER (PARTITION BY user_id ORDER BY created_at) as activation_delay
FROM user_events
WHERE user_id in (SELECT * from active_user_ids)
AND state IN ('pending', 'active')
)
SELECT
activation_method,
avg(activation_delay)
FROM user_activation_delays
JOIN users ON users.id = user_activation_delays.user_id
WHERE activation_delay IS NOT NULL
GROUP BY activation_method;
We've pulled out the two inner SELECT
s and given them names, so it's very
clear what their purpose is. Also, we can read this top-to-bottom and see the
story behind how we've stitched this data together. Finally, notice that we can
treat those "temporary tables" just like normal tables: joining, filtering,
grouping.