A kind of map/reduce using CTE's
I was once tasked with writing a pretty complicated query to quantify Watsi’s weekly growth. We wanted to take into account monthly donations though. How does one do that when monthly donations are all charged on the first of the month? How do we include tips to our overhead fund?
I recognize there are all sorts of ways to do this. Hell, I could have
written this in Ruby
, and I almost did, but we used Heroku’s
Dataclips which allowed us to get near-realtime results[1].
Let’s take a slightly simpler example: A small business sells widgets. Widgets are sold for all different prices, and you can even set up a recurring order that is charged on the first of the month.
Imagine we have a very simple table purchases
that looks something
like this:
name | type |
---|---|
id | int |
amount | int |
created_at | timestamp without time zone |
refunded_at | timestamp without time zone |
recurring | bool |
Some sample data:
id | amount | created_at | refunded_at | recurring |
---|---|---|---|---|
1 | 1300 | 2014-08-01 08:00:00 | null | true |
2 | 900 | 2014-08-03 13:30:00 | null | false |
3 | 1300 | 2014-08-24 23:30:00 | 2014-08-24 23:31:00 | false |
3 | 1400 | 2014-08-24 17:30:00 | null | false |
4 | 1700 | 2014-09-01 08:00:00 | null | true |
5 | 1200 | 2014-09-03 17:27:00 | null | false |
So, given this info, we calculate our weekly amount as:
(daily totals + avg daily totals from recurring) / # of weeks in month
Note: I’m going to ignore time zones for brevity, but assume I’m setting
everything to America/Los_Angeles
.
In any other database I would have had to either use temp tables
or sub-selects
but PostgreSQL has what are called Common Table Expressions
or WITH
queries.
WITH
queries allow us to turn sub-selects
into named tables without the need
to generate actual temp tables
.
select
foo,
bar
from
(
select
foo,
bar_id
from
foos
where
something > 1
) as some_foos inner join bars on
some_foos.bar_id = bars.id
can be turned into this:
with some_foos as (
select
foo,
bar_id
from
foos
where
something > 1
)
select
foo,
bar
from
some_foos inner join bars on
some_foos.bar_id = bars.id
This trivial example doesn’t do the greatest job in showing the power of CTE’s, but the final query below will.
OK, so let’s see some SQL:
with daily_purchases as (
select
date(date_trunc('day', created_at)) as day,
sum(amount)::money / 100 as purchased
from
purchases
where
refunded_at is null
and
recurring = false
group by
1
order by
1
)
which will yield results like so:
day | purchased |
---|---|
2014-08-03 | $9.00 |
2014-08-24 | $14.00 |
2014-09-03 | $12.00 |
Now let’s add the month to this result set, which wasn’t possible before
because of the group by
clause.
daily_purchases_with_month as (
select
date(date_trunc('month', day)) as month,
day,
purchased
from
daily_purchases
)
month | day | purchased |
---|---|---|
2014-08-01 | 2014-08-03 | $9.00 |
2014-08-01 | 2014-08-24 | $14.00 |
2014-09-01 | 2014-09-03 | $12.00 |
Now we need to calculate each month’s recurring revenue:
per_month_recurring as (
select
date(date_trunc('month', created_at)) as month,
sum(amount)::money / 100 as purchased
from
purchases
where
recurring = true
group by
1
order by
1
)
month | purchased |
---|---|
2014-08-01 | $13.00 |
2014-09-01 | $17.00 |
But now we have to average in the recurring into each day. How can we handle the different number of days in a month? With another query:
per_day_recurring as (
select
date_part('days',
date_trunc('month', month)
+ '1 month'::interval
- date_trunc('month', month)
) as days_per_month,
month,
purchased
from
per_month_recurring
)
days_per_month | month | purchased |
---|---|---|
31 | 2014-08-01 | $13.00 |
30 | 2014-09-01 | $17.00 |
August has 31 days, September has 30. This approach handles leap years in February too.
We now need to find the actual average per day based on the month:
average_per_day_recurring as (
select
month,
purchased / days_per_month as recurring_purchases_per_day
from
per_day_recurring
)
So now we have:
month | purchased |
---|---|
2014-08-01 | $0.42 |
2014-09-01 | $0.57 |
Now it’s time to start mashing these tables together:
totals as (
select
dpwm.day,
dpwm.purchased,
apdr.recurring_purchases_per_day
from
daily_purchases_with_month dpwm left outer join average_per_day_recurring apdr on
dpwm.month = apdr.month
)
day | purchased | recurring_purchases_per_day |
---|---|---|
2014-08-03 | $9.00 | $0.42 |
2014-08-24 | $14.00 | $0.42 |
2014-09-03 | $12.00 | $0.57 |
OK, we’re close! We have daily totals
and avg daily totals from recurring
.
Last calculation we need to do is add up the amounts and group by
the week.
This is the reduction I alluded to above:
reduced as (
select
date(date_trunc('week', day)) as week,
sum(purchased + recurring_purchases_per_day) as total_purchased
from
totals
group by
1
order by
1
)
Finally, we get our actual weekly totals:
week | total_purchased |
---|---|
2014-07-28 | $9.42 |
2014-08-18 | $14.42 |
2014-09-01 | $12.57 |
Notice that the start of the week shifted. Take heed!
The final query
In all its glory, or something:
with daily_purchases as (
select
date(date_trunc('day', created_at)) as day,
sum(amount)::money / 100 as purchased
from
purchases
where
refunded_at is null
and
recurring = false
group by
1
order by
1
), daily_purchases_with_month as (
select
date(date_trunc('month', day)) as month,
day,
purchased
from
daily_purchases
), per_month_recurring as (
select
date(date_trunc('month', created_at)) as month,
sum(amount)::money / 100 as purchased
from
purchases
where
recurring = true
group by
1
order by
1
), per_day_recurring as (
select
date_part('days',
date_trunc('month', month)
+ '1 month'::interval
- date_trunc('month', month)
) as days_per_month,
month,
purchased
from
per_month_recurring
), average_per_day_recurring as (
select
month,
purchased / days_per_month as recurring_purchases_per_day
from
per_day_recurring
), totals as (
select
dpwm.day,
dpwm.purchased,
apdr.recurring_purchases_per_day
from
daily_purchases_with_month dpwm left outer join average_per_day_recurring apdr on
dpwm.month = apdr.month
), reduced as (
select
date(date_trunc('week', day)) as week,
sum(purchased + recurring_purchases_per_day) as total_purchased
from
totals
group by
1
order by
1
)
select
*
from
reduced
I hope you enjoyed this venture into using CTE’s to build up a simple result from a decently complex, for SQL, report. I’ve grown quite fond of CTE’s and luckily I used PostgreSQL exclusively for my last several projects. The expressive power to compose SQL queries like this is one of the many things I love about PostgreSQL.
Discussion over on hn.
ps: My friend Jacob asked how well this performs. From what I can tell it performs really well, but I haven’t had the chance to benchmark it yet. I’m hoping to write up how it performs 1) with more data and 2) against an implementation in Ruby.
[1] I say “near-realtime” because Heroku caches the results for some period.
- Jesse