A kind of map/reduce using CTE's
I was once tasked with writing a pretty complicated query to quantify Watsi’s weekly growth. We wanted to take into account monthly donations though. How does one do that when monthly donations are all charged on the first of the month? How do we include tips to our overhead fund?
I recognize there are all sorts of ways to do this. Hell, I could have
written this in Ruby
, and I almost did, but we used Heroku’s
Dataclips which allowed us to get near-realtime results[1].
Let’s take a slightly simpler example: A small business sells widgets. Widgets are sold for all different prices, and you can even set up a recurring order that is charged on the first of the month.
Imagine we have a very simple table purchases
that looks something
like this:
name | type |
---|---|
id | int |
amount | int |
created_at | timestamp without time zone |
refunded_at | timestamp without time zone |
recurring | bool |
Some sample data:
id | amount | created_at | refunded_at | recurring |
---|---|---|---|---|
1 | 1300 | 2014-08-01 08:00:00 | null | true |
2 | 900 | 2014-08-03 13:30:00 | null | false |
3 | 1300 | 2014-08-24 23:30:00 | 2014-08-24 23:31:00 | false |
3 | 1400 | 2014-08-24 17:30:00 | null | false |
4 | 1700 | 2014-09-01 08:00:00 | null | true |
5 | 1200 | 2014-09-03 17:27:00 | null | false |
So, given this info, we calculate our weekly amount as:
(daily totals + avg daily totals from recurring) / # of weeks in month
Note: I’m going to ignore time zones for brevity, but assume I’m setting
everything to America/Los_Angeles
.
In any other database I would have had to either use temp tables
or sub-selects
but PostgreSQL has what are called Common Table Expressions
or WITH
queries.
WITH
queries allow us to turn sub-selects
into named tables without the need
to generate actual temp tables
.
can be turned into this:
This trivial example doesn’t do the greatest job in showing the power of CTE’s, but the final query below will.
OK, so let’s see some SQL:
which will yield results like so:
day | purchased |
---|---|
2014-08-03 | $9.00 |
2014-08-24 | $14.00 |
2014-09-03 | $12.00 |
Now let’s add the month to this result set, which wasn’t possible before
because of the group by
clause.
month | day | purchased |
---|---|---|
2014-08-01 | 2014-08-03 | $9.00 |
2014-08-01 | 2014-08-24 | $14.00 |
2014-09-01 | 2014-09-03 | $12.00 |
Now we need to calculate each month’s recurring revenue:
month | purchased |
---|---|
2014-08-01 | $13.00 |
2014-09-01 | $17.00 |
But now we have to average in the recurring into each day. How can we handle the different number of days in a month? With another query:
days_per_month | month | purchased |
---|---|---|
31 | 2014-08-01 | $13.00 |
30 | 2014-09-01 | $17.00 |
August has 31 days, September has 30. This approach handles leap years in February too.
We now need to find the actual average per day based on the month:
So now we have:
month | purchased |
---|---|
2014-08-01 | $0.42 |
2014-09-01 | $0.57 |
Now it’s time to start mashing these tables together:
day | purchased | recurring_purchases_per_day |
---|---|---|
2014-08-03 | $9.00 | $0.42 |
2014-08-24 | $14.00 | $0.42 |
2014-09-03 | $12.00 | $0.57 |
OK, we’re close! We have daily totals
and avg daily totals from recurring
.
Last calculation we need to do is add up the amounts and group by
the week.
This is the reduction I alluded to above:
Finally, we get our actual weekly totals:
week | total_purchased |
---|---|
2014-07-28 | $9.42 |
2014-08-18 | $14.42 |
2014-09-01 | $12.57 |
Notice that the start of the week shifted. Take heed!
The final query
In all its glory, or something:
I hope you enjoyed this venture into using CTE’s to build up a simple result from a decently complex, for SQL, report. I’ve grown quite fond of CTE’s and luckily I used PostgreSQL exclusively for my last several projects. The expressive power to compose SQL queries like this is one of the many things I love about PostgreSQL.
Discussion over on hn.
ps: My friend Jacob asked how well this performs. From what I can tell it performs really well, but I haven’t had the chance to benchmark it yet. I’m hoping to write up how it performs 1) with more data and 2) against an implementation in Ruby.
[1] I say “near-realtime” because Heroku caches the results for some period.
- Jesse