I’m here to talk to you about how we write, run, and manage complex queries in Redshift at Aggregate Knowledge.To give you some background, our use of Redshift has been a substantial departure from how we generated reports in the past. Up until recently, all of our reports were generated through a streaming aggregation system.
But, generating one report in particular stumped us in a streaming setting.Multitouch attribution.What is it? It’s a way of answering the most important questions as an advertiser, which is how much should I pay for an ad?How do you answer the question: how much is a facebook ad worth? This is non-trivial.The way you answer this question is by looking at each user’s history up to a purchase and collecting statistics about the different ads the user has seen, and what order they’ve seen them in.
And those of you who feel like this is ringing a bell, you’re right, this is just the same behavioral analytics problem you see in any business. Take a user’s activity, look at it as a timeline, and try to understand what is preventing people from buying.
Market basket is a similar report in spirit.
And really they’re all addressing the same fundamental questions:Do we care that there are more or one type of activity or that the activity happened closer to the purchase?Do we care about the distribution of the different types of activity in time? In distinct type?Do we care that the activity was tightly clustered or evenly spread out?
And honestly, this is a pretty well understood problem in modern variants of SQL.
Here it is. You use a window function over the user’s ID and you’re done. Right?
So I know what the problem is and I know how to write the SQL to answer it, so why am I here?
Well, lots of data makes for lots of problems. And to boot the devil is in the details when you’re turning a query snippet into a product.
So how do we tame these crazy queries and data volumes?The answer is Redshift. We use Redshift to tackle these problems and we’re quite happy with how it’s treated us.So for the rest of the talk, I’m going to tell you about how we did that and what we’ve learned about these kinds of workloads.
What we found was that these three things were the critical parts of making Redshift work best for us.
First thing we had to do was turn the idea of a report like MTA into something executable and that meant writing some very complex queries.And what weighed heavily on us was that all the analytics SQL we’d seen in the past was a tangled mess, hundreds of lines long, with no documentation or testing.So we reminded ourselves that SQL is code.
And we need to apply our standard engineering rigor to our SQL. Redshift has a couple of features that make it very easy to deliver clean, logical, tested code.
The first is common table expressions. These allow me to factor my queries into digestible chunks that can be composed logically.
And to give you an idea of how helpful common table expressions can be I have one of our MTA queries here. It’s 500 lines long.With common table expressions, I can break this complex logic into smaller, reusable, testable chunks of about 8 to 12 lines.I can then take those pieces individually and debug them and test them, and once I’m confident in the correctness of each piece, I can layer them together to form the complex business logic I need.And this is simply applying that engineering rigor I mentioned before, to a place that was once like the wild west.
The second feature that allows us to manage the complexity of these queries is window functions.They allow us to express very advanced analytics in a concise and clear way.
In MTA we use window functions to answer questions likewhere in a user’s timeline a given ad impression fallshow many ads does a user see per sessionhow do users go from one site to another,And even more subtle analyses like “how many different sites did a user see before this site?”
And all of it in one or two lines a piece. This really makes our code much clearer and shorter.
So now we have the tools and SQL to express a complex report like MTA, but because of the scale of our data, we need to do some work to make it go as fast as possible.We found a few important techniques for organizing our data that allowed us to extract great performance from Redshift.
And simply, they are just playing to Redshift’s strengths, which are those of any MPP system.MTA has to join many billion row tables together, over multiple logical phases, which means that we need to make sure the data is laid out to facilitate that.
We have a couple of tricks that we use to do that, but the most useful one has been storing multiple representations of our data, each one sorted to optimize performance of a certain class of query.This has very little cost in Redshift because Redshift nodes come with a lot of storage, and they have excellent IO. The cost of loading data and materializing different sort orders is minimal compared to the performance improvements we see.
Now, clean queries that run fast are one thing, but building a product is another. In order to provide business value on top of the technical value, we need to optimize for the humans that write, debug, and ship these reports.
Redshift and the tooling around it gives us the ability to remove significant operational roadblocks that we would encounter with on-premises database vendors.
For instance, the complexity of MTA really showed us how important it is to have a very flexible operational environment.Specifically, the excellent bandwidth to and from S3 gave us the ability to materialize intermediate results from our queries and debug them without worrying about scratch space.We could also quickly and easily take snapshots before any major changes which really freed us up to take risks and experiment with different versions of queries and data layouts.In general, having great operational tools like the Redshift dashboard, plus the elastic scaling of well-integrated AWS services, allowed us to focus on developer productivity and getting a product out as fast as possible.
But just because you have all these easy tools to grow your use of redshift doesn’t mean that you can’t be frugal.
We used the ability to launch multiple clusters at once to benchmark and quantify the cost in dollars and seconds of our reports. This makes P&L and capacity planning much, much easier.Combine that with the fact that I can turn clusters on and off with minimal effort, and suddenly the lifecycles of all those clusters I launched are now manageable and can be tailored to their use case. QA clusters only need to be up during business hours. The dashboards are easy enough to manage that the devs that launch clusters are perfectly capable of shutting them down.And just a final point to wrap it up, Redshift’s stability, scalability, and reliability makes it so easy to quantify query performance in money and time that it gets you out of the habit of participating in the rat race of how many rows or how many nodes or how many jobs, and refocuses your thinking on“how much does it cost to deliver this report?”“how long will the customer have to wait?”“how do I enable my developers?”“how do I focus on minimizing technical debt?”It helps you focus on the business value you’re here to create.
-- Position in timeline
SUM(1) OVER (PARTITION BY user_id
ORDER BY record_date DESC ROWS UNBOUNDED PRECEDING)
-- Event count in timeline
SUM(1) OVER (PARTITION BY user_id
ORDER BY record_date DESC BETWEEN UNBOUNDED PRECEDING AND
-- Transition matrix of sites
LAG(site_name) OVER (PARTITION BY user_id
ORDER BY record_date DESC)
-- Unique sites in timeline, up to now
COUNT(DISTINCT site_name) OVER (PARTITION BY user_id
ORDER BY record_date DESC
ROWS UNBOUNDED PRECEDING)
Compact but expressive.
Simple to reason about.