How BigQuery broke my heart

how
BigQuerybroke my heart
Gabe Hamilton

Reporting Solutions Smackdown
We are evaluating replacements for SQL Server for our
Reporting & Business Intelligence backend.
Many TBs of data.
Closer to SQL the less report migration we need to do.
We like saving money.

Solutions we've been testing
Redshift
BigQuery
CouchDB
MongoDB
Cassandra
TerraData
Oracle

Plus various changes to our design
Some of these are necessary for certain technologies.
Denormalization
Sharding strategies
Nested data
Tune our existing Star Schema and Tables

BigQuery is
A massively parallel datastore
Columnar
Queries are SQL Select statements
Uses a Tree structure to distribute across
nodes

And what price?
3.5 cents /GBResource
Pricing
Query cost is per GB in the columns processed
Interactive Queries $0.035
Batch Queries $0.02
Storage $0.12 (per GB/month)

Which is great for our big queries
A gnarly query that looks at 200GB of data costs $7.50 in
BigQuery.
If that takes 2 hours to run on a $60/hr cluster of a
competing technology...
It's a little more complicated because in theory several of
those queries could run simultaneously on the competing
tech.
Still, that's 4 X cheaper plus the speed improvement.

Example: Github data from past year
3.5 GB Table
SELECT type, count(*) as num FROM [publicdata:samples.github_timeline]
group by type order by num desc;
Query complete (1.1s elapsed, 75.0 MB processed)
Event Type num
PushEvent 2,686,723
CreateEvent 964,830
WatchEvent 581,029
IssueCommentEvent 507,724
GistEvent 366,643
IssuesEvent 305,479
ForkEvent 180,712
PullRequestEvent 173,204
FollowEvent 156,427
GollumEvent 104,808
Cost $0.0026
or 5 for a penny

Uploaded our test dataset
Which is 250GB
Docs are good, tools are good.
Hurdle 1: only one join per query.
Ok, rewrite as ugly nested selects...

Round 2
No problem, I had seen that joins were
somewhat experimental.
Try the denormalized version of the data.
SELECT ProductId, StoreId, ProductSizeId, InventoryDate,
avg(InventoryQuantity) as InventoryQuantity
FROM BigDataTest.denorm
GROUP EACH BY ProductId, StoreId, ProductSizeId, InventoryDate
1st error message helpfully says, try GROUP EACH BY

It's not you, it's me
The documentation had some semi-useful information:
Because the system is interactive, queries that produce a large number of
groups might fail. The use of the TOP function instead of GROUP BY might
solve the problem.
However, the BigQuery TOP function only operates on one column.
At this point I had jumped through enough hoops. I posted
on Stack Overflow, the official support channel according to
the docs, and have gotten no response.

Epilogue
Simplifying my query down to two grouping
columns did cause it to run with a limit
statement.
SELECT ProductId, StoreId,
avg(InventoryQuantity) as InventoryQuantity
FROM BigDataTest.denorm
GROUP each BY ProductId, StoreId
Limit 1000
Query complete (4.5s elapsed, 28.1 GB processed)
Without a limit it gives Error: Response too large to return.
Perhaps there is still hope for me and BigQuery...

Me
Like this talk?
@gabehamilton
My twitter feed is just technical stuff.
or slideshare.net/gabehamilton

How BigQuery broke my heart

More Related Content

What's hot

Similar to How BigQuery broke my heart

More from Gabriel Hamilton

Recently uploaded

How BigQuery broke my heart