how
BigQuerybroke my heart
Gabe Hamilton
Reporting Solutions Smackdown
We are evaluating replacements for SQL Server for our
Reporting & Business Intelligence backend.
Many TBs of data.
Closer to SQL the less report migration we need to do.
We like saving money.
Solutions we've been testing
Redshift
BigQuery
CouchDB
MongoDB
Cassandra
TerraData
Oracle
Plus various changes to our design
Some of these are necessary for certain technologies.
Denormalization
Sharding strategies
Nested data
Tune our existing Star Schema and Tables
BigQuery is
A massively parallel datastore
Columnar
Queries are SQL Select statements
Uses a Tree structure to distribute across
nodes
How many nodes?
10,000 nodes!
And what price?
3.5 cents /GBResource
Pricing
Query cost is per GB in the columns processed
Interactive Queries $0.035
Batch Queries $0.02
Storage $0.12 (per GB/month)
Which is great for our big queries
A gnarly query that looks at 200GB of data costs $7.50 in
BigQuery.
If that takes 2 hours to run on a $60/hr cluster of a
competing technology...
It's a little more complicated because in theory several of
those queries could run simultaneously on the competing
tech.
Still, that's 4 X cheaper plus the speed improvement.
Example: Github data from past year
3.5 GB Table
SELECT type, count(*) as num FROM [publicdata:samples.github_timeline]
group by type order by num desc;
Query complete (1.1s elapsed, 75.0 MB processed)
Event Type num
PushEvent 2,686,723
CreateEvent 964,830
WatchEvent 581,029
IssueCommentEvent 507,724
GistEvent 366,643
IssuesEvent 305,479
ForkEvent 180,712
PullRequestEvent 173,204
FollowEvent 156,427
GollumEvent 104,808
Cost $0.0026
or 5 for a penny
It was love at first type.
But Then...
Reality
Uploaded our test dataset
Which is 250GB
Docs are good, tools are good.
Hurdle 1: only one join per query.
Ok, rewrite as ugly nested selects...
Result
Round 2
No problem, I had seen that joins were
somewhat experimental.
Try the denormalized version of the data.
SELECT ProductId, StoreId, ProductSizeId, InventoryDate,
avg(InventoryQuantity) as InventoryQuantity
FROM BigDataTest.denorm
GROUP EACH BY ProductId, StoreId, ProductSizeId, InventoryDate
1st error message helpfully says, try GROUP EACH BY
Final Result
It's not you, it's me
The documentation had some semi-useful information:
Because the system is interactive, queries that produce a large number of
groups might fail. The use of the TOP function instead of GROUP BY might
solve the problem.
However, the BigQuery TOP function only operates on one column.
At this point I had jumped through enough hoops. I posted
on Stack Overflow, the official support channel according to
the docs, and have gotten no response.
Epilogue
Simplifying my query down to two grouping
columns did cause it to run with a limit
statement.
SELECT ProductId, StoreId,
avg(InventoryQuantity) as InventoryQuantity
FROM BigDataTest.denorm
GROUP each BY ProductId, StoreId
Limit 1000
Query complete (4.5s elapsed, 28.1 GB processed)
Without a limit it gives Error: Response too large to return.
Perhaps there is still hope for me and BigQuery...
Me
Like this talk?
@gabehamilton
My twitter feed is just technical stuff.
or slideshare.net/gabehamilton

How BigQuery broke my heart

  • 1.
  • 2.
    Reporting Solutions Smackdown Weare evaluating replacements for SQL Server for our Reporting & Business Intelligence backend. Many TBs of data. Closer to SQL the less report migration we need to do. We like saving money.
  • 3.
    Solutions we've beentesting Redshift BigQuery CouchDB MongoDB Cassandra TerraData Oracle
  • 4.
    Plus various changesto our design Some of these are necessary for certain technologies. Denormalization Sharding strategies Nested data Tune our existing Star Schema and Tables
  • 5.
    BigQuery is A massivelyparallel datastore Columnar Queries are SQL Select statements Uses a Tree structure to distribute across nodes
  • 6.
  • 7.
    And what price? 3.5cents /GBResource Pricing Query cost is per GB in the columns processed Interactive Queries $0.035 Batch Queries $0.02 Storage $0.12 (per GB/month)
  • 8.
    Which is greatfor our big queries A gnarly query that looks at 200GB of data costs $7.50 in BigQuery. If that takes 2 hours to run on a $60/hr cluster of a competing technology... It's a little more complicated because in theory several of those queries could run simultaneously on the competing tech. Still, that's 4 X cheaper plus the speed improvement.
  • 9.
    Example: Github datafrom past year 3.5 GB Table SELECT type, count(*) as num FROM [publicdata:samples.github_timeline] group by type order by num desc; Query complete (1.1s elapsed, 75.0 MB processed) Event Type num PushEvent 2,686,723 CreateEvent 964,830 WatchEvent 581,029 IssueCommentEvent 507,724 GistEvent 366,643 IssuesEvent 305,479 ForkEvent 180,712 PullRequestEvent 173,204 FollowEvent 156,427 GollumEvent 104,808 Cost $0.0026 or 5 for a penny
  • 10.
    It was loveat first type.
  • 11.
  • 12.
    Uploaded our testdataset Which is 250GB Docs are good, tools are good. Hurdle 1: only one join per query. Ok, rewrite as ugly nested selects...
  • 13.
  • 14.
    Round 2 No problem,I had seen that joins were somewhat experimental. Try the denormalized version of the data. SELECT ProductId, StoreId, ProductSizeId, InventoryDate, avg(InventoryQuantity) as InventoryQuantity FROM BigDataTest.denorm GROUP EACH BY ProductId, StoreId, ProductSizeId, InventoryDate 1st error message helpfully says, try GROUP EACH BY
  • 15.
  • 16.
    It's not you,it's me The documentation had some semi-useful information: Because the system is interactive, queries that produce a large number of groups might fail. The use of the TOP function instead of GROUP BY might solve the problem. However, the BigQuery TOP function only operates on one column. At this point I had jumped through enough hoops. I posted on Stack Overflow, the official support channel according to the docs, and have gotten no response.
  • 17.
    Epilogue Simplifying my querydown to two grouping columns did cause it to run with a limit statement. SELECT ProductId, StoreId, avg(InventoryQuantity) as InventoryQuantity FROM BigDataTest.denorm GROUP each BY ProductId, StoreId Limit 1000 Query complete (4.5s elapsed, 28.1 GB processed) Without a limit it gives Error: Response too large to return. Perhaps there is still hope for me and BigQuery...
  • 18.
    Me Like this talk? @gabehamilton Mytwitter feed is just technical stuff. or slideshare.net/gabehamilton