Crunching Data with BigQueryFast analysis of Big Data                            Jordan Tigani, Software Engineer
01000001011011100111001101110111011001010111001000100000011101000110111100100000011101000110100001100101001000000101010101...
Big Data at Google      72 hours      100 million gigabytes
SELECT  kick_ass_product_plan AS strategy,  AVG(kicking_factor) AS awesomenessFROM  lots_of_dataGROUP BY  strategy
+-------------+----------------+| strategy    | awesomeness    |+-------------+----------------+| "Forty-two" | 1000000.01...
Regular expressions on 13 billion rows...
13 Billion rows1 TB of data in 4 tablesFAST!AST
Googles Internal Technology:Dremel
MapReduce is Flexible but Heavy                                     •   Master constructs the plan and               Mappe...
MapReduce is Flexible but Heavy                  Stage 1                    Stage 2               Mapper      Mapper      ...
Dremel vs MapReduce•   MapReduce    o Flexible batch processing    o High overall throughput    o High latency•   Dremel  ...
Mixer 0                       Dremel Architecture                                                      •   Partial Reducti...
Simple QuerySELECT    state, COUNT(*) count_babiesFROM [publicdata:samples.natality]WHERE    year >= 1980 AND year < 1990G...
LIMIT 10                                                      ORDER BY count_babies DESC                        Mixer 0   ...
Modeling Data
Example: Daily Weather Station Data                            weather_station_datastation lat    long    mean_temp   humi...
Example: Daily Weather Station Datastation,   lat,     long,     mean_temp,   year,      mon, day999999,    36.624, -116.0...
Organizing BigQuery Tables                             October 22                             October 23   Your Source    ...
Modeling Event Data: Social Music Store                    logs.oct_24_2012_song_activitiesUSERNAME   ACTIVITY       Cost ...
Users Who Listened to More than 10 Songs/DaySELECT  UserId, COUNT(*) as ListenActivitiesFROM  [logs.oct_24_2012_song_activ...
How Many Songs Listened to Total by Listeners of PSY?SELECT  UserId, count(*) as ListenActivitiesFROM  [logs.oct_24_2012_s...
Modeling Event Data: Nested and Repeated Values{"UserID" : "Michael", "Listens":   [     {"TrackId":1234,"Title":"Gangnam ...
Which Users Have Listened to Beyonce?SELECT  UserID,  COUNT(ListenActivities.artist) WITHIN RECORD    AS song_countFROM  [...
What Position are PSY songs in our Users Daily Playlists?SELECT  UserID,  POSITION(ListenActivities.artist)FROM  [sample_m...
Average Position of Songs by PSY in All Daily Playlists?SELECT  AVG(POSITION(ListenActivities.artist))FROM  [sample_music_...
Summary: Choosing a BigQuery Data Model• "Shard" your Data Using Multiple Tables• Source Data Files  • CSV format  • Newli...
Developing with BigQuery
Upload Your Data                   Google Cloud                                  BigQuery                     Storage
Load your Data into BigQuery"jobReference":{   "projectId":"605902584318"},"configuration":{   "load":{      "destinationT...
Query Away!"jobReference":{    "projectId":"605902584318",    "query":"SELECT TOP(widget, 50), COUNT(*) AS sale_count     ...
Libraries•   Python   •   JavaScript•   Java     •   Go•   .NET     •   PHP•   Ruby     •   Objective-C
Libraries - Example JavaScript Queryvar request = gapi.client.bigquery.jobs.query({    projectId: project_id,    timeoutMs...
Custom Code and the Google Chart Tools API
Google Spreadsheets
Commercial Visualization Tools
Demo: Using BigQuery on BigQuery
BigQuery - Aggregate Big Data Analysis in Seconds• Full table scans FAST• Aggregate Queries on Massive Datasets• Supports ...
SELECT questions FROM audienceSELECT Thank You!FROM jordanhttp://developers.google.com/bigquery
Schema definition           birth_record         parents         parent_id_mother   id         parent_id_father   race    ...
Schema definition                         birth_record                    mother_race                    mother_age       ...
Tools to prepare your data• App Engine MapReduce• Commercial ETL tools  • Pervasive  • Informatica  • Talend• UNIX command...
Schema definition - sharding birth_record_2011      birth_record_2012     birth_record_2013mother_race            mother_r...
Visualizing your Data
BigQuery architecture
“ If you do a table scan over a 1TB table,  youre going to have a bad time. ” Anonymous 16th century Italian Philosopher-M...
Goal: Perform a 1 TB table scan in 1 secondParallelize Parallelize Parallelize!•• Reading 1 TB/ second from disk:  • 10k+ ...
Data access: Column Store Record Oriented Storage    Column Oriented Storage
BigQuery Architecture                                                  Mixer 0          Mixer 1                           ...
Running your Queries
BigQuery SQL Example: Simple aggregatesSELECT COUNT(foo), MAX(foo), STDDEV(foo)FROM ...
BigQuery SQL Example: Complex ProcessingSELECT ... FROM ....WHERE REGEXP_MATCH(url, ".com$")  AND user CONTAINS test
BigQuery SQL Example: Nested SELECTSELECT COUNT(*) FROM  (SELECT foo ..... )GROUP BY foo
BigQuery SQL Example: Small JOINSELECT huge_table.fooFROM huge_tableJOIN small_tableON small_table.foo = huge_table.foo
BigQuery Architecture: Small Join                                 Mixer 0             Mixer 1                             ...
Other new features!
Batch queries!• Dont need interactive queries for some jobs?  • priority: "BATCH"
Thats it• API• Column-based datastore• Full table scans FAST• Aggregates• Commercial tool support• Use cases
SELECT questions FROM audienceSELECT Thank You!FROM ryanhttp://developers.google.com/bigquery@ryguyrg          http://prof...
Data access: Column Store Record Oriented Storage    Column Oriented Storage
A Little Later ...Row   wp_namespace   Revs                                           Underlying table:1     0            ...
ORDER BY Revs DESC                        Mixer 0                                                      COUNT (revision_id)...
"Multi-stage" QuerySELECT  LogEdits, COUNT(contributor_id) ContributorsFROM (  SELECT  SELECT                SELECT    con...
ORDER BY LogEdits DESC                        Mixer 0                       COUNT(contributor_id)                         ...
When to use EACH•   Shuffle definitely adds some overhead•   Poor query performance if used incorrectly•   GROUP BY    o G...
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Upcoming SlideShare
Loading in …5
×

Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

4,147 views

Published on

Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/crunching-data-with-google-bigquery/jordan-tigani

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,147
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
83
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

  1. 1. Crunching Data with BigQueryFast analysis of Big Data Jordan Tigani, Software Engineer
  2. 2. 0100000101101110011100110111011101100101011100100010000001110100011011110010000001110100011010000110010100100000010101010110110001110100011010010110110101100001011101000110010100100000010100010111010101100101011100110111010001101001011011110110111000100000011011110110011000100000010011000110100101100110011001010010110000100000011101000110100001100101001000000101010101101110011010010111011001100101011100100111001101100101001011000010000001100001011011100110010000100000010001010111011001100101011100100111100101110100101110011001000000011010000110010...........
  3. 3. Big Data at Google 72 hours 100 million gigabytes
  4. 4. SELECT kick_ass_product_plan AS strategy, AVG(kicking_factor) AS awesomenessFROM lots_of_dataGROUP BY strategy
  5. 5. +-------------+----------------+| strategy | awesomeness |+-------------+----------------+| "Forty-two" | 1000000.01 |+-------------+----------------+1 row in result set (10.2 s)Scanned 100GB
  6. 6. Regular expressions on 13 billion rows...
  7. 7. 13 Billion rows1 TB of data in 4 tablesFAST!AST
  8. 8. Googles Internal Technology:Dremel
  9. 9. MapReduce is Flexible but Heavy • Master constructs the plan and Mapper Mapper begins spinning up workers • Mappers read and write to distributed storage Master Distributed Storage • Map => Shuffle => Reduce Reducer • Reducers read and write to distributed storage
  10. 10. MapReduce is Flexible but Heavy Stage 1 Stage 2 Mapper Mapper Mapper Mapper Master Distributed Storage Master Reducer Reducer
  11. 11. Dremel vs MapReduce• MapReduce o Flexible batch processing o High overall throughput o High latency• Dremel o Optimized for interactive SQL queries o Very low latency
  12. 12. Mixer 0 Dremel Architecture • Partial Reduction Mixer 1 Mixer 1 • Diskless data flow • Long lived shared serving treeLeaf Leaf Leaf Leaf • Columnar Storage Distributed Storage
  13. 13. Simple QuerySELECT state, COUNT(*) count_babiesFROM [publicdata:samples.natality]WHERE year >= 1980 AND year < 1990GROUP BY stateORDER BY count_babies DESCLIMIT 10
  14. 14. LIMIT 10 ORDER BY count_babies DESC Mixer 0 COUNT(*) GROUP BY state O(50 states) O(50 states) Mixer 1 Mixer 1 COUNT(*) GROUP BY state O(50 states) COUNT(*)Leaf Leaf Leaf Leaf GROUP BY state WHERE year >= 1980 and year < 1990 O(Rows ~140M) Distributed Storage SELECT state, year
  15. 15. Modeling Data
  16. 16. Example: Daily Weather Station Data weather_station_datastation lat long mean_temp humidity timestamp year month day9384 33.57 86.75 89.3 .35 1351005129 2011 04 192857 36.77 119.72 78.5 .24 1351005135 2011 04 193475 40.77 73.98 68 .35 1351015930 2011 04 19etc...
  17. 17. Example: Daily Weather Station Datastation, lat, long, mean_temp, year, mon, day999999, 36.624, -116.023, 63.6, 2009, 10, 9911904, 20.963, -156.675, 83.4, 2009, 10, 9916890, -18133, 178433, 76.9, 2009, 10, 9943320, -20678, 139488, 73.8, 2009, 10, 9 CSV
  18. 18. Organizing BigQuery Tables October 22 October 23 Your Source Data October 24
  19. 19. Modeling Event Data: Social Music Store logs.oct_24_2012_song_activitiesUSERNAME ACTIVITY Cost SONG ARTIST TIMESTAMPMichael LISTEN Too Close Alex Clare 1351065562Michael LISTEN Gangnam Style PSY 1351105150Jim LISTEN Complications Deadmau5 1351075720Michael PURCHASE 0.99 Gangnam Style PSY 1351115962
  20. 20. Users Who Listened to More than 10 Songs/DaySELECT UserId, COUNT(*) as ListenActivitiesFROM [logs.oct_24_2012_song_activities]GROUP EACH BY UserIdHAVING ListenActivites > 10
  21. 21. How Many Songs Listened to Total by Listeners of PSY?SELECT UserId, count(*) as ListenActivitiesFROM [logs.oct_24_2012_song_activities]WHERE UserId IN ( SELECT UserId FROM [logs.oct_24_2012_song_activities] WHERE artist = PSY)GROUP EACH BY UserIdHAVING ListenActivites > 10
  22. 22. Modeling Event Data: Nested and Repeated Values{"UserID" : "Michael", "Listens": [ {"TrackId":1234,"Title":"Gangnam Style", {"TrackId":1234,"Title":"Gangam Style", "Artist":"PSY","Timestamp":1351075700}, {"TrackId":1234,"Title":"Alex Clare", "Artist":"Alex Clare",Timestamp":1351075700} ] "Purchases": [ {"Track":2345,"Title":"Gangnam Style", {"Track":2345,"Title":"Gangam Style", "Artist":"PSY","Timestamp":1351075700,"Cost":0.99} ]} JSON
  23. 23. Which Users Have Listened to Beyonce?SELECT UserID, COUNT(ListenActivities.artist) WITHIN RECORD AS song_countFROM [logs.oct_24_2012_songactivities]WHERE UserID IN (SELECT UserID, FROM [logs.oct_24_2012_songactivities] WHERE ListenActivities.artist = Beyonce);
  24. 24. What Position are PSY songs in our Users Daily Playlists?SELECT UserID, POSITION(ListenActivities.artist)FROM [sample_music_logs.oct_24_2012_songactivities]WHERE ListenActivities.artist = PSY;
  25. 25. Average Position of Songs by PSY in All Daily Playlists?SELECT AVG(POSITION(ListenActivities.artist))FROM [sample_music_logs.oct_24_2012_songactivities], [sample_music_logs.oct_23_2012_songactivities], /* etc... */WHERE ListenActivities.artist = PSY;
  26. 26. Summary: Choosing a BigQuery Data Model• "Shard" your Data Using Multiple Tables• Source Data Files • CSV format • Newline-delimited JSON• Using Nested and Repeated Records • Simplify Some Types of Queries • Often Matches Document Database Models
  27. 27. Developing with BigQuery
  28. 28. Upload Your Data Google Cloud BigQuery Storage
  29. 29. Load your Data into BigQuery"jobReference":{ "projectId":"605902584318"},"configuration":{ "load":{ "destinationTable":{ "projectId":"605902584318", "datasetId":"my_dataset", "tableId":"widget_sales"}, "sourceUris":[ "gs://widget-sales-data/2012080100.csv"], "schema":{ "fields":[{ "name":"widget", "type":"string"}, ...POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs
  30. 30. Query Away!"jobReference":{ "projectId":"605902584318", "query":"SELECT TOP(widget, 50), COUNT(*) AS sale_count FROM widget_sales", "maxResults":100, "apiVersion":"v2"}POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs
  31. 31. Libraries• Python • JavaScript• Java • Go• .NET • PHP• Ruby • Objective-C
  32. 32. Libraries - Example JavaScript Queryvar request = gapi.client.bigquery.jobs.query({ projectId: project_id, timeoutMs: 30000, query: SELECT state, AVG(mother_age) AS theav FROM [publicdata:samples.natality] WHERE year=2000 AND ever_born=1 GROUP BY state ORDER BY theav DESC;});request.execute(function(response) { console.log(response); $.each(response.result.rows, function(i, item) { ...
  33. 33. Custom Code and the Google Chart Tools API
  34. 34. Google Spreadsheets
  35. 35. Commercial Visualization Tools
  36. 36. Demo: Using BigQuery on BigQuery
  37. 37. BigQuery - Aggregate Big Data Analysis in Seconds• Full table scans FAST• Aggregate Queries on Massive Datasets• Supports Flat and Nested/Repeated Data Models• Its an API Get started now: http://developers.google.com/bigquery/
  38. 38. SELECT questions FROM audienceSELECT Thank You!FROM jordanhttp://developers.google.com/bigquery
  39. 39. Schema definition birth_record parents parent_id_mother id parent_id_father race plurality age is_male cigarette_use race state weight
  40. 40. Schema definition birth_record mother_race mother_age mother_cigarette_use mother_state father_race father_age father_cigarette_use father_state plurality is_male race weight
  41. 41. Tools to prepare your data• App Engine MapReduce• Commercial ETL tools • Pervasive • Informatica • Talend• UNIX command-line
  42. 42. Schema definition - sharding birth_record_2011 birth_record_2012 birth_record_2013mother_race mother_race birth_record_2014mother_age mother_agemother_cigarette_use mother_cigarette_use birth_record_2015mother_state mother_statefather_race father_race birth_record_2016father_age father_agefather_cigarette_use father_cigarette_usefather_state father_stateplurality pluralityis_male is_malerace raceweight weight
  43. 43. Visualizing your Data
  44. 44. BigQuery architecture
  45. 45. “ If you do a table scan over a 1TB table, youre going to have a bad time. ” Anonymous 16th century Italian Philosopher-Monk
  46. 46. Goal: Perform a 1 TB table scan in 1 secondParallelize Parallelize Parallelize!•• Reading 1 TB/ second from disk: • 10k+ disks• Processing 1 TB / sec: • 5k processors
  47. 47. Data access: Column Store Record Oriented Storage Column Oriented Storage
  48. 48. BigQuery Architecture Mixer 0 Mixer 1 Mixer 1 Mixer 1 Shard 0-8 Shard 9-16 Shard 17-24Shard 0 Shard 10 Shard 12 Shard 20 Shard 24Distributed Storage (e.g. GFS)
  49. 49. Running your Queries
  50. 50. BigQuery SQL Example: Simple aggregatesSELECT COUNT(foo), MAX(foo), STDDEV(foo)FROM ...
  51. 51. BigQuery SQL Example: Complex ProcessingSELECT ... FROM ....WHERE REGEXP_MATCH(url, ".com$") AND user CONTAINS test
  52. 52. BigQuery SQL Example: Nested SELECTSELECT COUNT(*) FROM (SELECT foo ..... )GROUP BY foo
  53. 53. BigQuery SQL Example: Small JOINSELECT huge_table.fooFROM huge_tableJOIN small_tableON small_table.foo = huge_table.foo
  54. 54. BigQuery Architecture: Small Join Mixer 0 Mixer 1 Mixer 1 Shard 0-8 Shard 17-24 Shard 0 Shard 20 Shard 24Distributed Storage (e.g. GFS)
  55. 55. Other new features!
  56. 56. Batch queries!• Dont need interactive queries for some jobs? • priority: "BATCH"
  57. 57. Thats it• API• Column-based datastore• Full table scans FAST• Aggregates• Commercial tool support• Use cases
  58. 58. SELECT questions FROM audienceSELECT Thank You!FROM ryanhttp://developers.google.com/bigquery@ryguyrg http://profiles.google.com/ryan.boyd
  59. 59. Data access: Column Store Record Oriented Storage Column Oriented Storage
  60. 60. A Little Later ...Row wp_namespace Revs Underlying table:1 0 53697002 • Wikipedia page revision records2 1 6151228 • Rows: 314 million3 3 5519859 • Byte size: 35.7 GB4 4 4184389 Query Stats:5 2 3108562 • Scanned 7G of data6 10 1052044 • <5 seconds7 6 877417 • ~ 100M rows scanned / second8 14 8389409 5 65174910 11 19253411 100 148135
  61. 61. ORDER BY Revs DESC Mixer 0 COUNT (revision_id) GROUP BY wp_namespace Mixer 1 Mixer 1 COUNT (revision_id) GROUP BY wp_namespaceLeaf Leaf Leaf Leaf COUNT (revision_id) GROUP BY wp_namespace WHERE timestamp > CUTOFF 10 GB / s Distributed Storage SELECT wp_namespace, revision_id
  62. 62. "Multi-stage" QuerySELECT LogEdits, COUNT(contributor_id) ContributorsFROM ( SELECT SELECT SELECT contributor_id, contributor_id, contributor_id, INTEGER(LOG10(COUNT(revision_id))) LogEdits INTEGER(LOG10(COUNT(*))) LogEdits INTEGER(LOG10(COUNT(revision_id))) LogEdits FROM [publicdata:samples.wikipedia] FROM [publicdata:samples.wikipedia] FROM [publicdata:samples.wikipedia] GROUP EACH BY contributor_id) GROUP EACH BY contributor_id)GROUP BY LogEditsORDER BY LogEdits DESC
  63. 63. ORDER BY LogEdits DESC Mixer 0 COUNT(contributor_id) GROUP BY LogEdits Mixer 1 Mixer 1 COUNT(contributor_id) GROUP BY LogEdits COUNT(contributor_id)Leaf Leaf Shuffler Shuffler GROUP BY LogEdits N^2 Shuffle by SELECT LE, Id GB/s contributor_id COUNT(*) GROUP BY contributor_id Distributed Storage SELECT contributor_id
  64. 64. When to use EACH• Shuffle definitely adds some overhead• Poor query performance if used incorrectly• GROUP BY o Groups << Rows => Unbalanced load o Example: GROUP BY state• GROUP EACH BY o Groups ~ Rows o Example: GROUP BY user_id

×