SlideShare a Scribd company logo
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with BigQuery
Fast analysis of Big Data

                            Jordan Tigani, Software Engineer
01000001011011100111001101110111011001010111001
00010000001110100011011110010000001110100011010
00011001010010000001010101011011000111010001101
00101101101011000010111010001100101001000000101
00010111010101100101011100110111010001101001011
01111011011100010000001101111011001100010000001
00110001101001011001100110010100101100001000000
11101000110100001100101001000000101010101101110
01101001011101100110010101110010011100110110010
10010110000100000011000010110111001100100001000
00010001010111011001100101011100100111100101110
100101110011001000000011010000110010...........
Big Data at Google




      72 hours

      100 million gigabytes
SELECT
  kick_ass_product_plan AS strategy,
  AVG(kicking_factor) AS awesomeness
FROM
  lots_of_data
GROUP BY
  strategy
+-------------+----------------+
| strategy    | awesomeness    |
+-------------+----------------+
| "Forty-two" | 1000000.01     |
+-------------+----------------+
1 row in result set (10.2 s)
Scanned 100GB
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Regular expressions on 13 billion rows...
13 Billion rows
1 TB of data in 4 tables
FAST!
AST
Google's Internal Technology:
Dremel
MapReduce is Flexible but Heavy

                                     •   Master constructs the plan and
               Mapper      Mapper        begins spinning up workers

                                     •   Mappers read and write to
                                         distributed storage
    Master     Distributed Storage

                                     •   Map => Shuffle => Reduce


                     Reducer
                                     •   Reducers read and write to
                                         distributed storage
MapReduce is Flexible but Heavy

                  Stage 1                    Stage 2

               Mapper      Mapper       Mapper        Mapper




    Master                  Distributed Storage                Master




                        Reducer             Reducer
Dremel vs MapReduce

•   MapReduce
    o Flexible batch processing
    o High overall throughput
    o High latency

•   Dremel
    o Optimized for interactive SQL queries
    o Very low latency
Mixer 0                       Dremel Architecture


                                                      •   Partial Reduction
       Mixer 1                           Mixer 1
                                                      •   Diskless data flow

                                                      •   Long lived shared serving tree
Leaf             Leaf             Leaf         Leaf



                                                      •   Columnar Storage

             Distributed Storage
Simple Query
SELECT
    state, COUNT(*) count_babies
FROM [publicdata:samples.natality]
WHERE
    year >= 1980 AND year < 1990
GROUP BY state
ORDER BY count_babies DESC
LIMIT 10
LIMIT 10
                                                      ORDER BY count_babies DESC
                        Mixer 0
                                                      COUNT(*)
                                                      GROUP BY state


                                                                       O(50 states)
                                                                       O(50 states)
       Mixer 1                           Mixer 1      COUNT(*)
                                                      GROUP BY state


                                                                       O(50 states)
                                                      COUNT(*)
Leaf             Leaf             Leaf         Leaf
                                                      GROUP BY state
                                                      WHERE year >= 1980 and year < 1990


                                                                    O(Rows ~140M)
             Distributed Storage
                                                      SELECT state, year
Modeling Data
Example: Daily Weather Station Data


                            weather_station_data
station lat    long    mean_temp   humidity   timestamp    year   month   day
9384     33.57 86.75   89.3        .35        1351005129   2011   04      19
2857     36.77 119.72 78.5         .24        1351005135   2011   04      19
3475     40.77 73.98   68          .35        1351015930   2011   04      19
etc...
Example: Daily Weather Station Data

station,   lat,     long,     mean_temp,   year,      mon, day
999999,    36.624, -116.023, 63.6,         2009,      10,    9
911904,    20.963, -156.675, 83.4,         2009,      10,    9
916890,         -18133, 178433,    76.9,           2009,   10,   9
943320,         -20678, 139488,    73.8,           2009,   10,   9




                            CSV
Organizing BigQuery Tables

                             October 22




                             October 23



   Your Source
      Data                   October 24
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Modeling Event Data: Social Music Store


                    logs.oct_24_2012_song_activities
USERNAME   ACTIVITY       Cost    SONG            ARTIST       TIMESTAMP
Michael    LISTEN                 Too Close       Alex Clare   1351065562
Michael    LISTEN                 Gangnam Style   PSY          1351105150
Jim        LISTEN                 Complications   Deadmau5     1351075720
Michael    PURCHASE       0.99    Gangnam Style   PSY          1351115962
Users Who Listened to More than 10 Songs/Day
SELECT
  UserId, COUNT(*) as ListenActivities
FROM
  [logs.oct_24_2012_song_activities]
GROUP EACH BY
  UserId
HAVING
  ListenActivites > 10
How Many Songs Listened to Total by Listeners of PSY?
SELECT
  UserId, count(*) as ListenActivities
FROM
  [logs.oct_24_2012_song_activities]
WHERE UserId IN (
     SELECT
       UserId
     FROM
       [logs.oct_24_2012_song_activities]
     WHERE artist = 'PSY')
GROUP EACH BY UserId
HAVING
  ListenActivites > 10
Modeling Event Data: Nested and Repeated Values
{"UserID" : "Michael",
 "Listens":   [
     {"TrackId":1234,"Title":"Gangnam Style",
     {"TrackId":1234,"Title":"Gangam Style",
        "Artist":"PSY","Timestamp":1351075700},
     {"TrackId":1234,"Title":"Alex Clare",
        "Artist":"Alex Clare",'Timestamp":1351075700}
  ]
  "Purchases": [
     {"Track":2345,"Title":"Gangnam Style",
     {"Track":2345,"Title":"Gangam Style",
        "Artist":"PSY","Timestamp":1351075700,"Cost":0.99}
  ]}


                         JSON
Which Users Have Listened to Beyonce?
SELECT
  UserID,
  COUNT(ListenActivities.artist) WITHIN RECORD
    AS song_count
FROM
  [logs.oct_24_2012_songactivities]
WHERE
  UserID IN (SELECT UserID,
             FROM [logs.oct_24_2012_songactivities]
             WHERE ListenActivities.artist = 'Beyonce');
What Position are PSY songs in our Users' Daily Playlists?
SELECT
  UserID,
  POSITION(ListenActivities.artist)
FROM
  [sample_music_logs.oct_24_2012_songactivities]
WHERE
  ListenActivities.artist = 'PSY';
Average Position of Songs by PSY in All Daily Playlists?
SELECT
  AVG(POSITION(ListenActivities.artist))
FROM
  [sample_music_logs.oct_24_2012_songactivities],
  [sample_music_logs.oct_23_2012_songactivities],
  /* etc... */
WHERE
  ListenActivities.artist = 'PSY';
Summary: Choosing a BigQuery Data Model
• "Shard" your Data Using Multiple Tables
• Source Data Files
  • CSV format
  • Newline-delimited JSON
• Using Nested and Repeated Records
  • Simplify Some Types of Queries
  • Often Matches Document Database Models
Developing with BigQuery
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Upload Your Data




                   Google Cloud
                                  BigQuery
                     Storage
Load your Data into BigQuery
"jobReference":{
   "projectId":"605902584318"},
"configuration":{
   "load":{
      "destinationTable":{
         "projectId":"605902584318",
         "datasetId":"my_dataset",
         "tableId":"widget_sales"},
      "sourceUris":[
         "gs://widget-sales-data/2012080100.csv"],
      "schema":{
         "fields":[{
               "name":"widget",
               "type":"string"},
                                         ...

POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs
Query Away!


"jobReference":{
    "projectId":"605902584318",
    "query":"SELECT TOP(widget, 50), COUNT(*) AS sale_count
         FROM widget_sales",
    "maxResults":100,
    "apiVersion":"v2"
}



POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs
Libraries


•   Python   •   JavaScript
•   Java     •   Go
•   .NET     •   PHP
•   Ruby     •   Objective-C
Libraries - Example JavaScript Query

var request = gapi.client.bigquery.jobs.query({
    'projectId': project_id,
    'timeoutMs': '30000',
    'query': 'SELECT state, AVG(mother_age) AS theav
              FROM [publicdata:samples.natality]
              WHERE year=2000 AND ever_born=1
              GROUP BY state
              ORDER BY theav DESC;'
});

request.execute(function(response) {
    console.log(response);
    $.each(response.result.rows, function(i, item) {
    ...
Custom Code and the Google Chart Tools API
Google Spreadsheets
Commercial Visualization Tools
Demo: Using BigQuery on BigQuery
BigQuery - Aggregate Big Data Analysis in Seconds

• Full table scans FAST
• Aggregate Queries on Massive Datasets
• Supports Flat and Nested/Repeated Data Models
• It's an API

      Get started now:
      http://developers.google.com/bigquery/
SELECT questions FROM audience

SELECT 'Thank You!'
FROM jordan

http://developers.google.com/bigquery
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Schema definition




           birth_record         parents
         parent_id_mother   id
         parent_id_father   race
         plurality          age
         is_male            cigarette_use
         race               state
         weight
Schema definition

                         birth_record
                    mother_race
                    mother_age
                    mother_cigarette_use
                    mother_state
                    father_race
                    father_age
                    father_cigarette_use
                    father_state
                    plurality
                    is_male
                    race
                    weight
Tools to prepare your data

• App Engine MapReduce
• Commercial ETL tools
  • Pervasive
  • Informatica
  • Talend
• UNIX command-line
Schema definition - sharding
 birth_record_2011      birth_record_2012     birth_record_2013
mother_race            mother_race            birth_record_2014
mother_age             mother_age
mother_cigarette_use   mother_cigarette_use   birth_record_2015
mother_state           mother_state
father_race            father_race            birth_record_2016
father_age             father_age
father_cigarette_use   father_cigarette_use
father_state           father_state
plurality              plurality
is_male                is_male
race                   race
weight                 weight
Visualizing your Data
BigQuery architecture
“ If you do a table scan over a 1TB table,
  you're going to have a bad time. ”


 Anonymous
 16th century Italian Philosopher-Monk
Goal: Perform a 1 TB table scan in 1 second
Parallelize Parallelize Parallelize!


•
• Reading 1 TB/ second from disk:
  • 10k+ disks
• Processing 1 TB / sec:
  • 5k processors
Data access: Column Store




 Record Oriented Storage    Column Oriented Storage
BigQuery Architecture
                                                  Mixer 0




          Mixer 1                           Mixer 1                    Mixer 1
          Shard 0-8                         Shard 9-16                 Shard 17-24




Shard 0                          Shard 10                   Shard 12     Shard 20    Shard 24




Distributed Storage (e.g. GFS)
Running your Queries
BigQuery SQL Example: Simple aggregates




SELECT COUNT(foo), MAX(foo), STDDEV(foo)
FROM ...
BigQuery SQL Example: Complex Processing




SELECT ... FROM ....
WHERE REGEXP_MATCH(url, ".com$")
  AND user CONTAINS 'test'
BigQuery SQL Example: Nested SELECT

SELECT COUNT(*) FROM
  (SELECT foo ..... )
GROUP BY foo
BigQuery SQL Example: Small JOIN



SELECT huge_table.foo
FROM huge_table
JOIN small_table
ON small_table.foo = huge_table.foo
BigQuery Architecture: Small Join
                                 Mixer 0




             Mixer 1                                  Mixer 1
             Shard 0-8                                Shard 17-24




             Shard 0                       Shard 20                 Shard 24




Distributed Storage (e.g. GFS)
Other new features!
Batch queries!

• Don't need interactive queries for some jobs?
  • priority: "BATCH"
That's it

• API
• Column-based datastore
• Full table scans FAST
• Aggregates
• Commercial tool support
• Use cases
SELECT questions FROM audience

SELECT 'Thank You!'
FROM ryan

http://developers.google.com/bigquery

@ryguyrg          http://profiles.google.com/ryan.boyd
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Data access: Column Store




 Record Oriented Storage    Column Oriented Storage
A Little Later ...
Row   wp_namespace   Revs
                                           Underlying table:
1     0              53697002               • Wikipedia page revision records
2     1              6151228                • Rows: 314 million
3     3              5519859
                                            • Byte size: 35.7 GB
4     4              4184389               Query Stats:
5     2              3108562                • Scanned 7G of data
6     10             1052044                • <5 seconds
7     6              877417
                                            • ~ 100M rows scanned / second
8     14             838940
9     5              651749
10    11             192534
11    100            148135
ORDER BY Revs DESC
                        Mixer 0
                                                      COUNT (revision_id)
                                                      GROUP BY wp_namespace




       Mixer 1                           Mixer 1
                                                      COUNT (revision_id)
                                                      GROUP BY wp_namespace




Leaf             Leaf             Leaf         Leaf   COUNT (revision_id)
                                                      GROUP BY wp_namespace
                                                      WHERE timestamp > CUTOFF

                                                                       10 GB / s

             Distributed Storage
                                                      SELECT wp_namespace, revision_id
"Multi-stage" Query
SELECT
  LogEdits, COUNT(contributor_id) Contributors
FROM (
  SELECT
  SELECT                SELECT
    contributor_id,
    contributor_id, contributor_id,
    INTEGER(LOG10(COUNT(revision_id))) LogEdits
    INTEGER(LOG10(COUNT(*))) LogEdits
       INTEGER(LOG10(COUNT(revision_id))) LogEdits
  FROM [publicdata:samples.wikipedia]
  FROM [publicdata:samples.wikipedia]
          FROM [publicdata:samples.wikipedia]
  GROUP EACH BY contributor_id)
  GROUP EACH BY contributor_id)
GROUP BY LogEdits
ORDER BY LogEdits DESC
ORDER BY LogEdits DESC
                        Mixer 0                       COUNT(contributor_id)
                                                      GROUP BY LogEdits




       Mixer 1                       Mixer 1
                                                      COUNT(contributor_id)
                                                      GROUP BY LogEdits




                                                     COUNT(contributor_id)
Leaf             Leaf         Shuffler    Shuffler   GROUP BY LogEdits         N^2    Shuffle by
                                                     SELECT LE, Id             GB/s   contributor_id
                                                     COUNT(*)
                                                     GROUP BY contributor_id


             Distributed Storage
                                                      SELECT contributor_id
When to use EACH

•   Shuffle definitely adds some overhead
•   Poor query performance if used incorrectly

•   GROUP BY
    o Groups << Rows => Unbalanced load
    o Example: GROUP BY state

•   GROUP EACH BY
    o Groups ~ Rows
    o Example: GROUP BY user_id

More Related Content

What's hot

TDC2016SP - Trilha BigData
TDC2016SP - Trilha BigDataTDC2016SP - Trilha BigData
TDC2016SP - Trilha BigData
tdc-globalcode
 
Redshift VS BigQuery
Redshift VS BigQueryRedshift VS BigQuery
Redshift VS BigQuery
Kostas Pardalis
 
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
Big Data Spain
 
BigQuery for Beginners
BigQuery for BeginnersBigQuery for Beginners
BigQuery for Beginners
Better&Stronger
 
Big query
Big queryBig query
Big query
Tanvi Parikh
 
Google BigQuery 101 & What’s New
Google BigQuery 101 & What’s NewGoogle BigQuery 101 & What’s New
Google BigQuery 101 & What’s New
DoiT International
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
Márton Kodok
 
Google Cloud Platform at Vente-Exclusive.com
Google Cloud Platform at Vente-Exclusive.comGoogle Cloud Platform at Vente-Exclusive.com
Google Cloud Platform at Vente-Exclusive.com
Alex Van Boxel
 
30 days of google cloud event
30 days of google cloud event30 days of google cloud event
30 days of google cloud event
PreetyKhatkar
 
GDD Brazil 2010 - Google Storage, Bigquery and Prediction APIs
GDD Brazil 2010 - Google Storage, Bigquery and Prediction APIsGDD Brazil 2010 - Google Storage, Bigquery and Prediction APIs
GDD Brazil 2010 - Google Storage, Bigquery and Prediction APIs
Patrick Chanezon
 
BigQuery implementation
BigQuery implementationBigQuery implementation
BigQuery implementation
Simon Su
 
Augmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataAugmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure data
Treasure Data, Inc.
 
You might be paying too much for BigQuery
You might be paying too much for BigQueryYou might be paying too much for BigQuery
You might be paying too much for BigQuery
Ryuji Tamagawa
 
Google Cloud Spanner Preview
Google Cloud Spanner PreviewGoogle Cloud Spanner Preview
Google Cloud Spanner Preview
DoiT International
 
Unifying Events and Logs into the Cloud
Unifying Events and Logs into the CloudUnifying Events and Logs into the Cloud
Unifying Events and Logs into the Cloud
Eduardo Silva Pereira
 
Webinar: Live Data Visualisation with Tableau and MongoDB
Webinar: Live Data Visualisation with Tableau and MongoDBWebinar: Live Data Visualisation with Tableau and MongoDB
Webinar: Live Data Visualisation with Tableau and MongoDB
MongoDB
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big Data
Treasure Data, Inc.
 
Real Time Data Analytics with MongoDB and Fluentd at Wish
Real Time Data Analytics with MongoDB and Fluentd at WishReal Time Data Analytics with MongoDB and Fluentd at Wish
Real Time Data Analytics with MongoDB and Fluentd at Wish
MongoDB
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and Kibana
ObjectRocket
 
Hands On: Javascript SDK
Hands On: Javascript SDKHands On: Javascript SDK
Hands On: Javascript SDK
Treasure Data, Inc.
 

What's hot (20)

TDC2016SP - Trilha BigData
TDC2016SP - Trilha BigDataTDC2016SP - Trilha BigData
TDC2016SP - Trilha BigData
 
Redshift VS BigQuery
Redshift VS BigQueryRedshift VS BigQuery
Redshift VS BigQuery
 
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...
 
BigQuery for Beginners
BigQuery for BeginnersBigQuery for Beginners
BigQuery for Beginners
 
Big query
Big queryBig query
Big query
 
Google BigQuery 101 & What’s New
Google BigQuery 101 & What’s NewGoogle BigQuery 101 & What’s New
Google BigQuery 101 & What’s New
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
 
Google Cloud Platform at Vente-Exclusive.com
Google Cloud Platform at Vente-Exclusive.comGoogle Cloud Platform at Vente-Exclusive.com
Google Cloud Platform at Vente-Exclusive.com
 
30 days of google cloud event
30 days of google cloud event30 days of google cloud event
30 days of google cloud event
 
GDD Brazil 2010 - Google Storage, Bigquery and Prediction APIs
GDD Brazil 2010 - Google Storage, Bigquery and Prediction APIsGDD Brazil 2010 - Google Storage, Bigquery and Prediction APIs
GDD Brazil 2010 - Google Storage, Bigquery and Prediction APIs
 
BigQuery implementation
BigQuery implementationBigQuery implementation
BigQuery implementation
 
Augmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataAugmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure data
 
You might be paying too much for BigQuery
You might be paying too much for BigQueryYou might be paying too much for BigQuery
You might be paying too much for BigQuery
 
Google Cloud Spanner Preview
Google Cloud Spanner PreviewGoogle Cloud Spanner Preview
Google Cloud Spanner Preview
 
Unifying Events and Logs into the Cloud
Unifying Events and Logs into the CloudUnifying Events and Logs into the Cloud
Unifying Events and Logs into the Cloud
 
Webinar: Live Data Visualisation with Tableau and MongoDB
Webinar: Live Data Visualisation with Tableau and MongoDBWebinar: Live Data Visualisation with Tableau and MongoDB
Webinar: Live Data Visualisation with Tableau and MongoDB
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big Data
 
Real Time Data Analytics with MongoDB and Fluentd at Wish
Real Time Data Analytics with MongoDB and Fluentd at WishReal Time Data Analytics with MongoDB and Fluentd at Wish
Real Time Data Analytics with MongoDB and Fluentd at Wish
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and Kibana
 
Hands On: Javascript SDK
Hands On: Javascript SDKHands On: Javascript SDK
Hands On: Javascript SDK
 

Similar to Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Future of Data Meetup
 
Data Access Patterns
Data Access PatternsData Access Patterns
Data Access Patterns
Amazon Web Services
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
Julian Hyde
 
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
Amazon Web Services
 
At the core you will have KUSTO
At the core you will have KUSTOAt the core you will have KUSTO
At the core you will have KUSTO
Riccardo Zamana
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
Amazon Web Services
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
Amazon Web Services
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databases
Julian Hyde
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
DataWorks Summit
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
Grant Ingersoll
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
Mathias Herberts
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
Neville Li
 
Migrating To PostgreSQL
Migrating To PostgreSQLMigrating To PostgreSQL
Migrating To PostgreSQL
Grant Fritchey
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
n5712036
 
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDBAWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
Amazon Web Services
 
Odtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for youOdtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for you
Luc Bors
 
SRV405 Ancestry's Journey to Amazon Redshift
SRV405 Ancestry's Journey to Amazon RedshiftSRV405 Ancestry's Journey to Amazon Redshift
SRV405 Ancestry's Journey to Amazon Redshift
Amazon Web Services
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
Amazon Web Services
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
University of Washington
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
Amazon Web Services
 

Similar to Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012 (20)

Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
 
Data Access Patterns
Data Access PatternsData Access Patterns
Data Access Patterns
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
 
At the core you will have KUSTO
At the core you will have KUSTOAt the core you will have KUSTO
At the core you will have KUSTO
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databases
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Migrating To PostgreSQL
Migrating To PostgreSQLMigrating To PostgreSQL
Migrating To PostgreSQL
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
 
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDBAWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
 
Odtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for youOdtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for you
 
SRV405 Ancestry's Journey to Amazon Redshift
SRV405 Ancestry's Journey to Amazon RedshiftSRV405 Ancestry's Journey to Amazon Redshift
SRV405 Ancestry's Journey to Amazon Redshift
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 

More from Big Data Spain

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data Spain
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Big Data Spain
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
Big Data Spain
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Big Data Spain
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Big Data Spain
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Big Data Spain
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Big Data Spain
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Big Data Spain
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
Big Data Spain
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...
Big Data Spain
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Big Data Spain
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
Big Data Spain
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Big Data Spain
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Big Data Spain
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Big Data Spain
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Big Data Spain
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
Big Data Spain
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Big Data Spain
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
Big Data Spain
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Big Data Spain
 

More from Big Data Spain (20)

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
 

Recently uploaded

Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
SynapseIndia
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
Jimmy Lai
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Mydbops
 
Feature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptxFeature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptx
ssuser1915fe1
 
(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf
(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf
(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf
Priyanka Aash
 
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Muhammad Ali
 
IPLOOK Remote-Sensing Satellite Solution
IPLOOK Remote-Sensing Satellite SolutionIPLOOK Remote-Sensing Satellite Solution
IPLOOK Remote-Sensing Satellite Solution
IPLOOK Networks
 
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSECHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
kumarjarun2010
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
BrainSell Technologies
 
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
Edge AI and Vision Alliance
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
bhumivarma35300
 
July Patch Tuesday
July Patch TuesdayJuly Patch Tuesday
July Patch Tuesday
Ivanti
 
Amul milk launches in US: Key details of its new products ...
Amul milk launches in US: Key details of its new products ...Amul milk launches in US: Key details of its new products ...
Amul milk launches in US: Key details of its new products ...
chetankumar9855
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
Adam Dunkels
 
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
aslasdfmkhan4750
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes..."Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
Anant Gupta
 
Using LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and MilvusUsing LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and Milvus
Zilliz
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
Priyanka Aash
 

Recently uploaded (20)

Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
 
Feature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptxFeature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptx
 
(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf
(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf
(CISOPlatform Summit & SACON 2024) Digital Personal Data Protection Act.pdf
 
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
Litestack talk at Brighton 2024 (Unleashing the power of SQLite for Ruby apps)
 
IPLOOK Remote-Sensing Satellite Solution
IPLOOK Remote-Sensing Satellite SolutionIPLOOK Remote-Sensing Satellite Solution
IPLOOK Remote-Sensing Satellite Solution
 
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSECHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
CHAPTER-8 COMPONENTS OF COMPUTER SYSTEM CLASS 9 CBSE
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
 
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
 
July Patch Tuesday
July Patch TuesdayJuly Patch Tuesday
July Patch Tuesday
 
Amul milk launches in US: Key details of its new products ...
Amul milk launches in US: Key details of its new products ...Amul milk launches in US: Key details of its new products ...
Amul milk launches in US: Key details of its new products ...
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
 
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes..."Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
 
Using LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and MilvusUsing LLM Agents with Llama 3, LangGraph and Milvus
Using LLM Agents with Llama 3, LangGraph and Milvus
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
 

Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012

  • 2. Crunching Data with BigQuery Fast analysis of Big Data Jordan Tigani, Software Engineer
  • 4. Big Data at Google 72 hours 100 million gigabytes
  • 5. SELECT kick_ass_product_plan AS strategy, AVG(kicking_factor) AS awesomeness FROM lots_of_data GROUP BY strategy
  • 6. +-------------+----------------+ | strategy | awesomeness | +-------------+----------------+ | "Forty-two" | 1000000.01 | +-------------+----------------+ 1 row in result set (10.2 s) Scanned 100GB
  • 9. Regular expressions on 13 billion rows...
  • 10. 13 Billion rows 1 TB of data in 4 tables FAST! AST
  • 12. MapReduce is Flexible but Heavy • Master constructs the plan and Mapper Mapper begins spinning up workers • Mappers read and write to distributed storage Master Distributed Storage • Map => Shuffle => Reduce Reducer • Reducers read and write to distributed storage
  • 13. MapReduce is Flexible but Heavy Stage 1 Stage 2 Mapper Mapper Mapper Mapper Master Distributed Storage Master Reducer Reducer
  • 14. Dremel vs MapReduce • MapReduce o Flexible batch processing o High overall throughput o High latency • Dremel o Optimized for interactive SQL queries o Very low latency
  • 15. Mixer 0 Dremel Architecture • Partial Reduction Mixer 1 Mixer 1 • Diskless data flow • Long lived shared serving tree Leaf Leaf Leaf Leaf • Columnar Storage Distributed Storage
  • 16. Simple Query SELECT state, COUNT(*) count_babies FROM [publicdata:samples.natality] WHERE year >= 1980 AND year < 1990 GROUP BY state ORDER BY count_babies DESC LIMIT 10
  • 17. LIMIT 10 ORDER BY count_babies DESC Mixer 0 COUNT(*) GROUP BY state O(50 states) O(50 states) Mixer 1 Mixer 1 COUNT(*) GROUP BY state O(50 states) COUNT(*) Leaf Leaf Leaf Leaf GROUP BY state WHERE year >= 1980 and year < 1990 O(Rows ~140M) Distributed Storage SELECT state, year
  • 19. Example: Daily Weather Station Data weather_station_data station lat long mean_temp humidity timestamp year month day 9384 33.57 86.75 89.3 .35 1351005129 2011 04 19 2857 36.77 119.72 78.5 .24 1351005135 2011 04 19 3475 40.77 73.98 68 .35 1351015930 2011 04 19 etc...
  • 20. Example: Daily Weather Station Data station, lat, long, mean_temp, year, mon, day 999999, 36.624, -116.023, 63.6, 2009, 10, 9 911904, 20.963, -156.675, 83.4, 2009, 10, 9 916890, -18133, 178433, 76.9, 2009, 10, 9 943320, -20678, 139488, 73.8, 2009, 10, 9 CSV
  • 21. Organizing BigQuery Tables October 22 October 23 Your Source Data October 24
  • 23. Modeling Event Data: Social Music Store logs.oct_24_2012_song_activities USERNAME ACTIVITY Cost SONG ARTIST TIMESTAMP Michael LISTEN Too Close Alex Clare 1351065562 Michael LISTEN Gangnam Style PSY 1351105150 Jim LISTEN Complications Deadmau5 1351075720 Michael PURCHASE 0.99 Gangnam Style PSY 1351115962
  • 24. Users Who Listened to More than 10 Songs/Day SELECT UserId, COUNT(*) as ListenActivities FROM [logs.oct_24_2012_song_activities] GROUP EACH BY UserId HAVING ListenActivites > 10
  • 25. How Many Songs Listened to Total by Listeners of PSY? SELECT UserId, count(*) as ListenActivities FROM [logs.oct_24_2012_song_activities] WHERE UserId IN ( SELECT UserId FROM [logs.oct_24_2012_song_activities] WHERE artist = 'PSY') GROUP EACH BY UserId HAVING ListenActivites > 10
  • 26. Modeling Event Data: Nested and Repeated Values {"UserID" : "Michael", "Listens": [ {"TrackId":1234,"Title":"Gangnam Style", {"TrackId":1234,"Title":"Gangam Style", "Artist":"PSY","Timestamp":1351075700}, {"TrackId":1234,"Title":"Alex Clare", "Artist":"Alex Clare",'Timestamp":1351075700} ] "Purchases": [ {"Track":2345,"Title":"Gangnam Style", {"Track":2345,"Title":"Gangam Style", "Artist":"PSY","Timestamp":1351075700,"Cost":0.99} ]} JSON
  • 27. Which Users Have Listened to Beyonce? SELECT UserID, COUNT(ListenActivities.artist) WITHIN RECORD AS song_count FROM [logs.oct_24_2012_songactivities] WHERE UserID IN (SELECT UserID, FROM [logs.oct_24_2012_songactivities] WHERE ListenActivities.artist = 'Beyonce');
  • 28. What Position are PSY songs in our Users' Daily Playlists? SELECT UserID, POSITION(ListenActivities.artist) FROM [sample_music_logs.oct_24_2012_songactivities] WHERE ListenActivities.artist = 'PSY';
  • 29. Average Position of Songs by PSY in All Daily Playlists? SELECT AVG(POSITION(ListenActivities.artist)) FROM [sample_music_logs.oct_24_2012_songactivities], [sample_music_logs.oct_23_2012_songactivities], /* etc... */ WHERE ListenActivities.artist = 'PSY';
  • 30. Summary: Choosing a BigQuery Data Model • "Shard" your Data Using Multiple Tables • Source Data Files • CSV format • Newline-delimited JSON • Using Nested and Repeated Records • Simplify Some Types of Queries • Often Matches Document Database Models
  • 33. Upload Your Data Google Cloud BigQuery Storage
  • 34. Load your Data into BigQuery "jobReference":{ "projectId":"605902584318"}, "configuration":{ "load":{ "destinationTable":{ "projectId":"605902584318", "datasetId":"my_dataset", "tableId":"widget_sales"}, "sourceUris":[ "gs://widget-sales-data/2012080100.csv"], "schema":{ "fields":[{ "name":"widget", "type":"string"}, ... POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs
  • 35. Query Away! "jobReference":{ "projectId":"605902584318", "query":"SELECT TOP(widget, 50), COUNT(*) AS sale_count FROM widget_sales", "maxResults":100, "apiVersion":"v2" } POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs
  • 36. Libraries • Python • JavaScript • Java • Go • .NET • PHP • Ruby • Objective-C
  • 37. Libraries - Example JavaScript Query var request = gapi.client.bigquery.jobs.query({ 'projectId': project_id, 'timeoutMs': '30000', 'query': 'SELECT state, AVG(mother_age) AS theav FROM [publicdata:samples.natality] WHERE year=2000 AND ever_born=1 GROUP BY state ORDER BY theav DESC;' }); request.execute(function(response) { console.log(response); $.each(response.result.rows, function(i, item) { ...
  • 38. Custom Code and the Google Chart Tools API
  • 41. Demo: Using BigQuery on BigQuery
  • 42. BigQuery - Aggregate Big Data Analysis in Seconds • Full table scans FAST • Aggregate Queries on Massive Datasets • Supports Flat and Nested/Repeated Data Models • It's an API Get started now: http://developers.google.com/bigquery/
  • 43. SELECT questions FROM audience SELECT 'Thank You!' FROM jordan http://developers.google.com/bigquery
  • 45. Schema definition birth_record parents parent_id_mother id parent_id_father race plurality age is_male cigarette_use race state weight
  • 46. Schema definition birth_record mother_race mother_age mother_cigarette_use mother_state father_race father_age father_cigarette_use father_state plurality is_male race weight
  • 47. Tools to prepare your data • App Engine MapReduce • Commercial ETL tools • Pervasive • Informatica • Talend • UNIX command-line
  • 48. Schema definition - sharding birth_record_2011 birth_record_2012 birth_record_2013 mother_race mother_race birth_record_2014 mother_age mother_age mother_cigarette_use mother_cigarette_use birth_record_2015 mother_state mother_state father_race father_race birth_record_2016 father_age father_age father_cigarette_use father_cigarette_use father_state father_state plurality plurality is_male is_male race race weight weight
  • 51. “ If you do a table scan over a 1TB table, you're going to have a bad time. ” Anonymous 16th century Italian Philosopher-Monk
  • 52. Goal: Perform a 1 TB table scan in 1 second Parallelize Parallelize Parallelize! • • Reading 1 TB/ second from disk: • 10k+ disks • Processing 1 TB / sec: • 5k processors
  • 53. Data access: Column Store Record Oriented Storage Column Oriented Storage
  • 54. BigQuery Architecture Mixer 0 Mixer 1 Mixer 1 Mixer 1 Shard 0-8 Shard 9-16 Shard 17-24 Shard 0 Shard 10 Shard 12 Shard 20 Shard 24 Distributed Storage (e.g. GFS)
  • 56. BigQuery SQL Example: Simple aggregates SELECT COUNT(foo), MAX(foo), STDDEV(foo) FROM ...
  • 57. BigQuery SQL Example: Complex Processing SELECT ... FROM .... WHERE REGEXP_MATCH(url, ".com$") AND user CONTAINS 'test'
  • 58. BigQuery SQL Example: Nested SELECT SELECT COUNT(*) FROM (SELECT foo ..... ) GROUP BY foo
  • 59. BigQuery SQL Example: Small JOIN SELECT huge_table.foo FROM huge_table JOIN small_table ON small_table.foo = huge_table.foo
  • 60. BigQuery Architecture: Small Join Mixer 0 Mixer 1 Mixer 1 Shard 0-8 Shard 17-24 Shard 0 Shard 20 Shard 24 Distributed Storage (e.g. GFS)
  • 62. Batch queries! • Don't need interactive queries for some jobs? • priority: "BATCH"
  • 63. That's it • API • Column-based datastore • Full table scans FAST • Aggregates • Commercial tool support • Use cases
  • 64. SELECT questions FROM audience SELECT 'Thank You!' FROM ryan http://developers.google.com/bigquery @ryguyrg http://profiles.google.com/ryan.boyd
  • 66. Data access: Column Store Record Oriented Storage Column Oriented Storage
  • 67. A Little Later ... Row wp_namespace Revs Underlying table: 1 0 53697002 • Wikipedia page revision records 2 1 6151228 • Rows: 314 million 3 3 5519859 • Byte size: 35.7 GB 4 4 4184389 Query Stats: 5 2 3108562 • Scanned 7G of data 6 10 1052044 • <5 seconds 7 6 877417 • ~ 100M rows scanned / second 8 14 838940 9 5 651749 10 11 192534 11 100 148135
  • 68. ORDER BY Revs DESC Mixer 0 COUNT (revision_id) GROUP BY wp_namespace Mixer 1 Mixer 1 COUNT (revision_id) GROUP BY wp_namespace Leaf Leaf Leaf Leaf COUNT (revision_id) GROUP BY wp_namespace WHERE timestamp > CUTOFF 10 GB / s Distributed Storage SELECT wp_namespace, revision_id
  • 69. "Multi-stage" Query SELECT LogEdits, COUNT(contributor_id) Contributors FROM ( SELECT SELECT SELECT contributor_id, contributor_id, contributor_id, INTEGER(LOG10(COUNT(revision_id))) LogEdits INTEGER(LOG10(COUNT(*))) LogEdits INTEGER(LOG10(COUNT(revision_id))) LogEdits FROM [publicdata:samples.wikipedia] FROM [publicdata:samples.wikipedia] FROM [publicdata:samples.wikipedia] GROUP EACH BY contributor_id) GROUP EACH BY contributor_id) GROUP BY LogEdits ORDER BY LogEdits DESC
  • 70. ORDER BY LogEdits DESC Mixer 0 COUNT(contributor_id) GROUP BY LogEdits Mixer 1 Mixer 1 COUNT(contributor_id) GROUP BY LogEdits COUNT(contributor_id) Leaf Leaf Shuffler Shuffler GROUP BY LogEdits N^2 Shuffle by SELECT LE, Id GB/s contributor_id COUNT(*) GROUP BY contributor_id Distributed Storage SELECT contributor_id
  • 71. When to use EACH • Shuffle definitely adds some overhead • Poor query performance if used incorrectly • GROUP BY o Groups << Rows => Unbalanced load o Example: GROUP BY state • GROUP EACH BY o Groups ~ Rows o Example: GROUP BY user_id