SlideShare a Scribd company logo
VLDB'10, Singapore Sep 14, 2010 Sergey Melnik Andrey Gubarev Jing Jing Long  Geoffrey Romer Shiva Shivakumar    Matt Tolton  Theo Vassilakis    Dremel: Interactive Analysis of Web-Scale Datasets
Speed matters Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 Spam Trends Detection Web Dashboards Network Optimization Interactive Tools
Example: data exploration Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 Runs a MapReduce to extract billions of signals from web pages  Googler Alice DEFINE TABLE t AS /path/to/data/* SELECT TOP(signal, 100), COUNT(*) FROM t . . . More MR-based processing on her data (FlumeJava  [PLDI'10] , Sawzall  [Sci.Pr.'05] ) 1 2 3 Ad hoc SQL against Dremel
Dremel system ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10
[object Object],[object Object],Why call it Dremel Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10
Widely used inside Google ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10
Outline ,[object Object],[object Object],[object Object],[object Object],Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10
Records  vs.  columns A B C D E * * * . . . . . . r 1 r 2 r 1 r 2 r 1 r 2 r 1 r 2 Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 Challenge: preserve structure, reconstruct from a subset of fields Read less, cheaper decompression
Nested data model message  Document { required  int64  DocId;  [1,1] optional group  Links { repeated  int64  Backward;  [0,*] repeated  int64  Forward;   } repeated group  Name { repeated group  Language { required  string  Code; optional  string  Country;  [0,1]   } optional  string  Url;   } } DocId:  10 Links Forward:  20 Forward:  40 Forward:  60 Name  Language  Code:  'en-us' Country:  'us' Language Code:  'en' Url:  'http://A' Name Url:  'http://B' Name Language Code:  'en-gb' Country:  'gb' r 1 DocId:  20 Links Backward:  10 Backward:  30 Forward:  80 Name Url:  'http://C' r 2 Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 http://code.google.com/apis/protocolbuffers multiplicity:
Column-striped representation DocId Name.Url Name.Language.Code Name.Language.Country Links.Backward Links.Forward Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 value r d 10 0 0 20 0 0 value r d http://A 0 2 http://B 1 2 NULL 1 1 http://C 0 2 value r d en-us 0 2 en 2 2 NULL 1 1 en-gb 1 2 NULL 0 1 value r d us 0 3 NULL 2 2 NULL 1 1 gb 1 3 NULL 0 1 value r d 20 0 2 40 1 2 60 1 2 80 0 2 value r d NULL 0 1 10 0 2 30 1 2
Repetition and definition levels DocId:  10 Links Forward:  20 Forward:  40 Forward:  60 Name  Language  Code:  ' en-us ' Country:  'us' Language Code:  ' en ' Url:  'http://A' Name Url:  'http://B' Name Language Code:  'en-gb' Country:  'gb' r 1 DocId:  20 Links Backward:  10 Backward:  30 Forward:  80 Name Url:  'http://C' r 2 Name.Language.Code r : At what repeated field in the field's path   the value has repeated d : How many fields in paths that could be   undefined (opt. or rep.) are actually present Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 record (r=0) has repeated r=2 r=1 Language  (r=2) has repeated (non-repeating) value r d en-us 0 2 en 2 2 NULL 1 1 en-gb 1 2 NULL 0 1
Record assembly FSM Name.Language.Country Name.Language.Code Links.Backward Links.Forward Name.Url DocId 1 0 1 0 0,1,2 2 0,1 1 0 0 For record-oriented data processing (e.g., MapReduce) Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 Transitions labeled with repetition levels
Reading two fields DocId Name.Language.Country 1,2 0 0 DocId:  10 Name Language  Country:  'us' Language Name Name Language Country:  'gb' DocId:  20 Name s 1 s 2 Structure of parent fields is preserved. Useful for queries like  /Name[3]/Language[1]/Country Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 Both Dremel and MapReduce can read same columnar data
Outline ,[object Object],[object Object],[object Object],[object Object],Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10
Query processing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10
SQL dialect for nested data Id:  10 Name Cnt:  2 Language  Str:  'http://A,en-us' Str:  'http://A,en' Name Cnt:  0 t 1 SELECT  DocId  AS  Id, COUNT(Name.Language.Code)  WITHIN  Name  AS  Cnt,   Name.Url +  ','  + Name.Language.Code  AS  Str FROM  t WHERE  REGEXP(Name.Url,  '^http' )  AND  DocId <  20 ; message  QueryResult { required  int64  Id; repeated group  Name { optional  uint64  Cnt; repeated group  Language { optional  string  Str;   }   } } Output table Output schema Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 No record assembly during query processing
Serving tree storage layer (e.g., GFS) . . . . . . . . . leaf servers (with local storage) intermediate servers root server client ,[object Object],[object Object],[object Object],[object Object],Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 [Dean WSDM'09] histogram of response times
Example: count() SELECT A, COUNT(B) FROM T GROUP BY A T = {/gfs/1, /gfs/2, …, /gfs/100000} SELECT A, SUM(c) FROM ( R 1 1  UNION ALL  R 1 10 ) GROUP BY A SELECT A, COUNT(B) AS c FROM T 1 1 GROUP BY A T 1 1 = {/gfs/1, …, /gfs/10000} SELECT A, COUNT(B) AS c FROM T 1 2 GROUP BY A T 1 2 = {/gfs/10001, …, /gfs/20000} SELECT A, COUNT(B) AS c FROM T 3 1 GROUP BY A T 3 1 = {/gfs/1} . . . 0 1 3 R 1 1 R 1 2 Data access ops Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 . . . . . .
Outline ,[object Object],[object Object],[object Object],[object Object],Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10
Experiments Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 ,[object Object],[object Object],[object Object],Table name Number of records Size (unrepl., compressed) Number of fields Data center Repl. factor T1 85 billion 87 TB 270 A 3× T2 24 billion 13 TB 530 A 3× T3 4 billion 70 TB 1200 A 3× T4 1+ trillion 105 TB 50 B 3× T5 1+ trillion 20 TB 30 B 2×
Read from disk columns records objects from records from columns ( a ) read +   decompress ( b ) assemble    records ( c ) parse as   C++ objects ( d ) read +   decompress ( e ) parse as   C++ objects time (sec) number of fields &quot;cold&quot; time on local disk, averaged over 30 runs Table partition: 375 MB (compressed), 300K rows, 125 columns Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 2-4x overhead of using records 10x speedup using columnar storage
MR and Dremel execution ,[object Object],[object Object],execution time (sec) on  3000 nodes   ,[object Object],[object Object],Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 Q1: 87 TB 0.5 TB 0.5 TB MR overheads: launch jobs, schedule 0.5M tasks, assemble records Avg # of terms in txtField in 85 billion record table T1
Impact of serving tree depth execution time (sec) Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 SELECT country, SUM(item.amount) FROM T2 GROUP BY country SELECT domain, SUM(item.amount) FROM T2 WHERE domain CONTAINS ’.net’ GROUP BY domain Q2: Q3: 40 billion nested items (returns 100s of records) (returns 1M records)
Scalability execution time (sec) number of leaf servers Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 SELECT TOP(aid, 20), COUNT(*) FROM T4 Q5 on a trillion-row table T4:
Outline ,[object Object],[object Object],[object Object],[object Object],Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10
Interactive speed execution time (sec) percentage of queries Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 Most queries complete under 10 sec Monthly query workload of one 3000-node Dremel instance
Observations ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10
BigQuery: powered by Dremel Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 http://code.google.com/apis/bigquery/ 1. Upload 2. Process Upload your data to Google Storage Import to tables Run queries 3. Act Your Data BigQuery Your Apps

More Related Content

What's hot

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Cql – cassandra query language
Cql – cassandra query languageCql – cassandra query language
Cql – cassandra query language
Courtney Robinson
 
A DBA’s guide to using TSA
A DBA’s guide to using TSAA DBA’s guide to using TSA
A DBA’s guide to using TSA
Frederik Engelen
 
Apache Solr
Apache SolrApache Solr
Apache Solr
Minh Tran
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
Davin Abraham
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
Heman Hosainpana
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Julian Hyde
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
Amazon Web Services
 
Nginx cheat sheet
Nginx cheat sheetNginx cheat sheet
Nginx cheat sheet
Lam Hoang
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Bill Liu
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Julien Le Dem
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
Aniket Mokashi
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
Anil Gupta
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
Apache Apex
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
Data Con LA
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
Saurav Haloi
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
MongoDB: Advance concepts - Replication and Sharding
MongoDB: Advance concepts - Replication and ShardingMongoDB: Advance concepts - Replication and Sharding
MongoDB: Advance concepts - Replication and Sharding
Knoldus Inc.
 

What's hot (20)

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Cql – cassandra query language
Cql – cassandra query languageCql – cassandra query language
Cql – cassandra query language
 
A DBA’s guide to using TSA
A DBA’s guide to using TSAA DBA’s guide to using TSA
A DBA’s guide to using TSA
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Nginx cheat sheet
Nginx cheat sheetNginx cheat sheet
Nginx cheat sheet
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
MongoDB: Advance concepts - Replication and Sharding
MongoDB: Advance concepts - Replication and ShardingMongoDB: Advance concepts - Replication and Sharding
MongoDB: Advance concepts - Replication and Sharding
 

Viewers also liked

Drill architecture 20120913
Drill architecture 20120913Drill architecture 20120913
Drill architecture 20120913
jasonfrantz
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
Maria Stylianou
 
Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.
Vicente Orjales
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
Andy Petrella
 
How BigQuery broke my heart
How BigQuery broke my heartHow BigQuery broke my heart
How BigQuery broke my heart
Gabriel Hamilton
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
Julian Hyde
 
Apache Drill
Apache DrillApache Drill
Apache Drill
Ted Dunning
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
Amazon Web Services
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
DataWorks Summit/Hadoop Summit
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Li &amp; fung ltd
Li &amp; fung ltdLi &amp; fung ltd
Li &amp; fung ltd
Priyanka Chowdhury
 
Engage Workshop Berlin09 Part2
Engage Workshop Berlin09 Part2Engage Workshop Berlin09 Part2
Engage Workshop Berlin09 Part2
Paul Pivec
 
Medicaid vs. Medicare in Oregon
Medicaid vs. Medicare in OregonMedicaid vs. Medicare in Oregon
Medicaid vs. Medicare in Oregon
Richard Schneider
 
Storytime updated ppt
Storytime updated pptStorytime updated ppt
Storytime updated ppt
nolenlib
 
ShareThis Auto Study
ShareThis Auto Study ShareThis Auto Study
ShareThis Auto Study
ShareThis
 
2nd Easter A(2)
2nd Easter A(2)2nd Easter A(2)
2nd Easter A(2)
Rey An Yatco
 
Torishima Pump general catalogue, Venezuela, Central America, Caribbean
Torishima Pump general catalogue, Venezuela, Central America, CaribbeanTorishima Pump general catalogue, Venezuela, Central America, Caribbean
Torishima Pump general catalogue, Venezuela, Central America, Caribbean
P&L International Trading
 

Viewers also liked (20)

Drill architecture 20120913
Drill architecture 20120913Drill architecture 20120913
Drill architecture 20120913
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
 
Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
How BigQuery broke my heart
How BigQuery broke my heartHow BigQuery broke my heart
How BigQuery broke my heart
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Li &amp; fung ltd
Li &amp; fung ltdLi &amp; fung ltd
Li &amp; fung ltd
 
Engage Workshop Berlin09 Part2
Engage Workshop Berlin09 Part2Engage Workshop Berlin09 Part2
Engage Workshop Berlin09 Part2
 
Medicaid vs. Medicare in Oregon
Medicaid vs. Medicare in OregonMedicaid vs. Medicare in Oregon
Medicaid vs. Medicare in Oregon
 
Storytime updated ppt
Storytime updated pptStorytime updated ppt
Storytime updated ppt
 
ShareThis Auto Study
ShareThis Auto Study ShareThis Auto Study
ShareThis Auto Study
 
2nd Easter A(2)
2nd Easter A(2)2nd Easter A(2)
2nd Easter A(2)
 
Torishima Pump general catalogue, Venezuela, Central America, Caribbean
Torishima Pump general catalogue, Venezuela, Central America, CaribbeanTorishima Pump general catalogue, Venezuela, Central America, Caribbean
Torishima Pump general catalogue, Venezuela, Central America, Caribbean
 

Similar to Dremel: Interactive Analysis of Web-Scale Datasets

Interactive big data analytics
Interactive big data analyticsInteractive big data analytics
Interactive big data analytics
Viet-Trung TRAN
 
Querying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern FragmentsQuerying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern Fragments
Ruben Verborgh
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
Martin Dvorak
 
Get Value from Your Data
Get Value from Your DataGet Value from Your Data
Get Value from Your Data
Danilo Poccia
 
User biglm
User biglmUser biglm
User biglm
johnatan pladott
 
[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)
Steve Min
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Serban Tanasa
 
Intro to hadoop ecosystem
Intro to hadoop ecosystemIntro to hadoop ecosystem
Intro to hadoop ecosystem
Grzegorz Kolpuc
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Dancing with the Elephant
Dancing with the ElephantDancing with the Elephant
Dancing with the Elephant
DataWorks Summit
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
Joydeep Sen Sarma
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
TI1220 Lecture 14: Domain-Specific Languages
TI1220 Lecture 14: Domain-Specific LanguagesTI1220 Lecture 14: Domain-Specific Languages
TI1220 Lecture 14: Domain-Specific Languages
Eelco Visser
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
ragho
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Denodo
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
Neville Li
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
source{d}
 
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesGetting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
Amazon Web Services
 

Similar to Dremel: Interactive Analysis of Web-Scale Datasets (20)

Interactive big data analytics
Interactive big data analyticsInteractive big data analytics
Interactive big data analytics
 
Querying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern FragmentsQuerying federations 
of Triple Pattern Fragments
Querying federations 
of Triple Pattern Fragments
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
 
Get Value from Your Data
Get Value from Your DataGet Value from Your Data
Get Value from Your Data
 
User biglm
User biglmUser biglm
User biglm
 
[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)[SSA] 04.sql on hadoop(2014.02.05)
[SSA] 04.sql on hadoop(2014.02.05)
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
 
Intro to hadoop ecosystem
Intro to hadoop ecosystemIntro to hadoop ecosystem
Intro to hadoop ecosystem
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Dancing with the Elephant
Dancing with the ElephantDancing with the Elephant
Dancing with the Elephant
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
TI1220 Lecture 14: Domain-Specific Languages
TI1220 Lecture 14: Domain-Specific LanguagesTI1220 Lecture 14: Domain-Specific Languages
TI1220 Lecture 14: Domain-Specific Languages
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesGetting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
 

Recently uploaded

ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
Priyankaranawat4
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
David Douglas School District
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
AyyanKhan40
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
Academy of Science of South Africa
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
mulvey2
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
heathfieldcps1
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
IreneSebastianRueco1
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Akanksha trivedi rama nursing college kanpur.
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
TechSoup
 
Community pharmacy- Social and preventive pharmacy UNIT 5
Community pharmacy- Social and preventive pharmacy UNIT 5Community pharmacy- Social and preventive pharmacy UNIT 5
Community pharmacy- Social and preventive pharmacy UNIT 5
sayalidalavi006
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Dr. Vinod Kumar Kanvaria
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Excellence Foundation for South Sudan
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
amberjdewit93
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
Nicholas Montgomery
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
Colégio Santa Teresinha
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
TechSoup
 
Life upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for studentLife upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for student
NgcHiNguyn25
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
Nicholas Montgomery
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
adhitya5119
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
GeorgeMilliken2
 

Recently uploaded (20)

ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
 
Community pharmacy- Social and preventive pharmacy UNIT 5
Community pharmacy- Social and preventive pharmacy UNIT 5Community pharmacy- Social and preventive pharmacy UNIT 5
Community pharmacy- Social and preventive pharmacy UNIT 5
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
 
Life upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for studentLife upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for student
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
 

Dremel: Interactive Analysis of Web-Scale Datasets

  • 1. VLDB'10, Singapore Sep 14, 2010 Sergey Melnik Andrey Gubarev Jing Jing Long Geoffrey Romer Shiva Shivakumar   Matt Tolton Theo Vassilakis Dremel: Interactive Analysis of Web-Scale Datasets
  • 2. Speed matters Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 Spam Trends Detection Web Dashboards Network Optimization Interactive Tools
  • 3. Example: data exploration Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 Runs a MapReduce to extract billions of signals from web pages Googler Alice DEFINE TABLE t AS /path/to/data/* SELECT TOP(signal, 100), COUNT(*) FROM t . . . More MR-based processing on her data (FlumeJava [PLDI'10] , Sawzall [Sci.Pr.'05] ) 1 2 3 Ad hoc SQL against Dremel
  • 4.
  • 5.
  • 6.
  • 7.
  • 8. Records vs. columns A B C D E * * * . . . . . . r 1 r 2 r 1 r 2 r 1 r 2 r 1 r 2 Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 Challenge: preserve structure, reconstruct from a subset of fields Read less, cheaper decompression
  • 9. Nested data model message Document { required int64 DocId; [1,1] optional group Links { repeated int64 Backward; [0,*] repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; [0,1] } optional string Url; } } DocId: 10 Links Forward: 20 Forward: 40 Forward: 60 Name Language Code: 'en-us' Country: 'us' Language Code: 'en' Url: 'http://A' Name Url: 'http://B' Name Language Code: 'en-gb' Country: 'gb' r 1 DocId: 20 Links Backward: 10 Backward: 30 Forward: 80 Name Url: 'http://C' r 2 Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 http://code.google.com/apis/protocolbuffers multiplicity:
  • 10. Column-striped representation DocId Name.Url Name.Language.Code Name.Language.Country Links.Backward Links.Forward Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 value r d 10 0 0 20 0 0 value r d http://A 0 2 http://B 1 2 NULL 1 1 http://C 0 2 value r d en-us 0 2 en 2 2 NULL 1 1 en-gb 1 2 NULL 0 1 value r d us 0 3 NULL 2 2 NULL 1 1 gb 1 3 NULL 0 1 value r d 20 0 2 40 1 2 60 1 2 80 0 2 value r d NULL 0 1 10 0 2 30 1 2
  • 11. Repetition and definition levels DocId: 10 Links Forward: 20 Forward: 40 Forward: 60 Name Language Code: ' en-us ' Country: 'us' Language Code: ' en ' Url: 'http://A' Name Url: 'http://B' Name Language Code: 'en-gb' Country: 'gb' r 1 DocId: 20 Links Backward: 10 Backward: 30 Forward: 80 Name Url: 'http://C' r 2 Name.Language.Code r : At what repeated field in the field's path the value has repeated d : How many fields in paths that could be undefined (opt. or rep.) are actually present Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 record (r=0) has repeated r=2 r=1 Language (r=2) has repeated (non-repeating) value r d en-us 0 2 en 2 2 NULL 1 1 en-gb 1 2 NULL 0 1
  • 12. Record assembly FSM Name.Language.Country Name.Language.Code Links.Backward Links.Forward Name.Url DocId 1 0 1 0 0,1,2 2 0,1 1 0 0 For record-oriented data processing (e.g., MapReduce) Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 Transitions labeled with repetition levels
  • 13. Reading two fields DocId Name.Language.Country 1,2 0 0 DocId: 10 Name Language Country: 'us' Language Name Name Language Country: 'gb' DocId: 20 Name s 1 s 2 Structure of parent fields is preserved. Useful for queries like /Name[3]/Language[1]/Country Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 Both Dremel and MapReduce can read same columnar data
  • 14.
  • 15.
  • 16. SQL dialect for nested data Id: 10 Name Cnt: 2 Language Str: 'http://A,en-us' Str: 'http://A,en' Name Cnt: 0 t 1 SELECT DocId AS Id, COUNT(Name.Language.Code) WITHIN Name AS Cnt, Name.Url + ',' + Name.Language.Code AS Str FROM t WHERE REGEXP(Name.Url, '^http' ) AND DocId < 20 ; message QueryResult { required int64 Id; repeated group Name { optional uint64 Cnt; repeated group Language { optional string Str; } } } Output table Output schema Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 No record assembly during query processing
  • 17.
  • 18. Example: count() SELECT A, COUNT(B) FROM T GROUP BY A T = {/gfs/1, /gfs/2, …, /gfs/100000} SELECT A, SUM(c) FROM ( R 1 1 UNION ALL R 1 10 ) GROUP BY A SELECT A, COUNT(B) AS c FROM T 1 1 GROUP BY A T 1 1 = {/gfs/1, …, /gfs/10000} SELECT A, COUNT(B) AS c FROM T 1 2 GROUP BY A T 1 2 = {/gfs/10001, …, /gfs/20000} SELECT A, COUNT(B) AS c FROM T 3 1 GROUP BY A T 3 1 = {/gfs/1} . . . 0 1 3 R 1 1 R 1 2 Data access ops Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 . . . . . .
  • 19.
  • 20.
  • 21. Read from disk columns records objects from records from columns ( a ) read + decompress ( b ) assemble   records ( c ) parse as C++ objects ( d ) read + decompress ( e ) parse as C++ objects time (sec) number of fields &quot;cold&quot; time on local disk, averaged over 30 runs Table partition: 375 MB (compressed), 300K rows, 125 columns Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 2-4x overhead of using records 10x speedup using columnar storage
  • 22.
  • 23. Impact of serving tree depth execution time (sec) Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 SELECT country, SUM(item.amount) FROM T2 GROUP BY country SELECT domain, SUM(item.amount) FROM T2 WHERE domain CONTAINS ’.net’ GROUP BY domain Q2: Q3: 40 billion nested items (returns 100s of records) (returns 1M records)
  • 24. Scalability execution time (sec) number of leaf servers Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 SELECT TOP(aid, 20), COUNT(*) FROM T4 Q5 on a trillion-row table T4:
  • 25.
  • 26. Interactive speed execution time (sec) percentage of queries Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 Most queries complete under 10 sec Monthly query workload of one 3000-node Dremel instance
  • 27.
  • 28. BigQuery: powered by Dremel Dremel: Interactive Analysis of Web-Scale Datasets. VLDB'10 http://code.google.com/apis/bigquery/ 1. Upload 2. Process Upload your data to Google Storage Import to tables Run queries 3. Act Your Data BigQuery Your Apps