SlideShare a Scribd company logo
Query Generation Across
Multiple Data Stores
Hiral Patel
Who am I?
● Sr Principal Architect/Director of Engineering at
Yahoo - Gemini Reporting
● Big Data at GridX, Klout, Ebay/Shopping.com,
Ask.com, and HP using
Hadoop/Hbase/Hive/Pig/Oozie/Ab
Initio/Oracle/DB2
Agenda
● What and Why
● Evolving Query Generation
● Why not Kylin or Lens?
● Results
● What’s Next?
● Data Warehouse / OLAP Queries
○ Star Schema
■ Dimensions - reference for
measures, denormalized
■ Facts - measures
○ Snowflake Schema
■ Normalized dimensions
What kind of query?
● OLAP cube is a method of storing data in
a multidimensional form that is optimized
for reporting queries across dimensions
What do we mean by OLAP Cube?
Table Name
ad_stats ad_id ad_grp_id campaign_id advertiser_id spend
ad_grp_stats ad_grp_id campaign_id advertiser_id spend
campaign_stats campaing_id advertiesr_id spend
● Fields/Columns
● Data Store/Engine
● Dimension Driven
● Fact Driven
Terminology
● Centralize reporting system
● Multiple use cases
● Simple interface
Why query generation?
● Druid
● Apache Spark
● PrestoDB
● Apache Drill
● Kudu
● Impala
● Big Query
What do you choose?
● MemSQL
● Redshift/ParAccel
● Vertica
● Netezza/IBM
● Greenplum
● Teradata
● Exadata/Oracle RAC
● Evolving Technology
○ Start simple
○ Scale the Business
○ Use the right tool for the job
○ Mixture of vertical and horizontal
scaling
○ Support incremental migration
○ Cost of migration
Why multiple data stores?
● Dimension / Metadata Interface
Evolving Query Generation - Take 1
● Fact / Stats interface
Evolving Query Generation - Take 1
● Challenges
○ Difficult to scale
○ Not generic enough
○ Not easy to optimize
Evolving Query Generation - Take 1
● SQL like DSL for cube definitions
Evolving Query Generation - Take 2
● Annotations based constraint definitions
Evolving Query Generation - Take 2
● Easily define rollups
Evolving Query Generation - Take 2
● SQL construction by inspecting definitions
● Easier to optimize query at construction
time
● Engine specific SQL
Evolving Query Generation - Take 2
● Challenges
○ No intelligence for selecting a data
store beyond available columns
○ Difficult to extend
○ Annotations promoted arbitrary special
casing, duplication
Evolving Query Generation - Take 2
● Dimension table definition
Evolving Query Generation - Take 3
● Fact table definition
Evolving Query Generation - Take 3
● Cube definition
Evolving Query Generation - Take 3
● Easier to add new data stores/engines
through generalization and better
separation of concerns
Evolving Query Generation - Take 3
● Cost based engine selection with
pluggable cost estimators
○ Dimension cost - due to join cardinality
○ Fact cost - due to number of rows
scanned
Evolving Query Generation - Take 3
● Partitioning aware definitions with
pluggable partitioning scheme
Evolving Query Generation - Take 3
● Versioning of cube definitions
● Bucket testing of new definitions
○ User list
○ Internal users
○ Dry run
○ External users
● Timezone aware definitions with
pluggable time provider
Evolving Query Generation - Take 3
Evolving Query Generation - Take 3
● Querying across multiple engines
Evolving Query Generation - Take 3
● Kylin - end to end product for managing
your OLAP needs, ANSI-SQL
● Lens - manages definitions and query
lifecycle, Cube QL
Why not Kylin or Lens?
● Library/Framework approach built upon
Star/Snowflake schema data model
● Easy to customize and optimize
generators
● Simple JSON interface
Why not Kylin or Lens?
Evolving Query Generation - JSON Input
● Druid - # of cores
● Oracle - Hints
● Hive/Oracle - predicate push down
Evolving Query Generation - Optimizations
● Millions of OLAP queries per day
● 30+ cube definitions across 3 data stores
(Hive, Oracle, Druid)
● Current query generation is 3x faster than
previous version
● 20% less code, more features, better
validation and error handling
Evolving Query Generation - Results
● Integration with Caravel (of AirBnB fame)
Evolving Query Generation - Results
● Add Fact/Dim view support
● Add Fact/Fact join support
● Resource availability for engine selection
● Data availability for engine selection
● Open source (should we?)
Future Work
Team/Contributors
● Shengyao Qian
● Pranav Bhole
● Jian Shen
● Surabhi Pandit
● Pavan Arakere Badarinath
● Shravana Krishnamurthy
● Ravi Chotrani
● Narayanan Krishnamoorthy
● Priyanka Gupta
● Raghu Kumar
● Rashmi Prabhu
● Vivek Chauhan
● Aleksey Sanin
● Seshasai Kuchimanchi
● Santhosh Joshi
● Kurt Maegerle
● Parveen Kumar
● Remesh Balakrishnan
● Himanshu Gupta
Evolving Query Generation - Show Cube
Definition
Hiral Patel
github: patelh
email: hiral@yahoo-inc.com
Questions?

More Related Content

What's hot

Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Boston
kbajda
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
Oss as a competitive advantage
Oss as a competitive advantageOss as a competitive advantage
Oss as a competitive advantage
Regunath B
 
Presto Apache BigData 2017
Presto Apache BigData 2017Presto Apache BigData 2017
Presto Apache BigData 2017
Zhenxiao Luo
 
The Holy Grail of Data Analytics
The Holy Grail of Data AnalyticsThe Holy Grail of Data Analytics
The Holy Grail of Data Analytics
Dan Lynn
 
SOLR Power FTW: short version
SOLR Power FTW: short versionSOLR Power FTW: short version
SOLR Power FTW: short version
Alex Pinkin
 
Presto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - LyftPresto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - Lyft
kbajda
 
Yertl v2 granada
Yertl v2 granadaYertl v2 granada
Yertl v2 granada
Radoslaw Kotowicz
 
Presto Summit 2018 - 10 - Qubole
Presto Summit 2018  - 10 - QubolePresto Summit 2018  - 10 - Qubole
Presto Summit 2018 - 10 - Qubole
kbajda
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
Yaroslav Tkachenko
 
An Engineering Approach to Database Evaluations
An Engineering Approach to Database EvaluationsAn Engineering Approach to Database Evaluations
An Engineering Approach to Database Evaluations
SingleStore
 
Austin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAustin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_data
Alex Pinkin
 
Aesop change data propagation
Aesop change data propagationAesop change data propagation
Aesop change data propagation
Regunath B
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
Zhenxiao Luo
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML Inference
Databricks
 
Sistema de recomendación entiempo real usando Delta Lake
Sistema de recomendación entiempo real usando Delta LakeSistema de recomendación entiempo real usando Delta Lake
Sistema de recomendación entiempo real usando Delta Lake
Globant
 
Migrating Oracle database to Cassandra
Migrating Oracle database to CassandraMigrating Oracle database to Cassandra
Migrating Oracle database to Cassandra
Umair Mansoob
 
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, BlazegraphDatabase Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
✔ Eric David Benari, PMP
 
Ibm datastage online training in hyderabad
Ibm datastage online training in hyderabadIbm datastage online training in hyderabad
Ibm datastage online training in hyderabad
GoLogica Technologies
 

What's hot (20)

Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Boston
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Oss as a competitive advantage
Oss as a competitive advantageOss as a competitive advantage
Oss as a competitive advantage
 
Presto Apache BigData 2017
Presto Apache BigData 2017Presto Apache BigData 2017
Presto Apache BigData 2017
 
The Holy Grail of Data Analytics
The Holy Grail of Data AnalyticsThe Holy Grail of Data Analytics
The Holy Grail of Data Analytics
 
SOLR Power FTW: short version
SOLR Power FTW: short versionSOLR Power FTW: short version
SOLR Power FTW: short version
 
Presto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - LyftPresto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - Lyft
 
Yertl v2 granada
Yertl v2 granadaYertl v2 granada
Yertl v2 granada
 
Presto Summit 2018 - 10 - Qubole
Presto Summit 2018  - 10 - QubolePresto Summit 2018  - 10 - Qubole
Presto Summit 2018 - 10 - Qubole
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
 
An Engineering Approach to Database Evaluations
An Engineering Approach to Database EvaluationsAn Engineering Approach to Database Evaluations
An Engineering Approach to Database Evaluations
 
Austin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAustin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_data
 
Aesop change data propagation
Aesop change data propagationAesop change data propagation
Aesop change data propagation
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML Inference
 
Sistema de recomendación entiempo real usando Delta Lake
Sistema de recomendación entiempo real usando Delta LakeSistema de recomendación entiempo real usando Delta Lake
Sistema de recomendación entiempo real usando Delta Lake
 
Migrating Oracle database to Cassandra
Migrating Oracle database to CassandraMigrating Oracle database to Cassandra
Migrating Oracle database to Cassandra
 
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, BlazegraphDatabase Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
 
Ibm datastage online training in hyderabad
Ibm datastage online training in hyderabadIbm datastage online training in hyderabad
Ibm datastage online training in hyderabad
 

Similar to Query generation across multiple data stores [SBTB 2016]

Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
argonauts007
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
ScyllaDB
 
Starring sakila my sql university 2009
Starring sakila my sql university 2009Starring sakila my sql university 2009
Starring sakila my sql university 2009
David Paz
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
zekeLabs Technologies
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
Torsten Steinbach
 
My sql cluster case study apr16
My sql cluster case study apr16My sql cluster case study apr16
My sql cluster case study apr16
Sumi Ryu
 
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Lviv Startup Club
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Xu Jiang
 
50 Billion pins and counting: Using Hadoop to build data driven Products
50 Billion pins and counting: Using Hadoop to build data driven Products50 Billion pins and counting: Using Hadoop to build data driven Products
50 Billion pins and counting: Using Hadoop to build data driven Products
DataWorks Summit
 
Pinterest hadoop summit_talk
Pinterest hadoop summit_talkPinterest hadoop summit_talk
Pinterest hadoop summit_talk
Krishna Gade
 
Cassandra Summit 2015 - Building a multi-tenant API PaaS with DataStax Enterp...
Cassandra Summit 2015 - Building a multi-tenant API PaaS with DataStax Enterp...Cassandra Summit 2015 - Building a multi-tenant API PaaS with DataStax Enterp...
Cassandra Summit 2015 - Building a multi-tenant API PaaS with DataStax Enterp...
Restlet
 
Le big data à l'épreuve des projets d'entreprise
Le big data à l'épreuve des projets d'entrepriseLe big data à l'épreuve des projets d'entreprise
Le big data à l'épreuve des projets d'entreprise
Rubedo, a WebTales solution
 
Data Warehouse approaches with Dynamics AX
Data Warehouse  approaches with Dynamics AXData Warehouse  approaches with Dynamics AX
Data Warehouse approaches with Dynamics AX
Alvin You
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
Marcin Bielak
 
Cloud dwh
Cloud dwhCloud dwh
Enabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedEnabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speed
Shubham Tagra
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Grega Kespret
 
Executive Intro to BigQuery
Executive Intro to BigQueryExecutive Intro to BigQuery
Executive Intro to BigQuery
William M. Cohee
 
Designing, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDesigning, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons Learned
Denny Lee
 
Building Analytics Infrastructure for Growing Tech Companies
Building Analytics Infrastructure for Growing Tech CompaniesBuilding Analytics Infrastructure for Growing Tech Companies
Building Analytics Infrastructure for Growing Tech Companies
Holistics Software
 

Similar to Query generation across multiple data stores [SBTB 2016] (20)

Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
 
Starring sakila my sql university 2009
Starring sakila my sql university 2009Starring sakila my sql university 2009
Starring sakila my sql university 2009
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
 
Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
 
My sql cluster case study apr16
My sql cluster case study apr16My sql cluster case study apr16
My sql cluster case study apr16
 
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
50 Billion pins and counting: Using Hadoop to build data driven Products
50 Billion pins and counting: Using Hadoop to build data driven Products50 Billion pins and counting: Using Hadoop to build data driven Products
50 Billion pins and counting: Using Hadoop to build data driven Products
 
Pinterest hadoop summit_talk
Pinterest hadoop summit_talkPinterest hadoop summit_talk
Pinterest hadoop summit_talk
 
Cassandra Summit 2015 - Building a multi-tenant API PaaS with DataStax Enterp...
Cassandra Summit 2015 - Building a multi-tenant API PaaS with DataStax Enterp...Cassandra Summit 2015 - Building a multi-tenant API PaaS with DataStax Enterp...
Cassandra Summit 2015 - Building a multi-tenant API PaaS with DataStax Enterp...
 
Le big data à l'épreuve des projets d'entreprise
Le big data à l'épreuve des projets d'entrepriseLe big data à l'épreuve des projets d'entreprise
Le big data à l'épreuve des projets d'entreprise
 
Data Warehouse approaches with Dynamics AX
Data Warehouse  approaches with Dynamics AXData Warehouse  approaches with Dynamics AX
Data Warehouse approaches with Dynamics AX
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
Cloud dwh
Cloud dwhCloud dwh
Cloud dwh
 
Enabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speedEnabling presto to handle massive scale at lightning speed
Enabling presto to handle massive scale at lightning speed
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
 
Executive Intro to BigQuery
Executive Intro to BigQueryExecutive Intro to BigQuery
Executive Intro to BigQuery
 
Designing, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDesigning, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons Learned
 
Building Analytics Infrastructure for Growing Tech Companies
Building Analytics Infrastructure for Growing Tech CompaniesBuilding Analytics Infrastructure for Growing Tech Companies
Building Analytics Infrastructure for Growing Tech Companies
 

Recently uploaded

Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 

Recently uploaded (20)

Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 

Query generation across multiple data stores [SBTB 2016]

  • 1. Query Generation Across Multiple Data Stores Hiral Patel
  • 2. Who am I? ● Sr Principal Architect/Director of Engineering at Yahoo - Gemini Reporting ● Big Data at GridX, Klout, Ebay/Shopping.com, Ask.com, and HP using Hadoop/Hbase/Hive/Pig/Oozie/Ab Initio/Oracle/DB2
  • 3. Agenda ● What and Why ● Evolving Query Generation ● Why not Kylin or Lens? ● Results ● What’s Next?
  • 4. ● Data Warehouse / OLAP Queries ○ Star Schema ■ Dimensions - reference for measures, denormalized ■ Facts - measures ○ Snowflake Schema ■ Normalized dimensions What kind of query?
  • 5. ● OLAP cube is a method of storing data in a multidimensional form that is optimized for reporting queries across dimensions What do we mean by OLAP Cube? Table Name ad_stats ad_id ad_grp_id campaign_id advertiser_id spend ad_grp_stats ad_grp_id campaign_id advertiser_id spend campaign_stats campaing_id advertiesr_id spend
  • 6. ● Fields/Columns ● Data Store/Engine ● Dimension Driven ● Fact Driven Terminology
  • 7. ● Centralize reporting system ● Multiple use cases ● Simple interface Why query generation?
  • 8. ● Druid ● Apache Spark ● PrestoDB ● Apache Drill ● Kudu ● Impala ● Big Query What do you choose? ● MemSQL ● Redshift/ParAccel ● Vertica ● Netezza/IBM ● Greenplum ● Teradata ● Exadata/Oracle RAC
  • 9. ● Evolving Technology ○ Start simple ○ Scale the Business ○ Use the right tool for the job ○ Mixture of vertical and horizontal scaling ○ Support incremental migration ○ Cost of migration Why multiple data stores?
  • 10. ● Dimension / Metadata Interface Evolving Query Generation - Take 1
  • 11. ● Fact / Stats interface Evolving Query Generation - Take 1
  • 12. ● Challenges ○ Difficult to scale ○ Not generic enough ○ Not easy to optimize Evolving Query Generation - Take 1
  • 13. ● SQL like DSL for cube definitions Evolving Query Generation - Take 2
  • 14. ● Annotations based constraint definitions Evolving Query Generation - Take 2
  • 15. ● Easily define rollups Evolving Query Generation - Take 2
  • 16. ● SQL construction by inspecting definitions ● Easier to optimize query at construction time ● Engine specific SQL Evolving Query Generation - Take 2
  • 17. ● Challenges ○ No intelligence for selecting a data store beyond available columns ○ Difficult to extend ○ Annotations promoted arbitrary special casing, duplication Evolving Query Generation - Take 2
  • 18. ● Dimension table definition Evolving Query Generation - Take 3
  • 19. ● Fact table definition Evolving Query Generation - Take 3
  • 20. ● Cube definition Evolving Query Generation - Take 3
  • 21. ● Easier to add new data stores/engines through generalization and better separation of concerns Evolving Query Generation - Take 3
  • 22. ● Cost based engine selection with pluggable cost estimators ○ Dimension cost - due to join cardinality ○ Fact cost - due to number of rows scanned Evolving Query Generation - Take 3
  • 23. ● Partitioning aware definitions with pluggable partitioning scheme Evolving Query Generation - Take 3
  • 24. ● Versioning of cube definitions ● Bucket testing of new definitions ○ User list ○ Internal users ○ Dry run ○ External users ● Timezone aware definitions with pluggable time provider Evolving Query Generation - Take 3
  • 26. ● Querying across multiple engines Evolving Query Generation - Take 3
  • 27. ● Kylin - end to end product for managing your OLAP needs, ANSI-SQL ● Lens - manages definitions and query lifecycle, Cube QL Why not Kylin or Lens?
  • 28. ● Library/Framework approach built upon Star/Snowflake schema data model ● Easy to customize and optimize generators ● Simple JSON interface Why not Kylin or Lens?
  • 30. ● Druid - # of cores ● Oracle - Hints ● Hive/Oracle - predicate push down Evolving Query Generation - Optimizations
  • 31. ● Millions of OLAP queries per day ● 30+ cube definitions across 3 data stores (Hive, Oracle, Druid) ● Current query generation is 3x faster than previous version ● 20% less code, more features, better validation and error handling Evolving Query Generation - Results
  • 32. ● Integration with Caravel (of AirBnB fame) Evolving Query Generation - Results
  • 33. ● Add Fact/Dim view support ● Add Fact/Fact join support ● Resource availability for engine selection ● Data availability for engine selection ● Open source (should we?) Future Work
  • 34. Team/Contributors ● Shengyao Qian ● Pranav Bhole ● Jian Shen ● Surabhi Pandit ● Pavan Arakere Badarinath ● Shravana Krishnamurthy ● Ravi Chotrani ● Narayanan Krishnamoorthy ● Priyanka Gupta ● Raghu Kumar ● Rashmi Prabhu ● Vivek Chauhan ● Aleksey Sanin ● Seshasai Kuchimanchi ● Santhosh Joshi ● Kurt Maegerle ● Parveen Kumar ● Remesh Balakrishnan ● Himanshu Gupta
  • 35. Evolving Query Generation - Show Cube Definition
  • 36. Hiral Patel github: patelh email: hiral@yahoo-inc.com Questions?