Query Generation Across
Multiple Data Stores
Hiral Patel
Who am I?
● Sr Principal Architect/Director of Engineering at
Yahoo - Gemini Reporting
● Big Data at GridX, Klout, Ebay/Shopping.com,
Ask.com, and HP using
Hadoop/Hbase/Hive/Pig/Oozie/Ab
Initio/Oracle/DB2
Agenda
● What and Why
● Evolving Query Generation
● Why not Kylin or Lens?
● Results
● What’s Next?
● Data Warehouse / OLAP Queries
○ Star Schema
■ Dimensions - reference for
measures, denormalized
■ Facts - measures
○ Snowflake Schema
■ Normalized dimensions
What kind of query?
● OLAP cube is a method of storing data in
a multidimensional form that is optimized
for reporting queries across dimensions
What do we mean by OLAP Cube?
Table Name
ad_stats ad_id ad_grp_id campaign_id advertiser_id spend
ad_grp_stats ad_grp_id campaign_id advertiser_id spend
campaign_stats campaing_id advertiesr_id spend
● Fields/Columns
● Data Store/Engine
● Dimension Driven
● Fact Driven
Terminology
● Centralize reporting system
● Multiple use cases
● Simple interface
Why query generation?
● Druid
● Apache Spark
● PrestoDB
● Apache Drill
● Kudu
● Impala
● Big Query
What do you choose?
● MemSQL
● Redshift/ParAccel
● Vertica
● Netezza/IBM
● Greenplum
● Teradata
● Exadata/Oracle RAC
● Evolving Technology
○ Start simple
○ Scale the Business
○ Use the right tool for the job
○ Mixture of vertical and horizontal
scaling
○ Support incremental migration
○ Cost of migration
Why multiple data stores?
● Dimension / Metadata Interface
Evolving Query Generation - Take 1
● Fact / Stats interface
Evolving Query Generation - Take 1
● Challenges
○ Difficult to scale
○ Not generic enough
○ Not easy to optimize
Evolving Query Generation - Take 1
● SQL like DSL for cube definitions
Evolving Query Generation - Take 2
● Annotations based constraint definitions
Evolving Query Generation - Take 2
● Easily define rollups
Evolving Query Generation - Take 2
● SQL construction by inspecting definitions
● Easier to optimize query at construction
time
● Engine specific SQL
Evolving Query Generation - Take 2
● Challenges
○ No intelligence for selecting a data
store beyond available columns
○ Difficult to extend
○ Annotations promoted arbitrary special
casing, duplication
Evolving Query Generation - Take 2
● Dimension table definition
Evolving Query Generation - Take 3
● Fact table definition
Evolving Query Generation - Take 3
● Cube definition
Evolving Query Generation - Take 3
● Easier to add new data stores/engines
through generalization and better
separation of concerns
Evolving Query Generation - Take 3
● Cost based engine selection with
pluggable cost estimators
○ Dimension cost - due to join cardinality
○ Fact cost - due to number of rows
scanned
Evolving Query Generation - Take 3
● Partitioning aware definitions with
pluggable partitioning scheme
Evolving Query Generation - Take 3
● Versioning of cube definitions
● Bucket testing of new definitions
○ User list
○ Internal users
○ Dry run
○ External users
● Timezone aware definitions with
pluggable time provider
Evolving Query Generation - Take 3
Evolving Query Generation - Take 3
● Querying across multiple engines
Evolving Query Generation - Take 3
● Kylin - end to end product for managing
your OLAP needs, ANSI-SQL
● Lens - manages definitions and query
lifecycle, Cube QL
Why not Kylin or Lens?
● Library/Framework approach built upon
Star/Snowflake schema data model
● Easy to customize and optimize
generators
● Simple JSON interface
Why not Kylin or Lens?
Evolving Query Generation - JSON Input
● Druid - # of cores
● Oracle - Hints
● Hive/Oracle - predicate push down
Evolving Query Generation - Optimizations
● Millions of OLAP queries per day
● 30+ cube definitions across 3 data stores
(Hive, Oracle, Druid)
● Current query generation is 3x faster than
previous version
● 20% less code, more features, better
validation and error handling
Evolving Query Generation - Results
● Integration with Caravel (of AirBnB fame)
Evolving Query Generation - Results
● Add Fact/Dim view support
● Add Fact/Fact join support
● Resource availability for engine selection
● Data availability for engine selection
● Open source (should we?)
Future Work
Team/Contributors
● Shengyao Qian
● Pranav Bhole
● Jian Shen
● Surabhi Pandit
● Pavan Arakere Badarinath
● Shravana Krishnamurthy
● Ravi Chotrani
● Narayanan Krishnamoorthy
● Priyanka Gupta
● Raghu Kumar
● Rashmi Prabhu
● Vivek Chauhan
● Aleksey Sanin
● Seshasai Kuchimanchi
● Santhosh Joshi
● Kurt Maegerle
● Parveen Kumar
● Remesh Balakrishnan
● Himanshu Gupta
Evolving Query Generation - Show Cube
Definition
Hiral Patel
github: patelh
email: hiral@yahoo-inc.com
Questions?

Query generation across multiple data stores [SBTB 2016]

  • 1.
    Query Generation Across MultipleData Stores Hiral Patel
  • 2.
    Who am I? ●Sr Principal Architect/Director of Engineering at Yahoo - Gemini Reporting ● Big Data at GridX, Klout, Ebay/Shopping.com, Ask.com, and HP using Hadoop/Hbase/Hive/Pig/Oozie/Ab Initio/Oracle/DB2
  • 3.
    Agenda ● What andWhy ● Evolving Query Generation ● Why not Kylin or Lens? ● Results ● What’s Next?
  • 4.
    ● Data Warehouse/ OLAP Queries ○ Star Schema ■ Dimensions - reference for measures, denormalized ■ Facts - measures ○ Snowflake Schema ■ Normalized dimensions What kind of query?
  • 5.
    ● OLAP cubeis a method of storing data in a multidimensional form that is optimized for reporting queries across dimensions What do we mean by OLAP Cube? Table Name ad_stats ad_id ad_grp_id campaign_id advertiser_id spend ad_grp_stats ad_grp_id campaign_id advertiser_id spend campaign_stats campaing_id advertiesr_id spend
  • 6.
    ● Fields/Columns ● DataStore/Engine ● Dimension Driven ● Fact Driven Terminology
  • 7.
    ● Centralize reportingsystem ● Multiple use cases ● Simple interface Why query generation?
  • 8.
    ● Druid ● ApacheSpark ● PrestoDB ● Apache Drill ● Kudu ● Impala ● Big Query What do you choose? ● MemSQL ● Redshift/ParAccel ● Vertica ● Netezza/IBM ● Greenplum ● Teradata ● Exadata/Oracle RAC
  • 9.
    ● Evolving Technology ○Start simple ○ Scale the Business ○ Use the right tool for the job ○ Mixture of vertical and horizontal scaling ○ Support incremental migration ○ Cost of migration Why multiple data stores?
  • 10.
    ● Dimension /Metadata Interface Evolving Query Generation - Take 1
  • 11.
    ● Fact /Stats interface Evolving Query Generation - Take 1
  • 12.
    ● Challenges ○ Difficultto scale ○ Not generic enough ○ Not easy to optimize Evolving Query Generation - Take 1
  • 13.
    ● SQL likeDSL for cube definitions Evolving Query Generation - Take 2
  • 14.
    ● Annotations basedconstraint definitions Evolving Query Generation - Take 2
  • 15.
    ● Easily definerollups Evolving Query Generation - Take 2
  • 16.
    ● SQL constructionby inspecting definitions ● Easier to optimize query at construction time ● Engine specific SQL Evolving Query Generation - Take 2
  • 17.
    ● Challenges ○ Nointelligence for selecting a data store beyond available columns ○ Difficult to extend ○ Annotations promoted arbitrary special casing, duplication Evolving Query Generation - Take 2
  • 18.
    ● Dimension tabledefinition Evolving Query Generation - Take 3
  • 19.
    ● Fact tabledefinition Evolving Query Generation - Take 3
  • 20.
    ● Cube definition EvolvingQuery Generation - Take 3
  • 21.
    ● Easier toadd new data stores/engines through generalization and better separation of concerns Evolving Query Generation - Take 3
  • 22.
    ● Cost basedengine selection with pluggable cost estimators ○ Dimension cost - due to join cardinality ○ Fact cost - due to number of rows scanned Evolving Query Generation - Take 3
  • 23.
    ● Partitioning awaredefinitions with pluggable partitioning scheme Evolving Query Generation - Take 3
  • 24.
    ● Versioning ofcube definitions ● Bucket testing of new definitions ○ User list ○ Internal users ○ Dry run ○ External users ● Timezone aware definitions with pluggable time provider Evolving Query Generation - Take 3
  • 25.
  • 26.
    ● Querying acrossmultiple engines Evolving Query Generation - Take 3
  • 27.
    ● Kylin -end to end product for managing your OLAP needs, ANSI-SQL ● Lens - manages definitions and query lifecycle, Cube QL Why not Kylin or Lens?
  • 28.
    ● Library/Framework approachbuilt upon Star/Snowflake schema data model ● Easy to customize and optimize generators ● Simple JSON interface Why not Kylin or Lens?
  • 29.
  • 30.
    ● Druid -# of cores ● Oracle - Hints ● Hive/Oracle - predicate push down Evolving Query Generation - Optimizations
  • 31.
    ● Millions ofOLAP queries per day ● 30+ cube definitions across 3 data stores (Hive, Oracle, Druid) ● Current query generation is 3x faster than previous version ● 20% less code, more features, better validation and error handling Evolving Query Generation - Results
  • 32.
    ● Integration withCaravel (of AirBnB fame) Evolving Query Generation - Results
  • 33.
    ● Add Fact/Dimview support ● Add Fact/Fact join support ● Resource availability for engine selection ● Data availability for engine selection ● Open source (should we?) Future Work
  • 34.
    Team/Contributors ● Shengyao Qian ●Pranav Bhole ● Jian Shen ● Surabhi Pandit ● Pavan Arakere Badarinath ● Shravana Krishnamurthy ● Ravi Chotrani ● Narayanan Krishnamoorthy ● Priyanka Gupta ● Raghu Kumar ● Rashmi Prabhu ● Vivek Chauhan ● Aleksey Sanin ● Seshasai Kuchimanchi ● Santhosh Joshi ● Kurt Maegerle ● Parveen Kumar ● Remesh Balakrishnan ● Himanshu Gupta
  • 35.
    Evolving Query Generation- Show Cube Definition
  • 36.
    Hiral Patel github: patelh email:hiral@yahoo-inc.com Questions?