Your SlideShare is downloading. ×
HQL over Tiered Data Warehouse
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

HQL over Tiered Data Warehouse

425
views

Published on

Published in: Technology

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
425
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Inmobi provides marketplace, where it buys the space on mobile from publishers and sells it to advertisers, meanwhile it acquires users.
  • Adhoc querying system
    Adhoc and Batch queries
    Scheduled queries
    Based on Hadoop Mapreduce
    Provides UI and custom api
    Data is stored in HDFS

    Dashboard system
    Canned reports
    Interactive and adhoc queries
    Provides UI and Custom api
    Data is stored in columnar DWH

    Customer facing system
    Face to the outside world (Advertisers and publishers)
    Interactive and adhoc queries
    Provides UI and custom api
    Data is stored in relational DB, Postgres


    Conventional columnar databases (RDBMS) systems lend themselves well for interactive SQL queries over reasonably small datasets in the order of 10-100s of GB, while hadoop based warehouses operate well over large datasets in the order of TBs and PBs and scales fairly linearly. Though there have been some improvements recently in storage structures in the Hadoop warehouses such as ORC, queries over hadoop still typically adopts a full scan approach. Choosing between these different data stores based on cost of storage, concurrency, scalability and performance is fairly complex and not easy for most users.
  • Inmobi has 130TB hadoop warehouse and 5TB SQL warehouse. Let us see an example of reporting page. This is the dashboard a publisher sees.
  • Individually all the systems we just saw work really great! They provide best time responses to user queries.

    Disparate user experience because of multiple reporting systems
    Involves a learning curve for systems and their api
    Disparate data storage systems causing inability to scale
    Altering schema involves different systems
    Data discovery
    Cannot leverage data in other systems
    Not leveraging community around
    Cannot experiment with new storage/execution engine out of the box

  • Column Measure : name, type, default aggregate, format string, start date, end date
    Expression Measure : Associated Expression

    Simple Dimension: name, type, start date, end date
    Referenced Dimension : Referencing table and column
    Hierarchical Dimension :hierarchy
    Expression Dimension : Associated expression
  • The grammar is subset of HQL

    Resolve candidate dimension tables and the storage tables .
    Resolve the candidate fact tables which can answer the query, pick the ones from top of the pyramid.
    Resolve fact storage tables for the queried time range.
    Automatically resolve joins using the relationships between cubes and dimension.
    Automatically add aggregate functions to measures.
    Add expression to group by clause, if projected; and project group by clause, if it is not.
  • Resolve candidate dimension tables and the storage tables .
    Resolve the candidate fact tables which can answer the query, pick the ones from top of the pyramid.
    Resolve fact storage tables for the queried time range.
    Automatically resolve joins using the relationships between cubes and dimension.
    Automatically add aggregate functions to measures.
    Add expression to group by clause, if projected; and project group by clause, if it is not.
  • Transcript

    • 1. Grill HQL Tiered Data Warehouse Amareshwari Sriramadasu Suma Shivaprasad
    • 2. Amareshwari Sriramadasu • Engineer at Inmobi • Working in Hadoop and eco systems since 2007 • Apache Hadoop – PMC • Apache Hive– Committer
    • 3. Suma Shivaprasad • Engineer at Inmobi • Working in Hadoop and eco systems since 2011
    • 4. Analytics at Inmobi - Problem areas and Motivation Why Apache Hive OLAP Model in Apache Hive Query examples Grill – Unified analytics Demo Agenda
    • 5. Digital advertising at Inmobi Courtesy: http://www.liesdamnedlies.com/ Owns & Sells Real estate on digital inventory Has reach to users Wants to target Users Brings money Market place Consumer
    • 6. Publisher/Advertiser specific analytics Understanding trends & inference Feedback to improve Ad Relevance in Real Time Sizing & Estimation (Ex: Inventory sizing, Forecasting) Tracking metrics and Anamoly detection Debugging / Postmortem of issues (troubleshooting) Analytics Use cases Advertisers and publishers Account/Regional managers Executive team Business/Product analysts Data scientists Engineering systems Developers
    • 7. • Canned/Dashboard queries • Adhoc queries • Interactive/Batch queries • Scheduled queries • Infer insights through ML algorithms Categorize the use cases
    • 8. Adhoc querying system Internal Dashboards Customer facing Dashboards and Reporting Analytics systems at Inmobi
    • 9. Analytics Warehouses at Inmobi and Scale  Billions of Ad Requests and Impressions per day  170 TB Hadoop warehouse  5 TB SQL Columnar Datawarehouse  Hbase cluster  DBMS(Postgres, Mysql)  Spark/Shark in near future
    • 10. Why both Hadoop and SQL warehouse? Canned Adhoc Response Times IO (Input Records) Adhoc Canned Adhoc Query Engine Query Dashboard queries are mostly canned and Interactive Adhoc queries can be Interactive or batch depending on the data volumes and query complexity
    • 11. • Disparate user experience • Disparate data storage systems • Schema management across storages • Data discovery • Not leveraging ‘SQL on Hadoop’ community Problems
    • 12. Analytics at Inmobi - Problem areas and Motivation Why Apache Hive OLAP Model in Apache Hive Query examples Grill – Unified analytics Demo Agenda
    • 13. Associates structure to data Provides Metastore and catalog service – Hcatalog Provides pluggable storage interface Accepts SQL like queries HQL is widely adopted language by systems like Shark, Impala Has strong apache community Data warehouse features like facts, dimensions Logical table associated with multiple physical storages Pluggable execution engine Interactive queries Query lifecycle management Query quota management Scheduling queries Plugging in Machine learning models WhatdoesHiveprovide WhatismissinginHive Apache Hive to the rescue
    • 14. Data Layout – OLAP Cube Dimensions, Measures Cost
    • 15. Data Layout – Snowflake Aggr Factk : measures (mak <= ma(k-1)), dims (dak < da(k-1)) ….. Aggr Fact2 : measures (ma2 <= ma1), dimensions (da2 < da1) Aggr Fact1 : measures (ma1 <= mr), dimensions (da1 < dr) Raw Fact: measures (mr), dimensions(dr) Dim2_1 Dim3 Dim2 Dim4_1 Dim4 Dim1
    • 16. Analytics at Inmobi - Problem areas and Motivation Why Apache Hive OLAP Model in Apache Hive Query examples Grill – Unified analytics Demo Agenda
    • 17. Data Model Cube Storage Fact Table Physical Fact tables Dimension Table Physical Dimension tables
    • 18. Data Model - Cube Dimension • Simple Dimension • Referenced Dimension • Hierarchical Dimension • Expression Dimension • Timed dimension Measure • Column Measure • Expression Measure Cube Measures Dimensions Note : Some of the concepts are borrowed from http://community.pentaho.com/projects/mondrian/
    • 19. Data Model – Storage Storage • Name • End point • Properties • Ex : ProdCluster, StagingCluster, Postgres1, HBase1, HBase2
    • 20. Data Model – Fact Table Fact table Cube Fact table Storage FactTable • Columns • Cube that it belongs • Storages on which it is present and the associated update periods
    • 21. Data Model – Dimension table DimensionTable • Columns • Dimension references • Storages on which it is present and associated snapshot dump period, if any. Cube Dimension table Dimension table Dimension table Storage
    • 22. Data Model – Storage tables and partitions Storagetable • Belongs to fact/dimension • Associated storage descriptor • Partitioned by columns • Naming convention – storage name followed by fact/dimension name • Partition can override its storage descriptor • Fact storage table Fact table • Dimension storage table Dimension table
    • 23. Data Model Aggrk Fact Table ….. Aggr2 Fact Table Aggr1 Fact Table Raw FactTable Dim2_1 Dim3 Dim2 Dim4_1 Dim4 Dim1 CUBE
    • 24. Analytics at Inmobi - Problem areas and Motivation Why Apache Hive OLAP Model in Apache Hive Query examples Grill – Unified analytics Demo Agenda
    • 25. CUBE SELECT [DISTINCT] select_expr, select_expr, ... FROM cube_table_reference WHERE [where_condition AND] TIME_RANGE_IN(colName , from, to) [GROUP BY col_list] [HAVING having_expr] [ORDER BY colList] [LIMIT number] cube_table_reference: cube_table_factor | join_table join_table: cube_table_reference JOIN cube_table_factor [join_condition] | cube_table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN cube_table_reference [join_condition] cube_table_factor: cube_name [alias] | ( cube_table_reference ) join_condition: ON equality_expression ( AND equality_expression )* equality_expression: expression = expression colOrder: ( ASC | DESC ) colList : colName colOrder? (',' colName colOrder?)* Queries on OLAP cubes
    • 26. Resolve candidate tables and storages Automatically resolve joins Resolve aggregates and groupby expressions Allows SQL over Cube QL Queries can span multiple storages Accepts multi time range queries All Hive QL features Query features
    • 27. • SELECT ( citytable . name ), ( citytable . stateid ) FROM c2_citytable citytable LIMIT 100 • SELECT ( citytable . name ), ( citytable . stateid ) FROM c1_citytable citytable WHERE (citytable.dt = 'latest') LIMIT 100 cube select name, stateid from citytable limit 100 Example query
    • 28. Example query • SELECT (citytable.name), sum((testcube.msr2)) FROM c2_testfact testcube INNER JOIN c1_citytable citytable ON ((testcube.cityid)= (citytable.id)) WHERE (( testcube.dt='2014-03-10-03') OR (testcube.dt='2014-03-10-04') OR (testcube.dt='2014-03- 10-05') OR (testcube.dt='2014-03-10-06') OR (testcube.dt='2014-03-10-07') OR (testcube.dt='2014-03-10-08') OR (testcube.dt='2014-03-10-09') OR (testcube.dt='2014-03- 10-10') OR (testcube.dt='2014-03-10-11') OR (testcube.dt='2014-03-10-12') OR (testcube.dt='2014-03-10-13') OR (testcube.dt='2014-03-10-14') OR (testcube.dt='2014-03- 10-15') OR (testcube.dt='2014-03-10-16') OR (testcube.dt='2014-03-10-17') OR (testcube.dt='2014-03-10-18') OR (testcube.dt='2014-03-10-19') OR (testcube.dt='2014-03- 10-20') OR (testcube.dt='2014-03-10-21') OR (testcube.dt='2014-03-10-22') OR (testcube.dt='2014-03-10-23') OR (testcube.dt='2014-03-11') OR (testcube.dt='2014-03-12- 00') OR (testcube.dt='2014-03-12 -01') OR (testcube.dt='2014-03-12-02') )AND (citytable.dt = 'latest') GROUP BY(citytable.name) cube select citytable.name, msr2 from testcube where timerange_in(dt, '2014-03-10-03’, '2014-03-12-03’)
    • 29. Available in Hive • Data warehouse features like facts, dimensions • Logical table associated with multiple physical storages Available in Grill • Pluggable execution engine for HQL • Query life cycle management • Scheduling queries Where is it available
    • 30. Analytics at Inmobi - Problem areas and Motivation Why Apache Hive OLAP Model in Apache Hive Query examples Grill – Unified analytics Demo Agenda
    • 31. GRILL Unify the Query layer for Adhoc/Canned Batch/Interactive Reports on single Interface
    • 32. Grill Architecture
    • 33. Implements an interface • execute • explain • executeAsynchronously • fetchResults • Specify all storages it can support Pluggable execution engine
    • 34. OLAP Cube QL query Rewrite query for available execution engine’s supported storages Get cost of the rewritten query from each execution engine Pick up execution engine with least cost and fire the query Cube query with multiple execution engines
    • 35. Grill Server
    • 36. Grill server – code status
    • 37. • Richer statistics • Query scheduler service • Query caching • Authentication and authorization • Machine learning through Grill • User query quota management Grill roadmap
    • 38. • Number of queries - 700 to 900 per day • Number of dimension tables - 125 • Number of fact tables – 30 • Number cubes – 22 • Size of the data • Total size – 170 TB • Dimension data – 400 MB compressed per hour • Raw data - 1.2 TB per day • Aggregated facts- 53GB per day Data ware house statistics
    • 39. Jira reference • https://issues.apache.org/jira/browse/HIVE-4115 Github source for Hive • https://github.com/InMobi/hive Github source for grill • https://github.com/InMobi/grill Documentation • http://inmobi.github.io/grill Mailing lists • grill-users@googlegroups.com • grill-dev@googlegroups.com References
    • 40. Analytics at Inmobi - Problem areas and Motivation Why Apache Hive OLAP Model in Apache Hive Query examples Grill – Unified analytics Demo Agenda
    • 41. Demo
    • 42. Thank you! • amareshwari@apache.org • suma.shivaprasad@inmobi.com