Datacubes in Apache Hive at ApacheCon

2,904 views

Published on

Introduces OLAP cubes in Apache Hive and unified analytics platform Grill

Published in: Engineering
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,904
On SlideShare
0
From Embeds
0
Number of Embeds
757
Actions
Shares
0
Downloads
43
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Datacubes in Apache Hive at ApacheCon

  1. 1. Data cubes in Apache Hive Amareshwari Sriramadasu Jaideep Dhok
  2. 2. Amareshwari Sriramadasu • Engineer at Inmobi • Working in Hadoop and eco systems since 2007 • Apache Hadoop – PMC • Apache Hive– Committer
  3. 3. Jaideep Dhok • Engineer at Inmobi • Working in distributed systems and Hadoop eco system since 2007 • Apache Hive and Hadoop Contributor
  4. 4. Background Why Apache Hive Data cubes in Apache Hive Queries on data cubes with examples Grill Agenda
  5. 5. Background Why Apache Hive Data cubes in Hive Queries on data cubes with examples Grill Agenda
  6. 6. Global Mobile technology company enabling - Developers & Publishers to monetize - Advertisers to engage and acquire users @ Scale About InMobi
  7. 7. Digital advertising – Intro Courtesy: http://www.liesdamnedlies.com/ Owns & Sells Real estate on digital inventory Has reach to users Wants to target Users Brings money Market place Consumer
  8. 8. Hadoop @ InMobi – Factual Reporting & Analytics  130 TB Hadoop warehouse  5 TB SQL warehouse
  9. 9. Users • Advertisers and publishers • Regional officers • Account managers • Executive team • Business analysts • Product analysts • Data scientists • Engineering systems • Developers Use cases
  10. 10. Sharing reports/data with customer (account level) Understand trends through data & exploration (analysis) Debugging / Postmortem of issues (troubleshooting) Sizing & Estimation (Ex: inventory, reach) Summary of Product lines, Geographies, Network (Ex: Rev by Geo) Sales/Revenue Targets vs Actuals Tracking campaign performance Tracking any metrics on REAL- TIME basis Use cases
  11. 11. Categorize the use cases • Batch queries • Adhoc queries • Interactive queries • Scheduled reports • Infer insights through ML algorithms Use cases
  12. 12. Adhoc querying system Dashboard system Customer facing system Analytics systems at Inmobi
  13. 13. • Disparate user experience • Disparate data storage systems causing inability to scale • Schema management • Not leveraging community around Problems
  14. 14. Background Why Apache Hive Data cubes in Hive Queries on data cubes with examples Grill Agenda
  15. 15. Associates structure to data Provides Metastore and catalog service – Hcatalog Provides pluggable storage interface Accepts SQL like batch queries HQL is widely adopted language by systems like Shark, Impala Has strong apache community Data warehouse features like facts, dimensions Logical table associated with multiple physical storages Interactive queries Scheduling queries UI WhatdoesHiveprovide WhatismissinginHive Apache Hive to the rescue
  16. 16. Background Why Apache Hive Data cubes in Apache Hive Queries on data cubes with examples Grill Agenda
  17. 17. Data Layout – Fact data Aggrk : measures (mak <= ma(k-1)), dimensions (dak < da(k-1)) ….. Aggr2 : measures (ma2 <= ma1), dimensions (da2 < da1) Aggr1 : measures (ma1 <= mr), dimensions (da1 < dr) Raw data : measures (mr), dimensions(dr) Other side of pyramid is aggregated at timed dimension
  18. 18. Data Layout – snowflake Aggrk : measures (mak <= ma(k-1)), dimensions (dak < da(k-1)) ….. Aggr2 : measures (ma2 <= ma1), dimensions (da2 < da1) Aggr1 : measures (ma1 <= mr), dimensions (da1 < dr) Raw data : measures (mr), dimensions(dr) Dim2_1 Dim3 Dim2 Dim4_1 Dim4 Dim1
  19. 19. Data Model Cube Storage Fact Table Physical Fact tables Dimension Table Physical Dimension tables
  20. 20. Data Model - Cube Dimension • Simple Dimension • Referenced Dimension • Hierarchical Dimension • Expression Dimension • Timed dimension Measure • Column Measure • Expression Measure Cube Measures Dimensions Note : Some of the concepts are borrowed from http://community.pentaho.com/projects/mondrian/
  21. 21. Data Model – Storage Storage Name End point Properties Ex : ProdCluster, StagingCluster, Postgres1, HBase1, HBase2
  22. 22. Data Model – Fact Table Fact table Cube Fact table Storage Fact Table Columns Cube that it belongs Storages on which it is present and the associated update periods
  23. 23. Data Model – Dimension table Dimension Table Columns Dimension references Storages on which it is present and associated snapshot dump period, if any. Cube Dimension table Dimension table Dimension table Storage
  24. 24. Data Model – Storage tables and partitions Storage table Belongs to fact/dimension Associated storage descriptor Partitioned by columns Naming convention – storage name followed by fact/dimension name Partition can override its storage descriptor • Fact storage table Fact table • Dimension storage table Dimension table
  25. 25. Background Why Apache Hive Data cubes in Hive Queries on data cubes with examples Grill Agenda
  26. 26. CUBE SELECT [DISTINCT] select_expr, select_expr, ... FROM cube_table_reference WHERE [where_condition AND] TIME_RANGE_IN(colName , from, to) [GROUP BY col_list] [HAVING having_expr] [ORDER BY colList] [LIMIT number] cube_table_reference: cube_table_factor | join_table join_table: cube_table_reference JOIN cube_table_factor [join_condition] | cube_table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN cube_table_reference [join_condition] cube_table_factor: cube_name [alias] | ( cube_table_reference ) join_condition: ON equality_expression ( AND equality_expression )* equality_expression: expression = expression colOrder: ( ASC | DESC ) colList : colName colOrder? (',' colName colOrder?)* Queries on Data cubes
  27. 27. Resolve candidate tables and storages Automatically resolve joins Resolve aggregates and groupby expressions Allows SQL over Cube QL Queries can span multiple storages Accepts multi time range queries All Hive QL features Query features
  28. 28. • SELECT ( citytable . name ), ( citytable . stateid ) FROM c2_citytable citytable LIMIT 100 • SELECT ( citytable . name ), ( citytable . stateid ) FROM c1_citytable citytable WHERE (citytable.dt = 'latest') LIMIT 100 cube select name, stateid from citytable limit 100 Example query
  29. 29. Example query • SELECT (citytable.name), sum((testcube.msr2)) FROM c2_testfact testcube INNER JOIN c1_citytable citytable ON ((testcube.cityid)= (citytable.id)) WHERE (( testcube.dt='2014-03-10-03') OR (testcube.dt='2014-03-10-04') OR (testcube.dt='2014-03- 10-05') OR (testcube.dt='2014-03-10-06') OR (testcube.dt='2014-03-10-07') OR (testcube.dt='2014-03-10-08') OR (testcube.dt='2014-03-10-09') OR (testcube.dt='2014-03- 10-10') OR (testcube.dt='2014-03-10-11') OR (testcube.dt='2014-03-10-12') OR (testcube.dt='2014-03-10-13') OR (testcube.dt='2014-03-10-14') OR (testcube.dt='2014-03- 10-15') OR (testcube.dt='2014-03-10-16') OR (testcube.dt='2014-03-10-17') OR (testcube.dt='2014-03-10-18') OR (testcube.dt='2014-03-10-19') OR (testcube.dt='2014-03- 10-20') OR (testcube.dt='2014-03-10-21') OR (testcube.dt='2014-03-10-22') OR (testcube.dt='2014-03-10-23') OR (testcube.dt='2014-03-11') OR (testcube.dt='2014-03-12- 00') OR (testcube.dt='2014-03-12 -01') OR (testcube.dt='2014-03-12-02') )AND (citytable.dt = 'latest') GROUP BY(citytable.name) cube select citytable.name, msr2 from testcube where timerange_in(dt, '2014-03-10-03’, '2014-03-12-03’)
  30. 30. Available in Hive • Data warehouse features like facts, dimensions • Logical table associated with multiple physical storages Available in Grill • Pluggable execution engine for HQL • Query history, caching • Scheduling queries Where is it available
  31. 31. Background Why Apache Hive Data cubes in Hive Queries on data cubes with examples Grill Agenda
  32. 32. Unified analytics platform at Inmobi • Supports multiple execution engines • Supports multiple storages Provides analytics on the system Provides query history Grill
  33. 33. Grill Architecture
  34. 34. Implements an interface • execute • explain • executeAsynchronously • fetchResults • Specify all storages it can support Pluggable execution engine
  35. 35. Cube QL query Rewrite query for available execution engine’s supported storages Get cost of the rewritten query from each execution engine Pick up execution engine with least cost and fire the query Cube query with multiple execution engines
  36. 36. Grill Server
  37. 37. Grill server – code status
  38. 38. • Number of queries - 700 to 900 per day • Number of dimension tables - 125 • Number of fact tables – 24 • Number cubes – 15 • Size of the data • Total size – 136 TB • Dimension data – 400 MB compressed per hour • Raw data - 1.2 TB per day • Aggregated facts- 53GB per day Data ware house statistics
  39. 39. Jira reference • https://issues.apache.org/jira/browse/HIVE-4115 Github source for Hive • https://github.com/InMobi/hive Github source for grill • https://github.com/InMobi/grill Documentation • http://inmobi.github.io/grill References
  40. 40. Thank you! • amareshwari@apache.org • jaideep.dhok@inmobi.com

×