Grill
Unified Analytics platform
Amareshwari Sriramadasu
Amareshwari
Sriramadasu
• Engineer at Inmobi
• Working in Hadoop and eco
systems since 2007
• Apache Hadoop – PMC
• Apache...
Background
Analytics at Inmobi
Why Apache Hive
Data cubes in Apache Hive
Queries on data cubes with examples
Grill
Agenda
Background
Analytics at Inmobi
Why Apache Hive
Data cubes in Hive
Queries on data cubes with examples
Grill
Agenda
Global Mobile technology company enabling
- Developers & Publishers to monetize
- Advertisers to engage and acquire users
...
Digital advertising – Intro
Courtesy: http://www.liesdamnedlies.com/
Owns & Sells
Real estate
on digital
inventory
Has rea...
Hadoop @ InMobi – Factual Reporting & Analytics
 130 TB Hadoop warehouse
 5 TB SQL warehouse
Users
• Advertisers and publishers
• Regional officers
• Account managers
• Executive team
• Business analysts
• Product a...
Reporting
Understand trends
Debugging / Postmortem of issues (troubleshooting)
Sizing & Estimation (Ex: inventory, reach)
...
Categorize the use cases
• Batch queries
• Adhoc queries
• Interactive queries
• Scheduled reports
• Infer insights throug...
Adhoc querying system Dashboard system
Customer facing system
Analytics systems at Inmobi
• Disparate user experience
• Disparate data storage systems causing inability to scale
• Schema management
• Not leveragi...
Background
Why Apache Hive
Data cubes in Hive
Queries on data cubes with examples
Grill
Agenda
Associates structure to data
Provides Metastore and catalog
service – Hcatalog
Provides pluggable storage
interface
Accept...
Background
Why Apache Hive
Data cubes in Apache Hive
Queries on data cubes with examples
Grill
Agenda
Data Layout – Fact data
Aggrk : measures (mak
<= ma(k-1)), dimensions
(dak < da(k-1))
…..
Aggr2 : measures (ma2 <= ma1),
d...
Data Layout – snowflake
Aggrk : measures
(mak <= ma(k-
1)), dimensions (dak
< da(k-1))
…..
Aggr2 : measures (ma2 <=
ma1), ...
Data Model
Cube Storage Fact Table
Physical
Fact tables
Dimension
Table
Physical
Dimension
tables
Data Model - Cube
Dimension
• Simple Dimension
• Referenced Dimension
• Hierarchical Dimension
• Expression Dimension
• Ti...
Data Model – Storage
Storage
Name
End point
Properties
Ex : ProdCluster,
StagingCluster, Postgres1,
HBase1, HBase2
Data Model – Fact Table
Fact
table
Cube
Fact
table
Storage
Fact Table
Columns
Cube that it belongs
Storages on which it is...
Data Model – Dimension table
Dimension Table
Columns
Dimension references
Storages on which it is present
and associated s...
Data Model – Storage tables and partitions
Storage table
Belongs to fact/dimension
Associated storage
descriptor
Partition...
Background
Why Apache Hive
Data cubes in Hive
Queries on data cubes with examples
Grill
Agenda
CUBE SELECT [DISTINCT]
select_expr, select_expr, ...
FROM cube_table_reference
WHERE [where_condition AND]
TIME_RANGE_IN(c...
• SELECT ( citytable . name ), ( citytable . stateid ) FROM c2_citytable
citytable LIMIT 100
• SELECT ( citytable . name )...
Example query
• SELECT (citytable.name), sum((testcube.msr2)) FROM c2_testfact testcube INNER
JOIN c1_citytable citytable ...
Available in Hive
• Data warehouse features
like facts, dimensions
• Logical table associated
with multiple physical
stora...
Background
Why Apache Hive
Data cubes in Hive
Queries on data cubes with examples
Grill
Agenda
Unified analytics platform at Inmobi
• Supports multiple execution engines
• Supports multiple storages
Provides analytics...
Grill Architecture
Implements an interface
• execute
• explain
• executeAsynchronously
• fetchResults
• Specify all storages it can support
P...
Cube QL query
Rewrite query for available execution engine’s
supported storages
Get cost of the rewritten query from each
...
Grill Server
Grill server – code status
• Number of queries - 700 to 900 per day
• Number of dimension tables - 125
• Number of fact tables – 24
• Number cubes – ...
Jira reference
• https://issues.apache.org/jira/browse/HIVE-4115
Github source for Hive
• https://github.com/InMobi/hive
G...
Thank you!
• amareshwari@apache.org
Upcoming SlideShare
Loading in...5
×

Grill at bigdata-cloud conf

198

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
198
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Inmobi is global mobile technology company, which allows app developers and other publishers to use their space on the mobile and monetize. And allows advertisers to engage and acquire users. And all this at scale – InMobiserves few billions of ad impressions per day.
  • Inmobi provides marketplace, where it buys the space on mobile from publishers and sells it to advertisers, meanwhile it acquires users.
  • Inmobi has 130TB hadoop warehouse and 5TB SQL warehouse. Let us see an example of reporting page. This is the dashboard a publisher sees.
  • Users of analytics system
  • Adhoc querying systemAdhoc and Batch queriesScheduled queriesBased on HadoopMapreduceProvides UI and custom apiData is stored in HDFSDashboard systemCanned reportsInteractive and adhoc queriesProvides UI and Custom apiData is stored in columnar DWHCustomer facing systemFace to the outside world (Advertisers and publishers)Interactive and adhoc queriesProvides UI and custom apiData is stored in relational DB, PostgresConventional columnar databases (RDBMS) systems lend themselves well for interactive SQL queries over reasonably small datasets in the order of 10-100s of GB, while hadoop based warehouses operate well over large datasets in the order of TBs and PBs and scales fairly linearly. Though there have been some improvements recently in storage structures in the Hadoop warehouses such as ORC, queries over hadoop still typically adopts a full scan approach. Choosing between these different data stores based on cost of storage, concurrency, scalability and performance is fairly complex and not easy for most users.
  • Individually all the systems we just saw work really great! They provide best time responses to user queries.Disparate user experience because of multiple reporting systemsInvolves a learning curve for systems and their apiDisparate data storage systems causing inability to scaleAltering schema involves different systemsData discoveryCannot leverage data in other systemsNot leveraging community aroundCannot experiment with new storage/execution engine out of the box
  • Now let us see why Inmobi wants to use Apache Hive
  • Column Measure : name, type, default aggregate, format string, start date, end dateExpression Measure : Associated ExpressionSimple Dimension: name, type, start date, end dateReferenced Dimension : Referencing table and columnHierarchical Dimension :hierarchyExpression Dimension : Associated expression
  • The grammar is subset of HQLResolve candidate dimension tables and the storage tables .Resolve the candidate fact tables which can answer the query, pick the ones from top of the pyramid.Resolve fact storage tables for the queried time range.Automatically resolve joins using the relationships between cubes and dimension.Automatically add aggregate functions to measures.Add expression to group by clause, if projected; and project group by clause, if it is not.
  • Grill at bigdata-cloud conf

    1. 1. Grill Unified Analytics platform Amareshwari Sriramadasu
    2. 2. Amareshwari Sriramadasu • Engineer at Inmobi • Working in Hadoop and eco systems since 2007 • Apache Hadoop – PMC • Apache Hive– Committer
    3. 3. Background Analytics at Inmobi Why Apache Hive Data cubes in Apache Hive Queries on data cubes with examples Grill Agenda
    4. 4. Background Analytics at Inmobi Why Apache Hive Data cubes in Hive Queries on data cubes with examples Grill Agenda
    5. 5. Global Mobile technology company enabling - Developers & Publishers to monetize - Advertisers to engage and acquire users @ Scale About InMobi
    6. 6. Digital advertising – Intro Courtesy: http://www.liesdamnedlies.com/ Owns & Sells Real estate on digital inventory Has reach to users Wants to target Users Brings money Market place Consumer
    7. 7. Hadoop @ InMobi – Factual Reporting & Analytics  130 TB Hadoop warehouse  5 TB SQL warehouse
    8. 8. Users • Advertisers and publishers • Regional officers • Account managers • Executive team • Business analysts • Product analysts • Data scientists • Engineering systems • Developers Use cases
    9. 9. Reporting Understand trends Debugging / Postmortem of issues (troubleshooting) Sizing & Estimation (Ex: inventory, reach) Summary of Product lines, Geographies, Network (Ex: Rev by Geo) Use cases
    10. 10. Categorize the use cases • Batch queries • Adhoc queries • Interactive queries • Scheduled reports • Infer insights through ML algorithms Use cases
    11. 11. Adhoc querying system Dashboard system Customer facing system Analytics systems at Inmobi
    12. 12. • Disparate user experience • Disparate data storage systems causing inability to scale • Schema management • Not leveraging community around Problems
    13. 13. Background Why Apache Hive Data cubes in Hive Queries on data cubes with examples Grill Agenda
    14. 14. Associates structure to data Provides Metastore and catalog service – Hcatalog Provides pluggable storage interface Accepts SQL like batch queries HQL is widely adopted language by systems like Shark, Impala Has strong apache community Data warehouse features like facts, dimensions Logical table associated with multiple physical storages Interactive queries Scheduling queries UI WhatdoesHiveprovide WhatismissinginHive Apache Hive to the rescue
    15. 15. Background Why Apache Hive Data cubes in Apache Hive Queries on data cubes with examples Grill Agenda
    16. 16. Data Layout – Fact data Aggrk : measures (mak <= ma(k-1)), dimensions (dak < da(k-1)) ….. Aggr2 : measures (ma2 <= ma1), dimensions (da2 < da1) Aggr1 : measures (ma1 <= mr), dimensions (da1 < dr) Raw data : measures (mr), dimensions(dr) Other side of pyramid is aggregated at timed dimension
    17. 17. Data Layout – snowflake Aggrk : measures (mak <= ma(k- 1)), dimensions (dak < da(k-1)) ….. Aggr2 : measures (ma2 <= ma1), dimensions (da2 < da1) Aggr1 : measures (ma1 <= mr), dimensions (da1 < dr) Raw data : measures (mr), dimensions(dr) Dim2_1 Dim3 Dim2 Dim4_1 Dim4 Dim1
    18. 18. Data Model Cube Storage Fact Table Physical Fact tables Dimension Table Physical Dimension tables
    19. 19. Data Model - Cube Dimension • Simple Dimension • Referenced Dimension • Hierarchical Dimension • Expression Dimension • Timed dimension Measure • Column Measure • Expression Measure Cube Measures Dimensions Note : Some of the concepts are borrowed from http://community.pentaho.com/projects/mondrian/
    20. 20. Data Model – Storage Storage Name End point Properties Ex : ProdCluster, StagingCluster, Postgres1, HBase1, HBase2
    21. 21. Data Model – Fact Table Fact table Cube Fact table Storage Fact Table Columns Cube that it belongs Storages on which it is present and the associated update periods
    22. 22. Data Model – Dimension table Dimension Table Columns Dimension references Storages on which it is present and associated snapshot dump period, if any. Cube Dimension table Dimension table Dimension table Storage
    23. 23. Data Model – Storage tables and partitions Storage table Belongs to fact/dimension Associated storage descriptor Partitioned by columns Naming convention – storage name followed by fact/dimension name Partition can override its storage descriptor • Fact storage table Fact table • Dimension storage table Dimension table
    24. 24. Background Why Apache Hive Data cubes in Hive Queries on data cubes with examples Grill Agenda
    25. 25. CUBE SELECT [DISTINCT] select_expr, select_expr, ... FROM cube_table_reference WHERE [where_condition AND] TIME_RANGE_IN(colName , from, to) [GROUP BY col_list] [HAVING having_expr] [ORDER BY colList] [LIMIT number] cube_table_reference: cube_table_factor | join_table join_table: cube_table_reference JOIN cube_table_factor [join_condition] | cube_table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN cube_table_reference [join_condition] cube_table_factor: cube_name [alias] | ( cube_table_reference ) join_condition: ON equality_expression ( AND equality_expression )* equality_expression: expression = expression colOrder: ( ASC | DESC ) colList : colName colOrder? (',' colName colOrder?)* Queries on Data cubes
    26. 26. • SELECT ( citytable . name ), ( citytable . stateid ) FROM c2_citytable citytable LIMIT 100 • SELECT ( citytable . name ), ( citytable . stateid ) FROM c1_citytable citytable WHERE (citytable.dt = 'latest') LIMIT 100 cube select name, stateid from citytable limit 100 Example query
    27. 27. Example query • SELECT (citytable.name), sum((testcube.msr2)) FROM c2_testfact testcube INNER JOIN c1_citytable citytable ON ((testcube.cityid)= (citytable.id)) WHERE (( testcube.dt='2014-03-10-03') OR (testcube.dt='2014-03-10-04') OR (testcube.dt='2014-03- 10-05') OR (testcube.dt='2014-03-10-06') OR (testcube.dt='2014-03-10-07') OR (testcube.dt='2014-03-10-08') OR (testcube.dt='2014-03-10-09') OR (testcube.dt='2014-03- 10-10') OR (testcube.dt='2014-03-10-11') OR (testcube.dt='2014-03-10-12') OR (testcube.dt='2014-03-10-13') OR (testcube.dt='2014-03-10-14') OR (testcube.dt='2014-03- 10-15') OR (testcube.dt='2014-03-10-16') OR (testcube.dt='2014-03-10-17') OR (testcube.dt='2014-03-10-18') OR (testcube.dt='2014-03-10-19') OR (testcube.dt='2014-03- 10-20') OR (testcube.dt='2014-03-10-21') OR (testcube.dt='2014-03-10-22') OR (testcube.dt='2014-03-10-23') OR (testcube.dt='2014-03-11') OR (testcube.dt='2014-03-12- 00') OR (testcube.dt='2014-03-12 -01') OR (testcube.dt='2014-03-12-02') )AND (citytable.dt = 'latest') GROUP BY(citytable.name) cube select citytable.name, msr2 from testcube where timerange_in(dt, '2014-03-10-03’, '2014-03-12-03’)
    28. 28. Available in Hive • Data warehouse features like facts, dimensions • Logical table associated with multiple physical storages Available in Grill • Pluggable execution engine for HQL • Query history, caching • Scheduling queries Where is it available
    29. 29. Background Why Apache Hive Data cubes in Hive Queries on data cubes with examples Grill Agenda
    30. 30. Unified analytics platform at Inmobi • Supports multiple execution engines • Supports multiple storages Provides analytics on the system Provides query history Grill
    31. 31. Grill Architecture
    32. 32. Implements an interface • execute • explain • executeAsynchronously • fetchResults • Specify all storages it can support Pluggable execution engine
    33. 33. Cube QL query Rewrite query for available execution engine’s supported storages Get cost of the rewritten query from each execution engine Pick up execution engine with least cost and fire the query Cube query with multiple execution engines
    34. 34. Grill Server
    35. 35. Grill server – code status
    36. 36. • Number of queries - 700 to 900 per day • Number of dimension tables - 125 • Number of fact tables – 24 • Number cubes – 15 • Size of the data • Total size – 136 TB • Dimension data – 400 MB compressed per hour • Raw data - 1.2 TB per day • Aggregated facts- 53GB per day Data ware house statistics
    37. 37. Jira reference • https://issues.apache.org/jira/browse/HIVE-4115 Github source for Hive • https://github.com/InMobi/hive Github source for grill • https://github.com/InMobi/grill Documentation • http://inmobi.github.io/grill References
    38. 38. Thank you! • amareshwari@apache.org
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×