SlideShare a Scribd company logo
Grill
Unified Analytics platform
Amareshwari Sriramadasu
Amareshwari
Sriramadasu
• Engineer at Inmobi
• Working in Hadoop and eco
systems since 2007
• Apache Hadoop – PMC
• Apache Hive– Committer
Background
Analytics at Inmobi
Why Apache Hive
Data cubes in Apache Hive
Queries on data cubes with examples
Grill
Agenda
Background
Analytics at Inmobi
Why Apache Hive
Data cubes in Hive
Queries on data cubes with examples
Grill
Agenda
Global Mobile technology company enabling
- Developers & Publishers to monetize
- Advertisers to engage and acquire users
@ Scale
About InMobi
Digital advertising – Intro
Courtesy: http://www.liesdamnedlies.com/
Owns & Sells
Real estate
on digital
inventory
Has reach to
users
Wants to
target Users
Brings money
Market place
Consumer
Hadoop @ InMobi – Factual Reporting & Analytics
 130 TB Hadoop warehouse
 5 TB SQL warehouse
Users
• Advertisers and publishers
• Regional officers
• Account managers
• Executive team
• Business analysts
• Product analysts
• Data scientists
• Engineering systems
• Developers
Use cases
Reporting
Understand trends
Debugging / Postmortem of issues (troubleshooting)
Sizing & Estimation (Ex: inventory, reach)
Summary of Product lines, Geographies, Network (Ex: Rev by Geo)
Use cases
Categorize the use cases
• Batch queries
• Adhoc queries
• Interactive queries
• Scheduled reports
• Infer insights through ML algorithms
Use cases
Adhoc querying system Dashboard system
Customer facing system
Analytics systems at Inmobi
• Disparate user experience
• Disparate data storage systems causing inability to scale
• Schema management
• Not leveraging community around
Problems
Background
Why Apache Hive
Data cubes in Hive
Queries on data cubes with examples
Grill
Agenda
Associates structure to data
Provides Metastore and catalog
service – Hcatalog
Provides pluggable storage
interface
Accepts SQL like batch queries
HQL is widely adopted
language by systems like
Shark, Impala
Has strong apache community
Data warehouse features like
facts, dimensions
Logical table associated with
multiple physical storages
Interactive queries
Scheduling queries
UI
WhatdoesHiveprovide
WhatismissinginHive
Apache Hive to the rescue
Background
Why Apache Hive
Data cubes in Apache Hive
Queries on data cubes with examples
Grill
Agenda
Data Layout – Fact data
Aggrk : measures (mak
<= ma(k-1)), dimensions
(dak < da(k-1))
…..
Aggr2 : measures (ma2 <= ma1),
dimensions (da2 < da1)
Aggr1 : measures (ma1 <= mr), dimensions (da1
< dr)
Raw data : measures (mr), dimensions(dr)
Other side of pyramid is aggregated at timed dimension
Data Layout – snowflake
Aggrk : measures
(mak <= ma(k-
1)), dimensions (dak
< da(k-1))
…..
Aggr2 : measures (ma2 <=
ma1), dimensions (da2 < da1)
Aggr1 : measures (ma1 <= mr), dimensions
(da1 < dr)
Raw data : measures (mr), dimensions(dr)
Dim2_1
Dim3
Dim2
Dim4_1
Dim4
Dim1
Data Model
Cube Storage Fact Table
Physical
Fact tables
Dimension
Table
Physical
Dimension
tables
Data Model - Cube
Dimension
• Simple Dimension
• Referenced Dimension
• Hierarchical Dimension
• Expression Dimension
• Timed dimension
Measure
• Column Measure
• Expression Measure
Cube
Measures Dimensions
Note : Some of the concepts are borrowed from
http://community.pentaho.com/projects/mondrian/
Data Model – Storage
Storage
Name
End point
Properties
Ex : ProdCluster,
StagingCluster, Postgres1,
HBase1, HBase2
Data Model – Fact Table
Fact
table
Cube
Fact
table
Storage
Fact Table
Columns
Cube that it belongs
Storages on which it is
present and the associated
update periods
Data Model – Dimension table
Dimension Table
Columns
Dimension references
Storages on which it is present
and associated snapshot dump
period, if any.
Cube
Dimension
table
Dimension
table
Dimension
table
Storage
Data Model – Storage tables and partitions
Storage table
Belongs to fact/dimension
Associated storage
descriptor
Partitioned by columns
Naming convention –
storage name followed by
fact/dimension name
Partition can override its
storage descriptor
• Fact storage table
Fact table
• Dimension storage table
Dimension table
Background
Why Apache Hive
Data cubes in Hive
Queries on data cubes with examples
Grill
Agenda
CUBE SELECT [DISTINCT]
select_expr, select_expr, ...
FROM cube_table_reference
WHERE [where_condition AND]
TIME_RANGE_IN(colName , from, to)
[GROUP BY col_list]
[HAVING having_expr]
[ORDER BY colList]
[LIMIT number]
cube_table_reference:
cube_table_factor
| join_table
join_table:
cube_table_reference JOIN cube_table_factor
[join_condition]
| cube_table_reference {LEFT|RIGHT|FULL} [OUTER]
JOIN cube_table_reference [join_condition]
cube_table_factor:
cube_name [alias]
| ( cube_table_reference )
join_condition:
ON equality_expression ( AND equality_expression )*
equality_expression:
expression = expression
colOrder: ( ASC | DESC )
colList : colName colOrder? (',' colName colOrder?)*
Queries on Data cubes
• SELECT ( citytable . name ), ( citytable . stateid ) FROM c2_citytable
citytable LIMIT 100
• SELECT ( citytable . name ), ( citytable . stateid ) FROM c1_citytable
citytable WHERE (citytable.dt = 'latest') LIMIT 100
cube select name, stateid from citytable limit 100
Example query
Example query
• SELECT (citytable.name), sum((testcube.msr2)) FROM c2_testfact testcube INNER
JOIN c1_citytable citytable ON ((testcube.cityid)= (citytable.id)) WHERE ((
testcube.dt='2014-03-10-03') OR (testcube.dt='2014-03-10-04') OR (testcube.dt='2014-03-
10-05') OR (testcube.dt='2014-03-10-06') OR (testcube.dt='2014-03-10-07') OR
(testcube.dt='2014-03-10-08') OR (testcube.dt='2014-03-10-09') OR (testcube.dt='2014-03-
10-10') OR (testcube.dt='2014-03-10-11') OR (testcube.dt='2014-03-10-12') OR
(testcube.dt='2014-03-10-13') OR (testcube.dt='2014-03-10-14') OR (testcube.dt='2014-03-
10-15') OR (testcube.dt='2014-03-10-16') OR (testcube.dt='2014-03-10-17') OR
(testcube.dt='2014-03-10-18') OR (testcube.dt='2014-03-10-19') OR (testcube.dt='2014-03-
10-20') OR (testcube.dt='2014-03-10-21') OR (testcube.dt='2014-03-10-22') OR
(testcube.dt='2014-03-10-23') OR (testcube.dt='2014-03-11') OR (testcube.dt='2014-03-12-
00') OR (testcube.dt='2014-03-12 -01') OR (testcube.dt='2014-03-12-02') )AND (citytable.dt
= 'latest')
GROUP BY(citytable.name)
cube select citytable.name, msr2 from testcube where
timerange_in(dt, '2014-03-10-03’, '2014-03-12-03’)
Available in Hive
• Data warehouse features
like facts, dimensions
• Logical table associated
with multiple physical
storages
Available in Grill
• Pluggable execution engine for
HQL
• Query history, caching
• Scheduling queries
Where is it available
Background
Why Apache Hive
Data cubes in Hive
Queries on data cubes with examples
Grill
Agenda
Unified analytics platform at Inmobi
• Supports multiple execution engines
• Supports multiple storages
Provides analytics on the system
Provides query history
Grill
Grill Architecture
Implements an interface
• execute
• explain
• executeAsynchronously
• fetchResults
• Specify all storages it can support
Pluggable execution engine
Cube QL query
Rewrite query for available execution engine’s
supported storages
Get cost of the rewritten query from each
execution engine
Pick up execution engine with least cost and
fire the query
Cube query with multiple execution engines
Grill Server
Grill server – code status
• Number of queries - 700 to 900 per day
• Number of dimension tables - 125
• Number of fact tables – 24
• Number cubes – 15
• Size of the data
• Total size – 136 TB
• Dimension data – 400 MB compressed per hour
• Raw data - 1.2 TB per day
• Aggregated facts- 53GB per day
Data ware house statistics
Jira reference
• https://issues.apache.org/jira/browse/HIVE-4115
Github source for Hive
• https://github.com/InMobi/hive
Github source for grill
• https://github.com/InMobi/grill
Documentation
• http://inmobi.github.io/grill
References
Thank you!
• amareshwari@apache.org

More Related Content

What's hot

Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
Jen Aman
 
Spark algorithms
Spark algorithmsSpark algorithms
Spark algorithms
Ashutosh Trivedi
 
Time Series Analysis for Network Secruity
Time Series Analysis for Network SecruityTime Series Analysis for Network Secruity
Time Series Analysis for Network Secruity
mrphilroth
 
Visualizing database performance hotsos 13-v2
Visualizing database performance   hotsos 13-v2Visualizing database performance   hotsos 13-v2
Visualizing database performance hotsos 13-v2
Gwen (Chen) Shapira
 
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)쉽게 설명하는 GAN (What is this? Gum? It's GAN.)
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)
Hansol Kang
 
Deep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceDeep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent space
Hansol Kang
 
R and cpp
R and cppR and cpp
R and cpp
Romain Francois
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)
Hansol Kang
 
Machine learning using spark
Machine learning using sparkMachine learning using spark
Machine learning using spark
Ran Silberman
 
Representing and Querying Geospatial Information in the Semantic Web
Representing and Querying Geospatial Information in the Semantic WebRepresenting and Querying Geospatial Information in the Semantic Web
Representing and Querying Geospatial Information in the Semantic Web
Kostis Kyzirakos
 
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
NoSQLmatters
 
High-Performance Approach to String Similarity using Most Frequent K Characters
High-Performance Approach to String Similarity using Most Frequent K CharactersHigh-Performance Approach to String Similarity using Most Frequent K Characters
High-Performance Approach to String Similarity using Most Frequent K Characters
Holistic Benchmarking of Big Linked Data
 
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
Hansol Kang
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
Amund Tveit
 
Move your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R codeMove your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R code
Jeffrey Breen
 
Raster package jacob
Raster package jacobRaster package jacob
Building Scalable Semantic Geospatial RDF Stores
Building Scalable Semantic Geospatial RDF StoresBuilding Scalable Semantic Geospatial RDF Stores
Building Scalable Semantic Geospatial RDF Stores
Kostis Kyzirakos
 
Table of Useful R commands.
Table of Useful R commands.Table of Useful R commands.
Table of Useful R commands.
Dr. Volkan OBAN
 
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Jyotirmoy Sundi
 
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVATopic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
Tetsuya Sakai
 

What's hot (20)

Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
 
Spark algorithms
Spark algorithmsSpark algorithms
Spark algorithms
 
Time Series Analysis for Network Secruity
Time Series Analysis for Network SecruityTime Series Analysis for Network Secruity
Time Series Analysis for Network Secruity
 
Visualizing database performance hotsos 13-v2
Visualizing database performance   hotsos 13-v2Visualizing database performance   hotsos 13-v2
Visualizing database performance hotsos 13-v2
 
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)쉽게 설명하는 GAN (What is this? Gum? It's GAN.)
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)
 
Deep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceDeep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent space
 
R and cpp
R and cppR and cpp
R and cpp
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)
 
Machine learning using spark
Machine learning using sparkMachine learning using spark
Machine learning using spark
 
Representing and Querying Geospatial Information in the Semantic Web
Representing and Querying Geospatial Information in the Semantic WebRepresenting and Querying Geospatial Information in the Semantic Web
Representing and Querying Geospatial Information in the Semantic Web
 
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
 
High-Performance Approach to String Similarity using Most Frequent K Characters
High-Performance Approach to String Similarity using Most Frequent K CharactersHigh-Performance Approach to String Similarity using Most Frequent K Characters
High-Performance Approach to String Similarity using Most Frequent K Characters
 
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Move your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R codeMove your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R code
 
Raster package jacob
Raster package jacobRaster package jacob
Raster package jacob
 
Building Scalable Semantic Geospatial RDF Stores
Building Scalable Semantic Geospatial RDF StoresBuilding Scalable Semantic Geospatial RDF Stores
Building Scalable Semantic Geospatial RDF Stores
 
Table of Useful R commands.
Table of Useful R commands.Table of Useful R commands.
Table of Useful R commands.
 
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
 
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVATopic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
 

Similar to Grill at bigdata-cloud conf

Datacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheConDatacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheCon
amarsri
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetup
amarsri
 
Rethinking metrics: metrics 2.0 @ Lisa 2014
Rethinking metrics: metrics 2.0 @ Lisa 2014Rethinking metrics: metrics 2.0 @ Lisa 2014
Rethinking metrics: metrics 2.0 @ Lisa 2014
Dieter Plaetinck
 
Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015
Johann de Boer
 
Grill at HadoopSummit
Grill at HadoopSummitGrill at HadoopSummit
Grill at HadoopSummit
amarsri
 
HQL over Tiered Data Warehouse
HQL over Tiered Data WarehouseHQL over Tiered Data Warehouse
HQL over Tiered Data Warehouse
DataWorks Summit
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
Julian Hyde
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
b0ris_1
 
Data Science in the Elastic Stack
Data Science in the Elastic StackData Science in the Elastic Stack
Data Science in the Elastic Stack
Rochelle Sonnenberg
 
Cab Booking Cancellation
Cab Booking CancellationCab Booking Cancellation
Cab Booking Cancellation
Devesh Khandelwal
 
Fifth elephant-grill
Fifth elephant-grillFifth elephant-grill
Fifth elephant-grill
amarsri
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on R
Ajay Ohri
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
Neville Li
 
Spark Kafka summit 2017
Spark Kafka summit 2017Spark Kafka summit 2017
Spark Kafka summit 2017
ajay_ei
 
World’s Best Data Modeling Tool
World’s Best Data Modeling ToolWorld’s Best Data Modeling Tool
World’s Best Data Modeling Tool
Artem Chebotko
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
Ted Dunning
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUG
MapR Technologies
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Schema Design by Chad Tindel, Solution Architect, 10gen
Schema Design  by Chad Tindel, Solution Architect, 10genSchema Design  by Chad Tindel, Solution Architect, 10gen
Schema Design by Chad Tindel, Solution Architect, 10gen
MongoDB
 

Similar to Grill at bigdata-cloud conf (20)

Datacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheConDatacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheCon
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetup
 
Rethinking metrics: metrics 2.0 @ Lisa 2014
Rethinking metrics: metrics 2.0 @ Lisa 2014Rethinking metrics: metrics 2.0 @ Lisa 2014
Rethinking metrics: metrics 2.0 @ Lisa 2014
 
Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015
 
Grill at HadoopSummit
Grill at HadoopSummitGrill at HadoopSummit
Grill at HadoopSummit
 
HQL over Tiered Data Warehouse
HQL over Tiered Data WarehouseHQL over Tiered Data Warehouse
HQL over Tiered Data Warehouse
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
Data Science in the Elastic Stack
Data Science in the Elastic StackData Science in the Elastic Stack
Data Science in the Elastic Stack
 
Cab Booking Cancellation
Cab Booking CancellationCab Booking Cancellation
Cab Booking Cancellation
 
Fifth elephant-grill
Fifth elephant-grillFifth elephant-grill
Fifth elephant-grill
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on R
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Spark Kafka summit 2017
Spark Kafka summit 2017Spark Kafka summit 2017
Spark Kafka summit 2017
 
World’s Best Data Modeling Tool
World’s Best Data Modeling ToolWorld’s Best Data Modeling Tool
World’s Best Data Modeling Tool
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUG
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Schema Design by Chad Tindel, Solution Architect, 10gen
Schema Design  by Chad Tindel, Solution Architect, 10genSchema Design  by Chad Tindel, Solution Architect, 10gen
Schema Design by Chad Tindel, Solution Architect, 10gen
 

Recently uploaded

“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 

Grill at bigdata-cloud conf

  • 2. Amareshwari Sriramadasu • Engineer at Inmobi • Working in Hadoop and eco systems since 2007 • Apache Hadoop – PMC • Apache Hive– Committer
  • 3. Background Analytics at Inmobi Why Apache Hive Data cubes in Apache Hive Queries on data cubes with examples Grill Agenda
  • 4. Background Analytics at Inmobi Why Apache Hive Data cubes in Hive Queries on data cubes with examples Grill Agenda
  • 5. Global Mobile technology company enabling - Developers & Publishers to monetize - Advertisers to engage and acquire users @ Scale About InMobi
  • 6. Digital advertising – Intro Courtesy: http://www.liesdamnedlies.com/ Owns & Sells Real estate on digital inventory Has reach to users Wants to target Users Brings money Market place Consumer
  • 7. Hadoop @ InMobi – Factual Reporting & Analytics  130 TB Hadoop warehouse  5 TB SQL warehouse
  • 8. Users • Advertisers and publishers • Regional officers • Account managers • Executive team • Business analysts • Product analysts • Data scientists • Engineering systems • Developers Use cases
  • 9. Reporting Understand trends Debugging / Postmortem of issues (troubleshooting) Sizing & Estimation (Ex: inventory, reach) Summary of Product lines, Geographies, Network (Ex: Rev by Geo) Use cases
  • 10. Categorize the use cases • Batch queries • Adhoc queries • Interactive queries • Scheduled reports • Infer insights through ML algorithms Use cases
  • 11. Adhoc querying system Dashboard system Customer facing system Analytics systems at Inmobi
  • 12. • Disparate user experience • Disparate data storage systems causing inability to scale • Schema management • Not leveraging community around Problems
  • 13. Background Why Apache Hive Data cubes in Hive Queries on data cubes with examples Grill Agenda
  • 14. Associates structure to data Provides Metastore and catalog service – Hcatalog Provides pluggable storage interface Accepts SQL like batch queries HQL is widely adopted language by systems like Shark, Impala Has strong apache community Data warehouse features like facts, dimensions Logical table associated with multiple physical storages Interactive queries Scheduling queries UI WhatdoesHiveprovide WhatismissinginHive Apache Hive to the rescue
  • 15. Background Why Apache Hive Data cubes in Apache Hive Queries on data cubes with examples Grill Agenda
  • 16. Data Layout – Fact data Aggrk : measures (mak <= ma(k-1)), dimensions (dak < da(k-1)) ….. Aggr2 : measures (ma2 <= ma1), dimensions (da2 < da1) Aggr1 : measures (ma1 <= mr), dimensions (da1 < dr) Raw data : measures (mr), dimensions(dr) Other side of pyramid is aggregated at timed dimension
  • 17. Data Layout – snowflake Aggrk : measures (mak <= ma(k- 1)), dimensions (dak < da(k-1)) ….. Aggr2 : measures (ma2 <= ma1), dimensions (da2 < da1) Aggr1 : measures (ma1 <= mr), dimensions (da1 < dr) Raw data : measures (mr), dimensions(dr) Dim2_1 Dim3 Dim2 Dim4_1 Dim4 Dim1
  • 18. Data Model Cube Storage Fact Table Physical Fact tables Dimension Table Physical Dimension tables
  • 19. Data Model - Cube Dimension • Simple Dimension • Referenced Dimension • Hierarchical Dimension • Expression Dimension • Timed dimension Measure • Column Measure • Expression Measure Cube Measures Dimensions Note : Some of the concepts are borrowed from http://community.pentaho.com/projects/mondrian/
  • 20. Data Model – Storage Storage Name End point Properties Ex : ProdCluster, StagingCluster, Postgres1, HBase1, HBase2
  • 21. Data Model – Fact Table Fact table Cube Fact table Storage Fact Table Columns Cube that it belongs Storages on which it is present and the associated update periods
  • 22. Data Model – Dimension table Dimension Table Columns Dimension references Storages on which it is present and associated snapshot dump period, if any. Cube Dimension table Dimension table Dimension table Storage
  • 23. Data Model – Storage tables and partitions Storage table Belongs to fact/dimension Associated storage descriptor Partitioned by columns Naming convention – storage name followed by fact/dimension name Partition can override its storage descriptor • Fact storage table Fact table • Dimension storage table Dimension table
  • 24. Background Why Apache Hive Data cubes in Hive Queries on data cubes with examples Grill Agenda
  • 25. CUBE SELECT [DISTINCT] select_expr, select_expr, ... FROM cube_table_reference WHERE [where_condition AND] TIME_RANGE_IN(colName , from, to) [GROUP BY col_list] [HAVING having_expr] [ORDER BY colList] [LIMIT number] cube_table_reference: cube_table_factor | join_table join_table: cube_table_reference JOIN cube_table_factor [join_condition] | cube_table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN cube_table_reference [join_condition] cube_table_factor: cube_name [alias] | ( cube_table_reference ) join_condition: ON equality_expression ( AND equality_expression )* equality_expression: expression = expression colOrder: ( ASC | DESC ) colList : colName colOrder? (',' colName colOrder?)* Queries on Data cubes
  • 26. • SELECT ( citytable . name ), ( citytable . stateid ) FROM c2_citytable citytable LIMIT 100 • SELECT ( citytable . name ), ( citytable . stateid ) FROM c1_citytable citytable WHERE (citytable.dt = 'latest') LIMIT 100 cube select name, stateid from citytable limit 100 Example query
  • 27. Example query • SELECT (citytable.name), sum((testcube.msr2)) FROM c2_testfact testcube INNER JOIN c1_citytable citytable ON ((testcube.cityid)= (citytable.id)) WHERE (( testcube.dt='2014-03-10-03') OR (testcube.dt='2014-03-10-04') OR (testcube.dt='2014-03- 10-05') OR (testcube.dt='2014-03-10-06') OR (testcube.dt='2014-03-10-07') OR (testcube.dt='2014-03-10-08') OR (testcube.dt='2014-03-10-09') OR (testcube.dt='2014-03- 10-10') OR (testcube.dt='2014-03-10-11') OR (testcube.dt='2014-03-10-12') OR (testcube.dt='2014-03-10-13') OR (testcube.dt='2014-03-10-14') OR (testcube.dt='2014-03- 10-15') OR (testcube.dt='2014-03-10-16') OR (testcube.dt='2014-03-10-17') OR (testcube.dt='2014-03-10-18') OR (testcube.dt='2014-03-10-19') OR (testcube.dt='2014-03- 10-20') OR (testcube.dt='2014-03-10-21') OR (testcube.dt='2014-03-10-22') OR (testcube.dt='2014-03-10-23') OR (testcube.dt='2014-03-11') OR (testcube.dt='2014-03-12- 00') OR (testcube.dt='2014-03-12 -01') OR (testcube.dt='2014-03-12-02') )AND (citytable.dt = 'latest') GROUP BY(citytable.name) cube select citytable.name, msr2 from testcube where timerange_in(dt, '2014-03-10-03’, '2014-03-12-03’)
  • 28. Available in Hive • Data warehouse features like facts, dimensions • Logical table associated with multiple physical storages Available in Grill • Pluggable execution engine for HQL • Query history, caching • Scheduling queries Where is it available
  • 29. Background Why Apache Hive Data cubes in Hive Queries on data cubes with examples Grill Agenda
  • 30. Unified analytics platform at Inmobi • Supports multiple execution engines • Supports multiple storages Provides analytics on the system Provides query history Grill
  • 32. Implements an interface • execute • explain • executeAsynchronously • fetchResults • Specify all storages it can support Pluggable execution engine
  • 33. Cube QL query Rewrite query for available execution engine’s supported storages Get cost of the rewritten query from each execution engine Pick up execution engine with least cost and fire the query Cube query with multiple execution engines
  • 35. Grill server – code status
  • 36. • Number of queries - 700 to 900 per day • Number of dimension tables - 125 • Number of fact tables – 24 • Number cubes – 15 • Size of the data • Total size – 136 TB • Dimension data – 400 MB compressed per hour • Raw data - 1.2 TB per day • Aggregated facts- 53GB per day Data ware house statistics
  • 37. Jira reference • https://issues.apache.org/jira/browse/HIVE-4115 Github source for Hive • https://github.com/InMobi/hive Github source for grill • https://github.com/InMobi/grill Documentation • http://inmobi.github.io/grill References

Editor's Notes

  1. Inmobi is global mobile technology company, which allows app developers and other publishers to use their space on the mobile and monetize. And allows advertisers to engage and acquire users. And all this at scale – InMobiserves few billions of ad impressions per day.
  2. Inmobi provides marketplace, where it buys the space on mobile from publishers and sells it to advertisers, meanwhile it acquires users.
  3. Inmobi has 130TB hadoop warehouse and 5TB SQL warehouse. Let us see an example of reporting page. This is the dashboard a publisher sees.
  4. Users of analytics system
  5. Adhoc querying systemAdhoc and Batch queriesScheduled queriesBased on HadoopMapreduceProvides UI and custom apiData is stored in HDFSDashboard systemCanned reportsInteractive and adhoc queriesProvides UI and Custom apiData is stored in columnar DWHCustomer facing systemFace to the outside world (Advertisers and publishers)Interactive and adhoc queriesProvides UI and custom apiData is stored in relational DB, PostgresConventional columnar databases (RDBMS) systems lend themselves well for interactive SQL queries over reasonably small datasets in the order of 10-100s of GB, while hadoop based warehouses operate well over large datasets in the order of TBs and PBs and scales fairly linearly. Though there have been some improvements recently in storage structures in the Hadoop warehouses such as ORC, queries over hadoop still typically adopts a full scan approach. Choosing between these different data stores based on cost of storage, concurrency, scalability and performance is fairly complex and not easy for most users.
  6. Individually all the systems we just saw work really great! They provide best time responses to user queries.Disparate user experience because of multiple reporting systemsInvolves a learning curve for systems and their apiDisparate data storage systems causing inability to scaleAltering schema involves different systemsData discoveryCannot leverage data in other systemsNot leveraging community aroundCannot experiment with new storage/execution engine out of the box
  7. Now let us see why Inmobi wants to use Apache Hive
  8. Column Measure : name, type, default aggregate, format string, start date, end dateExpression Measure : Associated ExpressionSimple Dimension: name, type, start date, end dateReferenced Dimension : Referencing table and columnHierarchical Dimension :hierarchyExpression Dimension : Associated expression
  9. The grammar is subset of HQLResolve candidate dimension tables and the storage tables .Resolve the candidate fact tables which can answer the query, pick the ones from top of the pyramid.Resolve fact storage tables for the queried time range.Automatically resolve joins using the relationships between cubes and dimension.Automatically add aggregate functions to measures.Add expression to group by clause, if projected; and project group by clause, if it is not.