Analytics Pipeline at Inmobi

•Download as PPTX, PDF•

3 likes•773 views

Gaurav Agarwal

InMobi talk at Fifth Elephant - Analytics @ Inmobi by Gaurav Agarwal

Technology

Analytics @ Inmobi
Gaurav Agarwal
gaurav@inmobi.com

Scale
Large data sizes
~3B records per day ~= 2.5T
uncompressed json
~100 primary, ~300 derived
dimensions.
~50 measures.
Analysis horizon -years

Scope
Highly Dynamic Data and Analytic needs
Frequent addition of newer dimensions.
Very dynamic query patterns.
Both canned and ad-hoc reports.
Multiple phase-shifted large data
streams.
Different kind of consumers – sales,
analyst, execs, machines.

The Beginnings: Perl to MR
(Hadoop)..
Logs summarized using perl. Low volumes
(order of hundred thousand).
Perl could not handle increased volumes
(millions). (2Q, 2010)
MR jobs to aggregate logs and populate DB (3
machine cluster)
DB views increased ; creating MR jobs time
consuming, error prone and hard. (3Q, 2010)

Solving for Pipeline - Pig
MR: New job per need ; known by few.
Pig: Well suited for medium complexity
pipeline jobs.
Data gets aggregated using Pig and
pushed to DB for analytics.

Analytics gets complex
Business evolved; complex analytics needed.
DB suffers ‘limited angle view’ problems.
Proliferation of materialized views.
Hive: not mature (early 2011), too much resource
on small/medium clusters, lot of flux, not
optimal, difficult to fix things and add features.
Back to Pig: Team of engineers writing ad-hoc pig
scripts for business; Performance only as good as
person writing the query – very low productivity.

Realization
Frequently ‘tools’ don’t work as intended. Too
much customizations and constant tuning.
Difficult to absorb the dynamics of the data.
Too generic and not optimal for our data
models and cluster size.
Parts of the required stack – difficult to
integrate and maintain.
Pig not suited for analytics by business. Too
much technical knowledge needed.

Yoda
Developed in-house system to satisfy ad-
hoc analytics.
Complete Stack (ETL, Query Processor,
Query Builder, Visualization) on top of
Hadoop, for processing logs & analytics.
(Q1, 2011)
SQL like operations like Select, Sum, Avg
Min, Max, Count, Distinct, Decode,
Expressions, GroupBy, Where, Having,
Decode, UDF, UDAF etc.

Yoda cont..
Heavily optimize storage and
queries for the data model.
All the fact data streams and
metadata in a coherent, seamless
view.
Platform–UI as well as API (to
embed the functionality it in other
apps).

Life of a Query
UI Optimizations Validate Select
Convert to Metadata-> Query Metadata
protbuf Fact Create Joins
promotions: Select Cube Estimate cost
Transmit
GroupBy Select Priority
Json Select Optimal Determine
Where
grain Split size

Collect data (Reducer) (Mapper) (Driver)
Do Aggregate Filter Push Down
Format and at record Optimize
output CSV Apply Formula reconstruction. query via
Fact filters. reorganizatio
Update Perform Join. n
status Having Dim Filter.
Select/Group Generate MR
Notify user Top N Partial spec.
aggregation

What worked
Efficiency in modeling and joins
Solid data modeling. Wasteful to
perform joins on the fly. Single-stage
MR to both group and join.
Map side metadata joins – efficient
horizontal, vertical & filtered data
load.
Pre-join metadata once.

What worked cont..
Simplicity: Transparent Cube and
Aggregate selection (no From or Join
clause).
Ability to absorb data dynamics.
Intuitive query builder.
Analytics - not ‘just’ query.
Support for ‘scheduled’ ad-hoc queries.

What's hot

Interactive query in hadoopRommel Garcia

Distributed Data Analysis with Hadoop and R - Strangeloop 2011Jonathan Seidman

AWS Summit 2011: Big Data Analytics in the AWS cloudAmazon Web Services

Which data should you move to Hadoop?Attunity

Real time data processing frameworksIJDKP

Big Data AnalyticsAmazon Web Services

Topology Aware Resource AllocationSujith Jay Nair

Designing Scalable Data Warehouse Using MySQLVenu Anuganti

A sql implementation on the map reduce frameworkeldariof

Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Renato Bonomini

MS SQL SERVER: Olap cubes and data miningDataminingTools Inc

Rfhoc a random forest approach to auto-tuning hadoop’s configurationLeMeniz Infotech

IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...IRJET Journal

About Streaming Data Solutions for HadoopLynn Langit

What's hot (14)

Interactive query in hadoop

Distributed Data Analysis with Hadoop and R - Strangeloop 2011

AWS Summit 2011: Big Data Analytics in the AWS cloud

Which data should you move to Hadoop?

Real time data processing frameworks

Big Data Analytics

Topology Aware Resource Allocation

Designing Scalable Data Warehouse Using MySQL

A sql implementation on the map reduce framework

Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...

MS SQL SERVER: Olap cubes and data mining

Rfhoc a random forest approach to auto-tuning hadoop’s configuration

IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...

About Streaming Data Solutions for Hadoop

Similar to Analytics Pipeline at Inmobi

Presentation_BigData_NenaMarinn5712036

The Evolution of Apache KylinDataWorks Summit/Hadoop Summit

BigDataShankar R

Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...Yahoo Developer Network

Analysis Services Best Practices From Large Deploymentsrsnarayanan

Apache Kylin 1.5 UpdatesYang Li

WBDB 2014 Benchmarking Virtualized Hadoop Clusterst_ivanov

Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin

World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...Karthik K Iyengar

SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData

Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit

Starfish: A Self-tuning System for Big Data AnalyticsGrant Ingersoll

GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...Yann Cluchey

Experimentation Platform on HadoopDataWorks Summit

eBay Experimentation Platform on HadoopTony Ng

Hadoop Demo eConvergencekvnnrao

Camunda BPM 7.2: Performance and Scalability (English)camunda services GmbH

Naukri Search Team achievements, 2009-2010Aditya Varun Chadha

B4UConference_machine learning_deeplearningHoa Le

Adtech scala-performance-tuning-150323223738-conversion-gate01Giridhar Addepalli

Similar to Analytics Pipeline at Inmobi (20)

Presentation_BigData_NenaMarin

The Evolution of Apache Kylin

BigData

Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...

Analysis Services Best Practices From Large Deployments

Apache Kylin 1.5 Updates

WBDB 2014 Benchmarking Virtualized Hadoop Clusters

Big Data: Its Characteristics And Architecture Capabilities

World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...

SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Parallel Linear Regression in Interative Reduce and YARN

Starfish: A Self-tuning System for Big Data Analytics

GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...

Experimentation Platform on Hadoop

eBay Experimentation Platform on Hadoop

Hadoop Demo eConvergence

Camunda BPM 7.2: Performance and Scalability (English)

Naukri Search Team achievements, 2009-2010

B4UConference_machine learning_deeplearning

Adtech scala-performance-tuning-150323223738-conversion-gate01

Recently uploaded

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

AI as an Interface for Commercial BuildingsMemoori

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

Artificial intelligence in the post-deep learning eraDeakin University

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Recently uploaded (20)

Maximizing Board Effectiveness 2024 Webinar.pptx

Advanced Test Driven-Development @ php[tek] 2024

AI as an Interface for Commercial Buildings

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Benefits Of Flutter Compared To Other Frameworks

Artificial intelligence in the post-deep learning era

Pigging Solutions Piggable Sweeping Elbows

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

IAC 2024 - IA Fast Track to Search Focused AI Solutions

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

Breaking the Kubernetes Kill Chain: Host Path Mount

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx

Presentation on how to chat with PDF using ChatGPT code interpreter

08448380779 Call Girls In Friends Colony Women Seeking Men

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Analytics Pipeline at Inmobi

1. Analytics @ Inmobi Gaurav Agarwal gaurav@inmobi.com

2. Scale Large data sizes ~3B records per day ~= 2.5T uncompressed json ~100 primary, ~300 derived dimensions. ~50 measures. Analysis horizon -years

3. Scope Highly Dynamic Data and Analytic needs Frequent addition of newer dimensions. Very dynamic query patterns. Both canned and ad-hoc reports. Multiple phase-shifted large data streams. Different kind of consumers – sales, analyst, execs, machines.

4. The Beginnings: Perl to MR (Hadoop).. Logs summarized using perl. Low volumes (order of hundred thousand). Perl could not handle increased volumes (millions). (2Q, 2010) MR jobs to aggregate logs and populate DB (3 machine cluster) DB views increased ; creating MR jobs time consuming, error prone and hard. (3Q, 2010)

5. Solving for Pipeline - Pig MR: New job per need ; known by few. Pig: Well suited for medium complexity pipeline jobs. Data gets aggregated using Pig and pushed to DB for analytics.

6. Analytics gets complex Business evolved; complex analytics needed. DB suffers ‘limited angle view’ problems. Proliferation of materialized views. Hive: not mature (early 2011), too much resource on small/medium clusters, lot of flux, not optimal, difficult to fix things and add features. Back to Pig: Team of engineers writing ad-hoc pig scripts for business; Performance only as good as person writing the query – very low productivity.

7. Realization Frequently ‘tools’ don’t work as intended. Too much customizations and constant tuning. Difficult to absorb the dynamics of the data. Too generic and not optimal for our data models and cluster size. Parts of the required stack – difficult to integrate and maintain. Pig not suited for analytics by business. Too much technical knowledge needed.

8. Yoda Developed in-house system to satisfy ad- hoc analytics. Complete Stack (ETL, Query Processor, Query Builder, Visualization) on top of Hadoop, for processing logs & analytics. (Q1, 2011) SQL like operations like Select, Sum, Avg Min, Max, Count, Distinct, Decode, Expressions, GroupBy, Where, Having, Decode, UDF, UDAF etc.

9. Yoda cont.. Heavily optimize storage and queries for the data model. All the fact data streams and metadata in a coherent, seamless view. Platform–UI as well as API (to embed the functionality it in other apps).

10. Life of a Query UI Optimizations Validate Select Convert to Metadata-> Query Metadata protbuf Fact Create Joins promotions: Select Cube Estimate cost Transmit GroupBy Select Priority Json Select Optimal Determine Where grain Split size Collect data (Reducer) (Mapper) (Driver) Do Aggregate Filter Push Down Format and at record Optimize output CSV Apply Formula reconstruction. query via Fact filters. reorganizatio Update Perform Join. n status Having Dim Filter. Select/Group Generate MR Notify user Top N Partial spec. aggregation

11. What worked Efficiency in modeling and joins Solid data modeling. Wasteful to perform joins on the fly. Single-stage MR to both group and join. Map side metadata joins – efficient horizontal, vertical & filtered data load. Pre-join metadata once.

12. What worked cont.. Simplicity: Transparent Cube and Aggregate selection (no From or Join clause). Ability to absorb data dynamics. Intuitive query builder. Analytics - not ‘just’ query. Support for ‘scheduled’ ad-hoc queries.

13. Demo + QA gaurav@inmobi.com

Analytics Pipeline at Inmobi

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to Analytics Pipeline at Inmobi

Similar to Analytics Pipeline at Inmobi (20)

Recently uploaded

Recently uploaded (20)

Analytics Pipeline at Inmobi