SlideShare a Scribd company logo
1 of 13
Analytics @ Inmobi
      Gaurav Agarwal
    gaurav@inmobi.com
Scale
Large data sizes
 ~3B records per day ~= 2.5T
 uncompressed json
 ~100 primary, ~300 derived
 dimensions.
 ~50 measures.
 Analysis horizon -years
Scope
Highly Dynamic Data and Analytic needs
  Frequent addition of newer dimensions.
  Very dynamic query patterns.
  Both canned and ad-hoc reports.
  Multiple phase-shifted large data
  streams.
  Different kind of consumers – sales,
  analyst, execs, machines.
The Beginnings: Perl to MR
           (Hadoop)..
Logs summarized using perl. Low volumes
(order of hundred thousand).
Perl could not handle increased volumes
(millions). (2Q, 2010)
MR jobs to aggregate logs and populate DB (3
machine cluster)
DB views increased ; creating MR jobs time
consuming, error prone and hard. (3Q, 2010)
Solving for Pipeline - Pig
MR: New job per need ; known by few.
Pig: Well suited for medium complexity
pipeline jobs.
Data gets aggregated using Pig and
pushed to DB for analytics.
Analytics gets complex
Business evolved; complex analytics needed.
DB suffers ‘limited angle view’ problems.
Proliferation of materialized views.
Hive: not mature (early 2011), too much resource
on small/medium clusters, lot of flux, not
optimal, difficult to fix things and add features.
Back to Pig: Team of engineers writing ad-hoc pig
scripts for business; Performance only as good as
person writing the query – very low productivity.
Realization
Frequently ‘tools’ don’t work as intended. Too
much customizations and constant tuning.
Difficult to absorb the dynamics of the data.
Too generic and not optimal for our data
models and cluster size.
Parts of the required stack – difficult to
integrate and maintain.
Pig not suited for analytics by business. Too
much technical knowledge needed.
Yoda
Developed in-house system to satisfy ad-
hoc analytics.
Complete Stack (ETL, Query Processor,
Query Builder, Visualization) on top of
Hadoop, for processing logs & analytics.
(Q1, 2011)
SQL like operations like Select, Sum, Avg
Min, Max, Count, Distinct, Decode,
Expressions, GroupBy, Where, Having,
Decode, UDF, UDAF etc.
Yoda cont..
Heavily optimize storage and
queries for the data model.
All the fact data streams and
metadata in a coherent, seamless
view.
Platform–UI as well as API (to
embed the functionality it in other
apps).
Life of a Query
     UI        Optimizations     Validate         Select
Convert to     Metadata->        Query            Metadata
protbuf        Fact                               Create Joins
               promotions:       Select Cube      Estimate cost
Transmit
               GroupBy                            Select Priority
Json                             Select Optimal   Determine
               Where
                                 grain            Split size




Collect data   (Reducer)       (Mapper)           (Driver)
               Do Aggregate    Filter Push Down
Format and                     at record          Optimize
output CSV     Apply Formula   reconstruction.    query via
                               Fact filters.      reorganizatio
Update         Perform         Join.              n
status         Having          Dim Filter.
                               Select/Group       Generate MR
Notify user    Top N           Partial            spec.
                               aggregation
What worked
Efficiency in modeling and joins
  Solid data modeling. Wasteful to
  perform joins on the fly. Single-stage
  MR to both group and join.
  Map side metadata joins – efficient
  horizontal, vertical & filtered data
  load.
  Pre-join metadata once.
What worked cont..
Simplicity: Transparent Cube and
Aggregate selection (no From or Join
clause).
Ability to absorb data dynamics.
Intuitive query builder.
Analytics - not ‘just’ query.
Support for ‘scheduled’ ad-hoc queries.
Demo + QA

gaurav@inmobi.com

More Related Content

What's hot

Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoopRommel Garcia
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Jonathan Seidman
 
AWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAmazon Web Services
 
Which data should you move to Hadoop?
Which data should you move to Hadoop?Which data should you move to Hadoop?
Which data should you move to Hadoop?Attunity
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworksIJDKP
 
Topology Aware Resource Allocation
Topology Aware Resource AllocationTopology Aware Resource Allocation
Topology Aware Resource AllocationSujith Jay Nair
 
Designing Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLDesigning Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLVenu Anuganti
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkeldariof
 
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Renato Bonomini
 
MS SQL SERVER: Olap cubes and data mining
MS SQL SERVER: Olap cubes and data miningMS SQL SERVER: Olap cubes and data mining
MS SQL SERVER: Olap cubes and data miningDataminingTools Inc
 
Rfhoc a random forest approach to auto-tuning hadoop’s configuration
Rfhoc a random forest approach to auto-tuning hadoop’s configurationRfhoc a random forest approach to auto-tuning hadoop’s configuration
Rfhoc a random forest approach to auto-tuning hadoop’s configurationLeMeniz Infotech
 
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...
IRJET-  	  Hadoop based Frequent Closed Item-Sets for Association Rules form ...IRJET-  	  Hadoop based Frequent Closed Item-Sets for Association Rules form ...
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...IRJET Journal
 
About Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopAbout Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopLynn Langit
 

What's hot (14)

Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoop
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 
AWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloud
 
Which data should you move to Hadoop?
Which data should you move to Hadoop?Which data should you move to Hadoop?
Which data should you move to Hadoop?
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworks
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Topology Aware Resource Allocation
Topology Aware Resource AllocationTopology Aware Resource Allocation
Topology Aware Resource Allocation
 
Designing Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLDesigning Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQL
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
 
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
 
MS SQL SERVER: Olap cubes and data mining
MS SQL SERVER: Olap cubes and data miningMS SQL SERVER: Olap cubes and data mining
MS SQL SERVER: Olap cubes and data mining
 
Rfhoc a random forest approach to auto-tuning hadoop’s configuration
Rfhoc a random forest approach to auto-tuning hadoop’s configurationRfhoc a random forest approach to auto-tuning hadoop’s configuration
Rfhoc a random forest approach to auto-tuning hadoop’s configuration
 
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...
IRJET-  	  Hadoop based Frequent Closed Item-Sets for Association Rules form ...IRJET-  	  Hadoop based Frequent Closed Item-Sets for Association Rules form ...
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...
 
About Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopAbout Streaming Data Solutions for Hadoop
About Streaming Data Solutions for Hadoop
 

Similar to Analytics Pipeline at Inmobi

Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarinn5712036
 
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...Yahoo Developer Network
 
Analysis Services Best Practices From Large Deployments
Analysis Services   Best Practices From Large DeploymentsAnalysis Services   Best Practices From Large Deployments
Analysis Services Best Practices From Large Deploymentsrsnarayanan
 
Apache Kylin 1.5 Updates
Apache Kylin 1.5 UpdatesApache Kylin 1.5 Updates
Apache Kylin 1.5 UpdatesYang Li
 
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop ClustersWBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop Clusterst_ivanov
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin
 
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...Karthik K Iyengar
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsGrant Ingersoll
 
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...Yann Cluchey
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on HadoopDataWorks Summit
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopTony Ng
 
Hadoop Demo eConvergence
Hadoop Demo eConvergenceHadoop Demo eConvergence
Hadoop Demo eConvergencekvnnrao
 
Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)camunda services GmbH
 
Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Aditya Varun Chadha
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningHoa Le
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Giridhar Addepalli
 

Similar to Analytics Pipeline at Inmobi (20)

Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
BigData
BigDataBigData
BigData
 
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
 
Analysis Services Best Practices From Large Deployments
Analysis Services   Best Practices From Large DeploymentsAnalysis Services   Best Practices From Large Deployments
Analysis Services Best Practices From Large Deployments
 
Apache Kylin 1.5 Updates
Apache Kylin 1.5 UpdatesApache Kylin 1.5 Updates
Apache Kylin 1.5 Updates
 
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop ClustersWBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARN
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on Hadoop
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
 
Hadoop Demo eConvergence
Hadoop Demo eConvergenceHadoop Demo eConvergence
Hadoop Demo eConvergence
 
Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)
 
Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearning
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
 

Recently uploaded

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 

Recently uploaded (20)

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 

Analytics Pipeline at Inmobi

  • 1. Analytics @ Inmobi Gaurav Agarwal gaurav@inmobi.com
  • 2. Scale Large data sizes ~3B records per day ~= 2.5T uncompressed json ~100 primary, ~300 derived dimensions. ~50 measures. Analysis horizon -years
  • 3. Scope Highly Dynamic Data and Analytic needs Frequent addition of newer dimensions. Very dynamic query patterns. Both canned and ad-hoc reports. Multiple phase-shifted large data streams. Different kind of consumers – sales, analyst, execs, machines.
  • 4. The Beginnings: Perl to MR (Hadoop).. Logs summarized using perl. Low volumes (order of hundred thousand). Perl could not handle increased volumes (millions). (2Q, 2010) MR jobs to aggregate logs and populate DB (3 machine cluster) DB views increased ; creating MR jobs time consuming, error prone and hard. (3Q, 2010)
  • 5. Solving for Pipeline - Pig MR: New job per need ; known by few. Pig: Well suited for medium complexity pipeline jobs. Data gets aggregated using Pig and pushed to DB for analytics.
  • 6. Analytics gets complex Business evolved; complex analytics needed. DB suffers ‘limited angle view’ problems. Proliferation of materialized views. Hive: not mature (early 2011), too much resource on small/medium clusters, lot of flux, not optimal, difficult to fix things and add features. Back to Pig: Team of engineers writing ad-hoc pig scripts for business; Performance only as good as person writing the query – very low productivity.
  • 7. Realization Frequently ‘tools’ don’t work as intended. Too much customizations and constant tuning. Difficult to absorb the dynamics of the data. Too generic and not optimal for our data models and cluster size. Parts of the required stack – difficult to integrate and maintain. Pig not suited for analytics by business. Too much technical knowledge needed.
  • 8. Yoda Developed in-house system to satisfy ad- hoc analytics. Complete Stack (ETL, Query Processor, Query Builder, Visualization) on top of Hadoop, for processing logs & analytics. (Q1, 2011) SQL like operations like Select, Sum, Avg Min, Max, Count, Distinct, Decode, Expressions, GroupBy, Where, Having, Decode, UDF, UDAF etc.
  • 9. Yoda cont.. Heavily optimize storage and queries for the data model. All the fact data streams and metadata in a coherent, seamless view. Platform–UI as well as API (to embed the functionality it in other apps).
  • 10. Life of a Query UI Optimizations Validate Select Convert to Metadata-> Query Metadata protbuf Fact Create Joins promotions: Select Cube Estimate cost Transmit GroupBy Select Priority Json Select Optimal Determine Where grain Split size Collect data (Reducer) (Mapper) (Driver) Do Aggregate Filter Push Down Format and at record Optimize output CSV Apply Formula reconstruction. query via Fact filters. reorganizatio Update Perform Join. n status Having Dim Filter. Select/Group Generate MR Notify user Top N Partial spec. aggregation
  • 11. What worked Efficiency in modeling and joins Solid data modeling. Wasteful to perform joins on the fly. Single-stage MR to both group and join. Map side metadata joins – efficient horizontal, vertical & filtered data load. Pre-join metadata once.
  • 12. What worked cont.. Simplicity: Transparent Cube and Aggregate selection (no From or Join clause). Ability to absorb data dynamics. Intuitive query builder. Analytics - not ‘just’ query. Support for ‘scheduled’ ad-hoc queries.