SlideShare a Scribd company logo
Analytics @ Inmobi
      Gaurav Agarwal
    gaurav@inmobi.com
Scale
Large data sizes
 ~3B records per day ~= 2.5T
 uncompressed json
 ~100 primary, ~300 derived
 dimensions.
 ~50 measures.
 Analysis horizon -years
Scope
Highly Dynamic Data and Analytic needs
  Frequent addition of newer dimensions.
  Very dynamic query patterns.
  Both canned and ad-hoc reports.
  Multiple phase-shifted large data
  streams.
  Different kind of consumers – sales,
  analyst, execs, machines.
The Beginnings: Perl to MR
           (Hadoop)..
Logs summarized using perl. Low volumes
(order of hundred thousand).
Perl could not handle increased volumes
(millions). (2Q, 2010)
MR jobs to aggregate logs and populate DB (3
machine cluster)
DB views increased ; creating MR jobs time
consuming, error prone and hard. (3Q, 2010)
Solving for Pipeline - Pig
MR: New job per need ; known by few.
Pig: Well suited for medium complexity
pipeline jobs.
Data gets aggregated using Pig and
pushed to DB for analytics.
Analytics gets complex
Business evolved; complex analytics needed.
DB suffers ‘limited angle view’ problems.
Proliferation of materialized views.
Hive: not mature (early 2011), too much resource
on small/medium clusters, lot of flux, not
optimal, difficult to fix things and add features.
Back to Pig: Team of engineers writing ad-hoc pig
scripts for business; Performance only as good as
person writing the query – very low productivity.
Realization
Frequently ‘tools’ don’t work as intended. Too
much customizations and constant tuning.
Difficult to absorb the dynamics of the data.
Too generic and not optimal for our data
models and cluster size.
Parts of the required stack – difficult to
integrate and maintain.
Pig not suited for analytics by business. Too
much technical knowledge needed.
Yoda
Developed in-house system to satisfy ad-
hoc analytics.
Complete Stack (ETL, Query Processor,
Query Builder, Visualization) on top of
Hadoop, for processing logs & analytics.
(Q1, 2011)
SQL like operations like Select, Sum, Avg
Min, Max, Count, Distinct, Decode,
Expressions, GroupBy, Where, Having,
Decode, UDF, UDAF etc.
Yoda cont..
Heavily optimize storage and
queries for the data model.
All the fact data streams and
metadata in a coherent, seamless
view.
Platform–UI as well as API (to
embed the functionality it in other
apps).
Life of a Query
     UI        Optimizations     Validate         Select
Convert to     Metadata->        Query            Metadata
protbuf        Fact                               Create Joins
               promotions:       Select Cube      Estimate cost
Transmit
               GroupBy                            Select Priority
Json                             Select Optimal   Determine
               Where
                                 grain            Split size




Collect data   (Reducer)       (Mapper)           (Driver)
               Do Aggregate    Filter Push Down
Format and                     at record          Optimize
output CSV     Apply Formula   reconstruction.    query via
                               Fact filters.      reorganizatio
Update         Perform         Join.              n
status         Having          Dim Filter.
                               Select/Group       Generate MR
Notify user    Top N           Partial            spec.
                               aggregation
What worked
Efficiency in modeling and joins
  Solid data modeling. Wasteful to
  perform joins on the fly. Single-stage
  MR to both group and join.
  Map side metadata joins – efficient
  horizontal, vertical & filtered data
  load.
  Pre-join metadata once.
What worked cont..
Simplicity: Transparent Cube and
Aggregate selection (no From or Join
clause).
Ability to absorb data dynamics.
Intuitive query builder.
Analytics - not ‘just’ query.
Support for ‘scheduled’ ad-hoc queries.
Demo + QA

gaurav@inmobi.com

More Related Content

What's hot

Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoop
Rommel Garcia
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Jonathan Seidman
 
AWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloud
Amazon Web Services
 
Which data should you move to Hadoop?
Which data should you move to Hadoop?Which data should you move to Hadoop?
Which data should you move to Hadoop?
Attunity
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworks
IJDKP
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Amazon Web Services
 
Topology Aware Resource Allocation
Topology Aware Resource AllocationTopology Aware Resource Allocation
Topology Aware Resource Allocation
Sujith Jay Nair
 
Designing Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLDesigning Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQL
Venu Anuganti
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
eldariof
 
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Renato Bonomini
 
MS SQL SERVER: Olap cubes and data mining
MS SQL SERVER: Olap cubes and data miningMS SQL SERVER: Olap cubes and data mining
MS SQL SERVER: Olap cubes and data mining
DataminingTools Inc
 
Rfhoc a random forest approach to auto-tuning hadoop’s configuration
Rfhoc a random forest approach to auto-tuning hadoop’s configurationRfhoc a random forest approach to auto-tuning hadoop’s configuration
Rfhoc a random forest approach to auto-tuning hadoop’s configuration
LeMeniz Infotech
 
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...
IRJET-  	  Hadoop based Frequent Closed Item-Sets for Association Rules form ...IRJET-  	  Hadoop based Frequent Closed Item-Sets for Association Rules form ...
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...
IRJET Journal
 
About Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopAbout Streaming Data Solutions for Hadoop
About Streaming Data Solutions for Hadoop
Lynn Langit
 

What's hot (14)

Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoop
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 
AWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloud
 
Which data should you move to Hadoop?
Which data should you move to Hadoop?Which data should you move to Hadoop?
Which data should you move to Hadoop?
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworks
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Topology Aware Resource Allocation
Topology Aware Resource AllocationTopology Aware Resource Allocation
Topology Aware Resource Allocation
 
Designing Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQLDesigning Scalable Data Warehouse Using MySQL
Designing Scalable Data Warehouse Using MySQL
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
 
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
 
MS SQL SERVER: Olap cubes and data mining
MS SQL SERVER: Olap cubes and data miningMS SQL SERVER: Olap cubes and data mining
MS SQL SERVER: Olap cubes and data mining
 
Rfhoc a random forest approach to auto-tuning hadoop’s configuration
Rfhoc a random forest approach to auto-tuning hadoop’s configurationRfhoc a random forest approach to auto-tuning hadoop’s configuration
Rfhoc a random forest approach to auto-tuning hadoop’s configuration
 
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...
IRJET-  	  Hadoop based Frequent Closed Item-Sets for Association Rules form ...IRJET-  	  Hadoop based Frequent Closed Item-Sets for Association Rules form ...
IRJET- Hadoop based Frequent Closed Item-Sets for Association Rules form ...
 
About Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopAbout Streaming Data Solutions for Hadoop
About Streaming Data Solutions for Hadoop
 

Similar to Yoda fifth elephant

Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
n5712036
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
DataWorks Summit/Hadoop Summit
 
BigData
BigDataBigData
BigData
Shankar R
 
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Yahoo Developer Network
 
Analysis Services Best Practices From Large Deployments
Analysis Services   Best Practices From Large DeploymentsAnalysis Services   Best Practices From Large Deployments
Analysis Services Best Practices From Large Deployments
rsnarayanan
 
Apache Kylin 1.5 Updates
Apache Kylin 1.5 UpdatesApache Kylin 1.5 Updates
Apache Kylin 1.5 Updates
Yang Li
 
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop ClustersWBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
t_ivanov
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
Ashraf Uddin
 
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
Karthik K Iyengar
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARN
DataWorks Summit
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
Grant Ingersoll
 
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
Yann Cluchey
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on Hadoop
DataWorks Summit
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
Tony Ng
 
Hadoop Demo eConvergence
Hadoop Demo eConvergenceHadoop Demo eConvergence
Hadoop Demo eConvergence
kvnnrao
 
Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)
camunda services GmbH
 
Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010
Aditya Varun Chadha
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearning
Hoa Le
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
Giridhar Addepalli
 

Similar to Yoda fifth elephant (20)

Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
BigData
BigDataBigData
BigData
 
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
 
Analysis Services Best Practices From Large Deployments
Analysis Services   Best Practices From Large DeploymentsAnalysis Services   Best Practices From Large Deployments
Analysis Services Best Practices From Large Deployments
 
Apache Kylin 1.5 Updates
Apache Kylin 1.5 UpdatesApache Kylin 1.5 Updates
Apache Kylin 1.5 Updates
 
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop ClustersWBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARN
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on Hadoop
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
 
Hadoop Demo eConvergence
Hadoop Demo eConvergenceHadoop Demo eConvergence
Hadoop Demo eConvergence
 
Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)
 
Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearning
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
 

Recently uploaded

Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
Steven Carlson
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
sunilverma7884
 
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Torry Harris
 
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes..."Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
Anant Gupta
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc
 
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Nicolás Lopéz
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
BrainSell Technologies
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
Jimmy Lai
 
The Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF GuideThe Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF Guide
Shiv Technolabs
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
SynapseIndia
 
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
Priyanka Aash
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
Google Developer Group - Harare
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
Uncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in LibrariesUncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in Libraries
Brian Pichman
 
The importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT StandardizationThe importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT Standardization
Axel Rennoch
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
huseindihon
 
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
SynapseIndia
 
Figma AI Design Generator_ In-Depth Review.pdf
Figma AI Design Generator_ In-Depth Review.pdfFigma AI Design Generator_ In-Depth Review.pdf
Figma AI Design Generator_ In-Depth Review.pdf
Management Institute of Skills Development
 
Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024
aakash malhotra
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
RaminGhanbari2
 

Recently uploaded (20)

Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
 
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
 
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes..."Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
 
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024
 
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdfAcumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
Acumatica vs. Sage Intacct vs. NetSuite _ NOW CFO.pdf
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
 
The Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF GuideThe Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF Guide
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
 
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
Uncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in LibrariesUncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in Libraries
 
The importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT StandardizationThe importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT Standardization
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
 
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
 
Figma AI Design Generator_ In-Depth Review.pdf
Figma AI Design Generator_ In-Depth Review.pdfFigma AI Design Generator_ In-Depth Review.pdf
Figma AI Design Generator_ In-Depth Review.pdf
 
Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
 

Yoda fifth elephant

  • 1. Analytics @ Inmobi Gaurav Agarwal gaurav@inmobi.com
  • 2. Scale Large data sizes ~3B records per day ~= 2.5T uncompressed json ~100 primary, ~300 derived dimensions. ~50 measures. Analysis horizon -years
  • 3. Scope Highly Dynamic Data and Analytic needs Frequent addition of newer dimensions. Very dynamic query patterns. Both canned and ad-hoc reports. Multiple phase-shifted large data streams. Different kind of consumers – sales, analyst, execs, machines.
  • 4. The Beginnings: Perl to MR (Hadoop).. Logs summarized using perl. Low volumes (order of hundred thousand). Perl could not handle increased volumes (millions). (2Q, 2010) MR jobs to aggregate logs and populate DB (3 machine cluster) DB views increased ; creating MR jobs time consuming, error prone and hard. (3Q, 2010)
  • 5. Solving for Pipeline - Pig MR: New job per need ; known by few. Pig: Well suited for medium complexity pipeline jobs. Data gets aggregated using Pig and pushed to DB for analytics.
  • 6. Analytics gets complex Business evolved; complex analytics needed. DB suffers ‘limited angle view’ problems. Proliferation of materialized views. Hive: not mature (early 2011), too much resource on small/medium clusters, lot of flux, not optimal, difficult to fix things and add features. Back to Pig: Team of engineers writing ad-hoc pig scripts for business; Performance only as good as person writing the query – very low productivity.
  • 7. Realization Frequently ‘tools’ don’t work as intended. Too much customizations and constant tuning. Difficult to absorb the dynamics of the data. Too generic and not optimal for our data models and cluster size. Parts of the required stack – difficult to integrate and maintain. Pig not suited for analytics by business. Too much technical knowledge needed.
  • 8. Yoda Developed in-house system to satisfy ad- hoc analytics. Complete Stack (ETL, Query Processor, Query Builder, Visualization) on top of Hadoop, for processing logs & analytics. (Q1, 2011) SQL like operations like Select, Sum, Avg Min, Max, Count, Distinct, Decode, Expressions, GroupBy, Where, Having, Decode, UDF, UDAF etc.
  • 9. Yoda cont.. Heavily optimize storage and queries for the data model. All the fact data streams and metadata in a coherent, seamless view. Platform–UI as well as API (to embed the functionality it in other apps).
  • 10. Life of a Query UI Optimizations Validate Select Convert to Metadata-> Query Metadata protbuf Fact Create Joins promotions: Select Cube Estimate cost Transmit GroupBy Select Priority Json Select Optimal Determine Where grain Split size Collect data (Reducer) (Mapper) (Driver) Do Aggregate Filter Push Down Format and at record Optimize output CSV Apply Formula reconstruction. query via Fact filters. reorganizatio Update Perform Join. n status Having Dim Filter. Select/Group Generate MR Notify user Top N Partial spec. aggregation
  • 11. What worked Efficiency in modeling and joins Solid data modeling. Wasteful to perform joins on the fly. Single-stage MR to both group and join. Map side metadata joins – efficient horizontal, vertical & filtered data load. Pre-join metadata once.
  • 12. What worked cont.. Simplicity: Transparent Cube and Aggregate selection (no From or Join clause). Ability to absorb data dynamics. Intuitive query builder. Analytics - not ‘just’ query. Support for ‘scheduled’ ad-hoc queries.