SlideShare a Scribd company logo
1 of 26
Democratizing Data Science using Apache Spark, Hive and Druid
● Pushkar Priyadarshi
● Igor Yurinok
● Michael Dreibelbis
Intro
● Game studio produces massive mobile games that break
down linguistic and geographic barriers by uniting an
unprecedented number of global players in one gaming
world. Games are played in 180+ countries.
● Performance marketing platform Cognant enables marketing
for our internal games as well as external businesses over
250+ channels. It merges extensive mobile ad buying
expertise with a live data platform to deliver not only true ROI
on mobile marketing spend but eliminate endless fraud and
tiresome make-goods in the process.
Machine Zone(mz.com)
● 40 billion messages/day
● kafka cluster handling 250+ topics over 4k partitions
● 3 hadoop clusters largest one spanning 300 nodes
● 5 PB of unreplicated data in hadoop eco system
● Ads published on 100k apps in nearly 200 countries serving average
750 millions impression a day peaking at 1B/day
● Data from 300 distinct sources
● Druid cluster containing 30+ data sources holding 50 TBs of data
Data @ MZ
● Data Ingestion
○ Ingest raw data from external entities
● Data Normalization
○ Normalize data using transformation framework
● Model Generation
○ Create Model using model generation framework
● Generate predictions
● Second layer of Intelligence
○ Campaign Initialization
○ Campaign Optimization
● Data Service Framework
Overview
Data Ingestion
RAW Store
S3
FTP
REST
Email
Reader
Delegator
WriterReader
WriterReader
Data Ingestion (cont’d)
● DataReaders extract data from various types of sources
○ S3 - Amazon S3 bucket accessed reporting data
○ REST - HTTP endpoint reporting data
○ FTP Similar to S3, loads from FileSystem
○ Email - Scan inbox and extract valid reports
● DataWriters output data to HDFS
○ HIVE external tables
Data Normalization
RT
RAW
Rules
Loader
Rules
Store HDFS
Rules
Parser
Rules
Applier
Druid
Rule Based Transformation Engine
● Streaming Real time data source
○ Kafka + Spark Streaming => Tranquility => Druid
● Batch historical backfill raw data source
○ Spark => Druid
● Rule based transformation engine (normalizer)
○ Built using Apache Spark
○ Custom DDL for defining column transformation rules
Data Normalization (cont’d)
● Machine Learning Pipeline based on Apache Spark ML
○ Feature Engineering
○ Model Training
○ Predictions
○ Model Testing/Tuning
○ Model Deployment
MLPlatform
● Feature Engineering extensions
○ Aggregator => NumericAggregator
● Perform aggregate transformation on input Dataset
MLPlatform (cont’d)
● Feature Engineering extensions
○ ParallelCountVectorizer
■ Compute CountVectorizer per input column
○ ParallelIDF
■ Compute IDF per input column
MLPlatform (cont’d)
● Feature Engineering extensions
○ DAGPipeline
■ Support multi-input dataset DAG based feature extraction
MLPlatform (cont’d)
n1 n4
n3
n2
DAGModel generated:
● Model Testing/Tuning
○ Feature Store
■ Rapid iterative model testing
○ Configurable Split-Testing
○ Model Store
■ Based on SparkML MLWritable
● Predictions
○ Can be generated using any version of model
○ Compared across model implementations
MLPlatform (cont’d)
● Predictions using Apache Zeppelin based visualization layer
○ Notebooks allow for rapid testing and model iteration
○ Graphing library allows for instant visual feedback
MLPlatform (cont’d)
What is output from ML Models?
● Predictions
What is business value of it?
● Not much
What does business need?
● Translate predictions in ad partner instructions
Second Layer of Intelligence
Partner instruction is a command which partner can/should execute:
● Create a new campaign
● Update Budget
● Update Bid
● Update Targeting
● Update Creative Asset
What is Partner Instructions?
Campaign Initialization:
● Bid
○ Finds the best possible bid to create campaigns
● Budget
○ Splits total budget between partners
● Targeting
○ Generates sets of possible targeting groups (Gender, Age, GEO)
● Creative
○ Generates and assign creatives
Campaign Initialization Process
Campaign Optimization:
● Bid
○ Increase, Decrease bids per campaign based on performance prediction
● Budget
○ Increase, Decrease and Reshuffle budget across partners/campaigns
● Targeting
○ Update targeting based on performance
● Creative
○ Reassign creatives based on performance
Campaign Optimization Process
Campaign InitializationOptimization
Process
ML Output
(Predictions)
Historical
Data
Initializer
Optimizer
Ad Partner
Instructions
Ad
Partner
Where to store metadata for Data Pipelines?
Where to store Ad Partner Instructions?
How to deliver Ad Partner Instructions?
Data Service Framework
Possible Microservices:
● Ad Partner Data Service
● Campaign Data Service
● ASP Data Service
● Ad Partner Instruction Service
Data Service Framework (cont’d)
Technologies:
● REST API
● Spring Boot
● Openshift Kubernetis
● Gradle + Jenkins Pipelines for CI/CD
Data Service Framework (cont’d)
Connect All Components Together
Data
Ingestion
Data
Normalization
MLPlatform
Ad PartnerData Services
Questions???
Democratizing data science Using spark, hive and druid

More Related Content

What's hot

Highly configurable and extensible data processing framework at PubMatic
Highly configurable and extensible data processing framework at PubMaticHighly configurable and extensible data processing framework at PubMatic
Highly configurable and extensible data processing framework at PubMatic
DataWorks Summit
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
Databricks
 
Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at Walgreens
DataWorks Summit
 

What's hot (20)

Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and FastCloud Experience: Data-driven Applications Made Simple and Fast
Cloud Experience: Data-driven Applications Made Simple and Fast
 
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
Verizon Centralizes Data into a Data Lake in Real Time for AnalyticsVerizon Centralizes Data into a Data Lake in Real Time for Analytics
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
 
Building a Federated Data Directory Platform for Public Health
Building a Federated Data Directory Platform for Public HealthBuilding a Federated Data Directory Platform for Public Health
Building a Federated Data Directory Platform for Public Health
 
Highly configurable and extensible data processing framework at PubMatic
Highly configurable and extensible data processing framework at PubMaticHighly configurable and extensible data processing framework at PubMatic
Highly configurable and extensible data processing framework at PubMatic
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
 
Google App Engine
Google App EngineGoogle App Engine
Google App Engine
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI Initiatives
 
Geospatial data platform at Uber
Geospatial data platform at UberGeospatial data platform at Uber
Geospatial data platform at Uber
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
 
Spark and Online Analytics: Spark Summit East talky by Shubham Chopra
Spark and Online Analytics: Spark Summit East talky by Shubham ChopraSpark and Online Analytics: Spark Summit East talky by Shubham Chopra
Spark and Online Analytics: Spark Summit East talky by Shubham Chopra
 
Dataworks | 2018-06-20 | Gimel data platform
Dataworks | 2018-06-20 | Gimel data platformDataworks | 2018-06-20 | Gimel data platform
Dataworks | 2018-06-20 | Gimel data platform
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
 
Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at Walgreens
 
Loan Decisioning Transformation
Loan Decisioning TransformationLoan Decisioning Transformation
Loan Decisioning Transformation
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 

Similar to Democratizing data science Using spark, hive and druid

Building Pinterest Real-Time Ads Platform Using Kafka Streams
Building Pinterest Real-Time Ads Platform Using Kafka Streams Building Pinterest Real-Time Ads Platform Using Kafka Streams
Building Pinterest Real-Time Ads Platform Using Kafka Streams
confluent
 
How to Suceed in Hadoop
How to Suceed in HadoopHow to Suceed in Hadoop
How to Suceed in Hadoop
Precisely
 
Understanding Business APIs through statistics
Understanding Business APIs through statisticsUnderstanding Business APIs through statistics
Understanding Business APIs through statistics
WSO2
 

Similar to Democratizing data science Using spark, hive and druid (20)

Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
 
Analytics in Your Enterprise
Analytics in Your EnterpriseAnalytics in Your Enterprise
Analytics in Your Enterprise
 
Building Pinterest Real-Time Ads Platform Using Kafka Streams
Building Pinterest Real-Time Ads Platform Using Kafka Streams Building Pinterest Real-Time Ads Platform Using Kafka Streams
Building Pinterest Real-Time Ads Platform Using Kafka Streams
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
 
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data HubSFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
SFSCON23 - Martin Rabanser - Real-time aeroplane tracking and the Open Data Hub
 
Presentation Data Council Meetup: F. Mekkenholt, R. Vlijm
Presentation Data Council Meetup: F. Mekkenholt, R. VlijmPresentation Data Council Meetup: F. Mekkenholt, R. Vlijm
Presentation Data Council Meetup: F. Mekkenholt, R. Vlijm
 
A Big (Query) Frog in a Small Pond, Jakub Motyl, BuffPanel
A Big (Query) Frog in a Small Pond, Jakub Motyl, BuffPanelA Big (Query) Frog in a Small Pond, Jakub Motyl, BuffPanel
A Big (Query) Frog in a Small Pond, Jakub Motyl, BuffPanel
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Machine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud PlatformMachine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud Platform
 
Supercharge your data analytics with BigQuery
Supercharge your data analytics with BigQuerySupercharge your data analytics with BigQuery
Supercharge your data analytics with BigQuery
 
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
 
How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...
How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...
How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...
 
How to Suceed in Hadoop
How to Suceed in HadoopHow to Suceed in Hadoop
How to Suceed in Hadoop
 
WSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
WSO2Con USA 2015: An Introduction to the WSO2 Analytics PlatformWSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
WSO2Con USA 2015: An Introduction to the WSO2 Analytics Platform
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)Big Query - Women Techmarkers (Ukraine - March 2014)
Big Query - Women Techmarkers (Ukraine - March 2014)
 
MicroStrategy at Badoo
MicroStrategy at BadooMicroStrategy at Badoo
MicroStrategy at Badoo
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Understanding Business APIs through statistics
Understanding Business APIs through statisticsUnderstanding Business APIs through statistics
Understanding Business APIs through statistics
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Democratizing data science Using spark, hive and druid

  • 1. Democratizing Data Science using Apache Spark, Hive and Druid
  • 2. ● Pushkar Priyadarshi ● Igor Yurinok ● Michael Dreibelbis Intro
  • 3. ● Game studio produces massive mobile games that break down linguistic and geographic barriers by uniting an unprecedented number of global players in one gaming world. Games are played in 180+ countries. ● Performance marketing platform Cognant enables marketing for our internal games as well as external businesses over 250+ channels. It merges extensive mobile ad buying expertise with a live data platform to deliver not only true ROI on mobile marketing spend but eliminate endless fraud and tiresome make-goods in the process. Machine Zone(mz.com)
  • 4. ● 40 billion messages/day ● kafka cluster handling 250+ topics over 4k partitions ● 3 hadoop clusters largest one spanning 300 nodes ● 5 PB of unreplicated data in hadoop eco system ● Ads published on 100k apps in nearly 200 countries serving average 750 millions impression a day peaking at 1B/day ● Data from 300 distinct sources ● Druid cluster containing 30+ data sources holding 50 TBs of data Data @ MZ
  • 5. ● Data Ingestion ○ Ingest raw data from external entities ● Data Normalization ○ Normalize data using transformation framework ● Model Generation ○ Create Model using model generation framework ● Generate predictions ● Second layer of Intelligence ○ Campaign Initialization ○ Campaign Optimization ● Data Service Framework Overview
  • 7. Data Ingestion (cont’d) ● DataReaders extract data from various types of sources ○ S3 - Amazon S3 bucket accessed reporting data ○ REST - HTTP endpoint reporting data ○ FTP Similar to S3, loads from FileSystem ○ Email - Scan inbox and extract valid reports ● DataWriters output data to HDFS ○ HIVE external tables
  • 9. ● Streaming Real time data source ○ Kafka + Spark Streaming => Tranquility => Druid ● Batch historical backfill raw data source ○ Spark => Druid ● Rule based transformation engine (normalizer) ○ Built using Apache Spark ○ Custom DDL for defining column transformation rules Data Normalization (cont’d)
  • 10. ● Machine Learning Pipeline based on Apache Spark ML ○ Feature Engineering ○ Model Training ○ Predictions ○ Model Testing/Tuning ○ Model Deployment MLPlatform
  • 11. ● Feature Engineering extensions ○ Aggregator => NumericAggregator ● Perform aggregate transformation on input Dataset MLPlatform (cont’d)
  • 12. ● Feature Engineering extensions ○ ParallelCountVectorizer ■ Compute CountVectorizer per input column ○ ParallelIDF ■ Compute IDF per input column MLPlatform (cont’d)
  • 13. ● Feature Engineering extensions ○ DAGPipeline ■ Support multi-input dataset DAG based feature extraction MLPlatform (cont’d) n1 n4 n3 n2 DAGModel generated:
  • 14. ● Model Testing/Tuning ○ Feature Store ■ Rapid iterative model testing ○ Configurable Split-Testing ○ Model Store ■ Based on SparkML MLWritable ● Predictions ○ Can be generated using any version of model ○ Compared across model implementations MLPlatform (cont’d)
  • 15. ● Predictions using Apache Zeppelin based visualization layer ○ Notebooks allow for rapid testing and model iteration ○ Graphing library allows for instant visual feedback MLPlatform (cont’d)
  • 16. What is output from ML Models? ● Predictions What is business value of it? ● Not much What does business need? ● Translate predictions in ad partner instructions Second Layer of Intelligence
  • 17. Partner instruction is a command which partner can/should execute: ● Create a new campaign ● Update Budget ● Update Bid ● Update Targeting ● Update Creative Asset What is Partner Instructions?
  • 18. Campaign Initialization: ● Bid ○ Finds the best possible bid to create campaigns ● Budget ○ Splits total budget between partners ● Targeting ○ Generates sets of possible targeting groups (Gender, Age, GEO) ● Creative ○ Generates and assign creatives Campaign Initialization Process
  • 19. Campaign Optimization: ● Bid ○ Increase, Decrease bids per campaign based on performance prediction ● Budget ○ Increase, Decrease and Reshuffle budget across partners/campaigns ● Targeting ○ Update targeting based on performance ● Creative ○ Reassign creatives based on performance Campaign Optimization Process
  • 21. Where to store metadata for Data Pipelines? Where to store Ad Partner Instructions? How to deliver Ad Partner Instructions? Data Service Framework
  • 22. Possible Microservices: ● Ad Partner Data Service ● Campaign Data Service ● ASP Data Service ● Ad Partner Instruction Service Data Service Framework (cont’d)
  • 23. Technologies: ● REST API ● Spring Boot ● Openshift Kubernetis ● Gradle + Jenkins Pipelines for CI/CD Data Service Framework (cont’d)
  • 24. Connect All Components Together Data Ingestion Data Normalization MLPlatform Ad PartnerData Services