MZ is re-inventing how the entire world experiences data via our mobile games division MZ Games Studios, our digital marketing division Cognant, and our live data platform division Satori.
Growing need of data science capabilities across the organization requires an architecture that can democratize building these applications and disseminating insight from the outcome of data science applications to the wider organization.
Attend this session to learn about how we built a platform for data science using spark, hive, and druid specifically for our performance marketing division cognant.This platform powers several data science application like fraud detection and bid optimization at large scale.
We will be sharing lessons learned over past 3 years in building this platform by also walking through some of the actual data science applications built on top of this platform.
Attendees from ML engineering and data science background can gain deep insight from our experience of building this platform.
Speakers
Pushkar Priyadarshi, Director of Engineer, Michaine Zone Inc
Igor Yurinok, Staff Software Engineer, MZ
3. ● Game studio produces massive mobile games that break
down linguistic and geographic barriers by uniting an
unprecedented number of global players in one gaming
world. Games are played in 180+ countries.
● Performance marketing platform Cognant enables marketing
for our internal games as well as external businesses over
250+ channels. It merges extensive mobile ad buying
expertise with a live data platform to deliver not only true ROI
on mobile marketing spend but eliminate endless fraud and
tiresome make-goods in the process.
Machine Zone(mz.com)
4. ● 40 billion messages/day
● kafka cluster handling 250+ topics over 4k partitions
● 3 hadoop clusters largest one spanning 300 nodes
● 5 PB of unreplicated data in hadoop eco system
● Ads published on 100k apps in nearly 200 countries serving average
750 millions impression a day peaking at 1B/day
● Data from 300 distinct sources
● Druid cluster containing 30+ data sources holding 50 TBs of data
Data @ MZ
5. ● Data Ingestion
○ Ingest raw data from external entities
● Data Normalization
○ Normalize data using transformation framework
● Model Generation
○ Create Model using model generation framework
● Generate predictions
● Second layer of Intelligence
○ Campaign Initialization
○ Campaign Optimization
● Data Service Framework
Overview
7. Data Ingestion (cont’d)
● DataReaders extract data from various types of sources
○ S3 - Amazon S3 bucket accessed reporting data
○ REST - HTTP endpoint reporting data
○ FTP Similar to S3, loads from FileSystem
○ Email - Scan inbox and extract valid reports
● DataWriters output data to HDFS
○ HIVE external tables
9. ● Streaming Real time data source
○ Kafka + Spark Streaming => Tranquility => Druid
● Batch historical backfill raw data source
○ Spark => Druid
● Rule based transformation engine (normalizer)
○ Built using Apache Spark
○ Custom DDL for defining column transformation rules
Data Normalization (cont’d)
10. ● Machine Learning Pipeline based on Apache Spark ML
○ Feature Engineering
○ Model Training
○ Predictions
○ Model Testing/Tuning
○ Model Deployment
MLPlatform
13. ● Feature Engineering extensions
○ DAGPipeline
■ Support multi-input dataset DAG based feature extraction
MLPlatform (cont’d)
n1 n4
n3
n2
DAGModel generated:
14. ● Model Testing/Tuning
○ Feature Store
■ Rapid iterative model testing
○ Configurable Split-Testing
○ Model Store
■ Based on SparkML MLWritable
● Predictions
○ Can be generated using any version of model
○ Compared across model implementations
MLPlatform (cont’d)
15. ● Predictions using Apache Zeppelin based visualization layer
○ Notebooks allow for rapid testing and model iteration
○ Graphing library allows for instant visual feedback
MLPlatform (cont’d)
16. What is output from ML Models?
● Predictions
What is business value of it?
● Not much
What does business need?
● Translate predictions in ad partner instructions
Second Layer of Intelligence
17. Partner instruction is a command which partner can/should execute:
● Create a new campaign
● Update Budget
● Update Bid
● Update Targeting
● Update Creative Asset
What is Partner Instructions?
18. Campaign Initialization:
● Bid
○ Finds the best possible bid to create campaigns
● Budget
○ Splits total budget between partners
● Targeting
○ Generates sets of possible targeting groups (Gender, Age, GEO)
● Creative
○ Generates and assign creatives
Campaign Initialization Process
19. Campaign Optimization:
● Bid
○ Increase, Decrease bids per campaign based on performance prediction
● Budget
○ Increase, Decrease and Reshuffle budget across partners/campaigns
● Targeting
○ Update targeting based on performance
● Creative
○ Reassign creatives based on performance
Campaign Optimization Process
21. Where to store metadata for Data Pipelines?
Where to store Ad Partner Instructions?
How to deliver Ad Partner Instructions?
Data Service Framework
22. Possible Microservices:
● Ad Partner Data Service
● Campaign Data Service
● ASP Data Service
● Ad Partner Instruction Service
Data Service Framework (cont’d)
23. Technologies:
● REST API
● Spring Boot
● Openshift Kubernetis
● Gradle + Jenkins Pipelines for CI/CD
Data Service Framework (cont’d)
24. Connect All Components Together
Data
Ingestion
Data
Normalization
MLPlatform
Ad PartnerData Services