1. Engaging with Caserta to
ADVANCE YOUR BUSINESS
September 26th, 2017
November 15th, 2017November 15th, 2017
Maxwell Goldbas, Director of Caserta Innovation Labs
Multi-Touch
Attribution Modeling
with Spark
2. • Who am I?
• Raised on the Upper West Side
• Data Engineer
• Director, Caserta Innovations Lab
• Topics today
• Multi-touch attribution
• Data science with Spark
Introduction
2
3. • Caserta recently did a cloud migration
• Large media client
• Client could not join us today
• Client was not familiar with Spark
• Hesitant to change to open source code
• We want to demonstrate its power
Background
3
4. • Client: Which consumer touch points
drive engagement in rewards program?
• Snail Mail
• Texts
• Member Events
• Email
• Site Activity
• Caserta: Get client excited about our
Infrastructure
• Identity Resolution
• Unified Data Source
Objectives
4
5. • Databricks
• User Access to Data Lake
• Several Spark Clusters
• Graph Dataframes
• AWS
• Data Lake in S3
• Redshift
• EC2 for Clusters
• Caserta
• Airflow
• Docker
• RabbitMQ
Infrastructure
5
6. • Get data in useable format
• Required knowledge:
• Number of touch points that happened in between
each conversion
• Impact each touch point had on final conversion
• Pull all engagements
• Pull distinct conversions by individual key, event type,
date
• Conversions is the engagement with rewards program
• Do not want multiple conversions, by the same person
on the same day to create noise
• 15 billion rows of event data
Preparation
6
7. Process
7
Events
Paths
Models
• Order events by individual
• Flag each conversion event
• Flag each a new individual
• Path for each flag for conversion and individua
• Group touch points into paths
• Build Models from Paths
8. Conversion Paths – Event Data
8
Individual Key Activity Type
Key
Conversion New User Conversion
Path
1 Email 0 0 0
1 Text 0 0 0
1 Conversion 1 0 0
1 Email 0 0 1
2 Text 0 1 2
2 Conversion 1 0 2
2 Text 0 0 3
11. Conversion Paths – Conversion Data
11
Total Emails Total Texts Converted?
2 1 1
1 0 0
0 1 1
0 1 0
LabelFeatures
12. • Darling child of data science
• Flexible, easy to use, accurate
• Prediction for whether or not a certain
number of events will lead to a
conversion
• Each conversion should have the
number of touch points that lead it
• Results:
• Email and Web Traffic are king
First Model: Logistic Regression
12
13. • Does not take time between engagements
and conversions into account
• 1000 ads over a year is not 10 times greater
than 100 ads in a week
• Survival analysis to the rescue
• Offset the total number of ads by the
duration they were seen in
• Highest Survival Rate – Web Traffic
• The steeper the curve, the more powerful
the ad
First Model is Wrong: Survival Analysis
13
15. • Reduce touch points in a long conversion path
• Web traffic activity was effected the most
• More messages means easier to forget
• Less impact
• Multiply number of events by probability they
will convert after that number of events in
their duration
• Results:
• Email and Events are king
Second Model: Discrete Time Survival Model based
conversation paths
15
16. • Survival Analysis is currently univariate
• Multivariate would could demonstrate
covariance
• Did not have social media data
• Use deep learning
• Account for correlation across channels
• Add parameter for heavy web users,
balance between offline and online focus
Further Analysis
16
17. • Parallelism is good
• Use Redshift and Spark
• Watch your bottlenecks
• Actions like show and count can cost precious
time
• Bottlenecks can be mitigated by using less,
bigger instances
• Survival Analysis gave us a good amount of
data
• Duration of time before someone would
convert based on a channel
• Caching helped for frequently access data
Notes
17