2. We are SmarterHQ
SmarterHQ is the leading multi-channel behavioral marketing
platform, empowering B2C marketers to personalize individual
customer interactions in real-time. We work with some of the
world’s largest brands – such as Bloomingdales, Santander Bank,
Carrentals.com and Finish Line to drive phenomenal business
results. We’ve been recognized by Forbes as technology to push
B2C companies into a new era of personalization and Forrester’s
Total Economic Impact study to deliver 667% in ROI.
3. So Lets build our models!!!
Easy enough, choose our favorite algorithm (in our case going for eventual
near real time scoring Logistic Regression).
Model build and input data filtering using Standard Deviation, Correlation and
Lasso LARS
We use python libraries (SCIKIT and pySQL Libraries) to automate gathering
the data and delivering to the server for model building!
This was all developed and perfected prior to Jan 2015 (a scant 6 months at
SmarterHQ)
Recently, expanded to include Affinity Analysis for interaction term building and
Product Recommendations
3
So what is the problem???!!!What have I not told you?
6. Data Gathering
Digital Sources:
• Tag a website, mobile app, etc
Product views, customer ids, email address, products carted, products purchased, loyalty ids
• Streams to redshift in as little as 5 minutes.
• Incremental batches run on redshift ~5 minutes, so data latency is as little as 10 minutes
OMS:
• Daily Feeds worked out with the Client:
Customer ids, loyalty ids, products, order totals, email address, refunds, cancelations, shipping info
• Processed once a day in a daily process
Product:
• Product ids, client based marketing categories
6
7. StoreFront Infrastructure Design
Properties:
Modular in design
highly Parallel
Concurrent writing
Processes are Daemonized
Python Apps supporting infrastructure
A typical day for every customer:
Web load (240x/day):
OMS (1x/day):
Product Feeds(1x/day):
7
WEB
streaming
SQS Kinesis Lambda S3 Redshift
ETL from
Client
Informatica S3 Redshift
ETL from
Client
Informatica S3 Redshift
9. Entities!
• Everyone has a definition of what a customer is!!! How do we represent that customer in the data
that we have? If I ask for all of the purchase information from customer X then how can I get it
reliably and quickly?
• Entities are data driven constructs that are the data representation of a customer, location,
marketing campaign, etc….
• Defined by exact matching (Really want to go to Fuzzy land!)
Email Addresses, Loyalty ID, order ids, customer names, other customer ids
Require more than 2 pieces to match (except in the case of web only then email entities!)
Example:
9
10. Entity Mechanics
Build Entities using Graph Theory
Set of all possible data elements to be linked is the Vertex set
Use the data to build connections between Vertices or Edges!
Set of all connected vertices is the Edge Set
Use a graph building algorithms Breadth First Search or Depth First Search to build out the graphs
10
11. OMS:
1. Person Identifier fields (name, email address, customer ids, order ids)
2. Parse Email field (filter out with regular expression improperly formatted emails using RFC5322
standard) and get email user id
3. Algorithm Exact match on at least 2 fields (common names and email user names make single
point matches unreliable)
Could expand to 1 point using a frequency analysis to rule out 1 point matches for less common
names or email addresses
Digital:
Personal Identifier fields (email address, order id, loyalty ids)
1. Exact match on at least two of order id, email address or loyalty id to corresponding OMS entity
2. Next do digital email based entities (1 point matches)
11
Entities with both OMS Retail and Digital vertices – CrossChannel
Entities!
13. Asset Quality/Visit Quality
Measures the expected value based on history of products viewed online
Suppose an Entity “Sarah” views 3 products X, Y and Z.
Asset Quality (AQ) is #purchases * Price / #views
Today Sarah’s AQ:
13
Product Price # views # purchases Asset
Quality
X $5.00 220 23 $0.52
Y $10.00 342 45 $1.32
Z $15.00 122 5 $0.61
Visit Quality (VQ) is Sum of Asset Quality for a visit
e.g. $2.45
14. Engagement
14
A weeks long Engagement with a 50% decay rate:
Day Visit Quality Engagement
1 $10.98 $10.98
2 $0 $5.49
3 $0 $2.75
4 $0 $1.37
5 $3.46 $4.15
6 $0 $2.07
7 $2.45 $3.49
$-
$2.00
$4.00
$6.00
$8.00
$10.00
$12.00
0 1 2 3 4 5 6 7 8
Dollars($)
Day
VQ Engagement
15. Product Recommendations
Association Rules with monthly customer sessions
• N1: Count the number of times products appear in pairs (over a month for a customer)
• N2: Count the number of times products (Antecedent or Consequent)appear over a month for a
customer
• N3: Count the number customers in a month
Compute
• Antecedent Support ( N2A / N3)
• Consequent Support ( N2C / N3)
• Rule Confidence (N1 / N2A)
• Lift ( N1/ N2A / (N2C / N3 ) )
All of this is done in database for all the most recent month daily!
15
16. Recommendation Example
Antecedent: Mens Air Jordan City Collection NYC T-Shirt N2A = 384
Consequent: Mens Air Jordan Retro 10 NYC Basketball Shoes N2C = 9770
Rule Occurrence: N1 = 114
Transaction Count: N3 = 780,005
Antecedent Support ( N2A / N3) = 384/780,005 = 0.00049
Consequent Support ( N2C / N3) = 9770/780,005 = 0.012
Rule Confidence (N1 / N2A) = 114/384 = 0.297
Lift ( N1/ N2A / (N2C / N3 ) ) = Rule Confidence / Consequent Support = 23.7
23.7x more likely to purchase Air Jordans after buying the Jordan City
Collection NYC T-Shirt
16
17. RFML
Recency: the number of days since the last visit or purchase by a shopper.
Frequency: the number of visits or purchases within a time period of interest.
Monetary: the total dollar spend of a shopper within the time period of interest.
Latency: the average number of days between visits or purchases within the time period of interest.
Recency and Latency are computed 1/day
Computed on demand:
Frequency
Monetary
17
18. Predictive Models
GOAL: Predict Days To Next Purchase and Days to Next Visit for <= 1, 3, 7, 15 and interval 15-
30, 31-60, 61-90
216 input fields (Engagement, Average order value, Average session value, session count, asset
count, many more plus interactions)
Build models on 6M records at an entity level
Model Building Process:
18
6M records (Redshift) Python pyETL library
Variable Reduction
(Variance, Correlation
and Lasso-LARS
variable reduction)
Build Models
(Parallel!!)
Model Tests (ROC
AUC, Regression
Coefficients)
Upload model &
results to SQL
Models ready to
Deploy
Model scoring handled directly in SQL using a SQL process.
Can score 100M’s of records in minutes!
19. Example Big A$$ Client
Athletic Retailer, 2 years of data, $1.6B in sales / year,
Typical Daily Adds 50,000 transactions, typical batch gives about 20,000 records every 6 min!
Database size: 866G (compressed) which equates 2.5T (uncompressed)
Total Daily Run time 3 hours (rebuilds from scratch), Batch runtime 5 mins!
Vertex Set: 253,449,334
Entity Set: 203,531,275
There are 50 million non-Atomic equivalence classes!
These amount to $850M or ~53% of the sales
(these customers are the known repeat customers)
These are the customers we can target as we have richer information about their repeated
browsing.
19
20. This is StoreFront Personalization
20
Website Mobile App In-Store Call Center 3rd PartyAnnual Spend: $4,500
Transactional History
• Online: INV 1215 $103.98
• Store: INV 4672 $50.45
• Store: INV 8500 $123.87 [etc]
Email Addresses
• Transactional: sarahhall@gmail.com
• Account: shall@home.com
• Promotional: sarahh@yahoo.com
Category Affinity: Kid’s, Women’s,
Running
Brand Affinity: Nike
S AR AH
Sales Channel
Category, Brand, Product
Cross-Channel
Email Website Mobile Display Social