Dataiku hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Semi-Supervised Learning on
Hadoop to understand user
behaviors
Hadoop Summit Amsterdam
2-3 Avril 2014

Florian Douetteau
@fdouetteau
www.dataiku.com
Data Science
Studio

Motivation
• CxO
– Pages Views, Unique Visitors, Dollars, Subscription
• Editor / Product Manager
– Time Spent, Comments
• Users
– Content
What does matter on a web site ?

Key Usage Metrics
• Publisher
– Time Spent on Page
– Number of pages seen
– Number of comments
– Move to Subscription
• Search Engine
– Click on first hits / re-click
– Rephrasing ratio
– Will come back tomorrow
– Click on Advertisting
• Online Game
– Time spent in the game
– Level Progress
– In-App Purchase

The Quest for the Missing Proxy
• Publisher
– Time Spent on Page
– Number of pages seen
– Number of comments
– User Satisfaction
– Move to Subscription
• Search Engine
– Click on first hits / re-click
– Rephrasing ratio
– Will come back tomorrow
– Click on Advertisting
• Online Game
– Time spent in the game
– Level Progress
– In-App Purchase
U
S
E
R

Question
How to measure and drive user satisfaction on a
large web sites with very diverse usage patterns
?

The Problem
New Comers From
Google News
People Coming
from twitter and
Facebook Posts
People coming to
the website almost
each and everyday
People that loves
to comment
Foreigners Robots
People fond of
sport section only
…. …..
BEHAVIOUR DIVERSITY
THE AVERAGED
METRICS WOULD
HIDE
IMPORTANT
VARIATION ON
SPECIFIC SEGMENTS

SubProblem 1: Hard Segments
• Segments Users per
Number of visits per
month
– > 20 days per month
-> Engaged Users
• Segment per
transformed or not
• Segment per country

Subproblem 2: Hard Metrics
• Newspaper
Time Spent on the website
 log(Number of page
views) + Number of actions
• Search engine
Click Ratio
Click ratio
• E-Commerce
 Transformation Ratio

Limits
Hard Segments
 MISSING PART OF
THE REALITY
Hard Metrics
 ARGUING BETWEEN
TEAM

Semi-Supervised Learning
All Labeled Data
All Unlabeled Data
Some Labeled Data
Lots of Unlabeled
Data
Training Data
Supervised
Learning
Unsupervised
Learning
Semi-
Supervised
Learning
Model
Model
Model

½ SL – Natural Language Processing
I hope I’ll enjoy Amsterdam, and not only because of Hadoop
Je pense bien passer du bon temps à Amsterdam, et pas seulement grâce à Hadoop
Statistical Knowledge
 Text Structure
(Unsupervised)
Aligned Corpus
(Supervised)

½ SL Applied to Web Sessions
Lots of customer sessions
Not so many concrete customer
feedbacks
Subscription

Semi-Supervised Learning
3 Approaches
• Generative Models, e.g. gaussian fits
– All Data fits a gaussian distribution with parameter X
– Find X that better fit distribution of both labeled data and
unlabeled data
• Fits with costs
– Supervised learning with a costs function that capture a
distance between point related to the unlabeled data
structure
• Ad-hoc : Combine unsupervised, then supervised

Clustering+Supervised in practice
Unlabeled training data points in grey
Labeled training data points in color

½ SL : Fit to the underlying structure

Our Approach
1. (Lots of ) Data preparation to build miningful
user session
2. Clustering sessions and validate/tag those
clusters by end users
3. Create Predictive User Satisfaction Metrics
4. Follow those metrics !

Data Prep: Overview
Step 1
Build Sessions
Pig
Step 2
Parse IP/Time/..
Custom Python
(or )
Step 3
Parse Sequences
Hive or Python
custom
Step 4
Build user-level
stats
Hive
RAW DATA
READY FOR ML

Step 1. Build Session
• Use Hive ( Or Pig)
• Group into “Session”
• Depending on the variable
– IP, Device  Select only one per log
– URL, Event  Create an ordered array that
represents the sequence of events in the session

Step 2 : Basic Feature
• IP Address  Location, City
• User-Agent  Device
• Timestamp  User Time  Day or night ?
Python + Hadoop Streaming
Option 1 Option 2

Extracted DataORIGINAL
ORIGINAL
ORIGINAL
NEW!!
NEW!!
NEW!!
Country From IP Device From User-AgentHour from
Country & Time

Step 3: Session Signals
• Simple Signals
– Number of Page Views
– Time Spent …..
– Etc…
• Limitation
 It might not help that much to differentiate
behaviour

More Elaborate: N-Grams Model
Field Unit Sample 1-Gram 2-Gram 3-Gram
Protein Amino
Acid
Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu
DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,..
NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,..
NLP
(character)
Word ..some like it hot… some,like,it some-like,like-it some-like-it,
like-it-hote

N-Grams Model For Sessions
Field Unit Sample 1-Gram 2-Gram 3-Gram
Protein Amino
Acid
Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu
DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,..
NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,..
NLP
(character)
Word ..some like it hot… some,like,it some-like,like-it some-like-it,
like-it-hote
Web Sessions Page View [/home , /products, /trynow,
/blog]
/home, /products, /trynow,
/blog
/home /products, /products
/trynow, /trynow /blog
/home-/products-/trynow,
/products-/trynow-/blog

Session N-Grams Analytics
Campaign / URL / Event Detailed Token Simple Token
utm=google_search google-search-my-site google-search
/home home home
/search?q=baseball search-baseball search
click=www.nfl.com click-nfl click
/sport/new-player-com.. sport/new-player-comming sport
/search?q=Mick+JONES search-mick+jones search
click=www.nfl.com click-nfl click
/sport/new-player-com.. sport/new-player/comming sport
/politics/home politics-home politics
Important Tricks:
• Incorporate the first referrer / marketing campaign as FIRST TOKEN
• Build two level of tokens: detailed, and category only
N-Grams Fine Grain N-Grams Coarse Grain

How To In Practice
• Hive query using the n-grams UDF
• Compute the LLR (Least-Likehood Ratio) Metrics
• Keep the most frequent n-grams of each type (detailed
/ non detailed) as features for the session
• Hint : Set the frequency limit so that > 90% session
can be described by a non-detailed n-gram

Step 4. Cohort-like data
• Per cookie compute metrics
– Nb. Days since first visit
– Nb visits in the last 30 days
– Average session time
– …
• Reintegrate this information
• Easily achieved with a HiveQL query

Machine Learning for HDFS Data
Kind Algorithms
for clustering
Simplicity TRAIN set size
Apache Mahout MapReduce ~ 10 available Expert TERABYTES
Python
(Scikit+Pandas+…
)
Out for training /
In for apply
~ 20 available
(including bi-
clustering)
Medium (10GB)
1 SERVER RAM
H2O Separate Cluster 1 (kMeans) Medium (100GB – 1TB)
CLUSTER RAM
Open Source R +
Hadoop
Varies Varies Varies Varies
Open Source R +
Pattern
(Casacding)
Out for training
/ In for apply
> 3 Medium (1GB)
1 Server RAM in
R
Spark + MLLib Separate Cluster 1 Medium (100GB – 1TB)
CLUSTER RAM

How Big is out data here ?
Step 1
Build Sessions
Step 2
Parse IP/Time/..
Step 3
Parse Sequences
Step 4
Build user-level
stats
RAW DATA
READY FOR ML
Uncompressed data size, for 1 year worth of log on a website with
10 Millions Unique Visitors per month
10 GB5TB

Clustering With Scikit on HDFS
1. Use Pydoop to get data on train server
2. Use pandas to read data transform to numerical
3. Kmeans().fit()
4. Ipython to draw some graphs
5. Enjoy
or

Clustering & Cluster Sampling
Take a balanced number of samples
in each cluster, close to the centroid

Labelling
0’ 00
0’ 12
1’ 04
1’ 45
3’ 02
Visualizing Sessions
Search for a
specific Topic
Labelling
I can guess what this guy was
doing !!!

Labelling
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)

What if ?
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)

Supervised Learning
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
Independently from the clusters, used the
trained examples in order to classify each
session in the predefined segments

Supervised Learning : e.g. in python
• Load the data and the label in
python (Pandas)
• Fit the labeled sessions against
a model
• Save the model in HDFS
(python pickle)
• Run the model against all the
data (Hadoop Streaming)
We’ve got a tool to help you
do that in Data Science Studio
He’s called the Doctor and he’s
fun to use !

Compute Metrics Per Segments
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
0.3€ per session
0.23€ acquisition costs
``
`
13k sessions
1.3€ per session
938k sessions
938k sessions
0.3€ per session
738k sessions
0.83€ per session
68k sessions
0.3€ per session
1k sessions
0€ per session
0€ acquisition costs

User Satisfaction Metrics
• Future-Based Metrics
– Will the user most
likely subscribe/pay in
the future ?
• Expressed-Opinion
– Does he like satisfied
from its behaviour ?

Opinion-Based Training For User Satisfaction
User Feedbacks as “Labels” to build a model
on satisfaction
“Predict” a satisfaction score
for non-trained session
Session Data
Feedbacks
Scored
Session
HYPOTHESIS : IF TWO USERS HAVE SIMILAR NAVIGATION PATTERNS
THEY HAVE SIMILAR USER SATISFACTION LEVELS
(100 Million Sessions)
(10.000 feedbacks)

Compute Metrics Per Segments
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
0.3€ per session
``
`
13k sessions
1.3€ per session
938k sessions
938k sessions
0.3€ per session
738k sessions
0.83€ per session
68k sessions
0.3€ per session
1k sessions
0€ per session
0€ acquisition costs
SATISFACTION SCORE 0.87§
SATISFACTION SCORE 0.37
SATISFACTION SCORE 0.28 SATISFACTION SCORE 0.12

Data in Time: Smoothing
In Red : The Base Metric
In Blue : The smoothed metricRAW DATA MAY VARY A LOT
FROM DAYS TO DAYS
IT WILL SCARE PEOPLE

Exponential Smoothing In Hive
SELECT segment
moving_avg(day, satisfaction, 15, 1.52, 15, DATEDIFF(‘2014-15-01’, ‘2014-01-01’))
FROM
stats
GROUP BY segment
These factors determine
whether your smooth a lot
or not, and over how many days

Final : Follow Smoothed Satisfaction
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
Follow Statisfaction Metric Per Segment
Damn
our latest
release
has diverging
effects
on segments

Thank You !
Florian Douetteau
@fdouetteau
Questions now or later:
florian.douetteau@dataiku.com
dataiku.com

Dataiku hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Dataiku hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Similar to Dataiku hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours (20)

More from Dataiku

More from Dataiku (20)

Recently uploaded

Recently uploaded (20)

Dataiku hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours