½ S L using to turn
into
Semi-Supervised Learning on
Hadoop to understand user
behaviors
Hadoop Summit Amsterdam
2-3 Avril 2014
Florian Douetteau
@fdouetteau
www.dataiku.com
Data Science
Studio
Motivation
• CxO
– Pages Views, Unique Visitors, Dollars, Subscription
• Editor / Product Manager
– Time Spent, Comments
•...
Key Usage Metrics
• Publisher
– Time Spent on Page
– Number of pages seen
– Number of comments
– Move to Subscription
• Se...
The Quest for the Missing Proxy
• Publisher
– Time Spent on Page
– Number of pages seen
– Number of comments
– User Satisf...
Question
How to measure and drive user satisfaction on a
large web sites with very diverse usage patterns
?
The Problem
New Comers From
Google News
People Coming
from twitter and
Facebook Posts
People coming to
the website almost
...
SubProblem 1: Hard Segments
• Segments Users per
Number of visits per
month
– > 20 days per month
-> Engaged Users
• Segme...
Subproblem 2: Hard Metrics
• Newspaper
Time Spent on the website
 log(Number of page
views) + Number of actions
• Search...
Limits
Hard Segments
 MISSING PART OF
THE REALITY
Hard Metrics
 ARGUING BETWEEN
TEAM
Semi-Supervised Learning
All Labeled Data
All Unlabeled Data
Some Labeled Data
Lots of Unlabeled
Data
Training Data
Superv...
½ SL – Natural Language Processing
I hope I’ll enjoy Amsterdam, and not only because of Hadoop
Je pense bien passer du bon...
½ SL Applied to Web Sessions
Lots of customer sessions
Not so many concrete customer
feedbacks
Subscription
Semi-Supervised Learning
3 Approaches
• Generative Models, e.g. gaussian fits
– All Data fits a gaussian distribution with...
Clustering+Supervised in practice
Unlabeled training data points in grey
Labeled training data points in color
Supervised Learning Only
½ SL : Fit to the underlying structure
Our Approach
1. (Lots of ) Data preparation to build miningful
user session
2. Clustering sessions and validate/tag those
...
Data Prep: Overview
Step 1
Build Sessions
Pig
Step 2
Parse IP/Time/..
Custom Python
(or )
Step 3
Parse Sequences
Hive or P...
Step 1. Build Session
• Use Hive ( Or Pig)
• Group into “Session”
• Depending on the variable
– IP, Device  Select only o...
Step 2 : Basic Feature
• IP Address  Location, City
• User-Agent  Device
• Timestamp  User Time  Day or night ?
Python...
Extracted DataORIGINAL
ORIGINAL
ORIGINAL
NEW!!
NEW!!
NEW!!
Country From IP Device From User-AgentHour from
Country & Time
Step 3: Session Signals
• Simple Signals
– Number of Page Views
– Time Spent …..
– Etc…
• Limitation
 It might not help t...
More Elaborate: N-Grams Model
Field Unit Sample 1-Gram 2-Gram 3-Gram
Protein Amino
Acid
Cys-Gly-Leu Cys, Gly, Leu Cys-Gly,...
N-Grams Model For Sessions
Field Unit Sample 1-Gram 2-Gram 3-Gram
Protein Amino
Acid
Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gl...
Session N-Grams Analytics
Campaign / URL / Event Detailed Token Simple Token
utm=google_search google-search-my-site googl...
How To In Practice
• Hive query using the n-grams UDF
• Compute the LLR (Least-Likehood Ratio) Metrics
• Keep the most fre...
Step 4. Cohort-like data
• Per cookie compute metrics
– Nb. Days since first visit
– Nb visits in the last 30 days
– Avera...
Machine Learning for HDFS Data
Kind Algorithms
for clustering
Simplicity TRAIN set size
Apache Mahout MapReduce ~ 10 avail...
How Big is out data here ?
Step 1
Build Sessions
Step 2
Parse IP/Time/..
Step 3
Parse Sequences
Step 4
Build user-level
st...
Clustering With Scikit on HDFS
1. Use Pydoop to get data on train server
2. Use pandas to read data transform to numerical...
Session Data
Clustering
Clustering & Cluster Sampling
Take a balanced number of samples
in each cluster, close to the centroid
Labelling
0’ 00
0’ 12
1’ 04
1’ 45
3’ 02
Visualizing Sessions
Search for a
specific Topic
Labelling
I can guess what this g...
Labelling
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
H...
What if ?
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
H...
Supervised Learning
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to...
Supervised Learning : e.g. in python
• Load the data and the label in
python (Pandas)
• Fit the labeled sessions against
a...
Compute Metrics Per Segments
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that...
User Satisfaction Metrics
• Future-Based Metrics
– Will the user most
likely subscribe/pay in
the future ?
• Expressed-Opi...
Opinion-Based Training For User Satisfaction
User Feedbacks as “Labels” to build a model
on satisfaction
“Predict” a satis...
Compute Metrics Per Segments
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that...
Data in Time: Smoothing
In Red : The Base Metric
In Blue : The smoothed metricRAW DATA MAY VARY A LOT
FROM DAYS TO DAYS
IT...
Exponential Smoothing In Hive
SELECT segment
moving_avg(day, satisfaction, 15, 1.52, 15, DATEDIFF(‘2014-15-01’, ‘2014-01-0...
Final : Follow Smoothed Satisfaction
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
...
Thank You !
Florian Douetteau
@fdouetteau
Questions now or later:
florian.douetteau@dataiku.com
dataiku.com
Upcoming SlideShare
Loading in …5
×

Dataiku hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

2,876 views

Published on

Presentation made at Hadoop Summit Amsterdam 2014.

It's ab

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,876
On SlideShare
0
From Embeds
0
Number of Embeds
620
Actions
Shares
0
Downloads
41
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Dataiku hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

  1. 1. ½ S L using to turn into
  2. 2. Semi-Supervised Learning on Hadoop to understand user behaviors Hadoop Summit Amsterdam 2-3 Avril 2014
  3. 3. Florian Douetteau @fdouetteau www.dataiku.com Data Science Studio
  4. 4. Motivation • CxO – Pages Views, Unique Visitors, Dollars, Subscription • Editor / Product Manager – Time Spent, Comments • Users – Content What does matter on a web site ?
  5. 5. Key Usage Metrics • Publisher – Time Spent on Page – Number of pages seen – Number of comments – Move to Subscription • Search Engine – Click on first hits / re-click – Rephrasing ratio – Will come back tomorrow – Click on Advertisting • Online Game – Time spent in the game – Level Progress – In-App Purchase
  6. 6. The Quest for the Missing Proxy • Publisher – Time Spent on Page – Number of pages seen – Number of comments – User Satisfaction – Move to Subscription • Search Engine – Click on first hits / re-click – Rephrasing ratio – User Satisfaction – Will come back tomorrow – Click on Advertisting • Online Game – Time spent in the game – Level Progress – User Satisfaction – In-App Purchase U S E R
  7. 7. Question How to measure and drive user satisfaction on a large web sites with very diverse usage patterns ?
  8. 8. The Problem New Comers From Google News People Coming from twitter and Facebook Posts People coming to the website almost each and everyday People that loves to comment Foreigners Robots People fond of sport section only …. ….. BEHAVIOUR DIVERSITY THE AVERAGED METRICS WOULD HIDE IMPORTANT VARIATION ON SPECIFIC SEGMENTS
  9. 9. SubProblem 1: Hard Segments • Segments Users per Number of visits per month – > 20 days per month -> Engaged Users • Segment per transformed or not • Segment per country
  10. 10. Subproblem 2: Hard Metrics • Newspaper Time Spent on the website  log(Number of page views) + Number of actions • Search engine Click Ratio Click ratio • E-Commerce  Transformation Ratio
  11. 11. Limits Hard Segments  MISSING PART OF THE REALITY Hard Metrics  ARGUING BETWEEN TEAM
  12. 12. Semi-Supervised Learning All Labeled Data All Unlabeled Data Some Labeled Data Lots of Unlabeled Data Training Data Supervised Learning Unsupervised Learning Semi- Supervised Learning Model Model Model
  13. 13. ½ SL – Natural Language Processing I hope I’ll enjoy Amsterdam, and not only because of Hadoop Je pense bien passer du bon temps à Amsterdam, et pas seulement grâce à Hadoop Statistical Knowledge  Text Structure (Unsupervised) Aligned Corpus (Supervised)
  14. 14. ½ SL Applied to Web Sessions Lots of customer sessions Not so many concrete customer feedbacks Subscription
  15. 15. Semi-Supervised Learning 3 Approaches • Generative Models, e.g. gaussian fits – All Data fits a gaussian distribution with parameter X – Find X that better fit distribution of both labeled data and unlabeled data • Fits with costs – Supervised learning with a costs function that capture a distance between point related to the unlabeled data structure • Ad-hoc : Combine unsupervised, then supervised
  16. 16. Clustering+Supervised in practice Unlabeled training data points in grey Labeled training data points in color
  17. 17. Supervised Learning Only
  18. 18. ½ SL : Fit to the underlying structure
  19. 19. Our Approach 1. (Lots of ) Data preparation to build miningful user session 2. Clustering sessions and validate/tag those clusters by end users 3. Create Predictive User Satisfaction Metrics 4. Follow those metrics !
  20. 20. Data Prep: Overview Step 1 Build Sessions Pig Step 2 Parse IP/Time/.. Custom Python (or ) Step 3 Parse Sequences Hive or Python custom Step 4 Build user-level stats Hive RAW DATA READY FOR ML
  21. 21. Step 1. Build Session • Use Hive ( Or Pig) • Group into “Session” • Depending on the variable – IP, Device  Select only one per log – URL, Event  Create an ordered array that represents the sequence of events in the session
  22. 22. Step 2 : Basic Feature • IP Address  Location, City • User-Agent  Device • Timestamp  User Time  Day or night ? Python + Hadoop Streaming Option 1 Option 2
  23. 23. Extracted DataORIGINAL ORIGINAL ORIGINAL NEW!! NEW!! NEW!! Country From IP Device From User-AgentHour from Country & Time
  24. 24. Step 3: Session Signals • Simple Signals – Number of Page Views – Time Spent ….. – Etc… • Limitation  It might not help that much to differentiate behaviour
  25. 25. More Elaborate: N-Grams Model Field Unit Sample 1-Gram 2-Gram 3-Gram Protein Amino Acid Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,.. NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,.. NLP (character) Word ..some like it hot… some,like,it some-like,like-it some-like-it, like-it-hote
  26. 26. N-Grams Model For Sessions Field Unit Sample 1-Gram 2-Gram 3-Gram Protein Amino Acid Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,.. NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,.. NLP (character) Word ..some like it hot… some,like,it some-like,like-it some-like-it, like-it-hote Web Sessions Page View [/home , /products, /trynow, /blog] /home, /products, /trynow, /blog /home /products, /products /trynow, /trynow /blog /home-/products-/trynow, /products-/trynow-/blog
  27. 27. Session N-Grams Analytics Campaign / URL / Event Detailed Token Simple Token utm=google_search google-search-my-site google-search /home home home /search?q=baseball search-baseball search click=www.nfl.com click-nfl click /sport/new-player-com.. sport/new-player-comming sport /search?q=Mick+JONES search-mick+jones search click=www.nfl.com click-nfl click /sport/new-player-com.. sport/new-player/comming sport /politics/home politics-home politics Important Tricks: • Incorporate the first referrer / marketing campaign as FIRST TOKEN • Build two level of tokens: detailed, and category only N-Grams Fine Grain N-Grams Coarse Grain
  28. 28. How To In Practice • Hive query using the n-grams UDF • Compute the LLR (Least-Likehood Ratio) Metrics • Keep the most frequent n-grams of each type (detailed / non detailed) as features for the session • Hint : Set the frequency limit so that > 90% session can be described by a non-detailed n-gram
  29. 29. Step 4. Cohort-like data • Per cookie compute metrics – Nb. Days since first visit – Nb visits in the last 30 days – Average session time – … • Reintegrate this information • Easily achieved with a HiveQL query
  30. 30. Machine Learning for HDFS Data Kind Algorithms for clustering Simplicity TRAIN set size Apache Mahout MapReduce ~ 10 available Expert TERABYTES Python (Scikit+Pandas+… ) Out for training / In for apply ~ 20 available (including bi- clustering) Medium (10GB) 1 SERVER RAM H2O Separate Cluster 1 (kMeans) Medium (100GB – 1TB) CLUSTER RAM Open Source R + Hadoop Varies Varies Varies Varies Open Source R + Pattern (Casacding) Out for training / In for apply > 3 Medium (1GB) 1 Server RAM in R Spark + MLLib Separate Cluster 1 Medium (100GB – 1TB) CLUSTER RAM
  31. 31. How Big is out data here ? Step 1 Build Sessions Step 2 Parse IP/Time/.. Step 3 Parse Sequences Step 4 Build user-level stats RAW DATA READY FOR ML Uncompressed data size, for 1 year worth of log on a website with 10 Millions Unique Visitors per month 10 GB5TB
  32. 32. Clustering With Scikit on HDFS 1. Use Pydoop to get data on train server 2. Use pandas to read data transform to numerical 3. Kmeans().fit() 4. Ipython to draw some graphs 5. Enjoy or
  33. 33. Session Data
  34. 34. Clustering
  35. 35. Clustering & Cluster Sampling Take a balanced number of samples in each cluster, close to the centroid
  36. 36. Labelling 0’ 00 0’ 12 1’ 04 1’ 45 3’ 02 Visualizing Sessions Search for a specific Topic Labelling I can guess what this guy was doing !!!
  37. 37. Labelling Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?)
  38. 38. What if ? Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?)
  39. 39. Supervised Learning Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?) Independently from the clusters, used the trained examples in order to classify each session in the predefined segments
  40. 40. Supervised Learning : e.g. in python • Load the data and the label in python (Pandas) • Fit the labeled sessions against a model • Save the model in HDFS (python pickle) • Run the model against all the data (Hadoop Streaming) We’ve got a tool to help you do that in Data Science Studio He’s called the Doctor and he’s fun to use !
  41. 41. Compute Metrics Per Segments Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?) 0.3€ per session 0.23€ acquisition costs `` ` 13k sessions 1.3€ per session 0.23€ acquisition costs 938k sessions 938k sessions 0.3€ per session 0.23€ acquisition costs 738k sessions 0.83€ per session 0.73€ acquisition costs 68k sessions 0.3€ per session 1.23€ acquisition costs 1k sessions 0€ per session 0€ acquisition costs
  42. 42. User Satisfaction Metrics • Future-Based Metrics – Will the user most likely subscribe/pay in the future ? • Expressed-Opinion – Does he like satisfied from its behaviour ?
  43. 43. Opinion-Based Training For User Satisfaction User Feedbacks as “Labels” to build a model on satisfaction “Predict” a satisfaction score for non-trained session Session Data Feedbacks Scored Session HYPOTHESIS : IF TWO USERS HAVE SIMILAR NAVIGATION PATTERNS THEY HAVE SIMILAR USER SATISFACTION LEVELS (100 Million Sessions) (10.000 feedbacks)
  44. 44. Compute Metrics Per Segments Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?) 0.3€ per session 0.23€ acquisition costs `` ` 13k sessions 1.3€ per session 0.23€ acquisition costs 938k sessions 938k sessions 0.3€ per session 0.23€ acquisition costs 738k sessions 0.83€ per session 0.73€ acquisition costs 68k sessions 0.3€ per session 1.23€ acquisition costs 1k sessions 0€ per session 0€ acquisition costs SATISFACTION SCORE 0.87§ SATISFACTION SCORE 0.37 SATISFACTION SCORE 0.28 SATISFACTION SCORE 0.12 SATISFACTION SCORE 0.28 SATISFACTION SCORE 0.12
  45. 45. Data in Time: Smoothing In Red : The Base Metric In Blue : The smoothed metricRAW DATA MAY VARY A LOT FROM DAYS TO DAYS IT WILL SCARE PEOPLE
  46. 46. Exponential Smoothing In Hive SELECT segment moving_avg(day, satisfaction, 15, 1.52, 15, DATEDIFF(‘2014-15-01’, ‘2014-01-01’)) FROM stats GROUP BY segment These factors determine whether your smooth a lot or not, and over how many days
  47. 47. Final : Follow Smoothed Satisfaction Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?) Follow Statisfaction Metric Per Segment Damn our latest release has diverging effects on segments
  48. 48. Thank You ! Florian Douetteau @fdouetteau Questions now or later: florian.douetteau@dataiku.com dataiku.com

×