Your SlideShare is downloading. ×
Dataiku   hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Dataiku hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

1,941
views

Published on

Presentation made at Hadoop Summit Amsterdam 2014. …

Presentation made at Hadoop Summit Amsterdam 2014.

It's ab

Published in: Technology

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,941
On Slideshare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
28
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. ½ S L using to turn into
  • 2. Semi-Supervised Learning on Hadoop to understand user behaviors Hadoop Summit Amsterdam 2-3 Avril 2014
  • 3. Florian Douetteau @fdouetteau www.dataiku.com Data Science Studio
  • 4. Motivation • CxO – Pages Views, Unique Visitors, Dollars, Subscription • Editor / Product Manager – Time Spent, Comments • Users – Content What does matter on a web site ?
  • 5. Key Usage Metrics • Publisher – Time Spent on Page – Number of pages seen – Number of comments – Move to Subscription • Search Engine – Click on first hits / re-click – Rephrasing ratio – Will come back tomorrow – Click on Advertisting • Online Game – Time spent in the game – Level Progress – In-App Purchase
  • 6. The Quest for the Missing Proxy • Publisher – Time Spent on Page – Number of pages seen – Number of comments – User Satisfaction – Move to Subscription • Search Engine – Click on first hits / re-click – Rephrasing ratio – User Satisfaction – Will come back tomorrow – Click on Advertisting • Online Game – Time spent in the game – Level Progress – User Satisfaction – In-App Purchase U S E R
  • 7. Question How to measure and drive user satisfaction on a large web sites with very diverse usage patterns ?
  • 8. The Problem New Comers From Google News People Coming from twitter and Facebook Posts People coming to the website almost each and everyday People that loves to comment Foreigners Robots People fond of sport section only …. ….. BEHAVIOUR DIVERSITY THE AVERAGED METRICS WOULD HIDE IMPORTANT VARIATION ON SPECIFIC SEGMENTS
  • 9. SubProblem 1: Hard Segments • Segments Users per Number of visits per month – > 20 days per month -> Engaged Users • Segment per transformed or not • Segment per country
  • 10. Subproblem 2: Hard Metrics • Newspaper Time Spent on the website  log(Number of page views) + Number of actions • Search engine Click Ratio Click ratio • E-Commerce  Transformation Ratio
  • 11. Limits Hard Segments  MISSING PART OF THE REALITY Hard Metrics  ARGUING BETWEEN TEAM
  • 12. Semi-Supervised Learning All Labeled Data All Unlabeled Data Some Labeled Data Lots of Unlabeled Data Training Data Supervised Learning Unsupervised Learning Semi- Supervised Learning Model Model Model
  • 13. ½ SL – Natural Language Processing I hope I’ll enjoy Amsterdam, and not only because of Hadoop Je pense bien passer du bon temps à Amsterdam, et pas seulement grâce à Hadoop Statistical Knowledge  Text Structure (Unsupervised) Aligned Corpus (Supervised)
  • 14. ½ SL Applied to Web Sessions Lots of customer sessions Not so many concrete customer feedbacks Subscription
  • 15. Semi-Supervised Learning 3 Approaches • Generative Models, e.g. gaussian fits – All Data fits a gaussian distribution with parameter X – Find X that better fit distribution of both labeled data and unlabeled data • Fits with costs – Supervised learning with a costs function that capture a distance between point related to the unlabeled data structure • Ad-hoc : Combine unsupervised, then supervised
  • 16. Clustering+Supervised in practice Unlabeled training data points in grey Labeled training data points in color
  • 17. Supervised Learning Only
  • 18. ½ SL : Fit to the underlying structure
  • 19. Our Approach 1. (Lots of ) Data preparation to build miningful user session 2. Clustering sessions and validate/tag those clusters by end users 3. Create Predictive User Satisfaction Metrics 4. Follow those metrics !
  • 20. Data Prep: Overview Step 1 Build Sessions Pig Step 2 Parse IP/Time/.. Custom Python (or ) Step 3 Parse Sequences Hive or Python custom Step 4 Build user-level stats Hive RAW DATA READY FOR ML
  • 21. Step 1. Build Session • Use Hive ( Or Pig) • Group into “Session” • Depending on the variable – IP, Device  Select only one per log – URL, Event  Create an ordered array that represents the sequence of events in the session
  • 22. Step 2 : Basic Feature • IP Address  Location, City • User-Agent  Device • Timestamp  User Time  Day or night ? Python + Hadoop Streaming Option 1 Option 2
  • 23. Extracted DataORIGINAL ORIGINAL ORIGINAL NEW!! NEW!! NEW!! Country From IP Device From User-AgentHour from Country & Time
  • 24. Step 3: Session Signals • Simple Signals – Number of Page Views – Time Spent ….. – Etc… • Limitation  It might not help that much to differentiate behaviour
  • 25. More Elaborate: N-Grams Model Field Unit Sample 1-Gram 2-Gram 3-Gram Protein Amino Acid Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,.. NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,.. NLP (character) Word ..some like it hot… some,like,it some-like,like-it some-like-it, like-it-hote
  • 26. N-Grams Model For Sessions Field Unit Sample 1-Gram 2-Gram 3-Gram Protein Amino Acid Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,.. NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,.. NLP (character) Word ..some like it hot… some,like,it some-like,like-it some-like-it, like-it-hote Web Sessions Page View [/home , /products, /trynow, /blog] /home, /products, /trynow, /blog /home /products, /products /trynow, /trynow /blog /home-/products-/trynow, /products-/trynow-/blog
  • 27. Session N-Grams Analytics Campaign / URL / Event Detailed Token Simple Token utm=google_search google-search-my-site google-search /home home home /search?q=baseball search-baseball search click=www.nfl.com click-nfl click /sport/new-player-com.. sport/new-player-comming sport /search?q=Mick+JONES search-mick+jones search click=www.nfl.com click-nfl click /sport/new-player-com.. sport/new-player/comming sport /politics/home politics-home politics Important Tricks: • Incorporate the first referrer / marketing campaign as FIRST TOKEN • Build two level of tokens: detailed, and category only N-Grams Fine Grain N-Grams Coarse Grain
  • 28. How To In Practice • Hive query using the n-grams UDF • Compute the LLR (Least-Likehood Ratio) Metrics • Keep the most frequent n-grams of each type (detailed / non detailed) as features for the session • Hint : Set the frequency limit so that > 90% session can be described by a non-detailed n-gram
  • 29. Step 4. Cohort-like data • Per cookie compute metrics – Nb. Days since first visit – Nb visits in the last 30 days – Average session time – … • Reintegrate this information • Easily achieved with a HiveQL query
  • 30. Machine Learning for HDFS Data Kind Algorithms for clustering Simplicity TRAIN set size Apache Mahout MapReduce ~ 10 available Expert TERABYTES Python (Scikit+Pandas+… ) Out for training / In for apply ~ 20 available (including bi- clustering) Medium (10GB) 1 SERVER RAM H2O Separate Cluster 1 (kMeans) Medium (100GB – 1TB) CLUSTER RAM Open Source R + Hadoop Varies Varies Varies Varies Open Source R + Pattern (Casacding) Out for training / In for apply > 3 Medium (1GB) 1 Server RAM in R Spark + MLLib Separate Cluster 1 Medium (100GB – 1TB) CLUSTER RAM
  • 31. How Big is out data here ? Step 1 Build Sessions Step 2 Parse IP/Time/.. Step 3 Parse Sequences Step 4 Build user-level stats RAW DATA READY FOR ML Uncompressed data size, for 1 year worth of log on a website with 10 Millions Unique Visitors per month 10 GB5TB
  • 32. Clustering With Scikit on HDFS 1. Use Pydoop to get data on train server 2. Use pandas to read data transform to numerical 3. Kmeans().fit() 4. Ipython to draw some graphs 5. Enjoy or
  • 33. Session Data
  • 34. Clustering
  • 35. Clustering & Cluster Sampling Take a balanced number of samples in each cluster, close to the centroid
  • 36. Labelling 0’ 00 0’ 12 1’ 04 1’ 45 3’ 02 Visualizing Sessions Search for a specific Topic Labelling I can guess what this guy was doing !!!
  • 37. Labelling Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?)
  • 38. What if ? Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?)
  • 39. Supervised Learning Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?) Independently from the clusters, used the trained examples in order to classify each session in the predefined segments
  • 40. Supervised Learning : e.g. in python • Load the data and the label in python (Pandas) • Fit the labeled sessions against a model • Save the model in HDFS (python pickle) • Run the model against all the data (Hadoop Streaming) We’ve got a tool to help you do that in Data Science Studio He’s called the Doctor and he’s fun to use !
  • 41. Compute Metrics Per Segments Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?) 0.3€ per session 0.23€ acquisition costs `` ` 13k sessions 1.3€ per session 0.23€ acquisition costs 938k sessions 938k sessions 0.3€ per session 0.23€ acquisition costs 738k sessions 0.83€ per session 0.73€ acquisition costs 68k sessions 0.3€ per session 1.23€ acquisition costs 1k sessions 0€ per session 0€ acquisition costs
  • 42. User Satisfaction Metrics • Future-Based Metrics – Will the user most likely subscribe/pay in the future ? • Expressed-Opinion – Does he like satisfied from its behaviour ?
  • 43. Opinion-Based Training For User Satisfaction User Feedbacks as “Labels” to build a model on satisfaction “Predict” a satisfaction score for non-trained session Session Data Feedbacks Scored Session HYPOTHESIS : IF TWO USERS HAVE SIMILAR NAVIGATION PATTERNS THEY HAVE SIMILAR USER SATISFACTION LEVELS (100 Million Sessions) (10.000 feedbacks)
  • 44. Compute Metrics Per Segments Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?) 0.3€ per session 0.23€ acquisition costs `` ` 13k sessions 1.3€ per session 0.23€ acquisition costs 938k sessions 938k sessions 0.3€ per session 0.23€ acquisition costs 738k sessions 0.83€ per session 0.73€ acquisition costs 68k sessions 0.3€ per session 1.23€ acquisition costs 1k sessions 0€ per session 0€ acquisition costs SATISFACTION SCORE 0.87§ SATISFACTION SCORE 0.37 SATISFACTION SCORE 0.28 SATISFACTION SCORE 0.12 SATISFACTION SCORE 0.28 SATISFACTION SCORE 0.12
  • 45. Data in Time: Smoothing In Red : The Base Metric In Blue : The smoothed metricRAW DATA MAY VARY A LOT FROM DAYS TO DAYS IT WILL SCARE PEOPLE
  • 46. Exponential Smoothing In Hive SELECT segment moving_avg(day, satisfaction, 15, 1.52, 15, DATEDIFF(‘2014-15-01’, ‘2014-01-01’)) FROM stats GROUP BY segment These factors determine whether your smooth a lot or not, and over how many days
  • 47. Final : Follow Smoothed Satisfaction Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?) Follow Statisfaction Metric Per Segment Damn our latest release has diverging effects on segments
  • 48. Thank You ! Florian Douetteau @fdouetteau Questions now or later: florian.douetteau@dataiku.com dataiku.com

×