SlideShare a Scribd company logo
½ S L using to turn
into
Semi-Supervised Learning on
Hadoop to understand user
behaviors
Hadoop Summit Amsterdam
2-3 Avril 2014
Florian Douetteau
@fdouetteau
www.dataiku.com
Data Science
Studio
Motivation
• CxO
– Pages Views, Unique Visitors, Dollars, Subscription
• Editor / Product Manager
– Time Spent, Comments
• Users
– Content
What does matter on a web site ?
Key Usage Metrics
• Publisher
– Time Spent on Page
– Number of pages seen
– Number of comments
– Move to Subscription
• Search Engine
– Click on first hits / re-click
– Rephrasing ratio
– Will come back tomorrow
– Click on Advertisting
• Online Game
– Time spent in the game
– Level Progress
– In-App Purchase
The Quest for the Missing Proxy
• Publisher
– Time Spent on Page
– Number of pages seen
– Number of comments
– User Satisfaction
– Move to Subscription
• Search Engine
– Click on first hits / re-click
– Rephrasing ratio
– User Satisfaction
– Will come back tomorrow
– Click on Advertisting
• Online Game
– Time spent in the game
– Level Progress
– User Satisfaction
– In-App Purchase
U
S
E
R
Question
How to measure and drive user satisfaction on a
large web sites with very diverse usage patterns
?
The Problem
New Comers From
Google News
People Coming
from twitter and
Facebook Posts
People coming to
the website almost
each and everyday
People that loves
to comment
Foreigners Robots
People fond of
sport section only
…. …..
BEHAVIOUR DIVERSITY
THE AVERAGED
METRICS WOULD
HIDE
IMPORTANT
VARIATION ON
SPECIFIC SEGMENTS
SubProblem 1: Hard Segments
• Segments Users per
Number of visits per
month
– > 20 days per month
-> Engaged Users
• Segment per
transformed or not
• Segment per country
Subproblem 2: Hard Metrics
• Newspaper
Time Spent on the website
 log(Number of page
views) + Number of actions
• Search engine
Click Ratio
Click ratio
• E-Commerce
 Transformation Ratio
Limits
Hard Segments
 MISSING PART OF
THE REALITY
Hard Metrics
 ARGUING BETWEEN
TEAM
Semi-Supervised Learning
All Labeled Data
All Unlabeled Data
Some Labeled Data
Lots of Unlabeled
Data
Training Data
Supervised
Learning
Unsupervised
Learning
Semi-
Supervised
Learning
Model
Model
Model
½ SL – Natural Language Processing
I hope I’ll enjoy Amsterdam, and not only because of Hadoop
Je pense bien passer du bon temps à Amsterdam, et pas seulement grâce à Hadoop
Statistical Knowledge
 Text Structure
(Unsupervised)
Aligned Corpus
(Supervised)
½ SL Applied to Web Sessions
Lots of customer sessions
Not so many concrete customer
feedbacks
Subscription
Semi-Supervised Learning
3 Approaches
• Generative Models, e.g. gaussian fits
– All Data fits a gaussian distribution with parameter X
– Find X that better fit distribution of both labeled data and
unlabeled data
• Fits with costs
– Supervised learning with a costs function that capture a
distance between point related to the unlabeled data
structure
• Ad-hoc : Combine unsupervised, then supervised
Clustering+Supervised in practice
Unlabeled training data points in grey
Labeled training data points in color
Supervised Learning Only
½ SL : Fit to the underlying structure
Our Approach
1. (Lots of ) Data preparation to build miningful
user session
2. Clustering sessions and validate/tag those
clusters by end users
3. Create Predictive User Satisfaction Metrics
4. Follow those metrics !
Data Prep: Overview
Step 1
Build Sessions
Pig
Step 2
Parse IP/Time/..
Custom Python
(or )
Step 3
Parse Sequences
Hive or Python
custom
Step 4
Build user-level
stats
Hive
RAW DATA
READY FOR ML
Step 1. Build Session
• Use Hive ( Or Pig)
• Group into “Session”
• Depending on the variable
– IP, Device  Select only one per log
– URL, Event  Create an ordered array that
represents the sequence of events in the session
Step 2 : Basic Feature
• IP Address  Location, City
• User-Agent  Device
• Timestamp  User Time  Day or night ?
Python + Hadoop Streaming
Option 1 Option 2
Extracted DataORIGINAL
ORIGINAL
ORIGINAL
NEW!!
NEW!!
NEW!!
Country From IP Device From User-AgentHour from
Country & Time
Step 3: Session Signals
• Simple Signals
– Number of Page Views
– Time Spent …..
– Etc…
• Limitation
 It might not help that much to differentiate
behaviour
More Elaborate: N-Grams Model
Field Unit Sample 1-Gram 2-Gram 3-Gram
Protein Amino
Acid
Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu
DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,..
NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,..
NLP
(character)
Word ..some like it hot… some,like,it some-like,like-it some-like-it,
like-it-hote
N-Grams Model For Sessions
Field Unit Sample 1-Gram 2-Gram 3-Gram
Protein Amino
Acid
Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu
DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,..
NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,..
NLP
(character)
Word ..some like it hot… some,like,it some-like,like-it some-like-it,
like-it-hote
Web Sessions Page View [/home , /products, /trynow,
/blog]
/home, /products, /trynow,
/blog
/home /products, /products
/trynow, /trynow /blog
/home-/products-/trynow,
/products-/trynow-/blog
Session N-Grams Analytics
Campaign / URL / Event Detailed Token Simple Token
utm=google_search google-search-my-site google-search
/home home home
/search?q=baseball search-baseball search
click=www.nfl.com click-nfl click
/sport/new-player-com.. sport/new-player-comming sport
/search?q=Mick+JONES search-mick+jones search
click=www.nfl.com click-nfl click
/sport/new-player-com.. sport/new-player/comming sport
/politics/home politics-home politics
Important Tricks:
• Incorporate the first referrer / marketing campaign as FIRST TOKEN
• Build two level of tokens: detailed, and category only
N-Grams Fine Grain N-Grams Coarse Grain
How To In Practice
• Hive query using the n-grams UDF
• Compute the LLR (Least-Likehood Ratio) Metrics
• Keep the most frequent n-grams of each type (detailed
/ non detailed) as features for the session
• Hint : Set the frequency limit so that > 90% session
can be described by a non-detailed n-gram
Step 4. Cohort-like data
• Per cookie compute metrics
– Nb. Days since first visit
– Nb visits in the last 30 days
– Average session time
– …
• Reintegrate this information
• Easily achieved with a HiveQL query
Machine Learning for HDFS Data
Kind Algorithms
for clustering
Simplicity TRAIN set size
Apache Mahout MapReduce ~ 10 available Expert TERABYTES
Python
(Scikit+Pandas+…
)
Out for training /
In for apply
~ 20 available
(including bi-
clustering)
Medium (10GB)
1 SERVER RAM
H2O Separate Cluster 1 (kMeans) Medium (100GB – 1TB)
CLUSTER RAM
Open Source R +
Hadoop
Varies Varies Varies Varies
Open Source R +
Pattern
(Casacding)
Out for training
/ In for apply
> 3 Medium (1GB)
1 Server RAM in
R
Spark + MLLib Separate Cluster 1 Medium (100GB – 1TB)
CLUSTER RAM
How Big is out data here ?
Step 1
Build Sessions
Step 2
Parse IP/Time/..
Step 3
Parse Sequences
Step 4
Build user-level
stats
RAW DATA
READY FOR ML
Uncompressed data size, for 1 year worth of log on a website with
10 Millions Unique Visitors per month
10 GB5TB
Clustering With Scikit on HDFS
1. Use Pydoop to get data on train server
2. Use pandas to read data transform to numerical
3. Kmeans().fit()
4. Ipython to draw some graphs
5. Enjoy
or
Session Data
Clustering
Clustering & Cluster Sampling
Take a balanced number of samples
in each cluster, close to the centroid
Labelling
0’ 00
0’ 12
1’ 04
1’ 45
3’ 02
Visualizing Sessions
Search for a
specific Topic
Labelling
I can guess what this guy was
doing !!!
Labelling
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
What if ?
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
Supervised Learning
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
Independently from the clusters, used the
trained examples in order to classify each
session in the predefined segments
Supervised Learning : e.g. in python
• Load the data and the label in
python (Pandas)
• Fit the labeled sessions against
a model
• Save the model in HDFS
(python pickle)
• Run the model against all the
data (Hadoop Streaming)
We’ve got a tool to help you
do that in Data Science Studio
He’s called the Doctor and he’s
fun to use !
Compute Metrics Per Segments
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
0.3€ per session
0.23€ acquisition costs
``
`
13k sessions
1.3€ per session
0.23€ acquisition costs
938k sessions
938k sessions
0.3€ per session
0.23€ acquisition costs
738k sessions
0.83€ per session
0.73€ acquisition costs
68k sessions
0.3€ per session
1.23€ acquisition costs
1k sessions
0€ per session
0€ acquisition costs
User Satisfaction Metrics
• Future-Based Metrics
– Will the user most
likely subscribe/pay in
the future ?
• Expressed-Opinion
– Does he like satisfied
from its behaviour ?
Opinion-Based Training For User Satisfaction
User Feedbacks as “Labels” to build a model
on satisfaction
“Predict” a satisfaction score
for non-trained session
Session Data
Feedbacks
Scored
Session
HYPOTHESIS : IF TWO USERS HAVE SIMILAR NAVIGATION PATTERNS
THEY HAVE SIMILAR USER SATISFACTION LEVELS
(100 Million Sessions)
(10.000 feedbacks)
Compute Metrics Per Segments
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
0.3€ per session
0.23€ acquisition costs
``
`
13k sessions
1.3€ per session
0.23€ acquisition costs
938k sessions
938k sessions
0.3€ per session
0.23€ acquisition costs
738k sessions
0.83€ per session
0.73€ acquisition costs
68k sessions
0.3€ per session
1.23€ acquisition costs
1k sessions
0€ per session
0€ acquisition costs
SATISFACTION SCORE 0.87§
SATISFACTION SCORE 0.37
SATISFACTION SCORE 0.28
SATISFACTION SCORE 0.12
SATISFACTION SCORE 0.28 SATISFACTION SCORE 0.12
Data in Time: Smoothing
In Red : The Base Metric
In Blue : The smoothed metricRAW DATA MAY VARY A LOT
FROM DAYS TO DAYS
IT WILL SCARE PEOPLE
Exponential Smoothing In Hive
SELECT segment
moving_avg(day, satisfaction, 15, 1.52, 15, DATEDIFF(‘2014-15-01’, ‘2014-01-01’))
FROM
stats
GROUP BY segment
These factors determine
whether your smooth a lot
or not, and over how many days
Final : Follow Smoothed Satisfaction
Search for a
specific Topic
Newcomer
from Google
News
Foreigner
Discovering The
Site
Fan that loves
to comment
Home Page
Wanderer
Dark Bot
(Competitor?)
Follow Statisfaction Metric Per Segment
Damn
our latest
release
has diverging
effects
on segments
Thank You !
Florian Douetteau
@fdouetteau
Questions now or later:
florian.douetteau@dataiku.com
dataiku.com

More Related Content

What's hot

Blockchain, Hyperledger and the Oracle Blockchain Platform
Blockchain, Hyperledger and the Oracle Blockchain PlatformBlockchain, Hyperledger and the Oracle Blockchain Platform
Blockchain, Hyperledger and the Oracle Blockchain Platform
Juarez Junior
 
Announcing: Native MQTT Integration with HiveMQ and InfluxDB Cloud
Announcing: Native MQTT Integration with HiveMQ and InfluxDB Cloud Announcing: Native MQTT Integration with HiveMQ and InfluxDB Cloud
Announcing: Native MQTT Integration with HiveMQ and InfluxDB Cloud
InfluxData
 
Ipfs : InterPlanetary File System
Ipfs : InterPlanetary File SystemIpfs : InterPlanetary File System
Ipfs : InterPlanetary File System
동현 강
 
Blockchain: Real World Use Cases
Blockchain: Real World Use CasesBlockchain: Real World Use Cases
Blockchain: Real World Use Cases
Capgemini
 
Matomo: A guide to your site's usage
Matomo: A guide to your site's usageMatomo: A guide to your site's usage
Matomo: A guide to your site's usage
Kristina D.C. Hoeppner
 
Blockchain and Internet of Things
Blockchain and Internet of ThingsBlockchain and Internet of Things
Blockchain and Internet of Things
Valerie Lampkin
 
Anti patterns
Anti patternsAnti patterns
Anti patterns
Karthikeyan VK
 
How blockchain is revolutionizing crowdfunding
How blockchain is revolutionizing crowdfundingHow blockchain is revolutionizing crowdfunding
How blockchain is revolutionizing crowdfunding
Ahmed Banafa
 
Hyperledger Fabric Application Development 20190618
Hyperledger Fabric Application Development 20190618Hyperledger Fabric Application Development 20190618
Hyperledger Fabric Application Development 20190618
Arnaud Le Hors
 
01 - Introduction to Hyperledger : A Blockchain Technology for Business
01 - Introduction to Hyperledger : A Blockchain Technology for Business01 - Introduction to Hyperledger : A Blockchain Technology for Business
01 - Introduction to Hyperledger : A Blockchain Technology for Business
Merlec Mpyana
 
Generative AI
Generative AIGenerative AI
Generative AI
All Things Open
 
Modified MD5 Algorithm for Password Encryption
Modified MD5 Algorithm for Password EncryptionModified MD5 Algorithm for Password Encryption
Modified MD5 Algorithm for Password Encryption
International Journal of Computer and Communication System Engineering
 
Machine Learning in Cyber Security
Machine Learning in Cyber SecurityMachine Learning in Cyber Security
Machine Learning in Cyber Security
Rishi Kant
 
chatgpt-privacy and security.pptx
chatgpt-privacy and security.pptxchatgpt-privacy and security.pptx
chatgpt-privacy and security.pptx
Deepak Kumar
 
Cryptography and Network Lecture Notes
Cryptography and Network Lecture NotesCryptography and Network Lecture Notes
Cryptography and Network Lecture Notes
FellowBuddy.com
 
HTTP vs HTTPS, Do You Really Need HTTPS?
HTTP vs HTTPS, Do You Really Need HTTPS?HTTP vs HTTPS, Do You Really Need HTTPS?
HTTP vs HTTPS, Do You Really Need HTTPS?
CheapSSLsecurity
 
Block Chain Cloud Technology
Block Chain Cloud TechnologyBlock Chain Cloud Technology
Block Chain Cloud Technology
Vedant Mane
 
Collaborating In The Cloud - updated
Collaborating In The Cloud - updatedCollaborating In The Cloud - updated
Collaborating In The Cloud - updated
Robin Hastings
 
Distributed Ledger Technology
Distributed Ledger TechnologyDistributed Ledger Technology
Distributed Ledger Technology
Kriti Katyayan
 

What's hot (20)

Blockchain, Hyperledger and the Oracle Blockchain Platform
Blockchain, Hyperledger and the Oracle Blockchain PlatformBlockchain, Hyperledger and the Oracle Blockchain Platform
Blockchain, Hyperledger and the Oracle Blockchain Platform
 
Announcing: Native MQTT Integration with HiveMQ and InfluxDB Cloud
Announcing: Native MQTT Integration with HiveMQ and InfluxDB Cloud Announcing: Native MQTT Integration with HiveMQ and InfluxDB Cloud
Announcing: Native MQTT Integration with HiveMQ and InfluxDB Cloud
 
Ipfs : InterPlanetary File System
Ipfs : InterPlanetary File SystemIpfs : InterPlanetary File System
Ipfs : InterPlanetary File System
 
Blockchain: Real World Use Cases
Blockchain: Real World Use CasesBlockchain: Real World Use Cases
Blockchain: Real World Use Cases
 
Matomo: A guide to your site's usage
Matomo: A guide to your site's usageMatomo: A guide to your site's usage
Matomo: A guide to your site's usage
 
Transposition Cipher
Transposition CipherTransposition Cipher
Transposition Cipher
 
Blockchain and Internet of Things
Blockchain and Internet of ThingsBlockchain and Internet of Things
Blockchain and Internet of Things
 
Anti patterns
Anti patternsAnti patterns
Anti patterns
 
How blockchain is revolutionizing crowdfunding
How blockchain is revolutionizing crowdfundingHow blockchain is revolutionizing crowdfunding
How blockchain is revolutionizing crowdfunding
 
Hyperledger Fabric Application Development 20190618
Hyperledger Fabric Application Development 20190618Hyperledger Fabric Application Development 20190618
Hyperledger Fabric Application Development 20190618
 
01 - Introduction to Hyperledger : A Blockchain Technology for Business
01 - Introduction to Hyperledger : A Blockchain Technology for Business01 - Introduction to Hyperledger : A Blockchain Technology for Business
01 - Introduction to Hyperledger : A Blockchain Technology for Business
 
Generative AI
Generative AIGenerative AI
Generative AI
 
Modified MD5 Algorithm for Password Encryption
Modified MD5 Algorithm for Password EncryptionModified MD5 Algorithm for Password Encryption
Modified MD5 Algorithm for Password Encryption
 
Machine Learning in Cyber Security
Machine Learning in Cyber SecurityMachine Learning in Cyber Security
Machine Learning in Cyber Security
 
chatgpt-privacy and security.pptx
chatgpt-privacy and security.pptxchatgpt-privacy and security.pptx
chatgpt-privacy and security.pptx
 
Cryptography and Network Lecture Notes
Cryptography and Network Lecture NotesCryptography and Network Lecture Notes
Cryptography and Network Lecture Notes
 
HTTP vs HTTPS, Do You Really Need HTTPS?
HTTP vs HTTPS, Do You Really Need HTTPS?HTTP vs HTTPS, Do You Really Need HTTPS?
HTTP vs HTTPS, Do You Really Need HTTPS?
 
Block Chain Cloud Technology
Block Chain Cloud TechnologyBlock Chain Cloud Technology
Block Chain Cloud Technology
 
Collaborating In The Cloud - updated
Collaborating In The Cloud - updatedCollaborating In The Cloud - updated
Collaborating In The Cloud - updated
 
Distributed Ledger Technology
Distributed Ledger TechnologyDistributed Ledger Technology
Distributed Ledger Technology
 

Similar to Dataiku hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
tsliwowicz
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
Lars Albertsson
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Lucidworks
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
Donald Miner
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
Dony Riyanto
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Sparktsliwowicz
 
Paris ML meetup
Paris ML meetupParis ML meetup
Paris ML meetup
Yves Raimond
 
Parismlmeetupfinalslides 151209190037-lva1-app6892
Parismlmeetupfinalslides 151209190037-lva1-app6892Parismlmeetupfinalslides 151209190037-lva1-app6892
Parismlmeetupfinalslides 151209190037-lva1-app6892
mercedes calderon
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
Donald Miner
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Philip Filleul
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
MaulikLakhani
 
Streaming Analytics
Streaming AnalyticsStreaming Analytics
Streaming Analytics
Neera Agarwal
 
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
MongoDB
 
Python ml
Python mlPython ml
Python ml
Shubham Sharma
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrong
Fastly
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
Eli Singer
 

Similar to Dataiku hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours (20)

Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Paris ML meetup
Paris ML meetupParis ML meetup
Paris ML meetup
 
Parismlmeetupfinalslides 151209190037-lva1-app6892
Parismlmeetupfinalslides 151209190037-lva1-app6892Parismlmeetupfinalslides 151209190037-lva1-app6892
Parismlmeetupfinalslides 151209190037-lva1-app6892
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Streaming Analytics
Streaming AnalyticsStreaming Analytics
Streaming Analytics
 
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
 
ENAR short course
ENAR short courseENAR short course
ENAR short course
 
Python ml
Python mlPython ml
Python ml
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrong
 
kdd2015
kdd2015kdd2015
kdd2015
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 

More from Dataiku

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Dataiku
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
Dataiku
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML model
Dataiku
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
Dataiku
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
Dataiku
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
Dataiku
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare Industry
Dataiku
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ?
Dataiku
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem
Dataiku
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku
Dataiku
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015
Dataiku
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Dataiku
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECH
Dataiku
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
Dataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Dataiku
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku
 

More from Dataiku (20)

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML model
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare Industry
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ?
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECH
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 

Recently uploaded

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 

Recently uploaded (20)

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 

Dataiku hadoop summit - semi-supervised learning with hadoop for understanding user web behaviours

  • 1. ½ S L using to turn into
  • 2. Semi-Supervised Learning on Hadoop to understand user behaviors Hadoop Summit Amsterdam 2-3 Avril 2014
  • 4. Motivation • CxO – Pages Views, Unique Visitors, Dollars, Subscription • Editor / Product Manager – Time Spent, Comments • Users – Content What does matter on a web site ?
  • 5. Key Usage Metrics • Publisher – Time Spent on Page – Number of pages seen – Number of comments – Move to Subscription • Search Engine – Click on first hits / re-click – Rephrasing ratio – Will come back tomorrow – Click on Advertisting • Online Game – Time spent in the game – Level Progress – In-App Purchase
  • 6. The Quest for the Missing Proxy • Publisher – Time Spent on Page – Number of pages seen – Number of comments – User Satisfaction – Move to Subscription • Search Engine – Click on first hits / re-click – Rephrasing ratio – User Satisfaction – Will come back tomorrow – Click on Advertisting • Online Game – Time spent in the game – Level Progress – User Satisfaction – In-App Purchase U S E R
  • 7. Question How to measure and drive user satisfaction on a large web sites with very diverse usage patterns ?
  • 8. The Problem New Comers From Google News People Coming from twitter and Facebook Posts People coming to the website almost each and everyday People that loves to comment Foreigners Robots People fond of sport section only …. ….. BEHAVIOUR DIVERSITY THE AVERAGED METRICS WOULD HIDE IMPORTANT VARIATION ON SPECIFIC SEGMENTS
  • 9. SubProblem 1: Hard Segments • Segments Users per Number of visits per month – > 20 days per month -> Engaged Users • Segment per transformed or not • Segment per country
  • 10. Subproblem 2: Hard Metrics • Newspaper Time Spent on the website  log(Number of page views) + Number of actions • Search engine Click Ratio Click ratio • E-Commerce  Transformation Ratio
  • 11. Limits Hard Segments  MISSING PART OF THE REALITY Hard Metrics  ARGUING BETWEEN TEAM
  • 12. Semi-Supervised Learning All Labeled Data All Unlabeled Data Some Labeled Data Lots of Unlabeled Data Training Data Supervised Learning Unsupervised Learning Semi- Supervised Learning Model Model Model
  • 13. ½ SL – Natural Language Processing I hope I’ll enjoy Amsterdam, and not only because of Hadoop Je pense bien passer du bon temps à Amsterdam, et pas seulement grâce à Hadoop Statistical Knowledge  Text Structure (Unsupervised) Aligned Corpus (Supervised)
  • 14. ½ SL Applied to Web Sessions Lots of customer sessions Not so many concrete customer feedbacks Subscription
  • 15. Semi-Supervised Learning 3 Approaches • Generative Models, e.g. gaussian fits – All Data fits a gaussian distribution with parameter X – Find X that better fit distribution of both labeled data and unlabeled data • Fits with costs – Supervised learning with a costs function that capture a distance between point related to the unlabeled data structure • Ad-hoc : Combine unsupervised, then supervised
  • 16. Clustering+Supervised in practice Unlabeled training data points in grey Labeled training data points in color
  • 18. ½ SL : Fit to the underlying structure
  • 19. Our Approach 1. (Lots of ) Data preparation to build miningful user session 2. Clustering sessions and validate/tag those clusters by end users 3. Create Predictive User Satisfaction Metrics 4. Follow those metrics !
  • 20. Data Prep: Overview Step 1 Build Sessions Pig Step 2 Parse IP/Time/.. Custom Python (or ) Step 3 Parse Sequences Hive or Python custom Step 4 Build user-level stats Hive RAW DATA READY FOR ML
  • 21. Step 1. Build Session • Use Hive ( Or Pig) • Group into “Session” • Depending on the variable – IP, Device  Select only one per log – URL, Event  Create an ordered array that represents the sequence of events in the session
  • 22. Step 2 : Basic Feature • IP Address  Location, City • User-Agent  Device • Timestamp  User Time  Day or night ? Python + Hadoop Streaming Option 1 Option 2
  • 23. Extracted DataORIGINAL ORIGINAL ORIGINAL NEW!! NEW!! NEW!! Country From IP Device From User-AgentHour from Country & Time
  • 24. Step 3: Session Signals • Simple Signals – Number of Page Views – Time Spent ….. – Etc… • Limitation  It might not help that much to differentiate behaviour
  • 25. More Elaborate: N-Grams Model Field Unit Sample 1-Gram 2-Gram 3-Gram Protein Amino Acid Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,.. NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,.. NLP (character) Word ..some like it hot… some,like,it some-like,like-it some-like-it, like-it-hote
  • 26. N-Grams Model For Sessions Field Unit Sample 1-Gram 2-Gram 3-Gram Protein Amino Acid Cys-Gly-Leu Cys, Gly, Leu Cys-Gly, Gly-Leu Cys-Gly-Leu DNA Base Pair …ATTAGCAT.. A,T,T,A AT,TT,TA,AG, ATT,TTA,TAG,.. NLP (word) Character ..some like it hot… s,o,m,e,l,i,t.. so,om,el,li,it som,ome,me_,_li,lik,.. NLP (character) Word ..some like it hot… some,like,it some-like,like-it some-like-it, like-it-hote Web Sessions Page View [/home , /products, /trynow, /blog] /home, /products, /trynow, /blog /home /products, /products /trynow, /trynow /blog /home-/products-/trynow, /products-/trynow-/blog
  • 27. Session N-Grams Analytics Campaign / URL / Event Detailed Token Simple Token utm=google_search google-search-my-site google-search /home home home /search?q=baseball search-baseball search click=www.nfl.com click-nfl click /sport/new-player-com.. sport/new-player-comming sport /search?q=Mick+JONES search-mick+jones search click=www.nfl.com click-nfl click /sport/new-player-com.. sport/new-player/comming sport /politics/home politics-home politics Important Tricks: • Incorporate the first referrer / marketing campaign as FIRST TOKEN • Build two level of tokens: detailed, and category only N-Grams Fine Grain N-Grams Coarse Grain
  • 28. How To In Practice • Hive query using the n-grams UDF • Compute the LLR (Least-Likehood Ratio) Metrics • Keep the most frequent n-grams of each type (detailed / non detailed) as features for the session • Hint : Set the frequency limit so that > 90% session can be described by a non-detailed n-gram
  • 29. Step 4. Cohort-like data • Per cookie compute metrics – Nb. Days since first visit – Nb visits in the last 30 days – Average session time – … • Reintegrate this information • Easily achieved with a HiveQL query
  • 30. Machine Learning for HDFS Data Kind Algorithms for clustering Simplicity TRAIN set size Apache Mahout MapReduce ~ 10 available Expert TERABYTES Python (Scikit+Pandas+… ) Out for training / In for apply ~ 20 available (including bi- clustering) Medium (10GB) 1 SERVER RAM H2O Separate Cluster 1 (kMeans) Medium (100GB – 1TB) CLUSTER RAM Open Source R + Hadoop Varies Varies Varies Varies Open Source R + Pattern (Casacding) Out for training / In for apply > 3 Medium (1GB) 1 Server RAM in R Spark + MLLib Separate Cluster 1 Medium (100GB – 1TB) CLUSTER RAM
  • 31. How Big is out data here ? Step 1 Build Sessions Step 2 Parse IP/Time/.. Step 3 Parse Sequences Step 4 Build user-level stats RAW DATA READY FOR ML Uncompressed data size, for 1 year worth of log on a website with 10 Millions Unique Visitors per month 10 GB5TB
  • 32. Clustering With Scikit on HDFS 1. Use Pydoop to get data on train server 2. Use pandas to read data transform to numerical 3. Kmeans().fit() 4. Ipython to draw some graphs 5. Enjoy or
  • 35. Clustering & Cluster Sampling Take a balanced number of samples in each cluster, close to the centroid
  • 36. Labelling 0’ 00 0’ 12 1’ 04 1’ 45 3’ 02 Visualizing Sessions Search for a specific Topic Labelling I can guess what this guy was doing !!!
  • 37. Labelling Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?)
  • 38. What if ? Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?)
  • 39. Supervised Learning Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?) Independently from the clusters, used the trained examples in order to classify each session in the predefined segments
  • 40. Supervised Learning : e.g. in python • Load the data and the label in python (Pandas) • Fit the labeled sessions against a model • Save the model in HDFS (python pickle) • Run the model against all the data (Hadoop Streaming) We’ve got a tool to help you do that in Data Science Studio He’s called the Doctor and he’s fun to use !
  • 41. Compute Metrics Per Segments Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?) 0.3€ per session 0.23€ acquisition costs `` ` 13k sessions 1.3€ per session 0.23€ acquisition costs 938k sessions 938k sessions 0.3€ per session 0.23€ acquisition costs 738k sessions 0.83€ per session 0.73€ acquisition costs 68k sessions 0.3€ per session 1.23€ acquisition costs 1k sessions 0€ per session 0€ acquisition costs
  • 42. User Satisfaction Metrics • Future-Based Metrics – Will the user most likely subscribe/pay in the future ? • Expressed-Opinion – Does he like satisfied from its behaviour ?
  • 43. Opinion-Based Training For User Satisfaction User Feedbacks as “Labels” to build a model on satisfaction “Predict” a satisfaction score for non-trained session Session Data Feedbacks Scored Session HYPOTHESIS : IF TWO USERS HAVE SIMILAR NAVIGATION PATTERNS THEY HAVE SIMILAR USER SATISFACTION LEVELS (100 Million Sessions) (10.000 feedbacks)
  • 44. Compute Metrics Per Segments Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?) 0.3€ per session 0.23€ acquisition costs `` ` 13k sessions 1.3€ per session 0.23€ acquisition costs 938k sessions 938k sessions 0.3€ per session 0.23€ acquisition costs 738k sessions 0.83€ per session 0.73€ acquisition costs 68k sessions 0.3€ per session 1.23€ acquisition costs 1k sessions 0€ per session 0€ acquisition costs SATISFACTION SCORE 0.87§ SATISFACTION SCORE 0.37 SATISFACTION SCORE 0.28 SATISFACTION SCORE 0.12 SATISFACTION SCORE 0.28 SATISFACTION SCORE 0.12
  • 45. Data in Time: Smoothing In Red : The Base Metric In Blue : The smoothed metricRAW DATA MAY VARY A LOT FROM DAYS TO DAYS IT WILL SCARE PEOPLE
  • 46. Exponential Smoothing In Hive SELECT segment moving_avg(day, satisfaction, 15, 1.52, 15, DATEDIFF(‘2014-15-01’, ‘2014-01-01’)) FROM stats GROUP BY segment These factors determine whether your smooth a lot or not, and over how many days
  • 47. Final : Follow Smoothed Satisfaction Search for a specific Topic Newcomer from Google News Foreigner Discovering The Site Fan that loves to comment Home Page Wanderer Dark Bot (Competitor?) Follow Statisfaction Metric Per Segment Damn our latest release has diverging effects on segments
  • 48. Thank You ! Florian Douetteau @fdouetteau Questions now or later: florian.douetteau@dataiku.com dataiku.com