Recommender systems are paramount for e-business companies. There is an increasing need to take into account all the user information to tailor the best product proposition. One of them is the content that the user actually sees: the visual of the product.
When it comes to hostels, some people can be more attracted by pictures of the room, the building or even the nearby beach.
In this talk, we will describe how we improved an e-business vacation retailer recommender system using the content of images. We’ll explain how to leverage open dataset and pre-trained deep learning models to derive user taste information. This transfer learning approach enables companies to use state of the art machine learning methods without having deep learning expertise.
9654467111 Call Girls In Munirka Hotel And Home Service
From Labelling Open data images to building a private recommender system
1. From Labelling Open Data Images to
Building a Private Recommender System
A transfer learning application
2. Outline
• Introduction
• Iterative building of a recommender system
• Labelling images
AKA: Pragmatic Deep learning for “Dummies”
• Post processing
AKA: Using Images information for BI on steroids
• Results & Conclusion
3. Dataiku
• Founded in 2013
• 60 + employees
• Paris, New-York, London, San Francisco
Data Science Software Editor of Dataiku DSS
DESIGN
Load and prepare
your data
PREPARE
Build your
models
MODEL
Visualize and share
your work
ANALYSE
Re-execute your
workflow at ease
AUTOMATE
Follow your production
environment
MONITOR
Get predictions
in real time
SCORE
PRODUCTION
4. • E-business vacation retailer
• Founded in 2006. 500M revenue in 2015.
• 18 Millions of clients.
• Hundreds of sales everyday
-> recommendation engine
• Sale Image is paramount
Key Figures
5. VPG specificities
• Sales are very temporary
-> Unlike amazon / Price Minister / Cdiscount
-> Some classical recommender system fails
-> Sales are event linked (Christmas, ski, summer)
• Expensive Product
-> Few recurrent buyers
-> Appearance counts a lot
• Few recurrent buyer
-> Classical approach fail.
-> Less signal. Visit information paramount.
-> less inclined to browse a lot (4-10 first sales)
6. A data science workflow
Six steps to a predictive model
Data
Exploration &
Understanding
Data
Preparation
Model Creation
Evaluation
Deployment
Data
Acquisition
Dataset
1
Scored
dataset
Scored
dataset
Iteration 1
Iteration 2
Iteration n
Creating a predictive model is an highly
iterative process.
Data Science Studio enables its users to
create and manage these projects from
end-to-end.
This process is not industry specific, and
can be applied to many use cases.
Dataset
2
Dataset
n
Business
Understanding
Adapted from the CRISP-DM methodology
7. A data science workflow
Six steps to a predictive model
Data
Exploration &
Understanding
Data
Preparation
Model Creation
Evaluation
Deployment
Data
Acquisition
Dataset
1
Scored
dataset
Scored
dataset
Iteration 1
Iteration 2
Iteration n
Creating a predictive model is an highly
iterative process.
Data Science Studio enables its users to
create and manage these projects from
end-to-end.
This process is not industry specific, and
can be applied to many use cases.
Dataset
2
Dataset
n
Business
Understanding
Adapted from the CRISP-DM methodology
11. One Meta Model to Rule Them All
Recommenders
as
features
Machine
learning
to
op5mize
purchasing
probability
Combine
Recommend
Describe
12. One Meta Model to Rule Them All
• Negative sampling
• Take all purchases tuples : (user, product, timestamp)-> 1
• Select 5 sales open at the same date the user did not buy -> 0
• The model directly optimize purchasing probability
• Machine learning model
• Features : recommender systems.
• Logistic Regression
Regularizing effect : we don’t want to overfit leaks.
• Reranking approach.
Similar to Google or Yandex (Kaggle challenge)
13. One Meta Model to Rule Them All
• Going further ?
• Predict the visit ?
- Would enable to take account more information
- Many people browse randomly
• Learning to rank on target:
2 bought, 1 visited, 0 elsewhere
• Impact of this on top 10 sales ?
• Limitations :
• Highly dependant on ranking displayed
- which we don’t have
- may overfit old man made rules.
14. Cleaning, combining
and enrichment of
data
Recommendation
Engines
Optimization of
home display
the application
automatically runs and
compiles heterogeneous
data
Generation of
recommendations based
on user behaviour
Every customer is shown the 10
sales he is the most likely to buy
Customer visits
Purchases
Sales Images
Metal model combine
recommendations to
directly optimize
purchasing probability
Meta Model
Recommender system for Home Page Ordering
+7% revenue
Sales information
(A/B testing)
Batch Scoring every night
15. Why use Image ?
We want do distinguish
« Sun and
Beach »
« Ski »
A picture is worth a thousand words
16. Sales Images
Integrating Image Information
Labelling Model
Pool + Palm Trees Hotel
+ Mountains
Pool + Forest + Hotel + Sea
Sea + Beach +Forest + Hotel
Sales descriptions
CONTENT
BASED
Recommender
System
17. Image Labelling For Recommendation Engine
Pragma&c
Deep
learning
for
“Dummies”
18. Using Deep Learning models
Common Issues
“I don’t have GPUs server” “I don’t have a deep leaning expert”
“I don’t have labelled data” (or too few) “I don’t have the time to wait for model training ”
I don’t want to pay to pay for private apis” / “I’m afraid their labelling will change over time”
19. Pragmatic Deep Learning Cheat Sheet
Do
you
have
Labels
?
Many
?
Are
you
sure
?
Train
DL
model
Transfer
Learning
Is
there
a
similar
database
?
Is
there
a
pre-‐trained
model
?
Create
your
own
Use
it
!
Y
Y
Y
N
N
N
N
Y
N
20. “I don’t have (or few) labelled data”
-> Is there similar data ?
Solution 1 : Pre trained models
PLACES
DATABASE
VPG
SUN
DATABASE
205
categories
2.5
M
images
307
categories
110
K
images
21. tower: 0.53
skyscraper: 0.26
swimming_pool/outdoor: 0.65
inn/outdoor: 0.06
Solution 1 : Pre trained models
If there is open data, there is an open pre trained model !
• Kudos to the community
• Check the licensing
Example
with
Places
(Caffe
Model
Zoo)
:
22. Solution 2 : Transfer Learning
“I want to add information of SUN database”
“But I have only 100 K images”
If you know how to recognize… after a little bit of training… you will be able to recognize
Transfer Learning
Use a network that knows how to see
• As a feature generator / transformer
• To be updated for the new problem
23. Solution 2 : Transfer Learning
Not limited to images !
Pan, Sinno Jialin, and Qiang Yang. "A survey on transfer learning." IEEE Transactions on knowledge and
data engineering 22.10 (2010): 1345-1359.
If you know sentiment for
Transfer Learning
Word2Vec: Use large text corpora
• For grammar learning
• For synonym learning
This wine taste great
The most disgusting cheese ever
1
0
(word2vec)
And you know synonyms and grammar
This cheese tasted awful
The best wine in town
It’s easy to classify
24. Solution 2 : Transfer Learning
Credit
:
Fei-‐Fei
Li
&
Andrej
Karpathy
&
Jus5n
Johnson
h`p://cs231n.stanford.edu/slides/winter1516_lecture11.pdf
25. Retrain new network
Solution 2 : Transfer Learning
Similar Data Not so similar Data
Use network as
transformer
Simple model on shallow layers ?
Or get other data
Lot’s of labeled data
With existing architecture
Create Simple Model Troubles
Fine Tune
Few labeled data
Credit
:
Fei-‐Fei
Li
&
Andrej
Karpathy
&
Jus5n
Johnson
h`p://cs231n.stanford.edu/slides/winter1516_lecture11.pdf
Several layers depending on
size of data
SUN
VS
Places
dataset
J
VPG
:
• No
labeled
data
• Similar
data
?
26. PLACES
DATABASE
VOYAGE
PRIVE
SUN
DATABASE
Training
(op5onal)
Pre-‐trained
model
VGG16
tower: 0.53
skyscraper: 0.26
Re-‐Training
Transferred
Data
:
Last
convolu5onal
layer
features
Re-‐trained
model
TensorFlow
2
fully
connected
layers
Caffe
Model
Zoo
GPU
CPU
GPU
Leverage existing knowledge !
Solution 2 : Transfer Learning
Accuracy:
72%,
Top-‐5
Acc:
90
%
>
state
of
the
art
on
dataset
alone
27. Solution 3 : Generating your own large (or not) dataset
• Create Label Set
• Easy : Man VS Woman ?
• Harder : all relevant information in my images
• Manually select all words in a corpus (ex Wordnet)
• Use Search Engines
• Augment search terms
• Get URLs and images from search term
• Deduplicate
• Validate with Mechanical Turk
• Exclude incorrect images
• Evaluate human performance
29. Solution 4 : What about APIs ?
• Price
• Their cost
often rather cheap. Ex: 100 K request for less than 300$
• VS the one of redeveloping (probably not as well)
• Full Database scoring
• APIs are often limited query per month.
• Make sure to be able to avoid cold start problem
• Stability
• Use model versioning
• Avoid covariate shift, distribution drift
30. What about APIs ? Use for generating labels !
• How to :
• Score part of the database for training
• Train a model
• Score your entire database
• But I have only 5000 requests ?
-> Use Transfer Learning !
• Stealing models
Tramèr, Florian, et al. "Stealing Machine Learning Models via Prediction APIs." arXiv preprint
arXiv:1609.02943 (2016).
(Or don’t, it’s illegal)
31. What about APIs ? Use for generating labels !
Experiment:
• 5000 requests on API
-> 4500 for training
-> 500 for validation
• Transfer learning with MIT Places Pre-trained Model
• Scikit learn Multilabel model
• One Vs the Rest
• Untuned Logistic regression
(Or don’t, it’s illegal)
(demo, not used in any real project)
32. What about APIs ? Results
Accuracy
95
Recall
80
Precision
75
Label
Probability
Label
Probability
landscape 1,0000 sunset 0,9998
sky 1,0000 no person 0,9996
outdoors 1,0000 water 0,9990
nature 1,0000 park 0,9849
rock 1,0000 river 0,9678
travel 1,0000 scenic 0,8031
Label
Probability
Label
Probability
beach 1,0000 ocean 1,0000
summer 1,0000 relaxation 1,0000
sand 1,0000 island 1,0000
tropical 1,0000 idyllic 1,0000
travel 1,0000 seashore 0,9998
seascape 1,0000 water 0,9997
(demo, not used in any real project)
33. Post Treatment
(Or how we transfer the labelling information)
Using
Images
informa&on
for
BI
on
steroids
34. Classification problem
• Only have probabilities of each class
• Selecting based on probability threshold fails
• Keeping all information is not sparse
-> we keep 5 labels and probabilities per image
Labels post-processing
Deep/Transfer Learning models
5-10 tags per images
• 2s/image with CPU
• x20 speed up with GPU
Voyage Privé images
36. Topic extraction with Non-Negative Matrix Factorization
• Non Negative Matrix factorization (NMF)
X = WH
• X : image x tags, non negative
• W : image x theme
• H : theme x tag
(scikit learn implementation)
• Most represented Themes
• Swimming-pool_Apartment_Putting-green
• Ocean_Coast_SandBar
• Coast_SeaCliff_RockArch
• Beach_Coast_BoardWalk
• Bridge_Viaduc_River
• Palace_BuildingFacade-Mansion
• Castle_Mansion_Monastery
• HotelRoom_Bedroom_DormRoom
• Dimension Reduction
• 200x200 pixels -> 600 tags => 30 themes
• Faster content based filtering
• Image often sparse combination of themes
Faster content based filtering
• Each theme has the same explication power
Balanced vector for content based
• Explicability
Each theme corresponds to a few labels
37. Image content detection
Topic scores determine the importance of topics in an image
TOPIC
TOPIC
SCORE
(%)
Golf
course
–
Fairway
–
PuPng
green
31
Hotel
–
Inn
–
Apartment
building
outdoor
30
Swimming
pool
–
Lido
Deck
–
Hot
tub
outdoor
22
Beach
–
Coast
-‐
Harbor
17
TOPIC
TOPIC
SCORE
(%)
Tower
–
Skyscraper
–
Office
building
62
Bridge
–
River
–
Viaduct
38
38. Note on model performance
• Images labels are used for similarity
Calling herb field “putting green”:
• Is not important if all herbs field are called this way.
• Would be if we had lot’s of golf trips sales.
• Improving the NN performance ?
• Labels are used in NMF and reduced to themes
• Themes are used to calculate similarities for CB
recommenders
• CB Recommenders are used as a feature in meta model
• Meta model give probabilities of purchase = order
• Users only check 10 sales…
-> what is the change of online performance for 1% accuracy ?
40. Results ?
All Visits :
• Mostly France
• Pool displayed
First Recommendation
• Fail to display pools
Only Images ?
• Pool all around the world
Third column = Right Mix
41. Results ?
All Visits :
• Spain
• Sun & Beach
• Pool displayed
First Recommendation
• Displays nature…
Only Images ?
• Pool all around the world
Third = Right Mix
• Get the bungalow feature !
42. • Do iterative data science !
• Start simple and grow
• Validate each steps
• Image labelling = BI on steroids
• Deep Learning ?
• Is there existing data ?
• Is there a pre-trained model ?
• Transfer Learning
• Cheaper, faster
• Any Data Scientist can do it
• What’s Next ?
Conclusion