SlideShare a Scribd company logo
1 of 47
Download to read offline
From Labelling Open Data Images to
Building a Private Recommender System
A transfer learning application
Outline
•  Introduction
•  Iterative building of a recommender system
•  Labelling images
AKA: Pragmatic Deep learning for “Dummies”
•  Post processing
AKA: Using Images information for BI on steroids
•  Results & Conclusion
Dataiku
•  Founded in 2013
•  60 + employees
•  Paris, New-York, London, San Francisco
Data Science Software Editor of Dataiku DSS
DESIGN
Load and prepare
your data
PREPARE
Build your
models
MODEL
Visualize and share
your work
ANALYSE
Re-execute your
workflow at ease
AUTOMATE
Follow your production
environment
MONITOR
Get predictions
in real time
SCORE
PRODUCTION
•  E-business vacation retailer
•  Founded in 2006. 500M revenue in 2015.
•  18 Millions of clients.
•  Hundreds of sales everyday
-> recommendation engine
•  Sale Image is paramount
Key Figures
VPG specificities
•  Sales are very temporary
-> Unlike amazon / Price Minister / Cdiscount
-> Some classical recommender system fails
-> Sales are event linked (Christmas, ski, summer)
•  Expensive Product
-> Few recurrent buyers
-> Appearance counts a lot
•  Few recurrent buyer
-> Classical approach fail.
-> Less signal. Visit information paramount.
-> less inclined to browse a lot (4-10 first sales)
A data science workflow
Six steps to a predictive model
Data
Exploration &
Understanding
Data
Preparation
Model Creation
Evaluation
Deployment
Data
Acquisition
Dataset
1
Scored
dataset
Scored
dataset
Iteration 1
Iteration 2
Iteration n
Creating a predictive model is an highly
iterative process.
Data Science Studio enables its users to
create and manage these projects from
end-to-end.
This process is not industry specific, and
can be applied to many use cases.
Dataset
2
Dataset
n
Business
Understanding
Adapted from the CRISP-DM methodology
A data science workflow
Six steps to a predictive model
Data
Exploration &
Understanding
Data
Preparation
Model Creation
Evaluation
Deployment
Data
Acquisition
Dataset
1
Scored
dataset
Scored
dataset
Iteration 1
Iteration 2
Iteration n
Creating a predictive model is an highly
iterative process.
Data Science Studio enables its users to
create and manage these projects from
end-to-end.
This process is not industry specific, and
can be applied to many use cases.
Dataset
2
Dataset
n
Business
Understanding
Adapted from the CRISP-DM methodology
Iterative Building of a Recommender System
Basic Recommendation Engines
Other Factors
One Meta Model to Rule Them All
Recommenders	
  as	
  features	
  
Machine	
  learning	
  to	
  op5mize	
  
purchasing	
  probability	
  
Combine	
  
Recommend	
  
Describe	
  
One Meta Model to Rule Them All
•  Negative sampling
•  Take all purchases tuples : (user, product, timestamp)-> 1
•  Select 5 sales open at the same date the user did not buy -> 0
•  The model directly optimize purchasing probability
•  Machine learning model
•  Features : recommender systems.
•  Logistic Regression
Regularizing effect : we don’t want to overfit leaks.
•  Reranking approach.
Similar to Google or Yandex (Kaggle challenge)
One Meta Model to Rule Them All
•  Going further ?
•  Predict the visit ?
-  Would enable to take account more information
-  Many people browse randomly
•  Learning to rank on target:
2 bought, 1 visited, 0 elsewhere
•  Impact of this on top 10 sales ?
•  Limitations :
•  Highly dependant on ranking displayed
- which we don’t have
- may overfit old man made rules.
Cleaning, combining
and enrichment of
data
Recommendation
Engines
Optimization of
home display
the application
automatically runs and
compiles heterogeneous
data
Generation of
recommendations based
on user behaviour
Every customer is shown the 10
sales he is the most likely to buy
Customer visits
Purchases
Sales Images
Metal model combine
recommendations to
directly optimize
purchasing probability
Meta Model
Recommender system for Home Page Ordering
+7% revenue
Sales information
(A/B testing)
Batch Scoring every night
Why use Image ?
We want do distinguish
« Sun and
Beach »
« Ski »
A picture is worth a thousand words
Sales Images
Integrating Image Information
Labelling Model
Pool + Palm Trees Hotel
+ Mountains
Pool + Forest + Hotel + Sea
Sea + Beach +Forest + Hotel
Sales descriptions
CONTENT	
  BASED	
  
Recommender
System
Image Labelling For Recommendation Engine
Pragma&c	
  Deep	
  learning	
  for	
  “Dummies”	
  
Using Deep Learning models
Common Issues
“I don’t have GPUs server” “I don’t have a deep leaning expert”
“I don’t have labelled data” (or too few) “I don’t have the time to wait for model training ”
I don’t want to pay to pay for private apis” / “I’m afraid their labelling will change over time”
Pragmatic Deep Learning Cheat Sheet
Do	
  you	
  have	
  
Labels	
  ?	
  
Many	
  ?	
  	
  
Are	
  you	
  
sure	
  ?	
  
Train	
  DL	
  
model	
  
Transfer	
  
Learning	
  
Is	
  there	
  a	
  
similar	
  
database	
  ?	
  Is	
  there	
  a	
  
pre-­‐trained	
  
model	
  ?	
  
Create	
  
your	
  own	
  
Use	
  it	
  !	
  
Y	
  
Y	
  
Y	
  N	
  
N	
  
N	
  
N	
  
Y	
   N	
  
“I don’t have (or few) labelled data”
-> Is there similar data ?
Solution 1 : Pre trained models
PLACES	
  DATABASE	
  VPG	
   SUN	
  DATABASE	
  
205	
  categories	
  
2.5	
  M	
  images	
  
307	
  categories	
  
110	
  K	
  images	
  
tower: 0.53
skyscraper: 0.26
swimming_pool/outdoor: 0.65
inn/outdoor: 0.06
Solution 1 : Pre trained models
If there is open data, there is an open pre trained model !
•  Kudos to the community
•  Check the licensing
Example	
  with	
  Places	
  (Caffe	
  Model	
  Zoo)	
  :	
  
	
  
Solution 2 : Transfer Learning
“I want to add information of SUN database”
“But I have only 100 K images”
If you know how to recognize… after a little bit of training… you will be able to recognize
Transfer Learning
Use a network that knows how to see
•  As a feature generator / transformer
•  To be updated for the new problem
Solution 2 : Transfer Learning
Not limited to images !
Pan, Sinno Jialin, and Qiang Yang. "A survey on transfer learning." IEEE Transactions on knowledge and
data engineering 22.10 (2010): 1345-1359.
If you know sentiment for
Transfer Learning
Word2Vec: Use large text corpora
•  For grammar learning
•  For synonym learning
This wine taste great
The most disgusting cheese ever
1
0
(word2vec)
And you know synonyms and grammar
This cheese tasted awful
The best wine in town
It’s easy to classify
Solution 2 : Transfer Learning
Credit	
  :	
  	
  Fei-­‐Fei	
  Li	
  &	
  Andrej	
  Karpathy	
  &	
  Jus5n	
  Johnson	
  h`p://cs231n.stanford.edu/slides/winter1516_lecture11.pdf	
  
Retrain new network
Solution 2 : Transfer Learning
Similar Data Not so similar Data
Use network as
transformer
Simple model on shallow layers ?
Or get other data
Lot’s of labeled data
With existing architecture
Create Simple Model Troubles
Fine Tune
Few labeled data
Credit	
  :	
  	
  Fei-­‐Fei	
  Li	
  &	
  Andrej	
  Karpathy	
  &	
  Jus5n	
  Johnson	
  	
  h`p://cs231n.stanford.edu/slides/winter1516_lecture11.pdf	
  
Several layers depending on
size of data
SUN	
  VS	
  Places	
  
dataset	
  J	
  
VPG	
  :	
  
•  No	
  labeled	
  data	
  
•  Similar	
  data	
  
?	
  
PLACES	
  DATABASE	
   VOYAGE	
  PRIVE	
  SUN	
  DATABASE	
  
Training	
  
(op5onal)	
  
Pre-­‐trained	
  model	
  
VGG16	
  
tower: 0.53
skyscraper: 0.26
Re-­‐Training	
  
Transferred	
  Data	
  :	
  
Last	
  convolu5onal	
  
layer	
  features	
  
Re-­‐trained	
  model	
  
TensorFlow	
  
2	
  fully	
  connected	
  layers	
  
Caffe	
  
Model	
  Zoo	
  
	
  
GPU	
  
CPU	
  
GPU	
  
Leverage existing knowledge !
Solution 2 : Transfer Learning
Accuracy:	
  72%,	
  Top-­‐5	
  Acc:	
  90	
  %	
  >	
  state	
  of	
  the	
  art	
  on	
  dataset	
  alone	
  
Solution 3 : Generating your own large (or not) dataset
•  Create Label Set
•  Easy : Man VS Woman ?
•  Harder : all relevant information in my images
•  Manually select all words in a corpus (ex Wordnet)
•  Use Search Engines
•  Augment search terms
•  Get URLs and images from search term
•  Deduplicate
•  Validate with Mechanical Turk
•  Exclude incorrect images
•  Evaluate human performance
Solution 4 : What about APIs ?
Solution 4 : What about APIs ?
•  Price
•  Their cost
often rather cheap. Ex: 100 K request for less than 300$
•  VS the one of redeveloping (probably not as well)
•  Full Database scoring
•  APIs are often limited query per month.
•  Make sure to be able to avoid cold start problem
•  Stability
•  Use model versioning
•  Avoid covariate shift, distribution drift
What about APIs ? Use for generating labels !
•  How to :
•  Score part of the database for training
•  Train a model
•  Score your entire database
•  But I have only 5000 requests ?
-> Use Transfer Learning !
•  Stealing models
Tramèr, Florian, et al. "Stealing Machine Learning Models via Prediction APIs." arXiv preprint
arXiv:1609.02943 (2016).
(Or don’t, it’s illegal)
What about APIs ? Use for generating labels !
Experiment:
•  5000 requests on API
-> 4500 for training
-> 500 for validation
•  Transfer learning with MIT Places Pre-trained Model
•  Scikit learn Multilabel model
•  One Vs the Rest
•  Untuned Logistic regression
(Or don’t, it’s illegal)
(demo, not used in any real project)
What about APIs ? Results
Accuracy	
   95	
  
Recall	
   80	
  
Precision	
   75	
  
Label	
   Probability	
   Label	
   Probability	
  
landscape 1,0000 sunset 0,9998
sky 1,0000 no person 0,9996
outdoors 1,0000 water 0,9990
nature 1,0000 park 0,9849
rock 1,0000 river 0,9678
travel 1,0000 scenic 0,8031
Label	
   Probability	
   Label	
   Probability	
  
beach 1,0000 ocean 1,0000
summer 1,0000 relaxation 1,0000
sand 1,0000 island 1,0000
tropical 1,0000 idyllic 1,0000
travel 1,0000 seashore 0,9998
seascape 1,0000 water 0,9997
(demo, not used in any real project)
Post Treatment
(Or how we transfer the labelling information)
Using	
  Images	
  informa&on	
  for	
  BI	
  on	
  steroids	
  	
  
Classification problem
•  Only have probabilities of each class
•  Selecting based on probability threshold fails
•  Keeping all information is not sparse
-> we keep 5 labels and probabilities per image
Labels post-processing
Deep/Transfer Learning models
5-10 tags per images
•  2s/image with CPU
•  x20 speed up with GPU
Voyage Privé images
Labels post-processing
Complementary information Redondant information
Issue with our approach:
Solution : Matrix Factorization
Topic extraction with Non-Negative Matrix Factorization
•  Non Negative Matrix factorization (NMF)
X = WH
•  X : image x tags, non negative
•  W : image x theme
•  H : theme x tag
(scikit learn implementation)
•  Most represented Themes
•  Swimming-pool_Apartment_Putting-green
•  Ocean_Coast_SandBar
•  Coast_SeaCliff_RockArch
•  Beach_Coast_BoardWalk
•  Bridge_Viaduc_River
•  Palace_BuildingFacade-Mansion
•  Castle_Mansion_Monastery
•  HotelRoom_Bedroom_DormRoom
•  Dimension Reduction
•  200x200 pixels -> 600 tags => 30 themes
•  Faster content based filtering
•  Image often sparse combination of themes
Faster content based filtering
•  Each theme has the same explication power
Balanced vector for content based
•  Explicability
Each theme corresponds to a few labels
Image content detection
Topic scores determine the importance of topics in an image
TOPIC	
   TOPIC	
  SCORE	
  (%)	
  
Golf	
  course	
  –	
  Fairway	
  –	
  PuPng	
  green	
   31	
  
Hotel	
  –	
  Inn	
  –	
  Apartment	
  building	
  outdoor	
   30	
  
Swimming	
  pool	
  –	
  Lido	
  Deck	
  –	
  Hot	
  tub	
  outdoor	
   22	
  
Beach	
  –	
  Coast	
  -­‐	
  Harbor	
   17	
  
TOPIC	
   TOPIC	
  SCORE	
  (%)	
  
Tower	
  –	
  Skyscraper	
  –	
  Office	
  building	
   62	
  
Bridge	
  –	
  River	
  –	
  Viaduct	
   38	
  
Note on model performance
•  Images labels are used for similarity
Calling herb field “putting green”:
•  Is not important if all herbs field are called this way.
•  Would be if we had lot’s of golf trips sales.
•  Improving the NN performance ?
•  Labels are used in NMF and reduced to themes
•  Themes are used to calculate similarities for CB
recommenders
•  CB Recommenders are used as a feature in meta model
•  Meta model give probabilities of purchase = order
•  Users only check 10 sales…
-> what is the change of online performance for 1% accuracy ?
Results
Results ?
All Visits :
•  Mostly France
•  Pool displayed
First Recommendation
•  Fail to display pools
Only Images ?
•  Pool all around the world
Third column = Right Mix
Results ?
All Visits :
•  Spain
•  Sun & Beach
•  Pool displayed
First Recommendation
•  Displays nature…
Only Images ?
•  Pool all around the world
Third = Right Mix
•  Get the bungalow feature !
•  Do iterative data science !
•  Start simple and grow
•  Validate each steps
•  Image labelling = BI on steroids
•  Deep Learning ?
•  Is there existing data ?
•  Is there a pre-trained model ?
•  Transfer Learning
•  Cheaper, faster
•  Any Data Scientist can do it
•  What’s Next ?
Conclusion
Learned along the way
For ski sales, showing indoor pictures performs better
What’s next ?
•  Comparison proposed/visited vacation
𝑨𝒕𝒕𝒓𝒂𝒄𝒕𝒊𝒗𝒆𝒏𝒆𝒔𝒔( 𝒕𝒂𝒈)=  ​ 𝑽 𝒊𝒔𝒊𝒕𝒆𝒅   𝒐𝒇𝒇𝒆𝒓𝒔   𝒄𝒐𝒏𝒕𝒂𝒊𝒏𝒊𝒏𝒈   𝒕𝒂𝒈/𝑷𝒓𝒐𝒑𝒐𝒔𝒆𝒅   𝒐𝒇𝒇𝒆𝒓𝒔   𝒄𝒐𝒏𝒕𝒂𝒊𝒏𝒊𝒏𝒈   𝒕𝒂𝒈 
Ocean	
  
67%	
  
Bedroom	
  
33%	
  
VPG	
  offers	
  database	
  
Ocean	
  
33%	
  
Bedroo
m	
  
67%	
  
Visits	
  database	
  
•  Voyage Privé offers database = baseline
•  « Bedroom » attractiveness = ​0.67/0.33  = 2
•  « Ocean » attractiveness = ​0.33/0.67  = 0.5
Learned along the way
For ski sales, showing indoor pictures performs better
What’s next ?
What’s Next ?
Kenya
Prague
Berlin
Cambodia
What’s Next ? Customize the Image !
Kenya
Prague
Berlin
Cambodia
Thank you for your attention !

More Related Content

What's hot

Production machine learning_infrastructure
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructurejoshwills
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLPaco Nathan
 
Modern Machine Learning Infrastructure and Practices
Modern Machine Learning Infrastructure and PracticesModern Machine Learning Infrastructure and Practices
Modern Machine Learning Infrastructure and PracticesWill Gardella
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2Roger Barga
 
DutchMLSchool. ML Business Perspective
DutchMLSchool. ML Business PerspectiveDutchMLSchool. ML Business Perspective
DutchMLSchool. ML Business PerspectiveBigML, Inc
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureIvo Andreev
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchGreg Makowski
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with DatoTuri, Inc.
 
Machine learning the high interest credit card of technical debt [PWL]
Machine learning the high interest credit card of technical debt [PWL]Machine learning the high interest credit card of technical debt [PWL]
Machine learning the high interest credit card of technical debt [PWL]Jenia Gorokhovsky
 
DutchMLSchool. Logistic Regression, Deepnets, Time Series
DutchMLSchool. Logistic Regression, Deepnets, Time SeriesDutchMLSchool. Logistic Regression, Deepnets, Time Series
DutchMLSchool. Logistic Regression, Deepnets, Time SeriesBigML, Inc
 
Current and future challenges in data science
Current and future challenges in data scienceCurrent and future challenges in data science
Current and future challenges in data scienceNathaniel Shimoni
 
Building Custom
Machine Learning Algorithms
with Apache SystemML
Building Custom
Machine Learning Algorithms
with Apache SystemMLBuilding Custom
Machine Learning Algorithms
with Apache SystemML
Building Custom
Machine Learning Algorithms
with Apache SystemMLsparktc
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
 
DutchMLSchool. Machine Learning End-to-End
DutchMLSchool. Machine Learning End-to-EndDutchMLSchool. Machine Learning End-to-End
DutchMLSchool. Machine Learning End-to-EndBigML, Inc
 
Full-stack Data Scientist
Full-stack Data ScientistFull-stack Data Scientist
Full-stack Data ScientistAlexey Grigorev
 
Introduction to Recommender Systems
Introduction to Recommender SystemsIntroduction to Recommender Systems
Introduction to Recommender SystemsTuri, Inc.
 

What's hot (20)

Production machine learning_infrastructure
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructure
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
 
Modern Machine Learning Infrastructure and Practices
Modern Machine Learning Infrastructure and PracticesModern Machine Learning Infrastructure and Practices
Modern Machine Learning Infrastructure and Practices
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
Knowledge Discovery
Knowledge DiscoveryKnowledge Discovery
Knowledge Discovery
 
DutchMLSchool. ML Business Perspective
DutchMLSchool. ML Business PerspectiveDutchMLSchool. ML Business Perspective
DutchMLSchool. ML Business Perspective
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
Building Personalized Data Products with Dato
Building Personalized Data Products with DatoBuilding Personalized Data Products with Dato
Building Personalized Data Products with Dato
 
Machine learning the high interest credit card of technical debt [PWL]
Machine learning the high interest credit card of technical debt [PWL]Machine learning the high interest credit card of technical debt [PWL]
Machine learning the high interest credit card of technical debt [PWL]
 
DutchMLSchool. Logistic Regression, Deepnets, Time Series
DutchMLSchool. Logistic Regression, Deepnets, Time SeriesDutchMLSchool. Logistic Regression, Deepnets, Time Series
DutchMLSchool. Logistic Regression, Deepnets, Time Series
 
Current and future challenges in data science
Current and future challenges in data scienceCurrent and future challenges in data science
Current and future challenges in data science
 
kdd2015
kdd2015kdd2015
kdd2015
 
Building Custom
Machine Learning Algorithms
with Apache SystemML
Building Custom
Machine Learning Algorithms
with Apache SystemMLBuilding Custom
Machine Learning Algorithms
with Apache SystemML
Building Custom
Machine Learning Algorithms
with Apache SystemML
 
Introduction overviewmachinelearning sig Door Lucas Jellema
Introduction overviewmachinelearning sig Door Lucas JellemaIntroduction overviewmachinelearning sig Door Lucas Jellema
Introduction overviewmachinelearning sig Door Lucas Jellema
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
DutchMLSchool. Machine Learning End-to-End
DutchMLSchool. Machine Learning End-to-EndDutchMLSchool. Machine Learning End-to-End
DutchMLSchool. Machine Learning End-to-End
 
Machine Learning Goes Production
Machine Learning Goes ProductionMachine Learning Goes Production
Machine Learning Goes Production
 
Full-stack Data Scientist
Full-stack Data ScientistFull-stack Data Scientist
Full-stack Data Scientist
 
Introduction to Recommender Systems
Introduction to Recommender SystemsIntroduction to Recommender Systems
Introduction to Recommender Systems
 

Similar to From Labelling Open data images to building a private recommender system

Extracting information from images using deep learning and transfer learning ...
Extracting information from images using deep learning and transfer learning ...Extracting information from images using deep learning and transfer learning ...
Extracting information from images using deep learning and transfer learning ...PAPIs.io
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsYalçın Yenigün
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableJustin Basilico
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsWeaveworks
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
 
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...Sonya Liberman
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Databricks
 
An Agile Approach to Machine Learning
An Agile Approach to Machine LearningAn Agile Approach to Machine Learning
An Agile Approach to Machine LearningRandy Shoup
 
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...DevClub_lv
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for EveryoneAly Abdelkareem
 
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f..."Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...Edge AI and Vision Alliance
 
Machine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupMachine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupSri Ambati
 
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Spark Summit
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganSpark Summit
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesVinay Shukla
 
Driving Digital Transformation with Machine Learning in Oracle Analytics
Driving Digital Transformation with Machine Learning in Oracle AnalyticsDriving Digital Transformation with Machine Learning in Oracle Analytics
Driving Digital Transformation with Machine Learning in Oracle AnalyticsPerficient, Inc.
 
Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure applicationCodecamp Romania
 
Large Scale Modeling Overview
Large Scale Modeling OverviewLarge Scale Modeling Overview
Large Scale Modeling OverviewFerris Jumah
 

Similar to From Labelling Open data images to building a private recommender system (20)

Extracting information from images using deep learning and transfer learning ...
Extracting information from images using deep learning and transfer learning ...Extracting information from images using deep learning and transfer learning ...
Extracting information from images using deep learning and transfer learning ...
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning Applications
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms Reliable
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOps
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Google cloud certification data engineer
Google cloud certification data engineerGoogle cloud certification data engineer
Google cloud certification data engineer
 
An Agile Approach to Machine Learning
An Agile Approach to Machine LearningAn Agile Approach to Machine Learning
An Agile Approach to Machine Learning
 
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for Everyone
 
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f..."Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
 
Machine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupMachine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville Meetup
 
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenches
 
Driving Digital Transformation with Machine Learning in Oracle Analytics
Driving Digital Transformation with Machine Learning in Oracle AnalyticsDriving Digital Transformation with Machine Learning in Oracle Analytics
Driving Digital Transformation with Machine Learning in Oracle Analytics
 
Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure application
 
Large Scale Modeling Overview
Large Scale Modeling OverviewLarge Scale Modeling Overview
Large Scale Modeling Overview
 

Recently uploaded

RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 

Recently uploaded (20)

RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 

From Labelling Open data images to building a private recommender system

  • 1. From Labelling Open Data Images to Building a Private Recommender System A transfer learning application
  • 2. Outline •  Introduction •  Iterative building of a recommender system •  Labelling images AKA: Pragmatic Deep learning for “Dummies” •  Post processing AKA: Using Images information for BI on steroids •  Results & Conclusion
  • 3. Dataiku •  Founded in 2013 •  60 + employees •  Paris, New-York, London, San Francisco Data Science Software Editor of Dataiku DSS DESIGN Load and prepare your data PREPARE Build your models MODEL Visualize and share your work ANALYSE Re-execute your workflow at ease AUTOMATE Follow your production environment MONITOR Get predictions in real time SCORE PRODUCTION
  • 4. •  E-business vacation retailer •  Founded in 2006. 500M revenue in 2015. •  18 Millions of clients. •  Hundreds of sales everyday -> recommendation engine •  Sale Image is paramount Key Figures
  • 5. VPG specificities •  Sales are very temporary -> Unlike amazon / Price Minister / Cdiscount -> Some classical recommender system fails -> Sales are event linked (Christmas, ski, summer) •  Expensive Product -> Few recurrent buyers -> Appearance counts a lot •  Few recurrent buyer -> Classical approach fail. -> Less signal. Visit information paramount. -> less inclined to browse a lot (4-10 first sales)
  • 6. A data science workflow Six steps to a predictive model Data Exploration & Understanding Data Preparation Model Creation Evaluation Deployment Data Acquisition Dataset 1 Scored dataset Scored dataset Iteration 1 Iteration 2 Iteration n Creating a predictive model is an highly iterative process. Data Science Studio enables its users to create and manage these projects from end-to-end. This process is not industry specific, and can be applied to many use cases. Dataset 2 Dataset n Business Understanding Adapted from the CRISP-DM methodology
  • 7. A data science workflow Six steps to a predictive model Data Exploration & Understanding Data Preparation Model Creation Evaluation Deployment Data Acquisition Dataset 1 Scored dataset Scored dataset Iteration 1 Iteration 2 Iteration n Creating a predictive model is an highly iterative process. Data Science Studio enables its users to create and manage these projects from end-to-end. This process is not industry specific, and can be applied to many use cases. Dataset 2 Dataset n Business Understanding Adapted from the CRISP-DM methodology
  • 8. Iterative Building of a Recommender System
  • 11. One Meta Model to Rule Them All Recommenders  as  features   Machine  learning  to  op5mize   purchasing  probability   Combine   Recommend   Describe  
  • 12. One Meta Model to Rule Them All •  Negative sampling •  Take all purchases tuples : (user, product, timestamp)-> 1 •  Select 5 sales open at the same date the user did not buy -> 0 •  The model directly optimize purchasing probability •  Machine learning model •  Features : recommender systems. •  Logistic Regression Regularizing effect : we don’t want to overfit leaks. •  Reranking approach. Similar to Google or Yandex (Kaggle challenge)
  • 13. One Meta Model to Rule Them All •  Going further ? •  Predict the visit ? -  Would enable to take account more information -  Many people browse randomly •  Learning to rank on target: 2 bought, 1 visited, 0 elsewhere •  Impact of this on top 10 sales ? •  Limitations : •  Highly dependant on ranking displayed - which we don’t have - may overfit old man made rules.
  • 14. Cleaning, combining and enrichment of data Recommendation Engines Optimization of home display the application automatically runs and compiles heterogeneous data Generation of recommendations based on user behaviour Every customer is shown the 10 sales he is the most likely to buy Customer visits Purchases Sales Images Metal model combine recommendations to directly optimize purchasing probability Meta Model Recommender system for Home Page Ordering +7% revenue Sales information (A/B testing) Batch Scoring every night
  • 15. Why use Image ? We want do distinguish « Sun and Beach » « Ski » A picture is worth a thousand words
  • 16. Sales Images Integrating Image Information Labelling Model Pool + Palm Trees Hotel + Mountains Pool + Forest + Hotel + Sea Sea + Beach +Forest + Hotel Sales descriptions CONTENT  BASED   Recommender System
  • 17. Image Labelling For Recommendation Engine Pragma&c  Deep  learning  for  “Dummies”  
  • 18. Using Deep Learning models Common Issues “I don’t have GPUs server” “I don’t have a deep leaning expert” “I don’t have labelled data” (or too few) “I don’t have the time to wait for model training ” I don’t want to pay to pay for private apis” / “I’m afraid their labelling will change over time”
  • 19. Pragmatic Deep Learning Cheat Sheet Do  you  have   Labels  ?   Many  ?     Are  you   sure  ?   Train  DL   model   Transfer   Learning   Is  there  a   similar   database  ?  Is  there  a   pre-­‐trained   model  ?   Create   your  own   Use  it  !   Y   Y   Y  N   N   N   N   Y   N  
  • 20. “I don’t have (or few) labelled data” -> Is there similar data ? Solution 1 : Pre trained models PLACES  DATABASE  VPG   SUN  DATABASE   205  categories   2.5  M  images   307  categories   110  K  images  
  • 21. tower: 0.53 skyscraper: 0.26 swimming_pool/outdoor: 0.65 inn/outdoor: 0.06 Solution 1 : Pre trained models If there is open data, there is an open pre trained model ! •  Kudos to the community •  Check the licensing Example  with  Places  (Caffe  Model  Zoo)  :    
  • 22. Solution 2 : Transfer Learning “I want to add information of SUN database” “But I have only 100 K images” If you know how to recognize… after a little bit of training… you will be able to recognize Transfer Learning Use a network that knows how to see •  As a feature generator / transformer •  To be updated for the new problem
  • 23. Solution 2 : Transfer Learning Not limited to images ! Pan, Sinno Jialin, and Qiang Yang. "A survey on transfer learning." IEEE Transactions on knowledge and data engineering 22.10 (2010): 1345-1359. If you know sentiment for Transfer Learning Word2Vec: Use large text corpora •  For grammar learning •  For synonym learning This wine taste great The most disgusting cheese ever 1 0 (word2vec) And you know synonyms and grammar This cheese tasted awful The best wine in town It’s easy to classify
  • 24. Solution 2 : Transfer Learning Credit  :    Fei-­‐Fei  Li  &  Andrej  Karpathy  &  Jus5n  Johnson  h`p://cs231n.stanford.edu/slides/winter1516_lecture11.pdf  
  • 25. Retrain new network Solution 2 : Transfer Learning Similar Data Not so similar Data Use network as transformer Simple model on shallow layers ? Or get other data Lot’s of labeled data With existing architecture Create Simple Model Troubles Fine Tune Few labeled data Credit  :    Fei-­‐Fei  Li  &  Andrej  Karpathy  &  Jus5n  Johnson    h`p://cs231n.stanford.edu/slides/winter1516_lecture11.pdf   Several layers depending on size of data SUN  VS  Places   dataset  J   VPG  :   •  No  labeled  data   •  Similar  data   ?  
  • 26. PLACES  DATABASE   VOYAGE  PRIVE  SUN  DATABASE   Training   (op5onal)   Pre-­‐trained  model   VGG16   tower: 0.53 skyscraper: 0.26 Re-­‐Training   Transferred  Data  :   Last  convolu5onal   layer  features   Re-­‐trained  model   TensorFlow   2  fully  connected  layers   Caffe   Model  Zoo     GPU   CPU   GPU   Leverage existing knowledge ! Solution 2 : Transfer Learning Accuracy:  72%,  Top-­‐5  Acc:  90  %  >  state  of  the  art  on  dataset  alone  
  • 27. Solution 3 : Generating your own large (or not) dataset •  Create Label Set •  Easy : Man VS Woman ? •  Harder : all relevant information in my images •  Manually select all words in a corpus (ex Wordnet) •  Use Search Engines •  Augment search terms •  Get URLs and images from search term •  Deduplicate •  Validate with Mechanical Turk •  Exclude incorrect images •  Evaluate human performance
  • 28. Solution 4 : What about APIs ?
  • 29. Solution 4 : What about APIs ? •  Price •  Their cost often rather cheap. Ex: 100 K request for less than 300$ •  VS the one of redeveloping (probably not as well) •  Full Database scoring •  APIs are often limited query per month. •  Make sure to be able to avoid cold start problem •  Stability •  Use model versioning •  Avoid covariate shift, distribution drift
  • 30. What about APIs ? Use for generating labels ! •  How to : •  Score part of the database for training •  Train a model •  Score your entire database •  But I have only 5000 requests ? -> Use Transfer Learning ! •  Stealing models Tramèr, Florian, et al. "Stealing Machine Learning Models via Prediction APIs." arXiv preprint arXiv:1609.02943 (2016). (Or don’t, it’s illegal)
  • 31. What about APIs ? Use for generating labels ! Experiment: •  5000 requests on API -> 4500 for training -> 500 for validation •  Transfer learning with MIT Places Pre-trained Model •  Scikit learn Multilabel model •  One Vs the Rest •  Untuned Logistic regression (Or don’t, it’s illegal) (demo, not used in any real project)
  • 32. What about APIs ? Results Accuracy   95   Recall   80   Precision   75   Label   Probability   Label   Probability   landscape 1,0000 sunset 0,9998 sky 1,0000 no person 0,9996 outdoors 1,0000 water 0,9990 nature 1,0000 park 0,9849 rock 1,0000 river 0,9678 travel 1,0000 scenic 0,8031 Label   Probability   Label   Probability   beach 1,0000 ocean 1,0000 summer 1,0000 relaxation 1,0000 sand 1,0000 island 1,0000 tropical 1,0000 idyllic 1,0000 travel 1,0000 seashore 0,9998 seascape 1,0000 water 0,9997 (demo, not used in any real project)
  • 33. Post Treatment (Or how we transfer the labelling information) Using  Images  informa&on  for  BI  on  steroids    
  • 34. Classification problem •  Only have probabilities of each class •  Selecting based on probability threshold fails •  Keeping all information is not sparse -> we keep 5 labels and probabilities per image Labels post-processing Deep/Transfer Learning models 5-10 tags per images •  2s/image with CPU •  x20 speed up with GPU Voyage Privé images
  • 35. Labels post-processing Complementary information Redondant information Issue with our approach: Solution : Matrix Factorization
  • 36. Topic extraction with Non-Negative Matrix Factorization •  Non Negative Matrix factorization (NMF) X = WH •  X : image x tags, non negative •  W : image x theme •  H : theme x tag (scikit learn implementation) •  Most represented Themes •  Swimming-pool_Apartment_Putting-green •  Ocean_Coast_SandBar •  Coast_SeaCliff_RockArch •  Beach_Coast_BoardWalk •  Bridge_Viaduc_River •  Palace_BuildingFacade-Mansion •  Castle_Mansion_Monastery •  HotelRoom_Bedroom_DormRoom •  Dimension Reduction •  200x200 pixels -> 600 tags => 30 themes •  Faster content based filtering •  Image often sparse combination of themes Faster content based filtering •  Each theme has the same explication power Balanced vector for content based •  Explicability Each theme corresponds to a few labels
  • 37. Image content detection Topic scores determine the importance of topics in an image TOPIC   TOPIC  SCORE  (%)   Golf  course  –  Fairway  –  PuPng  green   31   Hotel  –  Inn  –  Apartment  building  outdoor   30   Swimming  pool  –  Lido  Deck  –  Hot  tub  outdoor   22   Beach  –  Coast  -­‐  Harbor   17   TOPIC   TOPIC  SCORE  (%)   Tower  –  Skyscraper  –  Office  building   62   Bridge  –  River  –  Viaduct   38  
  • 38. Note on model performance •  Images labels are used for similarity Calling herb field “putting green”: •  Is not important if all herbs field are called this way. •  Would be if we had lot’s of golf trips sales. •  Improving the NN performance ? •  Labels are used in NMF and reduced to themes •  Themes are used to calculate similarities for CB recommenders •  CB Recommenders are used as a feature in meta model •  Meta model give probabilities of purchase = order •  Users only check 10 sales… -> what is the change of online performance for 1% accuracy ?
  • 40. Results ? All Visits : •  Mostly France •  Pool displayed First Recommendation •  Fail to display pools Only Images ? •  Pool all around the world Third column = Right Mix
  • 41. Results ? All Visits : •  Spain •  Sun & Beach •  Pool displayed First Recommendation •  Displays nature… Only Images ? •  Pool all around the world Third = Right Mix •  Get the bungalow feature !
  • 42. •  Do iterative data science ! •  Start simple and grow •  Validate each steps •  Image labelling = BI on steroids •  Deep Learning ? •  Is there existing data ? •  Is there a pre-trained model ? •  Transfer Learning •  Cheaper, faster •  Any Data Scientist can do it •  What’s Next ? Conclusion
  • 43. Learned along the way For ski sales, showing indoor pictures performs better What’s next ? •  Comparison proposed/visited vacation 𝑨𝒕𝒕𝒓𝒂𝒄𝒕𝒊𝒗𝒆𝒏𝒆𝒔𝒔( 𝒕𝒂𝒈)=  ​ 𝑽 𝒊𝒔𝒊𝒕𝒆𝒅   𝒐𝒇𝒇𝒆𝒓𝒔   𝒄𝒐𝒏𝒕𝒂𝒊𝒏𝒊𝒏𝒈   𝒕𝒂𝒈/𝑷𝒓𝒐𝒑𝒐𝒔𝒆𝒅   𝒐𝒇𝒇𝒆𝒓𝒔   𝒄𝒐𝒏𝒕𝒂𝒊𝒏𝒊𝒏𝒈   𝒕𝒂𝒈  Ocean   67%   Bedroom   33%   VPG  offers  database   Ocean   33%   Bedroo m   67%   Visits  database   •  Voyage Privé offers database = baseline •  « Bedroom » attractiveness = ​0.67/0.33  = 2 •  « Ocean » attractiveness = ​0.33/0.67  = 0.5
  • 44. Learned along the way For ski sales, showing indoor pictures performs better What’s next ?
  • 46. What’s Next ? Customize the Image ! Kenya Prague Berlin Cambodia
  • 47. Thank you for your attention !