Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Smart recommendation engine
of things to do in destination
Natural Language Processing and
Machine Learning
How to automat...
Introduction
MyLittleAdventure
@mylitadventure
Johnny RAHAJARISON
@brainstorm_me
johnny.rahajarison@mylittleadventure.com
2
Agenda
Introduction to machine learning
Why Natural Language Processing is so hard?
How do we process text?
Let’s try it o...
What’s Machine Learning ?
Software that do something without being
explicitly programmed to, just by learning
through exam...
Unsupervised algorithms
Unsupervised algorithms
ClusteringAnomaly detection
5
Supervised algorithms
Supervised algorithms
ClassificationRegression
6
You said text, right?
7
Obviously, you said text
Not numbers
ContextPolysemy
Synonyms
Enantiosemy
Neologisms
Sarcasm
Names
Rare words
Common sense...
Ambiguity?
9
I saw a man on a hill with a telescope.
Ambiguity?
10
I saw a man on a hill with a telescope.
Text should be prepared
11
Let’s clean our text first
['one', 'morn', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubl', 'dream', 'he', 'found',
'hi...
A bag of words
“John","likes","to","watch","movies","Mary","likes","movies","too"
{"John":1,"likes":2,"to":1,"watch":1,"mo...
Count of documents
TF-IDF
TF (Term Frequencies)
Occurrences of a term
IDF (Inverse Document Frequency)
log( )Count of docu...
Another way: use words embeddings
Words embeddings captures relative meaning
Use vectors to get comprehensive geometry of ...
Paris - France + China = Beijing
Another way: use words embeddings
16
Example of “movies" vector
movies -0.34582 0.057328 0.1328 0.22376 0.10161 0.52948 -0.30199 0.45676 -0.37643 -0.51857 0.67...
[[], 2*[], [], [], 2 *[-0.34582, 0.057328, … 0.22376, 0.10161], [], []]
{"John":1,"likes":2,"to":1,"watch":1,"movies":2,"M...
Let’s predict
19
Recipe
Prepare
Training / Test
data
Files, database,
cache, data flow
Selection of model,
and (hyper) parameters
Train algo...
Collect our training & test dataset
Food Label Vectorized
Eiffel Tower with Dinner
[ 0., 0., 0., 0., 0.5, 0.5, 0., 0., 0.,...
Choose a classifier algorithm
22
A few recommendations
Naive Bayes / Logistic Regression
Decision Trees
Random Forest
Gradient Boosting
SVM
Neural Networks...
Let’s measure
Food Label Prediction
Eiffel Tower with Dinner 0.83
Gourmet tour of Paris 0.96
Dinner cruise with Champagne ...
25
Go further
26
There is way more
Cross validation dataset
N-Grams
Wrong user content
Misspellings & typos
Hard to get training data
Harde...
Some resources
https://www.slideshare.net/mylittleadventure/introduction-machine-learning-by-mylittleadventure
http://scik...
Thank you
July 2nd 2018
Questions ?
@mylitadventure
@brainstorm_me
johnny.rahajarison@mylittleadventure.com
Upcoming SlideShare
Loading in …5
×

SophiaConf 2018 - J. Rahajarison (My Little Adventure)

139 views

Published on

Support de présentation : Natural Language processing and Machine Learning

Published in: Technology
  • Be the first to comment

  • Be the first to like this

SophiaConf 2018 - J. Rahajarison (My Little Adventure)

  1. 1. Smart recommendation engine of things to do in destination Natural Language Processing and Machine Learning How to automatically categorize tours and activities ? July 2nd 2018
  2. 2. Introduction MyLittleAdventure @mylitadventure Johnny RAHAJARISON @brainstorm_me johnny.rahajarison@mylittleadventure.com 2
  3. 3. Agenda Introduction to machine learning Why Natural Language Processing is so hard? How do we process text? Let’s try it out Go further 3
  4. 4. What’s Machine Learning ? Software that do something without being explicitly programmed to, just by learning through examples Same software can be used for various tasks It learns from experiences with respect to some task and performance, and improves through experience 4
  5. 5. Unsupervised algorithms Unsupervised algorithms ClusteringAnomaly detection 5
  6. 6. Supervised algorithms Supervised algorithms ClassificationRegression 6
  7. 7. You said text, right? 7
  8. 8. Obviously, you said text Not numbers ContextPolysemy Synonyms Enantiosemy Neologisms Sarcasm Names Rare words Common sense Dialects Non formal / abbrev. 8
  9. 9. Ambiguity? 9 I saw a man on a hill with a telescope.
  10. 10. Ambiguity? 10 I saw a man on a hill with a telescope.
  11. 11. Text should be prepared 11
  12. 12. Let’s clean our text first ['one', 'morn', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubl', 'dream', 'he', 'found', 'himself', 'transform', 'in', 'hi', 'bed', 'into', 'a', 'horribl', 'vermin', 'He', 'lay', 'on', 'hi', 'armour-lik', 'back', 'and', 'if', 'he', 'lift', 'hi', 'head', 'a', 'littl', 'he', 'could', 'see', 'hi', 'brown', 'belli', 'slightli', 'dome', 'and', 'divid', 'by', 'arch', 'into', 'stiff', 'section', 'the', 'bed', 'wa', 'hardli', 'abl', 'to', 'cover', 'it', 'and', 'seem', 'readi', 'to', 'slide', 'off', 'ani', 'moment', 'hi', 'mani', 'leg', 'piti', 'thin', 'compar', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'wave', 'about', 'helplessli', 'as', 'he', 'look', 'what', "'s", 'happen', ‘to'] ✓ Tokenize sentences ✓ Tokenize words ✓ Transliterate ✓ Normalize ✓ Filter out 
 (punctuation, special characters, stop words) ✓ Use a stemmer and / or a lemmatizer
 ("be" = am, are, is; “vari" = variation, vary, varies, variables) 12
  13. 13. A bag of words “John","likes","to","watch","movies","Mary","likes","movies","too" {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1} {131:1, 132:2, 133:1, 134:1, 135:2, 136:1, 137:1} [1, 2, 1, 1, 2, 1, 1] Each unique word in our dictionary will correspond to a feature 13
  14. 14. Count of documents TF-IDF TF (Term Frequencies) Occurrences of a term IDF (Inverse Document Frequency) log( )Count of documents where terms appear Total words in each document 14
  15. 15. Another way: use words embeddings Words embeddings captures relative meaning Use vectors to get comprehensive geometry of words 15
  16. 16. Paris - France + China = Beijing Another way: use words embeddings 16
  17. 17. Example of “movies" vector movies -0.34582 0.057328 0.1328 0.22376 0.10161 0.52948 -0.30199 0.45676 -0.37643 -0.51857 0.67325 -0.012444 -0.099021 0.43823 -0.28905 -1.0183 -0.0062387 -0.32893 0.55547 0.44181 0.31524 0.29909 0.51605 0.32109 0.021471 0.67909 0.037333 -0.42321 0.56517 0.47979 -0.63307 0.1126 0.0050579 -0.18879 -0.87478 -0.29481 -0.70824 -0.072256 0.1614 0.34523 0.61872 -0.036932 -0.43343 0.29604 0.18671 -0.33384 0.50628 -0.013876 0.46303 0.19298 0.16783 -0.55786 -0.16947 -0.27382 0.31027 0.10974 0.12819 0.23538 0.038003 -0.077524 -0.23291 0.044094 0.36325 0.20611 0.55571 -0.022715 -0.04996 0.32312 0.44176 0.25272 0.15159 0.22682 -0.10425 0.73375 0.66572 -0.55885 0.082242 -0.13387 0.31042 -0.38443 -0.38631 -0.7518 0.6706 -0.17495 0.056298 0.82038 0.41573 -0.12316 0.28437 -0.19324 -0.13485 0.28862 -0.37817 0.37268 0.01515 0.39123 0.059544 -0.074006 -0.17152 -1.1523 0.26541 0.082314 0.17914 -0.089861 -0.20884 0.29248 -0.60263 -0.0024285 0.24521 -0.5427 -0.074404 0.14034 0.0085891 -0.37351 0.23573 0.1493 -0.14038 0.11725 -0.51013 -0.64531 0.1329 0.075911 -0.10827 0.22077 -0.086253 0.4096 0.052314 0.40964 -0.030506 0.30572 -0.40694 -0.11773 0.21586 0.14448 0.23419 -0.23401 0.06811 0.29447 -0.4086 0.88777 -0.19477 -0.18847 0.10324 -0.24593 -0.10173 -0.43226 -0.091173 -0.092602 -0.23385 -0.16498 0.22057 0.11014 -0.25018 -0.43089 0.19759 0.11762 -0.045432 0.13331 0.032684 -0.21702 0.35082 -0.40466 -0.02425 -0.22637 0.0094442 0.72848 0.10286 0.27199 -0.40396 0.22366 -0.039481 -0.17164 -1.7307 0.3706 -0.13711 0.2295 -0.34432 -0.024381 -0.093941 -0.29861 -0.33164 -0.12931 -0.11218 0.047052 0.40442 0.0043382 0.22364 -0.31537 0.1987 -0.46108 -0.35126 -0.14584 0.17765 0.10869 -0.14434 -0.6152 -0.5874 0.014977 -0.1691 -0.46926 1.3959 -0.15449 -0.24167 -0.002575 0.4758 -0.044786 -0.21345 0.22983 -0.34356 -0.43402 -0.45719 -0.29775 -0.053295 0.50132 -0.24066 0.45762 0.095118 0.21008 0.71912 0.028577 -0.64176 0.1314 0.21556 -0.12536 -0.3298 -0.07123 0.35428 -0.3787 0.12348 -0.060439 0.19217 -0.29951 -0.73189 -0.33589 0.449 0.22654 1.0404 0.019947 -0.74711 0.071042 0.067809 0.36341 -0.32579 -0.11085 -0.24507 -0.13518 -0.44326 0.022784 -0.57252 0.33756 -0.23411 -0.062955 -0.35353 1.0497 -0.14938 -0.57772 0.27652 -0.28787 -0.0040621 0.25113 0.40818 -0.13227 0.016032 -0.55465 0.0021098 -0.27755 0.16082 -0.055202 0.21104 0.58412 0.42842 -0.047253 0.10542 0.027478 0.30911 0.31792 -1.8564 0.014412 -0.29748 -0.70103 -0.068219 -0.53071 -0.10661 0.028596 0.081479 0.34323 -0.047833 0.023129 0.028697 0.33859 -0.20706 -0.0025571 -0.18267 -0.26946 -1.1064 -0.31228 -0.13101 0.1161 -0.068647 -0.09988 Another way: use words embeddings 17
  18. 18. [[], 2*[], [], [], 2 *[-0.34582, 0.057328, … 0.22376, 0.10161], [], []] {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1} {131:1, 132:2, 133:1, 134:1, 135:2, 136:1, 137:1} [1, 2, 1, 1, 2, 1, 1] Another way: use words embeddings Embeddings vector for “movies" 18
  19. 19. Let’s predict 19
  20. 20. Recipe Prepare Training / Test data Files, database, cache, data flow Selection of model, and (hyper) parameters Train algorithm Use or store your trained estimator Make predictions Measure accuracy precision Measure 20
  21. 21. Collect our training & test dataset Food Label Vectorized Eiffel Tower with Dinner [ 0., 0., 0., 0., 0.5, 0.5, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.5, 0., 0.5], Skip the line Eiffel Tower [ 0., 0., 0., 0., 0., 0.3967171 , 0., 0., 0., 0.47792296, 0., 0., 0., 0., 0., 0.47792296, 0.47792296, 0., 0., 0.3967171 , 0., 0.], Louvre Museum fast track [ 0., 0., 0., 0., 0., 0., 0.5, 0., 0., 0., 0.5, 0.5, 0., 0., 0., 0., 0., 0., 0., 0., 0.5, 0.], Gourmet tour of Paris [ 0., 0., 0., 0., 0., 0., 0., 0.58910044, 0., 0., 0., 0., 0.41798437, 0.48900396, 0., 0., 0., 0., 0.48900396, 0., 0., 0.], Segway tour of city’s highlights [ 0., 0., 0.48838773, 0., 0., 0., 0., 0., 0.48838773, 0., 0., 0., 0.3465257 , 0., 0.48838773, 0., 0., 0., 0.40540376, 0., 0., 0.], Dinner cruise with Champagne [ 0., 0.54408243, 0., 0.54408243, 0.45163515, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.45163515], Aquarium of Paris ticket [ 0.55967542, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.39710644, 0.46457866, 0., 0., 0., 0.55967542, 0., 0., 0., 0.] … … 21
  22. 22. Choose a classifier algorithm 22
  23. 23. A few recommendations Naive Bayes / Logistic Regression Decision Trees Random Forest Gradient Boosting SVM Neural Networks 23
  24. 24. Let’s measure Food Label Prediction Eiffel Tower with Dinner 0.83 Gourmet tour of Paris 0.96 Dinner cruise with Champagne 1.0 Segway tour of city’s highlights 0.03 Orsay dedicated entrance 0.02 3 course meal in Eiffel Tower 0.97 Cooking class in Paris 0.89 Moulin Rouge Paris dinner show 0.91 24 Training set Real datas
  25. 25. 25
  26. 26. Go further 26
  27. 27. There is way more Cross validation dataset N-Grams Wrong user content Misspellings & typos Hard to get training data Harder languages or transliterations issues Memory / computing limitations Online learning & Stacking 27
  28. 28. Some resources https://www.slideshare.net/mylittleadventure/introduction-machine-learning-by-mylittleadventure http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html https://bit.ly/2uL954v NLTK Book Stanford’s GloVe DatasetCourse Andrew Ng (coursera) Platform 28 Libraries
  29. 29. Thank you July 2nd 2018 Questions ? @mylitadventure @brainstorm_me johnny.rahajarison@mylittleadventure.com

×