SlideShare a Scribd company logo
1 of 37
Download to read offline
Word embeddings as a service
François Scharffe ( ),
PyData NYC 2015
@lechatpito (http://www.twitter.com/lechatpito) 3Top
(http://www.3top.com)
Outline of the talk
What is 3Top?
What are word embeddings?
How to implement a simple recommendation system for 3Top categories?
Rank Anything, Rank Everything
3Top is a ranking and recommendation platform
Rankings convey more information than star ratings
Who cares about 3 stars or less? I just want the best stuff
I'd rather trust my friends than reading through reviews
If I have more than 3 items to rank, I can probably use a more precise
category
Not yet launched, but the site is up
Let's take a look at http://www.3top.com (http://www.3top.com)
Places
http://www.3top.com/category/1138/gyms-for-students-near-lower-east-side
(http://www.3top.com/category/1138/gyms-for-students-near-lower-east-
side)
Movies
http://www.3top.com/category/142/movies-about-wall-street
(http://www.3top.com/category/142/movies-about-wall-street)
Anything really
http://www.3top.com/category/765/foods-named-after-people
(http://www.3top.com/category/765/foods-named-after-people)
Data & knowledge engineering at 3Top
Building a solid data engineering architecture before launching the site.
Natural language processing pipeline
Parsing categories
Currently using a parser we developped
About to switch to from Matthew
Honnibal. It's great, check it out!
Detecting named entities, locations
A large knowledge graph backed by an ontology
An itemization pipeline
matching free text items to entities in the knowledge graph
spaCy (http://spacy.io)
Category recommendation
How are we going to build a simple recommendation system without having any
significant number of user, categories or rankings?
Note the impressive figures:
Number of Users: 316
Number of Rankings: 2123
Number of Categories: 1316
Feel free to add a few rankings:
W o w ! ; )
http://www.3top.com (http://www.3top.com)
Word embeddings ?
Who hasn't heard about word2vec?
Word embeddings allow to represent words in a high dimensional space in a way that
words appearing in the same context will be close in that space.
Dimensionality of the space is not that high, typically a few 100 dimensions.
Word embeddings is a language modeling method, more precisely a distributed
vector representation of words.
Compared to Bag of words:
Dimensionality is low, and constant wrt the vocabulary size
Depending on the training algorithm, partially learned models give partially
good results
Compared to topic modeling:
Better granularity, the base element is a word
Phrases vector can also be learnt
What are word embeddings models good at:
Modeling similarity between words
Allows algebric operations on word vectors
s i m ( t o m a t o , b e e f s t e a k ) < s i m ( a p p l e , t o m a t o ) < s i m ( p e a r , a p p l e )
v ( P a r i s ) - v ( F r a n c e ) ~ = v ( B e r l i n ) - v ( G e r m a n y )
Examples
Examples here are using a GloVe model (100d, 400k vocab, trained on Wikipedia and
Gigaword (news articles)).
In [3]: from gensim.models import Word2Vec
model = Word2Vec().load_word2vec_format("./glove.6B.100d.txt")
In [4]: model.most_similar("python", topn=10)
Out[4]: [(u'monty', 0.6886237859725952),
(u'php', 0.586538553237915),
(u'perl', 0.5784406661987305),
(u'cleese', 0.5446674823760986),
(u'flipper', 0.5112984776496887),
(u'ruby', 0.5066927671432495),
(u'spamalot', 0.505638837814331),
(u'javascript', 0.5030568838119507),
(u'reticulated', 0.4983375668525696),
(u'monkey', 0.49764129519462585)]
In [5]: model.most_similar_cosmul(positive=["python", "programming"], topn=5)
Out[5]: [(u'perl', 0.5658619999885559),
(u'scripting', 0.559501588344574),
(u'scripts', 0.5469149351119995),
(u'php', 0.5461974740028381),
(u'language', 0.5350533127784729)]
In [6]: model.most_similar_cosmul(positive=["python", "venomous"], topn=5)
Out[6]: [(u'scorpion', 0.5413044095039368),
(u'snakes', 0.5263831615447998),
(u'snake', 0.5222328901290894),
(u'spider', 0.5214570164680481),
(u'marsupial', 0.517005205154419)]
The classical example:
v ( k i n g ) - v ( m a n ) + v ( w o m a n ) - > v ( q u e e n )
In [7]: model.most_similar_cosmul(positive=["king", "woman"], negative=["man"])
Out[7]: [(u'queen', 0.8964556455612183),
(u'monarch', 0.8495977520942688),
(u'throne', 0.8447030782699585),
(u'princess', 0.8371668457984924),
(u'elizabeth', 0.835679292678833),
(u'daughter', 0.8348594903945923),
(u'prince', 0.8230059742927551),
(u'mother', 0.8154449462890625),
(u'margaret', 0.8147734999656677),
(u'father', 0.8100854158401489)]
Training a model
Very easy once you have a clean corpus
Great tools in Python
Tutorial on training a model using Gensim:
Radim Řehůřek gave a talk last year at PyData Berlin about
optimizations in Cython:
For GloVe
http://rare-
technologies.com/word2vec-tutorial/ (http://rare-
technologies.com/word2vec-tutorial/)
https://www.youtube.com/watch?
v=vU4TlwZzTfU (https://www.youtube.com/watch?v=vU4TlwZzTfU)
https://github.com/maciejkula/glove-python
(https://github.com/maciejkula/glove-python)
Gensim word2vec implementation specifics:
Training time ~ 8hours on a 8 proc/8 threads to learn 600 dimensions on a 1.9B
words corpus
Memory requirements depends on the vocabulary size and on the number of
dimensions:
The
takes half the time but has a quadratic memory size. Check pull requests for memory
optimizations.
3 m a t r i c e s * 4 b y t e s ( f l o a t ) * | d i m e n s i o n s | * | v o c a b u l a r y |
GloVe implementation in Python (https://github.com/maciejkula/glove-python/)
A good think to know: a bigger training set does improve the quality of the model, even for
specialized tasks.
As a consequence, you probably want to use a huge corpus. Good models are available.
Building you own model can be useful when you want to find out about the properties of
your corpus, or you want to compare different corpora together. For examples evolution
of language in a newspaper during different periods of time.
Finding a model
From:
Model file
Number of
dimensions
300
1000
1000
50/100/200/3
300
300
25/50/100/20
https://github.com/3Top/word2vec-api/ (https://github.com/3Top/word2vec-
api/)
Google News (GoogleNews-vectors-negative300.bin.gz)
Freebase IDs
(https://docs.google.com/file/d/0B7XkCwpI5KDYaDBDQm1tZGNDRHc/edit?
usp=sharing)
Freebase names
(https://docs.google.com/file/d/0B7XkCwpI5KDYeFdmcVltWkhtbmM/edit?
usp=sharing)
Wikipedia+Gigaword 5 (http://nlp.stanford.edu/data/glove.6B.zip)
Common Crawl 42B (http://nlp.stanford.edu/data/glove.42B.300d.zip)
Common Crawl 840B (http://nlp.stanford.edu/data/glove.840B.300d.zip)
Twitter (2B Tweets) (http://www-nlp.stanford.edu/data/glove.twitter.27B.zip)
300
1000
Wikipedia dependency
(http://u.cs.biu.ac.il/~yogo/data/syntemb/deps.words.bz2)
DBPedia vectors
(https://github.com/idio/wiki2vec/raw/master/torrents/enwiki-gensim-
word2vec-1000-nostem-10cbow.torrent)
Building a recommendation engine for 3Top categories
By combining word vectors, we build category vectors.
In [8]: def build_category_vector(category):
pass # Get the postags
vector = []
for tag in postags:
if tag.tagged in ['NN', 'NNS', 'JJ', 'NNP', 'NNPS', 'NNDBN', 'VBG', 'CD'
]: # Only keep meaningful words
try:
v = word2vec(tag.tagValue) # Get the word vector
if v.any():
vector.append(v)
except:
logger.debug("Word not found in corpus: %s" % tag.tagValue)
tagset.add(tag.tagValue)
if vector:
return matutils.unitvec(np.array(vector).mean(axis=0)) # Average the ve
ctor
else:
return np.empty(300)
We store those vectors in a category space, and at page load time compute the most
similar categories for a given category.
Now let us look at the similarity method.
s i m ( c 1 , c 2 ) = v ( c 1 ) . v ( c 2 )
In [21]: cs = CategorySimilarity()
# print(Category.objects.all().count())
category = Category.objects.get(category=u"Blue-collar beers that come in a can"
)
_ = [print(c) for c in cs.most_similar_categories(category, n=5)]
DEBUG Category space size (as found in the cache): 1125
Belgian Trappist Beers
Belgian Beer Cafe in NYC
Dark and Stormy Cocktail in NYC
Brands of Ginger Beer
Pink Drinks
In [10]: category = Category.objects.get(category=u"Italian Restaurants in NYC.")
_ = [print(c) for c in cs.most_similar_categories(category, n=5)]
Italian Restaurants in NY
Restaurants in Nyc
NYC Mexican Restaurants
Romanian Restaurants in NYC
Thai Restaurants in NYC
In [11]: category = Category.objects.get(category=u"Coen Brothers Movies")
_ = [print(c) for c in cs.most_similar_categories(category, n=10)]
Quentin Tarantino Movies
Martin Scorsese Films.
Movies Starring Creepy Children
Tim Burton Movies
Movies Starring Sean Penn
Pixar Movies
Godfather Movies
Berlin Indie Movie Theaters
Kubrick Movies
Harry Potter Movies
Our recommendation system uses the Common Crawl 42B words, 300 dimensions model
trained with GloVe.
It takes around 6GB in memory ... and this is a problem:
We run a Django server and 8 celery workers on an EC2 T2 Micro... That would be a lot of
memory for that poor instance.
A word embedding service
We separate the word embedding model as a service
Simple Flask server with a few primitives:
curl http://127.0.0.1:5000/word2vec/similarity?
w1=Python&w2=Java
curl http://127.0.0.1:5000/word2vec/n_similarity?
ws1=Python&ws1=programming&ws2=Java
curl http://127.0.0.1:5000/word2vec/model?word=Python
curl http://127.0.0.1:5000/word2vec/most_similar?
positive=king&positive=queen&negative=man
Easy to setup:
python word2vec-api --model path/to/the/model [--host host --
port 1234]
Get it at https://github.com/3Top/word2vec-api
(https://github.com/3Top/word2vec-api)
Caching the vector space
As the number of categories increase we do not want to rebuild category vectors and hit
the database every time a recommendation is needed (every page access). Category
vectors size is actually significant:
In [12]: print(np.fromstring(category.vector).nbytes)
2400
In [13]: print(u"... and for {} categories the space becomes large (~{}MB)".format(len(cs
.category_space.syn0), cs.category_space.syn0.nbytes/1000000))
... and for 1125 categories the space becomes large (~2MB)
We store vectors with the category object in MySQL, using a base64 encoding of the
numpy object. Let's look at it:
In [14]: print(category._vector[:1000] + "...")
oNl7j9l8hr/2FoHhEbSyv+GALkJ6JVU/rxm5pueouL80aF72QLiivzr0z1WKaZM/i3M5FvqQmD+hDmI
s7fitv+lL6cLFSbM/lSwwkBxv0D/sA+FgTxmhP5+lJZPGVLE/q8pBj07ukT9OceKjxl2jv4s4cA7RJI
c/JxVUF8afnr/RQcXciUyCP+M4N3mbtrS/Ngwo85uUor/+4vqargCyP7YnHbTSv3W/MHjrh6iHsr/1v
kzSmI+yP+bsS7E0B5u/JJ5iJAUNoT//xo0IJ3inP5/BwCfgWZ2/Q2r8q9Fuir/KdAOAr0OQPwzGTnUX
U5i/9uQD77+xuD/1QKbEaDWjvyqfSePd7XE/3RLqJXiOrz8ZyEDICd2UP2beFLiqPZy/rIb+8sFgqr+
ILyc3/5yoP5pL25IahpQ/4WpgCeuNZ7/ley+Tl9ygP+knz2odUHo/yBSdc5+Klj+GLgrafftvP2yiB7
6KBY2/z0RqB+1ri7+THdXBVVKvPzwZ2X+2HaA/oOThsHeidL/O7w8+bummv8Z8XCqeYas/WzQpioG6q
r+JaauGrie2P7+8NmNBN5o/0Ji6XFJMpj/xAtoHvg9PPyr3OOBXVaG/M2aCbN8dv79pANKgDzNrPy4X
XBNVi4S/WBuYYjWZlD8T/W3jLbOJPy3xHNTzarQ/MoOWx7aZtz/RDMwbryievwA5kQgazaO/3Ep0jVo
1rD+ns3oJ3iWUv4TlEPcAnJy/dHNcwygjnr/cMGYNKPbCP5E06afPWa6/mUHAC+8mjj+NwgyjQFB5v6
ffLvduuai/kBntVvsdpb8Yg3KzY/qev9r5son3VJg/h06aD0/IuD8NMHm7jGViv7o8zQzPd5U/esP4A
x6BrL8TOZuX+qGpP1WHNPzdQH0/7HXRMAqXmr9G8pkwjbenv3RvQppal7i/E5jWmLXSp792VpPxJeOj
PyEKhEhl324/1E00QnHdvr9Mg0Fohd+cP6UAj0X5R5g/2umwTF42...
In [15]: # a property method takes care of the decoding
def get_vector(self):
return base64.b64decode(self._vector)
def set_vector(self, value):
encoded = base64.b64encode(value)
self._vector = encoded
vector = property(get_vector, set_vector)
In [16]: np.fromstring(category.vector)[:100]
Out[16]: array([-0.01098032, -0.07306015, 0.00129067, -0.09632728, -0.03656199,
0.01895729, 0.02399054, -0.05853978, 0.07534443, 0.25678171,
0.03339623, 0.06769982, 0.01751063, -0.03782483, 0.01130069,
-0.02990636, 0.00893505, -0.08091137, -0.03629005, 0.07032291,
-0.00530989, -0.07238248, 0.07250362, -0.02639468, 0.03330246,
0.04583857, -0.02866316, -0.01290668, 0.0158832 , -0.02375447,
0.09646225, -0.03751686, 0.00437724, 0.06163383, 0.02037444,
-0.02757899, -0.05151945, 0.04807279, 0.02004282, -0.00287529,
0.03293298, 0.00642406, 0.02201318, 0.0039041 , -0.01417073,
-0.01338945, 0.06117504, 0.03147669, -0.00503775, -0.04474968,
0.05347914, -0.05220418, 0.086543 , 0.02560141, 0.04355104,
0.00094792, -0.03385424, -0.12154957, 0.00332025, -0.01003138,
0.02011569, 0.01254879, 0.07975696, 0.09218924, -0.02945207,
-0.03867418, 0.05509456, -0.0196757 , -0.02793886, -0.029431 ,
0.1481371 , -0.05927895, 0.0147227 , -0.00618005, -0.04828975,
-0.04124437, -0.03025203, 0.02376162, 0.09680647, -0.00224569,
0.02096485, -0.05567259, 0.05006393, 0.00714194, -0.0259668 ,
-0.04632226, -0.09605948, -0.04652946, 0.03884238, 0.00376863,
-0.12056644, 0.02819642, 0.02371206, 0.08286085, 0.08104846,
-0.03060514, -0.0313298 , -0.00715603, -0.05278924, 0.0031662 ])
In order to avoid issuing a few thousand SQL queries every time a page is loaded
we use Memcache to store the category space.
As the space is larger than 1 MB we store each vector with its own key (the
category Id). They share a common key prefix.
We directly store the numpy vectors through the Gensim API.
A separate key is used for the vocabulary indexes.
In [17]: def set_space_cache(space):
sim.set(VOC, space.vocab)
sim.set(IDX, space.index2word)
sim.set_many({"{0}-{1}".format(VEC, i): space.syn0[i] for i in range(len
(space.vocab))})
This also allows to add a category vector to the space without having to rebuild it. Simply
by stacking its vector in the cache and updating the cached space indexes.
In [18]: def add_last_vector_to_space_cache(space):
sim.set(VOC, space.vocab)
sim.set(IDX, space.index2word)
sim.set("{}-{}".format(VEC, len(space.vocab)-1), space.syn0[-1])
Updates
Each process gets its own copy of the vector space.
Whenever a category is added, the space is updated in cache.
Django signals are used to tell other processes to reload the space from cache.
Work in progress
We are about to add a few 100k generated categories
The category space will become large in memory: 8 workers * 2.4 kb * 100000
categories = 1,9 GB
Including entity vectors would improve results for names, places, etc.
Training a specialized corpus using categories scraped all over the web
Train a phrase2vec model on these categories
Resources
Tutorials & Applications
Instagram:
Word embeddings and RNNs:
Word2vec gensim tutorial:
Clothing style search:
In digital humanities:
In digital humanities, application to gender studies:
Document classification on Yelp reviews:
http://instagram-
engineering.tumblr.com/post/117889701472/emojineering-part-1-machine-
learning-for-emoji (http://instagram-
engineering.tumblr.com/post/117889701472/emojineering-part-1-machine-
learning-for-emoji)
http://colah.github.io/posts/2014-07-NLP-
RNNs-Representations/ (http://colah.github.io/posts/2014-07-NLP-RNNs-
Representations/)
http://rare-technologies.com/word2vec-tutorial/
(http://rare-technologies.com/word2vec-tutorial/)
http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-
thousand-vectors/ (http://multithreaded.stitchfix.com/blog/2015/03/11/word-
is-worth-a-thousand-vectors/)
http://bookworm.benschmidt.org/posts/2015-10-25-
Word-Embeddings.html (http://bookworm.benschmidt.org/posts/2015-10-25-
Word-Embeddings.html)
http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-the-gender-
binary.html (http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-
the-gender-binary.html)
http://nbviewer.ipython.org/github/taddylab/deepir/blob/master/w2v-
inversion.ipynb
(http://nbviewer.ipython.org/github/taddylab/deepir/blob/master/w2v-
inversion.ipynb)
Resources
Academic Papers
Le, Quoc V., and Tomas Mikolov. "Distributed representations of sentences and
documents." arXiv preprint arXiv:1405.4053 (2014).
JeffreyPennington, RichardSocher, and ChristopherD Manning. "Glove: Global
vectors for word representation." (2014).
Levy, Omer, and Yoav Goldberg. "Dependencybased word embeddings."
Proceedings of the 52nd Annual Meeting of the Association for Computational
Linguistics. Vol. 2. 2014.
Goldberg, Yoav, and Omer Levy. "word2vec Explained: deriving Mikolov et al.'s
negative-sampling word-embedding method." arXiv preprint arXiv:1402.3722
(2014).
In [19]: Thank you !
File "<ipython-input-19-f087ca1d6988>", line 1
Thank you !
^
SyntaxError: invalid syntax

More Related Content

Viewers also liked

A Distributed System for Recognizing Home Automation Commands and Distress Ca...
A Distributed System for Recognizing Home Automation Commands and Distress Ca...A Distributed System for Recognizing Home Automation Commands and Distress Ca...
A Distributed System for Recognizing Home Automation Commands and Distress Ca...a3labdsp
 
Textrank algorithm
Textrank algorithmTextrank algorithm
Textrank algorithmAndrew Koo
 
Drawing word2vec
Drawing word2vecDrawing word2vec
Drawing word2vecKai Sasaki
 
word2vec - From theory to practice
word2vec - From theory to practiceword2vec - From theory to practice
word2vec - From theory to practicehen_drik
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsBhaskar Mitra
 
HDFS與MapReduce架構研討
HDFS與MapReduce架構研討HDFS與MapReduce架構研討
HDFS與MapReduce架構研討Billy Yang
 
Word2vec algorithm
Word2vec algorithmWord2vec algorithm
Word2vec algorithmAndrew Koo
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector spaceAbdullah Khan Zehady
 
Running Word2Vec with Chinese Wikipedia dump
Running Word2Vec with Chinese Wikipedia dumpRunning Word2Vec with Chinese Wikipedia dump
Running Word2Vec with Chinese Wikipedia dumpBilly Yang
 
Emerging Trends in Online Search
Emerging Trends in Online SearchEmerging Trends in Online Search
Emerging Trends in Online SearchDistilled
 
NIPS2013読み会: Distributed Representations of Words and Phrases and their Compo...
NIPS2013読み会: Distributed Representations of Words and Phrases and their Compo...NIPS2013読み会: Distributed Representations of Words and Phrases and their Compo...
NIPS2013読み会: Distributed Representations of Words and Phrases and their Compo...Yuya Unno
 
Text summarization
Text summarizationText summarization
Text summarizationkareemhashem
 
Ejercicios psicomotricidad
Ejercicios psicomotricidadEjercicios psicomotricidad
Ejercicios psicomotricidadsoniagrizq
 

Viewers also liked (14)

A Distributed System for Recognizing Home Automation Commands and Distress Ca...
A Distributed System for Recognizing Home Automation Commands and Distress Ca...A Distributed System for Recognizing Home Automation Commands and Distress Ca...
A Distributed System for Recognizing Home Automation Commands and Distress Ca...
 
Textrank algorithm
Textrank algorithmTextrank algorithm
Textrank algorithm
 
Drawing word2vec
Drawing word2vecDrawing word2vec
Drawing word2vec
 
word2vec - From theory to practice
word2vec - From theory to practiceword2vec - From theory to practice
word2vec - From theory to practice
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
HDFS與MapReduce架構研討
HDFS與MapReduce架構研討HDFS與MapReduce架構研討
HDFS與MapReduce架構研討
 
Word2vec algorithm
Word2vec algorithmWord2vec algorithm
Word2vec algorithm
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
 
Running Word2Vec with Chinese Wikipedia dump
Running Word2Vec with Chinese Wikipedia dumpRunning Word2Vec with Chinese Wikipedia dump
Running Word2Vec with Chinese Wikipedia dump
 
Emerging Trends in Online Search
Emerging Trends in Online SearchEmerging Trends in Online Search
Emerging Trends in Online Search
 
NIPS2013読み会: Distributed Representations of Words and Phrases and their Compo...
NIPS2013読み会: Distributed Representations of Words and Phrases and their Compo...NIPS2013読み会: Distributed Representations of Words and Phrases and their Compo...
NIPS2013読み会: Distributed Representations of Words and Phrases and their Compo...
 
Text summarization
Text summarizationText summarization
Text summarization
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Ejercicios psicomotricidad
Ejercicios psicomotricidadEjercicios psicomotricidad
Ejercicios psicomotricidad
 

Similar to Word embeddings as a service - PyData NYC 2015

Art & music vs Google App Engine
Art & music vs Google App EngineArt & music vs Google App Engine
Art & music vs Google App Enginethomas alisi
 
Word_Embeddings.pptx
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptxGowrySailaja
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsGabriel Moreira
 
Domain Driven Design Made Functional with Python
Domain Driven Design Made Functional with Python Domain Driven Design Made Functional with Python
Domain Driven Design Made Functional with Python Jean Carlo Machado
 
The Why and What of Pattern Lab
The Why and What of Pattern LabThe Why and What of Pattern Lab
The Why and What of Pattern LabDave Olsen
 
SNLI_presentation_2
SNLI_presentation_2SNLI_presentation_2
SNLI_presentation_2Viral Gupta
 
Faster! Faster! Accelerate your business with blazing prototypes
Faster! Faster! Accelerate your business with blazing prototypesFaster! Faster! Accelerate your business with blazing prototypes
Faster! Faster! Accelerate your business with blazing prototypesOSCON Byrum
 
Xopus Application Framework
Xopus Application FrameworkXopus Application Framework
Xopus Application FrameworkJady Yang
 
[GEMINI EXTERNAL DECK] Introduction to Gemini.pptx
[GEMINI EXTERNAL DECK] Introduction to Gemini.pptx[GEMINI EXTERNAL DECK] Introduction to Gemini.pptx
[GEMINI EXTERNAL DECK] Introduction to Gemini.pptxAhmedElbaloug
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.pptbutest
 
200612_BioPackathon_ss
200612_BioPackathon_ss200612_BioPackathon_ss
200612_BioPackathon_ssSatoshi Kume
 
EvoPat - Pattern-Based Evolution and Refactoring of RDF Knowledge Bases
EvoPat - Pattern-Based Evolution and Refactoring of RDF Knowledge BasesEvoPat - Pattern-Based Evolution and Refactoring of RDF Knowledge Bases
EvoPat - Pattern-Based Evolution and Refactoring of RDF Knowledge BasesSebastian Tramp
 
JBUG 11 - Django-The Web Framework For Perfectionists With Deadlines
JBUG 11 - Django-The Web Framework For Perfectionists With DeadlinesJBUG 11 - Django-The Web Framework For Perfectionists With Deadlines
JBUG 11 - Django-The Web Framework For Perfectionists With DeadlinesTikal Knowledge
 
How do OpenAI GPT Models Work - Misconceptions and Tips for Developers
How do OpenAI GPT Models Work - Misconceptions and Tips for DevelopersHow do OpenAI GPT Models Work - Misconceptions and Tips for Developers
How do OpenAI GPT Models Work - Misconceptions and Tips for DevelopersIvo Andreev
 
2020 09 24 - CONDG ML.Net
2020 09 24 - CONDG ML.Net2020 09 24 - CONDG ML.Net
2020 09 24 - CONDG ML.NetBruno Capuano
 

Similar to Word embeddings as a service - PyData NYC 2015 (20)

Yahoo is open to developers
Yahoo is open to developersYahoo is open to developers
Yahoo is open to developers
 
Art & music vs Google App Engine
Art & music vs Google App EngineArt & music vs Google App Engine
Art & music vs Google App Engine
 
Word_Embeddings.pptx
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptx
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender Systems
 
Domain Driven Design Made Functional with Python
Domain Driven Design Made Functional with Python Domain Driven Design Made Functional with Python
Domain Driven Design Made Functional with Python
 
The Why and What of Pattern Lab
The Why and What of Pattern LabThe Why and What of Pattern Lab
The Why and What of Pattern Lab
 
SNLI_presentation_2
SNLI_presentation_2SNLI_presentation_2
SNLI_presentation_2
 
Faster! Faster! Accelerate your business with blazing prototypes
Faster! Faster! Accelerate your business with blazing prototypesFaster! Faster! Accelerate your business with blazing prototypes
Faster! Faster! Accelerate your business with blazing prototypes
 
Xopus Application Framework
Xopus Application FrameworkXopus Application Framework
Xopus Application Framework
 
[GEMINI EXTERNAL DECK] Introduction to Gemini.pptx
[GEMINI EXTERNAL DECK] Introduction to Gemini.pptx[GEMINI EXTERNAL DECK] Introduction to Gemini.pptx
[GEMINI EXTERNAL DECK] Introduction to Gemini.pptx
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.ppt
 
200612_BioPackathon_ss
200612_BioPackathon_ss200612_BioPackathon_ss
200612_BioPackathon_ss
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
EvoPat - Pattern-Based Evolution and Refactoring of RDF Knowledge Bases
EvoPat - Pattern-Based Evolution and Refactoring of RDF Knowledge BasesEvoPat - Pattern-Based Evolution and Refactoring of RDF Knowledge Bases
EvoPat - Pattern-Based Evolution and Refactoring of RDF Knowledge Bases
 
JBUG 11 - Django-The Web Framework For Perfectionists With Deadlines
JBUG 11 - Django-The Web Framework For Perfectionists With DeadlinesJBUG 11 - Django-The Web Framework For Perfectionists With Deadlines
JBUG 11 - Django-The Web Framework For Perfectionists With Deadlines
 
How do OpenAI GPT Models Work - Misconceptions and Tips for Developers
How do OpenAI GPT Models Work - Misconceptions and Tips for DevelopersHow do OpenAI GPT Models Work - Misconceptions and Tips for Developers
How do OpenAI GPT Models Work - Misconceptions and Tips for Developers
 
BioSD Tutorial 2014 Editition
BioSD Tutorial 2014 EdititionBioSD Tutorial 2014 Editition
BioSD Tutorial 2014 Editition
 
2020 09 24 - CONDG ML.Net
2020 09 24 - CONDG ML.Net2020 09 24 - CONDG ML.Net
2020 09 24 - CONDG ML.Net
 
2014 toronto-torbug
2014 toronto-torbug2014 toronto-torbug
2014 toronto-torbug
 

More from François Scharffe

Publication et intégration de données ouvertes
Publication et intégration de données ouvertesPublication et intégration de données ouvertes
Publication et intégration de données ouvertesFrançois Scharffe
 
The Open Data Walk of Fame - from raw open data to five stars interlinked dat...
The Open Data Walk of Fame - from raw open data to five stars interlinked dat...The Open Data Walk of Fame - from raw open data to five stars interlinked dat...
The Open Data Walk of Fame - from raw open data to five stars interlinked dat...François Scharffe
 
20120313 coepia-mise-à-disposition-et-valorisation-des-données-publiques
20120313 coepia-mise-à-disposition-et-valorisation-des-données-publiques20120313 coepia-mise-à-disposition-et-valorisation-des-données-publiques
20120313 coepia-mise-à-disposition-et-valorisation-des-données-publiquesFrançois Scharffe
 
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011François Scharffe
 
Melinda: Methods and tools for Web Data Interlinking
Melinda: Methods and tools for Web Data InterlinkingMelinda: Methods and tools for Web Data Interlinking
Melinda: Methods and tools for Web Data InterlinkingFrançois Scharffe
 
Méthodes et outils pour interrelier le web des données
Méthodes et outils pour interrelier le web des donnéesMéthodes et outils pour interrelier le web des données
Méthodes et outils pour interrelier le web des donnéesFrançois Scharffe
 
Ontology alignment representation
Ontology alignment representationOntology alignment representation
Ontology alignment representationFrançois Scharffe
 

More from François Scharffe (10)

Publication et intégration de données ouvertes
Publication et intégration de données ouvertesPublication et intégration de données ouvertes
Publication et intégration de données ouvertes
 
The Open Data Walk of Fame - from raw open data to five stars interlinked dat...
The Open Data Walk of Fame - from raw open data to five stars interlinked dat...The Open Data Walk of Fame - from raw open data to five stars interlinked dat...
The Open Data Walk of Fame - from raw open data to five stars interlinked dat...
 
20120313 coepia-mise-à-disposition-et-valorisation-des-données-publiques
20120313 coepia-mise-à-disposition-et-valorisation-des-données-publiques20120313 coepia-mise-à-disposition-et-valorisation-des-données-publiques
20120313 coepia-mise-à-disposition-et-valorisation-des-données-publiques
 
20110728 datalift-rpi-troy
20110728 datalift-rpi-troy20110728 datalift-rpi-troy
20110728 datalift-rpi-troy
 
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
 
Cemagref
CemagrefCemagref
Cemagref
 
Melinda: Methods and tools for Web Data Interlinking
Melinda: Methods and tools for Web Data InterlinkingMelinda: Methods and tools for Web Data Interlinking
Melinda: Methods and tools for Web Data Interlinking
 
Méthodes et outils pour interrelier le web des données
Méthodes et outils pour interrelier le web des donnéesMéthodes et outils pour interrelier le web des données
Méthodes et outils pour interrelier le web des données
 
Linked Data Integration
Linked Data IntegrationLinked Data Integration
Linked Data Integration
 
Ontology alignment representation
Ontology alignment representationOntology alignment representation
Ontology alignment representation
 

Recently uploaded

JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...SOFTTECHHUB
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireExakis Nelite
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxFIDO Alliance
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxFIDO Alliance
 
Navigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiNavigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiRaviKumarDaparthi
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Skynet Technologies
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxFIDO Alliance
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptxFIDO Alliance
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuidePixlogix Infotech
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...caitlingebhard1
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!Memoori
 
الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهMohamed Sweelam
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfalexjohnson7307
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...ScyllaDB
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewDianaGray10
 

Recently uploaded (20)

JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
Navigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiNavigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi Daparthi
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهله
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 

Word embeddings as a service - PyData NYC 2015

  • 1. Word embeddings as a service François Scharffe ( ), PyData NYC 2015 @lechatpito (http://www.twitter.com/lechatpito) 3Top (http://www.3top.com)
  • 2. Outline of the talk What is 3Top? What are word embeddings? How to implement a simple recommendation system for 3Top categories?
  • 3. Rank Anything, Rank Everything 3Top is a ranking and recommendation platform Rankings convey more information than star ratings Who cares about 3 stars or less? I just want the best stuff I'd rather trust my friends than reading through reviews If I have more than 3 items to rank, I can probably use a more precise category Not yet launched, but the site is up Let's take a look at http://www.3top.com (http://www.3top.com)
  • 4.
  • 8. Data & knowledge engineering at 3Top Building a solid data engineering architecture before launching the site. Natural language processing pipeline Parsing categories Currently using a parser we developped About to switch to from Matthew Honnibal. It's great, check it out! Detecting named entities, locations A large knowledge graph backed by an ontology An itemization pipeline matching free text items to entities in the knowledge graph spaCy (http://spacy.io)
  • 9.
  • 10. Category recommendation How are we going to build a simple recommendation system without having any significant number of user, categories or rankings? Note the impressive figures: Number of Users: 316 Number of Rankings: 2123 Number of Categories: 1316 Feel free to add a few rankings: W o w ! ; ) http://www.3top.com (http://www.3top.com)
  • 11. Word embeddings ? Who hasn't heard about word2vec? Word embeddings allow to represent words in a high dimensional space in a way that words appearing in the same context will be close in that space. Dimensionality of the space is not that high, typically a few 100 dimensions. Word embeddings is a language modeling method, more precisely a distributed vector representation of words.
  • 12. Compared to Bag of words: Dimensionality is low, and constant wrt the vocabulary size Depending on the training algorithm, partially learned models give partially good results Compared to topic modeling: Better granularity, the base element is a word Phrases vector can also be learnt
  • 13. What are word embeddings models good at: Modeling similarity between words Allows algebric operations on word vectors s i m ( t o m a t o , b e e f s t e a k ) < s i m ( a p p l e , t o m a t o ) < s i m ( p e a r , a p p l e ) v ( P a r i s ) - v ( F r a n c e ) ~ = v ( B e r l i n ) - v ( G e r m a n y )
  • 14. Examples Examples here are using a GloVe model (100d, 400k vocab, trained on Wikipedia and Gigaword (news articles)). In [3]: from gensim.models import Word2Vec model = Word2Vec().load_word2vec_format("./glove.6B.100d.txt") In [4]: model.most_similar("python", topn=10) Out[4]: [(u'monty', 0.6886237859725952), (u'php', 0.586538553237915), (u'perl', 0.5784406661987305), (u'cleese', 0.5446674823760986), (u'flipper', 0.5112984776496887), (u'ruby', 0.5066927671432495), (u'spamalot', 0.505638837814331), (u'javascript', 0.5030568838119507), (u'reticulated', 0.4983375668525696), (u'monkey', 0.49764129519462585)]
  • 15. In [5]: model.most_similar_cosmul(positive=["python", "programming"], topn=5) Out[5]: [(u'perl', 0.5658619999885559), (u'scripting', 0.559501588344574), (u'scripts', 0.5469149351119995), (u'php', 0.5461974740028381), (u'language', 0.5350533127784729)] In [6]: model.most_similar_cosmul(positive=["python", "venomous"], topn=5) Out[6]: [(u'scorpion', 0.5413044095039368), (u'snakes', 0.5263831615447998), (u'snake', 0.5222328901290894), (u'spider', 0.5214570164680481), (u'marsupial', 0.517005205154419)]
  • 16. The classical example: v ( k i n g ) - v ( m a n ) + v ( w o m a n ) - > v ( q u e e n ) In [7]: model.most_similar_cosmul(positive=["king", "woman"], negative=["man"]) Out[7]: [(u'queen', 0.8964556455612183), (u'monarch', 0.8495977520942688), (u'throne', 0.8447030782699585), (u'princess', 0.8371668457984924), (u'elizabeth', 0.835679292678833), (u'daughter', 0.8348594903945923), (u'prince', 0.8230059742927551), (u'mother', 0.8154449462890625), (u'margaret', 0.8147734999656677), (u'father', 0.8100854158401489)]
  • 17. Training a model Very easy once you have a clean corpus Great tools in Python Tutorial on training a model using Gensim: Radim Řehůřek gave a talk last year at PyData Berlin about optimizations in Cython: For GloVe http://rare- technologies.com/word2vec-tutorial/ (http://rare- technologies.com/word2vec-tutorial/) https://www.youtube.com/watch? v=vU4TlwZzTfU (https://www.youtube.com/watch?v=vU4TlwZzTfU) https://github.com/maciejkula/glove-python (https://github.com/maciejkula/glove-python)
  • 18. Gensim word2vec implementation specifics: Training time ~ 8hours on a 8 proc/8 threads to learn 600 dimensions on a 1.9B words corpus Memory requirements depends on the vocabulary size and on the number of dimensions: The takes half the time but has a quadratic memory size. Check pull requests for memory optimizations. 3 m a t r i c e s * 4 b y t e s ( f l o a t ) * | d i m e n s i o n s | * | v o c a b u l a r y | GloVe implementation in Python (https://github.com/maciejkula/glove-python/)
  • 19. A good think to know: a bigger training set does improve the quality of the model, even for specialized tasks. As a consequence, you probably want to use a huge corpus. Good models are available. Building you own model can be useful when you want to find out about the properties of your corpus, or you want to compare different corpora together. For examples evolution of language in a newspaper during different periods of time.
  • 20. Finding a model From: Model file Number of dimensions 300 1000 1000 50/100/200/3 300 300 25/50/100/20 https://github.com/3Top/word2vec-api/ (https://github.com/3Top/word2vec- api/) Google News (GoogleNews-vectors-negative300.bin.gz) Freebase IDs (https://docs.google.com/file/d/0B7XkCwpI5KDYaDBDQm1tZGNDRHc/edit? usp=sharing) Freebase names (https://docs.google.com/file/d/0B7XkCwpI5KDYeFdmcVltWkhtbmM/edit? usp=sharing) Wikipedia+Gigaword 5 (http://nlp.stanford.edu/data/glove.6B.zip) Common Crawl 42B (http://nlp.stanford.edu/data/glove.42B.300d.zip) Common Crawl 840B (http://nlp.stanford.edu/data/glove.840B.300d.zip) Twitter (2B Tweets) (http://www-nlp.stanford.edu/data/glove.twitter.27B.zip)
  • 22. Building a recommendation engine for 3Top categories By combining word vectors, we build category vectors. In [8]: def build_category_vector(category): pass # Get the postags vector = [] for tag in postags: if tag.tagged in ['NN', 'NNS', 'JJ', 'NNP', 'NNPS', 'NNDBN', 'VBG', 'CD' ]: # Only keep meaningful words try: v = word2vec(tag.tagValue) # Get the word vector if v.any(): vector.append(v) except: logger.debug("Word not found in corpus: %s" % tag.tagValue) tagset.add(tag.tagValue) if vector: return matutils.unitvec(np.array(vector).mean(axis=0)) # Average the ve ctor else: return np.empty(300) We store those vectors in a category space, and at page load time compute the most similar categories for a given category.
  • 23. Now let us look at the similarity method. s i m ( c 1 , c 2 ) = v ( c 1 ) . v ( c 2 ) In [21]: cs = CategorySimilarity() # print(Category.objects.all().count()) category = Category.objects.get(category=u"Blue-collar beers that come in a can" ) _ = [print(c) for c in cs.most_similar_categories(category, n=5)] DEBUG Category space size (as found in the cache): 1125 Belgian Trappist Beers Belgian Beer Cafe in NYC Dark and Stormy Cocktail in NYC Brands of Ginger Beer Pink Drinks
  • 24. In [10]: category = Category.objects.get(category=u"Italian Restaurants in NYC.") _ = [print(c) for c in cs.most_similar_categories(category, n=5)] Italian Restaurants in NY Restaurants in Nyc NYC Mexican Restaurants Romanian Restaurants in NYC Thai Restaurants in NYC In [11]: category = Category.objects.get(category=u"Coen Brothers Movies") _ = [print(c) for c in cs.most_similar_categories(category, n=10)] Quentin Tarantino Movies Martin Scorsese Films. Movies Starring Creepy Children Tim Burton Movies Movies Starring Sean Penn Pixar Movies Godfather Movies Berlin Indie Movie Theaters Kubrick Movies Harry Potter Movies
  • 25. Our recommendation system uses the Common Crawl 42B words, 300 dimensions model trained with GloVe. It takes around 6GB in memory ... and this is a problem: We run a Django server and 8 celery workers on an EC2 T2 Micro... That would be a lot of memory for that poor instance.
  • 26. A word embedding service We separate the word embedding model as a service Simple Flask server with a few primitives: curl http://127.0.0.1:5000/word2vec/similarity? w1=Python&w2=Java curl http://127.0.0.1:5000/word2vec/n_similarity? ws1=Python&ws1=programming&ws2=Java curl http://127.0.0.1:5000/word2vec/model?word=Python curl http://127.0.0.1:5000/word2vec/most_similar? positive=king&positive=queen&negative=man Easy to setup: python word2vec-api --model path/to/the/model [--host host -- port 1234] Get it at https://github.com/3Top/word2vec-api (https://github.com/3Top/word2vec-api)
  • 27.
  • 28. Caching the vector space As the number of categories increase we do not want to rebuild category vectors and hit the database every time a recommendation is needed (every page access). Category vectors size is actually significant: In [12]: print(np.fromstring(category.vector).nbytes) 2400 In [13]: print(u"... and for {} categories the space becomes large (~{}MB)".format(len(cs .category_space.syn0), cs.category_space.syn0.nbytes/1000000)) ... and for 1125 categories the space becomes large (~2MB)
  • 29. We store vectors with the category object in MySQL, using a base64 encoding of the numpy object. Let's look at it: In [14]: print(category._vector[:1000] + "...") oNl7j9l8hr/2FoHhEbSyv+GALkJ6JVU/rxm5pueouL80aF72QLiivzr0z1WKaZM/i3M5FvqQmD+hDmI s7fitv+lL6cLFSbM/lSwwkBxv0D/sA+FgTxmhP5+lJZPGVLE/q8pBj07ukT9OceKjxl2jv4s4cA7RJI c/JxVUF8afnr/RQcXciUyCP+M4N3mbtrS/Ngwo85uUor/+4vqargCyP7YnHbTSv3W/MHjrh6iHsr/1v kzSmI+yP+bsS7E0B5u/JJ5iJAUNoT//xo0IJ3inP5/BwCfgWZ2/Q2r8q9Fuir/KdAOAr0OQPwzGTnUX U5i/9uQD77+xuD/1QKbEaDWjvyqfSePd7XE/3RLqJXiOrz8ZyEDICd2UP2beFLiqPZy/rIb+8sFgqr+ ILyc3/5yoP5pL25IahpQ/4WpgCeuNZ7/ley+Tl9ygP+knz2odUHo/yBSdc5+Klj+GLgrafftvP2yiB7 6KBY2/z0RqB+1ri7+THdXBVVKvPzwZ2X+2HaA/oOThsHeidL/O7w8+bummv8Z8XCqeYas/WzQpioG6q r+JaauGrie2P7+8NmNBN5o/0Ji6XFJMpj/xAtoHvg9PPyr3OOBXVaG/M2aCbN8dv79pANKgDzNrPy4X XBNVi4S/WBuYYjWZlD8T/W3jLbOJPy3xHNTzarQ/MoOWx7aZtz/RDMwbryievwA5kQgazaO/3Ep0jVo 1rD+ns3oJ3iWUv4TlEPcAnJy/dHNcwygjnr/cMGYNKPbCP5E06afPWa6/mUHAC+8mjj+NwgyjQFB5v6 ffLvduuai/kBntVvsdpb8Yg3KzY/qev9r5son3VJg/h06aD0/IuD8NMHm7jGViv7o8zQzPd5U/esP4A x6BrL8TOZuX+qGpP1WHNPzdQH0/7HXRMAqXmr9G8pkwjbenv3RvQppal7i/E5jWmLXSp792VpPxJeOj PyEKhEhl324/1E00QnHdvr9Mg0Fohd+cP6UAj0X5R5g/2umwTF42...
  • 30. In [15]: # a property method takes care of the decoding def get_vector(self): return base64.b64decode(self._vector) def set_vector(self, value): encoded = base64.b64encode(value) self._vector = encoded vector = property(get_vector, set_vector) In [16]: np.fromstring(category.vector)[:100] Out[16]: array([-0.01098032, -0.07306015, 0.00129067, -0.09632728, -0.03656199, 0.01895729, 0.02399054, -0.05853978, 0.07534443, 0.25678171, 0.03339623, 0.06769982, 0.01751063, -0.03782483, 0.01130069, -0.02990636, 0.00893505, -0.08091137, -0.03629005, 0.07032291, -0.00530989, -0.07238248, 0.07250362, -0.02639468, 0.03330246, 0.04583857, -0.02866316, -0.01290668, 0.0158832 , -0.02375447, 0.09646225, -0.03751686, 0.00437724, 0.06163383, 0.02037444, -0.02757899, -0.05151945, 0.04807279, 0.02004282, -0.00287529, 0.03293298, 0.00642406, 0.02201318, 0.0039041 , -0.01417073, -0.01338945, 0.06117504, 0.03147669, -0.00503775, -0.04474968, 0.05347914, -0.05220418, 0.086543 , 0.02560141, 0.04355104, 0.00094792, -0.03385424, -0.12154957, 0.00332025, -0.01003138, 0.02011569, 0.01254879, 0.07975696, 0.09218924, -0.02945207, -0.03867418, 0.05509456, -0.0196757 , -0.02793886, -0.029431 , 0.1481371 , -0.05927895, 0.0147227 , -0.00618005, -0.04828975, -0.04124437, -0.03025203, 0.02376162, 0.09680647, -0.00224569, 0.02096485, -0.05567259, 0.05006393, 0.00714194, -0.0259668 , -0.04632226, -0.09605948, -0.04652946, 0.03884238, 0.00376863, -0.12056644, 0.02819642, 0.02371206, 0.08286085, 0.08104846, -0.03060514, -0.0313298 , -0.00715603, -0.05278924, 0.0031662 ])
  • 31. In order to avoid issuing a few thousand SQL queries every time a page is loaded we use Memcache to store the category space. As the space is larger than 1 MB we store each vector with its own key (the category Id). They share a common key prefix. We directly store the numpy vectors through the Gensim API. A separate key is used for the vocabulary indexes. In [17]: def set_space_cache(space): sim.set(VOC, space.vocab) sim.set(IDX, space.index2word) sim.set_many({"{0}-{1}".format(VEC, i): space.syn0[i] for i in range(len (space.vocab))}) This also allows to add a category vector to the space without having to rebuild it. Simply by stacking its vector in the cache and updating the cached space indexes. In [18]: def add_last_vector_to_space_cache(space): sim.set(VOC, space.vocab) sim.set(IDX, space.index2word) sim.set("{}-{}".format(VEC, len(space.vocab)-1), space.syn0[-1])
  • 32. Updates Each process gets its own copy of the vector space. Whenever a category is added, the space is updated in cache. Django signals are used to tell other processes to reload the space from cache.
  • 33. Work in progress We are about to add a few 100k generated categories The category space will become large in memory: 8 workers * 2.4 kb * 100000 categories = 1,9 GB Including entity vectors would improve results for names, places, etc. Training a specialized corpus using categories scraped all over the web Train a phrase2vec model on these categories
  • 35. Instagram: Word embeddings and RNNs: Word2vec gensim tutorial: Clothing style search: In digital humanities: In digital humanities, application to gender studies: Document classification on Yelp reviews: http://instagram- engineering.tumblr.com/post/117889701472/emojineering-part-1-machine- learning-for-emoji (http://instagram- engineering.tumblr.com/post/117889701472/emojineering-part-1-machine- learning-for-emoji) http://colah.github.io/posts/2014-07-NLP- RNNs-Representations/ (http://colah.github.io/posts/2014-07-NLP-RNNs- Representations/) http://rare-technologies.com/word2vec-tutorial/ (http://rare-technologies.com/word2vec-tutorial/) http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a- thousand-vectors/ (http://multithreaded.stitchfix.com/blog/2015/03/11/word- is-worth-a-thousand-vectors/) http://bookworm.benschmidt.org/posts/2015-10-25- Word-Embeddings.html (http://bookworm.benschmidt.org/posts/2015-10-25- Word-Embeddings.html) http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-the-gender- binary.html (http://bookworm.benschmidt.org/posts/2015-10-30-rejecting- the-gender-binary.html) http://nbviewer.ipython.org/github/taddylab/deepir/blob/master/w2v- inversion.ipynb (http://nbviewer.ipython.org/github/taddylab/deepir/blob/master/w2v-
  • 36. inversion.ipynb) Resources Academic Papers Le, Quoc V., and Tomas Mikolov. "Distributed representations of sentences and documents." arXiv preprint arXiv:1405.4053 (2014). JeffreyPennington, RichardSocher, and ChristopherD Manning. "Glove: Global vectors for word representation." (2014). Levy, Omer, and Yoav Goldberg. "Dependencybased word embeddings." Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Vol. 2. 2014. Goldberg, Yoav, and Omer Levy. "word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method." arXiv preprint arXiv:1402.3722 (2014).
  • 37. In [19]: Thank you ! File "<ipython-input-19-f087ca1d6988>", line 1 Thank you ! ^ SyntaxError: invalid syntax