Introduction to word embeddings with Python

Introduction to
word embeddings
Pavel Kalaidin
@facultyofwonder
Moscow Data Fest, September, 12th, 2015
Introduction to word embeddings with Python
Introduction to word embeddings with Python
distributional hypothesis
лойс
годно, лойс
лойс за песню
из принципа не поставлю лойс
взаимные лойсы
лойс, если согласен
What is the meaning of лойс?
годно, лойс
лойс за песню
из принципа не поставлю лойс
взаимные лойсы
лойс, если согласен
What is the meaning of лойс?
кек
кек, что ли?
кек)))))))
ну ты кек
What is the meaning of кек?
кек, что ли?
кек)))))))
ну ты кек
What is the meaning of кек?
vectorial representations
of words
simple and flexible
platform for
understanding text and
probably not messing up
one-hot encoding?
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
co-occurrence matrix
recall: word-document co-occurrence
matrix for LSA
credits: [x]
from entire document to
window (length 5-10)
still seems suboptimal ->
big, sparse, etc.
lower dimensions, we
want dense vectors
(say, 25-1000)
How?
matrix factorization?
SVD of co-occurrence
matrix
lots of memory?
idea: directly learn low-
dimensional vectors
here comes word2vec
Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al: [paper]
idea: instead of capturing co-
occurrence counts
predict surrounding words
Two models:
C-BOW
predicting the word given its context
skip-gram
predicting the context given a word
Explained in great detail here, so we’ll skip it for now Also see: word2vec Parameter
Learning Explained, Rong, paper
Introduction to word embeddings with Python
CBOW: several times faster than skip-gram,
slightly better accuracy for the frequent words
Skip-Gram: works well with small amount of
data, represents well rare words or phrases
Examples?
Introduction to word embeddings with Python
Introduction to word embeddings with Python
Introduction to word embeddings with Python
Introduction to word embeddings with Python
Introduction to word embeddings with Python
Introduction to word embeddings with Python
Wwoman
- Wman
= Wqueen
-
Wking
classic example
<censored example>
word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling
Word-Embedding Method, Goldberg et al, 2014 [arxiv]
all done with gensim:
github.com/piskvorky/gensim/
...failing to take advantage of
the vast amount of repetition
in the data
so back to co-occurrences
GloVe for Global Vectors
Pennington et al, 2014: nlp.stanford.
edu/pubs/glove.pdf
Ratios seem to cancel noise
The gist: model ratios with
vectors
The model
Preserving
linearity
Preventing mixing
dimensions
Restoring
symmetry, part 1
recall:
Introduction to word embeddings with Python
Restoring symmetry, part 2
Least squares problem it is now
SGD->AdaGrad
ok, Python code
glove-python:
github.com/maciejkula/glove-python
two sets of vectors
input and context + bias
average/sum/drop
complexity |V|2
complexity |C|0.8
Evaluation: it works
#spb
#gatchina
#msk
#kyiv
#minsk
#helsinki
Compared to word2vec
#spb
#gatchina
#msk
#kyiv
#minsk
#helsinki
Introduction to word embeddings with Python
t-SNE:
github.com/oreillymedia/t-SNE-tutorial
seaborn:
stanford.edu/~mwaskom/software/seaborn/
Abusing models
music playlists:
github.com/mattdennewitz/playlist-to-vec
deep walk:
DeepWalk: Online Learning of Social
Representations [link]
user interests
Paragraph vectors: cs.stanford.
edu/~quocle/paragraph_vector.pdf
predicting hashtags
interesting read: #TAGSPACE: Semantic
Embeddings from Hashtags [link]
RusVectōrēs: distributional semantic
models for Russian: ling.go.mail.
ru/dsm/en/
Introduction to word embeddings with Python
corpus matters
building block for
bigger models
╰(*´︶`*)╯
</slides>
1 of 74

More Related Content

Similar to Introduction to word embeddings with Python(20)

Lda2vec text by the bay 2016 with notesLda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notes
👋 Christopher Moody2.4K views
Word embeddingsWord embeddings
Word embeddings
Shruti kar155 views
Vectorization In NLP.pptxVectorization In NLP.pptx
Vectorization In NLP.pptx
Chode Amarnath58 views
Query UnderstandingQuery Understanding
Query Understanding
Eoin Hurrell, PhD163 views
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptx
GowrySailaja7 views
Lda and it's applicationsLda and it's applications
Lda and it's applications
Babu Priyavrat2.1K views
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdf
Soha828 views
Subword tokenizersSubword tokenizers
Subword tokenizers
Ha Loc Do114 views
SNLI_presentation_2SNLI_presentation_2
SNLI_presentation_2
Viral Gupta332 views
New word analogy corpusNew word analogy corpus
New word analogy corpus
Lukáš Svoboda316 views
AINL 2016: NikolenkoAINL 2016: Nikolenko
AINL 2016: Nikolenko
Lidia Pivovarova447 views
What is word2vec?What is word2vec?
What is word2vec?
Traian Rebedea22.3K views
Constructive Hybrid LogicsConstructive Hybrid Logics
Constructive Hybrid Logics
Valeria de Paiva294 views

Recently uploaded(20)

Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra15 views
RuleBookForTheFairDataEconomy.pptxRuleBookForTheFairDataEconomy.pptx
RuleBookForTheFairDataEconomy.pptx
noraelstela164 views
Microsoft Fabric.pptxMicrosoft Fabric.pptx
Microsoft Fabric.pptx
Shruti Chaurasia17 views
PTicketInput.pdfPTicketInput.pdf
PTicketInput.pdf
stuartmcphersonflipm286 views
How Leaders See Data? (Level 1)How Leaders See Data? (Level 1)
How Leaders See Data? (Level 1)
Narendra Narendra10 views
Introduction to Microsoft Fabric.pdfIntroduction to Microsoft Fabric.pdf
Introduction to Microsoft Fabric.pdf
ishaniuudeshika19 views
Journey of Generative AIJourney of Generative AI
Journey of Generative AI
thomasjvarghese4917 views
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam 12 views
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
Timothy Spann88 views
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE909 views
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docxRIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
JaysonGarabilesEspej6 views
PROGRAMME.pdfPROGRAMME.pdf
PROGRAMME.pdf
HiNedHaJar7 views

Introduction to word embeddings with Python