SlideShare a Scribd company logo
Hendrik Heuer

Stockholm NLP Meetup
word2vec 

From theory to practice
About me
Hendrik Heuer
hendrikheuer@gmail.com
http://hen-drik.de
@hen_drik
–J. R. Firth 1957
“You shall know a word 

by the company it keeps”
–J. R. Firth 1957
“You shall know a word 

by the company it keeps”
Quoted after Socher
Quoted after Socher
Quoted after Socher
Vectors are directions in space
Vectors can encode relationships
man is to woman as king is to ?
word2vec
• by Mikolov, Sutskever, Chen, 

Corrado and Dean at Google
• NAACL 2013
• takes a text corpus as input 

and produces the 

word vectors as output
Sweden
Similar words
Harvard
Similar words
word2vec
• word meaning and relationships
between words are encoded spatially
• two main learning algorithms in
word2vec: 

continuous bag-of-words and 

continuous skip-gram
Goal
continuous
bag-of-words
• predicting the current
word based on the
context
• order of words in the
history does not
influence the
projection
• faster & more
appropriate for
larger corpora
Mikolov et al.
continuous
skip-gram
• maximize
classification of a
word based on
another word in the
same sentence
• better word vectors
for frequent words,
but slower to train
Mikolov et al.
Why it is awesome
• there is a fast open-source implementation
• can be used as features 

for natural language processing tasks and
machine learning algorithms
Machine Translation
Quoted after Mikolov
Sentiment Analysis
Quoted after SocherRecursive Neural Tensor Network
Image Descriptions
Quoted after Vinyals
Using word2vec
• Original: http://word2vec.googlecode.com/svn/trunk/
• C++11 version: https://github.com/jdeng/word2vec
• Python: http://radimrehurek.com/gensim/models/
word2vec.html
• Java: https://github.com/ansjsun/word2vec_java
• Parallel java: https://github.com/siegfang/word2vec
• CUDAversion: https://github.com/whatupbiatch/cuda-word2vec
Quoted after Wang
Using it in Python
Usage
Training a model
Quoted after Řehůřek
Training a model with iterator
Quoted after Řehůřek
Doing it in C
• Download the code: 

git clone https://github.com/h10r/word2vec-macosx-maverics.git
!
• Run 'make' to compile word2vec tool
!
• Run the demo scripts: 



./demo-word.sh 



and 



./demo-phrases.sh
Doing it in C
• Download the code: 

git clone https://github.com/h10r/word2vec-macosx-maverics.git
!
• Run 'make' to compile word2vec tool
!
• Run the demo scripts: 



./demo-word.sh 



and 



./demo-phrases.sh
Testing a model
Quoted after Řehůřek
word2vec t-SNE JSON
1. Find Word
Embeddings
2. Dimensionality
Reduction
3. Output
word2vec t-SNE JSON
1. Find Word
Embeddings
2. Dimensionality
Reduction
3. Output
word2vec t-SNE JSON
1. Find Word
Embeddings
2. Dimensionality
Reduction
3. Output
Should be in Macports
py27-scikit-learn @0.15.2 (python, science)
word2vec t-SNE JSON
1. Find Word
Embeddings
2. Dimensionality
Reduction
3. Output
word2vec 

From theory to practice
Hendrik Heuer

Stockholm NLP Meetup
!
Discussion:
Can anybody here think of ways
this might help her or him?
Further Reading
• Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Dean. Efficient Estimation of Word Representations in
Vector Space. In Proceedings of Workshop at ICLR, 2013.
• Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado,
and Jeffrey Dean. Distributed Representations of Words
and Phrases and their Compositionality. In Proceedings of
NIPS, 2013.
• Tomas Mikolov, Wen-tau Yih,and Geoffrey Zweig.
Linguistic Regularities in Continuous Space Word
Representations. In Proceedings of NAACL HLT, 2013.
Further Reading
• Richard Socher - Deep Learning for NLP (without Magic)

http://lxmls.it.pt/2014/?page_id=5
• Wang - Introduction to Word2vec and its application to find
predominant word senses

http://compling.hss.ntu.edu.sg/courses/hg7017/pdf/
word2vec%20and%20its%20application%20to%20wsd.pdf
• Exploiting Similarities among Languages for Machine
Translation, Tomas Mikolov, Quoc V. Le, Ilya Sutskever,
http://arxiv.org/abs/1309.4168
• Title Image by Hans Arp
Further Coding
• word2vec

https://code.google.com/p/word2vec/
• word2vec for MacOSX Maverics

https://github.com/h10r/word2vec-macosx-maverics
• Gensim Python Library

https://radimrehurek.com/gensim/index.html
• Gensim Tutorials

https://radimrehurek.com/gensim/tutorial.html
• Scikit-Learn TSNE

http://scikit-learn.org/stable/modules/generated/
sklearn.manifold.TSNE.html
Hendrik Heuer

Stockholm NLP Meetup
word2vec 

From theory to practice

More Related Content

What's hot

WAND Top-k Retrieval
WAND Top-k RetrievalWAND Top-k Retrieval
WAND Top-k Retrieval
Andrew Zhang
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity Recognition
Rrubaa Panchendrarajan
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
Christian Perone
 
[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기
[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기
[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기
CONNECT FOUNDATION
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
Shruti kar
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
Jinpyo Lee
 
Java garbage collection & GC friendly coding
Java garbage collection  & GC friendly codingJava garbage collection  & GC friendly coding
Java garbage collection & GC friendly coding
Md Ayub Ali Sarker
 
Elasticsearch development case
Elasticsearch development caseElasticsearch development case
Elasticsearch development case
일규 최
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
Bill Liu
 
Linux intro 4 awk + makefile
Linux intro 4  awk + makefileLinux intro 4  awk + makefile
Linux intro 4 awk + makefile
Giovanni Marco Dall'Olio
 
Meta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-InformationMeta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-Information
recsysfr
 
Skip gram and cbow
Skip gram and cbowSkip gram and cbow
Skip gram and cbow
hyunyoung Lee
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)
Shuntaro Yada
 
Python을 활용한 챗봇 서비스 개발 1일차
Python을 활용한 챗봇 서비스 개발 1일차Python을 활용한 챗봇 서비스 개발 1일차
Python을 활용한 챗봇 서비스 개발 1일차
Taekyung Han
 
Ceph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross TurkCeph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross Turkbuildacloud
 
Lucene
LuceneLucene
DVC - Git-like Data Version Control for Machine Learning projects
DVC - Git-like Data Version Control for Machine Learning projectsDVC - Git-like Data Version Control for Machine Learning projects
DVC - Git-like Data Version Control for Machine Learning projects
Francesco Casalegno
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
 
파이썬 TDD 101
파이썬 TDD 101파이썬 TDD 101
파이썬 TDD 101
정주 김
 
Context2Vec 기반 단어 의미 중의성 해소, Word Sense Disambiguation
Context2Vec 기반 단어 의미 중의성 해소, Word Sense DisambiguationContext2Vec 기반 단어 의미 중의성 해소, Word Sense Disambiguation
Context2Vec 기반 단어 의미 중의성 해소, Word Sense Disambiguation
찬희 이
 

What's hot (20)

WAND Top-k Retrieval
WAND Top-k RetrievalWAND Top-k Retrieval
WAND Top-k Retrieval
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity Recognition
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기
[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기
[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
Java garbage collection & GC friendly coding
Java garbage collection  & GC friendly codingJava garbage collection  & GC friendly coding
Java garbage collection & GC friendly coding
 
Elasticsearch development case
Elasticsearch development caseElasticsearch development case
Elasticsearch development case
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Linux intro 4 awk + makefile
Linux intro 4  awk + makefileLinux intro 4  awk + makefile
Linux intro 4 awk + makefile
 
Meta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-InformationMeta-Prod2Vec: Simple Product Embeddings with Side-Information
Meta-Prod2Vec: Simple Product Embeddings with Side-Information
 
Skip gram and cbow
Skip gram and cbowSkip gram and cbow
Skip gram and cbow
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)
 
Python을 활용한 챗봇 서비스 개발 1일차
Python을 활용한 챗봇 서비스 개발 1일차Python을 활용한 챗봇 서비스 개발 1일차
Python을 활용한 챗봇 서비스 개발 1일차
 
Ceph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross TurkCeph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross Turk
 
Lucene
LuceneLucene
Lucene
 
DVC - Git-like Data Version Control for Machine Learning projects
DVC - Git-like Data Version Control for Machine Learning projectsDVC - Git-like Data Version Control for Machine Learning projects
DVC - Git-like Data Version Control for Machine Learning projects
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
파이썬 TDD 101
파이썬 TDD 101파이썬 TDD 101
파이썬 TDD 101
 
Context2Vec 기반 단어 의미 중의성 해소, Word Sense Disambiguation
Context2Vec 기반 단어 의미 중의성 해소, Word Sense DisambiguationContext2Vec 기반 단어 의미 중의성 해소, Word Sense Disambiguation
Context2Vec 기반 단어 의미 중의성 해소, Word Sense Disambiguation
 

Viewers also liked

Drawing word2vec
Drawing word2vecDrawing word2vec
Drawing word2vec
Kai Sasaki
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Daniele Di Mitri
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
Felipe Moraes
 
Textrank algorithm
Textrank algorithmTextrank algorithm
Textrank algorithm
Andrew Koo
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
👋 Christopher Moody
 
Document Summarization
Document SummarizationDocument Summarization
Document SummarizationPratik Kumar
 
Word2vec 4 all
Word2vec 4 allWord2vec 4 all
Word2vec 4 all
Óscar García Peinado
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
Abdullah Khan Zehady
 
Text summarization
Text summarizationText summarization
Text summarizationkareemhashem
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensim
Edgar Marca
 
Statistical Semantic入門 ~分布仮説からword2vecまで~
Statistical Semantic入門 ~分布仮説からword2vecまで~Statistical Semantic入門 ~分布仮説からword2vecまで~
Statistical Semantic入門 ~分布仮説からword2vecまで~Yuya Unno
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Roelof Pieters
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 
Hacking Human Language (PyData London)
Hacking Human Language (PyData London)Hacking Human Language (PyData London)
Hacking Human Language (PyData London)
hen_drik
 
Sentiment Polarity Analysis for Generating Search Result Snippets based on Pa...
Sentiment Polarity Analysis for Generating Search Result Snippets based on Pa...Sentiment Polarity Analysis for Generating Search Result Snippets based on Pa...
Sentiment Polarity Analysis for Generating Search Result Snippets based on Pa...
Yujiro Terazawa
 
Word embeddings et leurs applications (Meetup TDS, 2016-06-30)
Word embeddings et leurs applications (Meetup TDS, 2016-06-30)Word embeddings et leurs applications (Meetup TDS, 2016-06-30)
Word embeddings et leurs applications (Meetup TDS, 2016-06-30)
Camille Pradel
 
IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents
Sharvil Katariya
 
Text mining, word embeddings, & wikipedia
Text mining, word embeddings, & wikipediaText mining, word embeddings, & wikipedia
Text mining, word embeddings, & wikipedia
M. Atif Qureshi
 
Deep learning - Chatbot
Deep learning - ChatbotDeep learning - Chatbot
Deep learning - Chatbot
Liam Bui
 
Automatic Text Summarization
Automatic Text SummarizationAutomatic Text Summarization
Automatic Text SummarizationHimanshuPu
 

Viewers also liked (20)

Drawing word2vec
Drawing word2vecDrawing word2vec
Drawing word2vec
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
Textrank algorithm
Textrank algorithmTextrank algorithm
Textrank algorithm
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
 
Document Summarization
Document SummarizationDocument Summarization
Document Summarization
 
Word2vec 4 all
Word2vec 4 allWord2vec 4 all
Word2vec 4 all
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
 
Text summarization
Text summarizationText summarization
Text summarization
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensim
 
Statistical Semantic入門 ~分布仮説からword2vecまで~
Statistical Semantic入門 ~分布仮説からword2vecまで~Statistical Semantic入門 ~分布仮説からword2vecまで~
Statistical Semantic入門 ~分布仮説からword2vecまで~
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Hacking Human Language (PyData London)
Hacking Human Language (PyData London)Hacking Human Language (PyData London)
Hacking Human Language (PyData London)
 
Sentiment Polarity Analysis for Generating Search Result Snippets based on Pa...
Sentiment Polarity Analysis for Generating Search Result Snippets based on Pa...Sentiment Polarity Analysis for Generating Search Result Snippets based on Pa...
Sentiment Polarity Analysis for Generating Search Result Snippets based on Pa...
 
Word embeddings et leurs applications (Meetup TDS, 2016-06-30)
Word embeddings et leurs applications (Meetup TDS, 2016-06-30)Word embeddings et leurs applications (Meetup TDS, 2016-06-30)
Word embeddings et leurs applications (Meetup TDS, 2016-06-30)
 
IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents
 
Text mining, word embeddings, & wikipedia
Text mining, word embeddings, & wikipediaText mining, word embeddings, & wikipedia
Text mining, word embeddings, & wikipedia
 
Deep learning - Chatbot
Deep learning - ChatbotDeep learning - Chatbot
Deep learning - Chatbot
 
Automatic Text Summarization
Automatic Text SummarizationAutomatic Text Summarization
Automatic Text Summarization
 

Similar to word2vec - From theory to practice

Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Sease
 
Subword tokenizers
Subword tokenizersSubword tokenizers
Subword tokenizers
Ha Loc Do
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Lucidworks
 
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Lucidworks
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic Matching
Simon Hughes
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
Ted Xiao
 
From Semantics to Self-supervised Learning for Speech and Beyond (Opening Ke...
From Semantics to Self-supervised Learning  for Speech and Beyond (Opening Ke...From Semantics to Self-supervised Learning  for Speech and Beyond (Opening Ke...
From Semantics to Self-supervised Learning for Speech and Beyond (Opening Ke...
linshanleearchive
 
Applying Data Science to Move Beyond Keywords for Social Analysis
Applying Data Science to Move Beyond Keywords for Social Analysis Applying Data Science to Move Beyond Keywords for Social Analysis
Applying Data Science to Move Beyond Keywords for Social Analysis
DataSift
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...
source{d}
 
Devops: A History
Devops: A HistoryDevops: A History
Devops: A History
Nell Shamrell-Harrington
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectors
Simon Hughes
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
OpenSource Connections
 
StackEngine Problem Space Demo
StackEngine Problem Space DemoStackEngine Problem Space Demo
StackEngine Problem Space Demo
Boyd Hemphill
 
Multi modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed modelsMulti modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed models
Roelof Pieters
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information Retrieval
Roelof Pieters
 
Jwt == insecurity?
Jwt == insecurity?Jwt == insecurity?
Jwt == insecurity?
snyff
 
NLP & DBpedia
 NLP & DBpedia NLP & DBpedia
NLP & DBpedia
kelbedweihy
 
Andrey Kutuzov and Elizaveta Kuzmenko - WebVectors: Toolkit for Building Web...
Andrey Kutuzov and  Elizaveta Kuzmenko - WebVectors: Toolkit for Building Web...Andrey Kutuzov and  Elizaveta Kuzmenko - WebVectors: Toolkit for Building Web...
Andrey Kutuzov and Elizaveta Kuzmenko - WebVectors: Toolkit for Building Web...
AIST
 
NLP with H2O
NLP with H2ONLP with H2O
NLP with H2O
Sri Ambati
 
Open Source Design at Ignite lightning talk
Open Source Design at Ignite lightning talkOpen Source Design at Ignite lightning talk
Open Source Design at Ignite lightning talk
Mushon Zer-Aviv
 

Similar to word2vec - From theory to practice (20)

Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
 
Subword tokenizers
Subword tokenizersSubword tokenizers
Subword tokenizers
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic Matching
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
From Semantics to Self-supervised Learning for Speech and Beyond (Opening Ke...
From Semantics to Self-supervised Learning  for Speech and Beyond (Opening Ke...From Semantics to Self-supervised Learning  for Speech and Beyond (Opening Ke...
From Semantics to Self-supervised Learning for Speech and Beyond (Opening Ke...
 
Applying Data Science to Move Beyond Keywords for Social Analysis
Applying Data Science to Move Beyond Keywords for Social Analysis Applying Data Science to Move Beyond Keywords for Social Analysis
Applying Data Science to Move Beyond Keywords for Social Analysis
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...
 
Devops: A History
Devops: A HistoryDevops: A History
Devops: A History
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectors
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
 
StackEngine Problem Space Demo
StackEngine Problem Space DemoStackEngine Problem Space Demo
StackEngine Problem Space Demo
 
Multi modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed modelsMulti modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed models
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information Retrieval
 
Jwt == insecurity?
Jwt == insecurity?Jwt == insecurity?
Jwt == insecurity?
 
NLP & DBpedia
 NLP & DBpedia NLP & DBpedia
NLP & DBpedia
 
Andrey Kutuzov and Elizaveta Kuzmenko - WebVectors: Toolkit for Building Web...
Andrey Kutuzov and  Elizaveta Kuzmenko - WebVectors: Toolkit for Building Web...Andrey Kutuzov and  Elizaveta Kuzmenko - WebVectors: Toolkit for Building Web...
Andrey Kutuzov and Elizaveta Kuzmenko - WebVectors: Toolkit for Building Web...
 
NLP with H2O
NLP with H2ONLP with H2O
NLP with H2O
 
Open Source Design at Ignite lightning talk
Open Source Design at Ignite lightning talkOpen Source Design at Ignite lightning talk
Open Source Design at Ignite lightning talk
 

More from hen_drik

ifib Lunchbag: CHI2018 Highlights - Algorithms in (Social) Practice and more
ifib Lunchbag: CHI2018 Highlights - Algorithms in (Social) Practice and moreifib Lunchbag: CHI2018 Highlights - Algorithms in (Social) Practice and more
ifib Lunchbag: CHI2018 Highlights - Algorithms in (Social) Practice and more
hen_drik
 
Game Design - Black and White
Game Design - Black and WhiteGame Design - Black and White
Game Design - Black and White
hen_drik
 
Hacking Human Language (PyCon Sweden 2015)
Hacking Human Language (PyCon Sweden 2015)Hacking Human Language (PyCon Sweden 2015)
Hacking Human Language (PyCon Sweden 2015)
hen_drik
 
Actuated workbench
Actuated workbenchActuated workbench
Actuated workbenchhen_drik
 
Ruby on Rails - Kurzvortrag
Ruby on Rails - KurzvortragRuby on Rails - Kurzvortrag
Ruby on Rails - Kurzvortraghen_drik
 
Jugendmedienschutz
JugendmedienschutzJugendmedienschutz
Jugendmedienschutzhen_drik
 
Farbformprozess
FarbformprozessFarbformprozess
Farbformprozesshen_drik
 
Defining video games
Defining video gamesDefining video games
Defining video gameshen_drik
 

More from hen_drik (9)

ifib Lunchbag: CHI2018 Highlights - Algorithms in (Social) Practice and more
ifib Lunchbag: CHI2018 Highlights - Algorithms in (Social) Practice and moreifib Lunchbag: CHI2018 Highlights - Algorithms in (Social) Practice and more
ifib Lunchbag: CHI2018 Highlights - Algorithms in (Social) Practice and more
 
Game Design - Black and White
Game Design - Black and WhiteGame Design - Black and White
Game Design - Black and White
 
Hacking Human Language (PyCon Sweden 2015)
Hacking Human Language (PyCon Sweden 2015)Hacking Human Language (PyCon Sweden 2015)
Hacking Human Language (PyCon Sweden 2015)
 
Actuated workbench
Actuated workbenchActuated workbench
Actuated workbench
 
Sport
SportSport
Sport
 
Ruby on Rails - Kurzvortrag
Ruby on Rails - KurzvortragRuby on Rails - Kurzvortrag
Ruby on Rails - Kurzvortrag
 
Jugendmedienschutz
JugendmedienschutzJugendmedienschutz
Jugendmedienschutz
 
Farbformprozess
FarbformprozessFarbformprozess
Farbformprozess
 
Defining video games
Defining video gamesDefining video games
Defining video games
 

Recently uploaded

Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdfBonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
khadija278284
 
Eureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 PresentationEureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 Presentation
Access Innovations, Inc.
 
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptxsomanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
Howard Spence
 
Getting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control TowerGetting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control Tower
Vladimir Samoylov
 
Media as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern EraMedia as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern Era
faizulhassanfaiz1670
 
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Sebastiano Panichella
 
International Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software TestingInternational Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software Testing
Sebastiano Panichella
 
María Carolina Martínez - eCommerce Day Colombia 2024
María Carolina Martínez - eCommerce Day Colombia 2024María Carolina Martínez - eCommerce Day Colombia 2024
María Carolina Martínez - eCommerce Day Colombia 2024
eCommerce Institute
 
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdfSupercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Access Innovations, Inc.
 
Acorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutesAcorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutes
IP ServerOne
 
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Orkestra
 
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXOBitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Matjaž Lipuš
 
Obesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditionsObesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditions
Faculty of Medicine And Health Sciences
 
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Sebastiano Panichella
 
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
0x01 - Newton's Third Law:  Static vs. Dynamic Abusers0x01 - Newton's Third Law:  Static vs. Dynamic Abusers
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
OWASP Beja
 
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
OECD Directorate for Financial and Enterprise Affairs
 

Recently uploaded (16)

Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdfBonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
 
Eureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 PresentationEureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 Presentation
 
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptxsomanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
 
Getting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control TowerGetting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control Tower
 
Media as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern EraMedia as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern Era
 
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...
 
International Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software TestingInternational Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software Testing
 
María Carolina Martínez - eCommerce Day Colombia 2024
María Carolina Martínez - eCommerce Day Colombia 2024María Carolina Martínez - eCommerce Day Colombia 2024
María Carolina Martínez - eCommerce Day Colombia 2024
 
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdfSupercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
 
Acorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutesAcorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutes
 
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
 
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXOBitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXO
 
Obesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditionsObesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditions
 
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
 
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
0x01 - Newton's Third Law:  Static vs. Dynamic Abusers0x01 - Newton's Third Law:  Static vs. Dynamic Abusers
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
 
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
 

word2vec - From theory to practice

  • 1. Hendrik Heuer Stockholm NLP Meetup word2vec 
 From theory to practice
  • 3. –J. R. Firth 1957 “You shall know a word 
 by the company it keeps”
  • 4. –J. R. Firth 1957 “You shall know a word 
 by the company it keeps” Quoted after Socher
  • 7.
  • 8. Vectors are directions in space Vectors can encode relationships
  • 9. man is to woman as king is to ?
  • 10.
  • 11.
  • 12.
  • 13.
  • 14. word2vec • by Mikolov, Sutskever, Chen, 
 Corrado and Dean at Google • NAACL 2013 • takes a text corpus as input 
 and produces the 
 word vectors as output
  • 17. word2vec • word meaning and relationships between words are encoded spatially • two main learning algorithms in word2vec: 
 continuous bag-of-words and 
 continuous skip-gram
  • 18. Goal
  • 19. continuous bag-of-words • predicting the current word based on the context • order of words in the history does not influence the projection • faster & more appropriate for larger corpora Mikolov et al.
  • 20. continuous skip-gram • maximize classification of a word based on another word in the same sentence • better word vectors for frequent words, but slower to train Mikolov et al.
  • 21. Why it is awesome • there is a fast open-source implementation • can be used as features 
 for natural language processing tasks and machine learning algorithms
  • 23. Sentiment Analysis Quoted after SocherRecursive Neural Tensor Network
  • 25. Using word2vec • Original: http://word2vec.googlecode.com/svn/trunk/ • C++11 version: https://github.com/jdeng/word2vec • Python: http://radimrehurek.com/gensim/models/ word2vec.html • Java: https://github.com/ansjsun/word2vec_java • Parallel java: https://github.com/siegfang/word2vec • CUDAversion: https://github.com/whatupbiatch/cuda-word2vec Quoted after Wang
  • 26. Using it in Python
  • 27.
  • 28. Usage
  • 29. Training a model Quoted after Řehůřek
  • 30. Training a model with iterator Quoted after Řehůřek
  • 31. Doing it in C • Download the code: 
 git clone https://github.com/h10r/word2vec-macosx-maverics.git ! • Run 'make' to compile word2vec tool ! • Run the demo scripts: 
 
 ./demo-word.sh 
 
 and 
 
 ./demo-phrases.sh
  • 32. Doing it in C • Download the code: 
 git clone https://github.com/h10r/word2vec-macosx-maverics.git ! • Run 'make' to compile word2vec tool ! • Run the demo scripts: 
 
 ./demo-word.sh 
 
 and 
 
 ./demo-phrases.sh
  • 33.
  • 34. Testing a model Quoted after Řehůřek
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40. word2vec t-SNE JSON 1. Find Word Embeddings 2. Dimensionality Reduction 3. Output
  • 41. word2vec t-SNE JSON 1. Find Word Embeddings 2. Dimensionality Reduction 3. Output
  • 42.
  • 43.
  • 44. word2vec t-SNE JSON 1. Find Word Embeddings 2. Dimensionality Reduction 3. Output
  • 45.
  • 46. Should be in Macports py27-scikit-learn @0.15.2 (python, science)
  • 47.
  • 48. word2vec t-SNE JSON 1. Find Word Embeddings 2. Dimensionality Reduction 3. Output
  • 49.
  • 50.
  • 51.
  • 52.
  • 53. word2vec 
 From theory to practice Hendrik Heuer Stockholm NLP Meetup ! Discussion: Can anybody here think of ways this might help her or him?
  • 54. Further Reading • Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013. • Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013. • Tomas Mikolov, Wen-tau Yih,and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013.
  • 55. Further Reading • Richard Socher - Deep Learning for NLP (without Magic)
 http://lxmls.it.pt/2014/?page_id=5 • Wang - Introduction to Word2vec and its application to find predominant word senses
 http://compling.hss.ntu.edu.sg/courses/hg7017/pdf/ word2vec%20and%20its%20application%20to%20wsd.pdf • Exploiting Similarities among Languages for Machine Translation, Tomas Mikolov, Quoc V. Le, Ilya Sutskever, http://arxiv.org/abs/1309.4168 • Title Image by Hans Arp
  • 56. Further Coding • word2vec
 https://code.google.com/p/word2vec/ • word2vec for MacOSX Maverics
 https://github.com/h10r/word2vec-macosx-maverics • Gensim Python Library
 https://radimrehurek.com/gensim/index.html • Gensim Tutorials
 https://radimrehurek.com/gensim/tutorial.html • Scikit-Learn TSNE
 http://scikit-learn.org/stable/modules/generated/ sklearn.manifold.TSNE.html
  • 57. Hendrik Heuer Stockholm NLP Meetup word2vec 
 From theory to practice