Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Vector representation of words


Published on

ML-India Talk on Vector representation of words by Chhaya Methani

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Vector representation of words

  1. 1. Vector representations of word Chhaya Methani
  2. 2. Natural Language Processing • Computers do not understand human language • How do humans learn languages? • Grammar • Vocabulary • Repetition • Tone of the text • Neurons fire in the brain to make sense of multiple factors
  3. 3. Early approaches in NLP • Grammar based models - Parse trees resembling the structure of a language - Brown Corpus was hand tagged to come up with a set of rules defining the English language • Statistical models - Learning from corpora and estimating probabilities - Ngram models - Models language as a sequence of words - P (s) = P(w1, w2, w3… wn) Unigram Model : P(w1) * P(w2) * …. P(wn) Bigram Model :P(w1|<s>)* P(w2|w1) * P(w3|w2) *….. * P(wn|wn-1) Trigram Model : P(w1 | <s>, <s>) * P( w2 | w1, <s>) * P(w3 | w2, w1) * ….* P(wn|wn-1,wn-2)
  4. 4. Early approaches in NLP • Challenges - Grammar based methods • Too many rules in each language • Languages evolve • Problems compound for short text - Statistical methods • Too many sequences • Suffer from sparsity of data to learn meaningful patterns • Manual tagging needed for learning semantic patterns from the data
  5. 5. Technology Hype Curve –Natural Language Processing 2014 Source:, Gartner Technology Curve 2014
  6. 6. Technology Hype Curve –Natural Language Processing 2016
  7. 7. Word representations • “Bag-of-words” instead of a sequence • One hot representation of words in vector space [0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 …] • Is it a problem? Banking [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 …] AND Debt [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 …]
  8. 8. Distributional word representations • “You shall know a word by the company it keeps” - (J. R. Firth 1957) these words represent Banking • In distributional representation of words, each word is now represented as a dense vector Banking [0.2 0.1 1.3 0.5 … ]
  9. 9. Word Representations • End goal is to cluster semantically similar words together 0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 Semantically Distant Semantically Similar
  10. 10. Similarity of documents: Term-Document Matrix xij d1 d2 d3 d4 ……………………….. dn-1 dn W1 W2 W3 W4 Wm-1 Wm Frequency of ith word wi in jth document dj • Indexing scenario, given a document extract documents most similar to it • Here, document ids are the word representations
  11. 11. Similarity of words: Word-Context matrix C1 C2 C3 C4 ……………………….. Cn-1 Cn W1 W2 W3 W4 Wm-1 Wm • Given a word, represent it with the contexts it appears with • Context: a bag of words, phrase, pattern, sentence, document defining a word • Word vector here is a set of contexts that the words appears with
  12. 12. Similarity of relationship between words: Pair-Pattern matrix • Pairs of words co-occurring alongside similar patterns tend to have similar relationship between them. • More explicit understanding of context – can be used to interpret exact action items X uses Y X is made from Y Y was used by X Carpenter 1 1 Wood 1 1 1 Mason 1 1 Stone 1 1 1 Similar relationship Similar Pairs Similar semantic relations
  13. 13. The Term–Document Matrices • The usual measure of similarity is the cosine of column vectors in a weighted term–document matrix. • Used to index terms in “Lucene” Document Clustering Document Retrieval Document Classification Essay Grading Question Answering Document Segmentation
  14. 14. The Word-Context Matrices • Word–context matrices are most suited to measuring the semantic similarity of words • “Semantic vectors” uses this matrix to find semantically similar words Word Clustering Word Similarity Word Classification Textual Advertising Query Expansion Word Sense Disambiguation
  15. 15. The Pair-Pattern Matrices • Pair–pattern matrices are most suited to measuring the semantic similarity of word pairs and patterns • “Latent Relational Analysis” uses pair-pattern matrix to find relationally similar word pairs Pattern Similarity Relation Similarity Relational Classification Relational Search Analogical Mapping Automatic Thesaurus Generation
  16. 16. Linguistic Processing for Vector Space Models • Handle Punctuation, hyphenation • Ignore high-frequency words Step 1 Tokenization • Case folding • Stemming • Differs among languages Step 2 Normalization • Parts-of-speech Tagging • Word-sense tagging • Large gains in IR performance Step 3 Annotation
  17. 17. Mathematical Processing For Vector Space Models • Generate a Matrix - Term-document, term – context or term – pattern? - Unigrams or phrases as the basic unit? - Compute matrix values based on term frequency from the corpus• - • Adjust the weights - Assign weights to important terms - Typically using tf-idf, length normalization, PMI, PPMI etc • Smooth the Matrix - Computationally intensive to retain the sparse matrix - Use matrix properties to represent information in latent spaces • Measure the similarity of vectors - Assess similarity of rows to get scores - Cosine similarity, dot product, Jaccard etc
  18. 18. Frequency Matrix A certain item Occurred in a certain situation a certain number of times Scan the corpus sequentially Record events and frequencies in hash-table Use the data structure to generate sparse matrix
  19. 19. Weighting the Elements Events Surprising Expected Surprising Events have higher information content • tf-idf (term frequency x inverse document frequency) • Length Normalization (to remove bias in favor of longer documents) • Pointwise Mutual Information (PMI) • Positive Pointwise Mutual Information (PPMI)
  20. 20. PMI & PPMI pij is the estimated probability that the word wi occurs in the context cj pi∗ is the estimated probability of the word wi p∗j is the estimated probability of the context cj If wi and cj are statistically independent, then pi∗p∗j = pij and thus pmiij is zero For interesting semantic relation, we would expect pij to be larger than when they were independent • Rare events have a higher value of PMI
  21. 21. Matrix Smoothing • Computing the similarity between all pairs of vectors is a computationally intensive task. • Project to latent spaces to uncover semantic similarity Methods • Only vectors that share a non-zero coordinate must be compared. • Highly weighted dimensions co-occur frequently with only very few words and have very high associations with them. • Keeping a PMI threshold and comparing with top-200 similar words have shown great decrease in computation and only slight decrease in similarity score. • SVD (Singular Value Decomposition) • Project into a random lower dimensional space to get dense vectors • Use Machine learning to learn the weights in the reduced vector space
  22. 22. X = UΣVT Matrix Diagonal Matrix Column orthonormal x̂ = UkΣkVk T Top k singular values SVD (Singular Value Decomposition) • Where x̂ minimizes ||x̂ - x||F over all matrices x̂ of rank k, where ||…..||F denotes the Frobenius norm • Take top k eigenvectors, and drop the rest of the eigen vectors resulting in a vector subspace representing the latent relationship between vectors • SVD assumes Gaussian distribution of data, which is not true in practice • Other approaches are proposed with different data distributions eg. Topic models SVD LSI (document similarity) LSA (word similarity)
  23. 23. SVD (Singular Value Decomposition) contd… Latent Meaning Noise Reduction High order co-occurrence Sparsity Reduction Ways of looking at SVD
  24. 24. Neural Networks for learning word representations • Neural networks are a class of supervised learning algorithms • Matrix weights can be optimized w.r.t a given label/word • Input is a set of words projected onto “D” dimensions using a projection matrix eg. One hot encoding • Hidden layer learns the feature importance • Output layer has a dimensionality V • Matrices: VxD, DxH & HxV • Unknown parameters increase with the number of hidden layers in the net
  25. 25. word2vec • Unsupervised algorithm to learn word embeddings • Two neural net architectures • CBOW predicts P(wt | wt-n…wt+n) • Skip-gram model predicts the context given a word P(wt-n…wt+n | wt)
  26. 26. King – man + woman = Queen! • Learnt word embeddings found to have interesting properties in the vector space • Possible to understand relationship between words given context
  27. 27. Related concepts
  28. 28. Comparing the Vectors • Vector distance can be computed using dot product, Jaccard similarity, cosine similarity etc • Most popular way to measure similarity of vectors is to take their cosines • Cosine captures the idea that the length of the vectors is irrelevant; the important thing is the angle between the vectors. • A measure of distance between vectors can easily be converted to a measure of similarity by inversion or subtraction.
  29. 29. Criticisms/Shortcomings of word embeddings • No intuitive way to represent phrases and sentences - The vectors are usually simply added up - Some papers learn phrasal & sentence representations directly to overcome this • Each word has a single vector averaged over all it’s contexts - eg. Bush, the president and bush, the plant will have the same vector! • Word order is not preserved - Sequence models work well
  30. 30. conclude • Vector representations are good for understanding correspondence between concepts/words • Language modeling needs a mix of both word and the following words context • Might be powerful in combination with word sense disambiguation, language specific grammars etc
  31. 31. Thank You
  32. 32. References [1] Turney, Peter D., and Patrick Pantel. "From frequency to meaning: Vector space models of semantics." Journal of artificial intelligence research 37.1 (2010): 141-188. [2] Mikolov, Tomas, Wen-tau Yih, and Geoffrey Zweig. "Linguistic Regularities in Continuous Space Word Representations." HLT-NAACL. Vol. 13. 2013. [3] Mikolov, Tomáš. "Statistical language models based on neural networks."Presentation at Google, Mountain View, 2nd April (2012). [4] Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013) [5] Mikolov, T., and J. Dean. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems (2013).