Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
1 of 50

Text Mining for Lexicography



Download to read offline

Presentation during Lorentz workshop 'The future of academic lexicography'

Related Books

Free with a 30 day trial from Scribd

See all

Text Mining for Lexicography

  2. 2. ABOUT ME  Master in Natural Language Processing, 2002  PhD in Information Retrieval, 2010  Postdoc at Radboud University, 2009-2017  Assistant Professor at Leiden University  Leiden Institute of Advanced Computer Science  Data Science Research Programme  Research group: Text Mining and Retrieval (TMR) Suzan Verberne 2019
  3. 3. BEFORE I START…  Who has a background in linguistics?  Who has experience with programming in Python?  Who is familiar with the vector space model?  Who is familiar with word embeddings?  Who is familiar with logistic regression?  Who is familiar with artificial neural networks? Suzan Verberne 2019
  4. 4. QUESTIONS  “Can Big Data Analytics solve the current bottle neck in the continuous updating of dictionaries?”  Text mining: Automatic extraction of knowledge from text  Text = unstructured  Knowledge = structured Suzan Verberne 2019
  5. 5. TEXT MINING FOR LEXICOGRAPHY  “How can we automatically extract structured information from the constant stream of text data on the web and social media in particular?”  Structured lexical information:  Discovery and selection of new lemmas  New meanings of existing lemmas  Collocations / multi-word expressions Suzan Verberne 2019
  6. 6. TEXT MINING & LEXICOGRAPHY  Discovery and selection of new lemmas  Sørensen, N. H., & Nimb, S. (2018). Word2Dict–Lemma Selection and Dictionary Editing Assisted by Word Embeddings. In The XVIII EURALEX International Congress (p. 146).  Trend analysis: Change in the meaning of existing words  Mitra, S., Mitra, R., Maity, S. K., Riedl, M., Biemann, C., Goyal, P., & Mukherjee, A. (2015). An automatic approach to identify word sense changes in text media across timescales. Natural Language Engineering, 21(5), 773-798.  Extraction of collocations/multiword expressions  Sanni Nimb, Henrik Lorentzen & Nicolai Hartvig Sørensen (2019): “Updating the dictionary: semantic change identification based on change in bigrams over time”. Presented in Workshop on collocations. Suzan Verberne 2019
  8. 8. TASK AND RESEARCH QUESTIONS  Example: Den Danske Ordbog (DDO)  Task: “augmenting the lemmata of an existing dictionary by adding either completely new or formerly neglected lemmas”  “How do you in a fast and consistent way compare new lemma candidates to already described lemmas within the same semantic field in order to ensure the consistency of the definitions?” Suzan Verberne 2019 Sørensen, N. H., & Nimb, S. (2018)
  9. 9. TOOL FOR LEMMA SELECTION Suzan Verberne 2019 Sørensen, N. H., & Nimb, S. (2018)
  10. 10. WORD2DICT  A lexicographic tool based on a word embedding model  Goal: “to present a number of words that are most semantically related to the lemma that the lexicographer is describing” Suzan Verberne 2019 Sørensen, N. H., & Nimb, S. (2018)
  11. 11. Suzan Verberne 2019 WORD EMBEDDINGS
  12. 12. WHERE TO START  Linguistics: Distributional hypothesis  Data science: Vector Space Model (VSM) Suzan Verberne 2019
  13. 13. DISTRIBUTIONAL HYPOTHESIS  Harris, Z. (1954). “Distributional structure”. Word. 10 (23): 146–162  The context of a word defines its meaning  Words that occur in similar contexts tend to be similar Suzan Verberne 2019
  14. 14. VECTOR SPACE MODEL  Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.  Documents and queries represented in a vector space  Where the dimensions are the words Suzan Verberne 2019
  15. 15. VECTOR SPACE MODEL  In the vector space model, we can model similarity as closeness  The closer two documents are in the space, the more similar they are Suzan Verberne 2019  We can compute the similarity between two points/vectors using a metric for distance or angle  Most used metric: cosine similarity: the cosine of the angle 𝝷 between two vectors
  16. 16. VECTOR SPACE MODEL Linguistic issues with the vector space model:  synonymy: multiple ways to refer to the same concept, e.g. bicycle and bike  polysemy/homonyms: most words have more than one distinct meaning, e.g. bank, bass, chips Suzan Verberne 2019
  17. 17. VECTOR SPACE MODEL  Computational issues with the vector space model:  The vector representations are high-dimensional (easily 10,000 dimensions – one for each term in the collection)  The vector representations are sparse (a given document only contains a fraction of those 10,000 terms – the other dimensions have a 0 value) Suzan Verberne 2019
  18. 18. VECTOR SPACE MODEL Suzan Verberne 2019
  19. 19. VECTOR SPACE MODEL Suzan Verberne 2019
  20. 20. WORD EMBEDDINGS  Word embeddings are dense representations of words Suzan Verberne 2019
  21. 21. WORD EMBEDDINGS  Word embeddings models represent (embed) words in a continuous vector space  The vector space is relatively low-dimensional (100 – 400 dimensions instead of 10,000s)  Semantically and syntactically similar words are mapped to nearby points because the representations are learned from word occurrences in context (Distributional Hypothesis) Suzan Verberne 2019
  22. 22. Suzan Verberne 2019 PCA projection of 320- dimensional vector space
  23. 23. Suzan Verberne 2019
  24. 24. Suzan Verberne 2019 WORD2VEC
  25. 25. WHAT IS WORD2VEC?  Word2vec is a particularly computationally-efficient predictive model for learning word embeddings from raw text  Intuition:  Train a classifier on a binary prediction task (on a text without labels!): “Is word w likely to show up near the word bicycle?”  We don’t actually care about this prediction task; instead we’ll take the learned classifier weights as the word embeddings Suzan Verberne 2019
  26. 26. WHERE DOES IT COME FROM  Neural network language model (NNLM) (Bengio et al., 2003)  Mikolov proposed to learn word vectors using a neural network with a single hidden layer (Mikolov et al 2013)  word2vec  Many neural architectures and models have been proposed for computing word vectors  GloVe (2014) - Global Vectors for Word Representation  FastText (2017) - Enriching Word Vectors with Subword Information  ELMo (2018) - Deep contextualized word representations  BERT (2019) - Bidirectional Encoder Representations from Transformers Suzan Verberne 2019
  27. 27. WORD2VEC  Starting point: large collection (e.g. 10 Million words)  First step: extract the vocabulary (e.g. 10,000 terms)  Goal: to represent each of these 10,000 terms as a dense, lower- dimensional vector (typically 100-400 dimensions)  Idea: to use the contexts of words to learn their meaning Suzan Verberne 2019
  28. 28. TRAINING WORD2VEC  Training task: binary classification of words in the text 1. Treat the target word and a neighboring context word as positive examples 2. Randomly sample other words in the lexicon to get negative samples 3. Train a classifier to distinguish those two cases Suzan Verberne 2019
  29. 29. TRAINING WORD2VEC  This example has a target word t (apricot), and 4 context words in the L = ±2 window, resulting in 4 positive training instances  Negative examples are artificially generated: Jurafsky and Martin. Speech and Language Processing (3rd edition, 2019)
  30. 30. TRAINING WORD2VEC  The classifier is a neural network with one hidden layer  Logistic functions are used as activation functions in the hidden layer  The regression weights are the embeddings Suzan Verberne 2019 sparse vector dense vector (embeddings)
  31. 31. TRAINING WORD2VEC  The weights on the nodes in the hidden layer get random initializations and get updated while the model processes the collection  The outcome of the classification determines whether we adjust the current word vector  Gradually, the vectors converge to sensible descriptors (embeddings) for words Suzan Verberne 2019
  32. 32. LANGUAGE MODELLING  The word prediction task is called language modelling  Traditional n-gram model: given the previous n words, predict the next word  Neural language models can handle much longer histories, and they can generalize over contexts of similar words  The resulting embeddings are referred to as language model  It is important that the context classification here is not an aim in itself: it is just an auxiliary task to learn vector representations good for other tasks Suzan Verberne 2019
  33. 33. ADVANTAGES OF WORD2VEC  It scales  Train on billion word corpora  In limited time  Possibility of parallel training  Pre-trained word embeddings trained by one can be used by others  For entirely different tasks  Incremental training  Train on one piece of data, save results, continue training later on  There is a Python module for it:  Gensim word2vec Suzan Verberne 2019
  34. 34. Suzan Verberne 2019 WHAT CAN YOU DO WITH IT?
  35. 35. GENSIM WORD2VEC  Implementation in Python package gensim import gensim model = gensim.models.Word2Vec(sentences, size=100, window=5, min_count=5, workers=4) size: the dimensionality of the feature vectors (common: 100, 200 or 320) window: the maximum distance between the current and predicted word within a sentence min_count: minimum number of occurrences of a word in the corpus to be included in the model workers: for parallellization with multicore machines Suzan Verberne 2019
  36. 36. GENSIM WORD2VEC Sørensen, N. H., & Nimb, S. (2018):  We used the version of the word2vec algorithm implemented in the Gensim Python package  to train a model based on the Danish corpus used by the lexicographers of DDO  The corpus included at the time of the training roughly 920 million running words, mainly newswire, but also, material from magazines, transcripts from the Danish Parliament, and some fiction, among others, spanning the years 1982 to 2017  We trained the model with 500 features, a window size of five, a minimum occurrence of five for all types  The corpus included 6.3 million types, five million of which occurred less than five times  The training took roughly 18 hours on a 2017 MacBook Pro Suzan Verberne 2019
  37. 37. DO IT YOURSELF model.most_similar(‘apple’)  [(’banana’, 0.8571481704711914), ...] model.doesnt_match("breakfast cereal dinner lunch".split())  ‘cereal’ model.similarity(‘woman’, ‘man’)  0.73723527 Suzan Verberne 2019 Cosine similarity
  38. 38. WHAT CAN YOU DO WITH IT?  Mining knowledge about natural language  Improve NLP applications Suzan Verberne 2019
  39. 39. WHAT CAN YOU DO WITH IT?  Mining knowledge about natural language  Learning semantic and semantic relations Suzan Verberne 2019
  40. 40. WHAT CAN YOU DO WITH IT?  A is to B as C is to ?  This is the famous example: vector(king) – vector(man) + vector(woman) = vector(queen)  Actually, what the original paper says is: if you substract the vector for ‘man’ from the one for ‘king’ and add the vector for ‘woman’, the vector closest to the one you end up with turns out to be the one for ‘queen’  More interesting: France is to Paris as Germany is to … Suzan Verberne 2019
  41. 41. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119, 2013.
  42. 42. WHAT CAN YOU DO WITH IT?  A is to B as C is to ?  It also works for syntactic relations:  vector(biggest) - vector(big) + vector(small) = Suzan Verberne 2019 vector(smallest)
  43. 43. WHAT CAN YOU DO WITH IT?  Mining knowledge about natural language  Learning semantic and semantic relations  Selecting out-of-the-list words  Example: which word does not belong in [monkey, lion, dog, truck]  Selectional preferences  Example: predict typical verb-noun pairs: people as subject of eating is more likely than people as object of eating  Discover new words Suzan Verberne 2019
  44. 44. Tshitoyan, V., Dagdelen, J., Weston, L., Dunn, A., Rong, Z., Kononova, O., ... & Jain, A. (2019). Unsupervised word embeddings capture latent knowledge from materials science literature. Nature, 571(7763), 95.
  45. 45. WHAT CAN YOU DO WITH IT?  Improve NLP applications:  Sentence completion/text prediction/reply suggestion  Bilingual Word Embeddings for Machine Translation with LSTMs  (Near-)Synonym detection ( query expansion)  Concept representation of texts  Example: Twitter sentiment classification  Document similarity  Example: cluster news articles per news event Suzan Verberne 2019
  46. 46. WORD EMBEDDINGS AS FEATURES  NLP models take word embeddings as low-level representation of words  Word embeddings as input for convolutional neural networks in text categorization  Word embeddings as input for recurrent neural networks in sequence labelling  Since 2018: word embeddings are used as language models that can be fine-tuned towards any natural language processing task Suzan Verberne 2019
  47. 47. Suzan Verberne 2019 CONCLUSIONS
  48. 48. SUMMARY Text Mining for Lexicography  Discovery and selection of new lemmas  Word2Dict: tool for lemma selection (Sørensen & Nimb 2018)  Word embeddings  Distributional hypothesis  Vector space model  From sparse to dense representations  Neural language modelling  Practical use in the gensim package Suzan Verberne 2019
  49. 49. FURTHER READING  T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.   model/   Visualisation of embeddings models: Suzan Verberne 2019
  50. 50. Suzan Verberne 2019 

Editor's Notes

  • Because the data we are interested in is text data and we want to mine knowledge from those text data
  • Figure 1: A search for ananasjuice (‘pineapple juice’). To the left (the top-half of the interface) we see the most similar words according to the context in which they appear in a corpus. Frequency counts for each word and whether or not the word is included in DDO is also displayed. The frequency counts are color- coded for quicker visual decoding: the darker the color, the higher the frequency. To the right (the bottom- half of the interface), definitions of the words already in the dictionary are shown, as well as their editorial status (e.g. “publiceret” (‘published’)) and the similarity score from the model (e.g. “0.75”, “0.71” – 1.0 equals identical).
  • One-hot encoding
  • ×