Text Mining for Lexicography

W O R K S H O P : ‘ T H E F U T U R E O F A C A D E M I C L E X I C O G R A P H Y ’
TEXT MINING FOR
LEXICOGRAPHY
SUZAN VERBERNE 2019

ABOUT ME
 Master in Natural Language Processing, 2002
 PhD in Information Retrieval, 2010
 Postdoc at Radboud University, 2009-2017
 Assistant Professor at Leiden University
 Leiden Institute of Advanced Computer Science
 Data Science Research Programme
 Research group: Text Mining and Retrieval (TMR)
Suzan Verberne 2019

BEFORE I START…
 Who has a background in linguistics?
 Who has experience with programming in Python?
 Who is familiar with the vector space model?
 Who is familiar with word embeddings?
 Who is familiar with logistic regression?
 Who is familiar with artificial neural networks?
Suzan Verberne 2019

QUESTIONS
 “Can Big Data Analytics solve the current bottle neck in the
continuous updating of dictionaries?”
 Text mining: Automatic extraction of knowledge from text
 Text = unstructured
 Knowledge = structured
Suzan Verberne 2019

TEXT MINING FOR LEXICOGRAPHY
 “How can we automatically extract structured information from the
constant stream of text data on the web and social media in
particular?”
 Structured lexical information:
 Discovery and selection of new lemmas
 New meanings of existing lemmas
 Collocations / multi-word expressions
Suzan Verberne 2019

TEXT MINING & LEXICOGRAPHY
 Sørensen, N. H., & Nimb, S. (2018). Word2Dict–Lemma Selection and
Dictionary Editing Assisted by Word Embeddings. In The XVIII EURALEX
International Congress (p. 146).
 Trend analysis: Change in the meaning of existing words
 Mitra, S., Mitra, R., Maity, S. K., Riedl, M., Biemann, C., Goyal, P., &
Mukherjee, A. (2015). An automatic approach to identify word sense
changes in text media across timescales. Natural Language Engineering,
21(5), 773-798.
 Extraction of collocations/multiword expressions
 Sanni Nimb, Henrik Lorentzen & Nicolai Hartvig Sørensen (2019): “Updating
the dictionary: semantic change identification based on change in bigrams
over time”. Presented in Workshop on collocations.
https://elex.link/elex2019/programme/workshop-on-collocations/
Suzan Verberne 2019

Suzan Verberne 2019
DISCOVERY AND SELECTION OF
NEW LEMMAS

TASK AND RESEARCH QUESTIONS
 Example: Den Danske Ordbog (DDO)
 Task: “augmenting the lemmata of an existing dictionary by adding
either completely new or formerly neglected lemmas”
 “How do you in a fast and consistent way compare new lemma
candidates to already described lemmas within the same semantic
field in order to ensure the consistency of the definitions?”
Suzan Verberne 2019 Sørensen, N. H., & Nimb, S. (2018)

TOOL FOR LEMMA SELECTION

WORD2DICT
 A lexicographic tool based on a word embedding model
 Goal: “to present a number of words that are most semantically
related to the lemma that the lexicographer is describing”

Suzan Verberne 2019
WORD EMBEDDINGS

WHERE TO START
 Linguistics: Distributional hypothesis
 Data science: Vector Space Model (VSM)
Suzan Verberne 2019

DISTRIBUTIONAL HYPOTHESIS
 Harris, Z. (1954). “Distributional structure”. Word. 10 (23): 146–162
 The context of a word defines its meaning
 Words that occur in similar contexts tend to be similar
Suzan Verberne 2019

VECTOR SPACE MODEL
 Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for
automatic indexing. Communications of the ACM, 18(11), 613-620.
 Documents and queries represented in a vector space
 Where the dimensions are the words
Suzan Verberne 2019

VECTOR SPACE MODEL
 In the vector space model, we can model similarity as closeness
 The closer two documents are in the space, the more similar they
are
Suzan Verberne 2019
 We can compute the similarity
between two points/vectors using
a metric for distance or angle
 Most used metric: cosine
similarity: the cosine of the angle
𝝷 between two vectors

VECTOR SPACE MODEL
Linguistic issues with the vector space model:
 synonymy: multiple ways to refer to the same concept, e.g. bicycle
and bike
 polysemy/homonyms: most words have more than one distinct
meaning, e.g. bank, bass, chips
Suzan Verberne 2019

VECTOR SPACE MODEL
 Computational issues with the vector space model:
 The vector representations are high-dimensional (easily 10,000
dimensions – one for each term in the collection)
 The vector representations are sparse (a given document only contains
a fraction of those 10,000 terms – the other dimensions have a 0 value)
Suzan Verberne 2019

VECTOR SPACE MODEL
Suzan Verberne 2019

WORD EMBEDDINGS
 Word embeddings are dense representations of words
Suzan Verberne 2019

WORD EMBEDDINGS
 Word embeddings models represent (embed) words in a
continuous vector space
 The vector space is relatively low-dimensional (100 – 400
dimensions instead of 10,000s)
 Semantically and syntactically similar words are mapped to nearby
points because the representations are learned from word
occurrences in context (Distributional Hypothesis)
Suzan Verberne 2019

Suzan Verberne 2019
PCA projection of 320-
dimensional vector space

WHAT IS WORD2VEC?
 Word2vec is a particularly computationally-efficient predictive
model for learning word embeddings from raw text
 Intuition:
 Train a classifier on a binary prediction task (on a text without labels!):
“Is word w likely to show up near the word bicycle?”
 We don’t actually care about this prediction task; instead we’ll take the
learned classifier weights as the word embeddings
Suzan Verberne 2019

WHERE DOES IT COME FROM
 Neural network language model (NNLM) (Bengio et al., 2003)
 Mikolov proposed to learn word vectors using a neural network
with a single hidden layer (Mikolov et al 2013)  word2vec
 Many neural architectures and models have been proposed for
computing word vectors
 GloVe (2014) - Global Vectors for Word Representation
 FastText (2017) - Enriching Word Vectors with Subword Information
 ELMo (2018) - Deep contextualized word representations
 BERT (2019) - Bidirectional Encoder Representations from Transformers
Suzan Verberne 2019

WORD2VEC
 Starting point: large collection (e.g. 10 Million words)
 First step: extract the vocabulary (e.g. 10,000 terms)
 Goal: to represent each of these 10,000 terms as a dense, lower-
dimensional vector (typically 100-400 dimensions)
 Idea: to use the contexts of words to learn their meaning
Suzan Verberne 2019

TRAINING WORD2VEC
 Training task: binary classification of words in the text
1. Treat the target word and a neighboring context word as positive
examples
2. Randomly sample other words in the lexicon to get negative samples
3. Train a classifier to distinguish those two cases
Suzan Verberne 2019

TRAINING WORD2VEC
 This example has a target word t (apricot), and 4 context words in
the L = ±2 window, resulting in 4 positive training instances
 Negative examples are artificially generated:
Jurafsky and Martin. Speech and Language Processing (3rd edition, 2019)

TRAINING WORD2VEC
 The classifier is a
neural network with
one hidden layer
 Logistic functions
are used as
activation functions
in the hidden layer
 The regression
weights are the
embeddings
Suzan Verberne 2019
sparse vector
dense vector
(embeddings)

TRAINING WORD2VEC
 The weights on the nodes in the hidden layer get random
initializations and get updated while the model processes the
collection
 The outcome of the classification determines whether we adjust
the current word vector
 Gradually, the vectors converge to sensible descriptors
(embeddings) for words
Suzan Verberne 2019

LANGUAGE MODELLING
 The word prediction task is called language modelling
 Traditional n-gram model: given the previous n words, predict the next
word
 Neural language models can handle much longer histories, and they can
generalize over contexts of similar words
 The resulting embeddings are referred to as language model
 It is important that the context classification here is not an aim in
itself: it is just an auxiliary task to learn vector representations good
for other tasks
Suzan Verberne 2019

ADVANTAGES OF WORD2VEC
 It scales
 Train on billion word corpora
 In limited time
 Possibility of parallel training
 Pre-trained word embeddings trained by one can be used by others
 For entirely different tasks
 Incremental training
 Train on one piece of data, save results, continue training later on
 There is a Python module for it:
 Gensim word2vec
Suzan Verberne 2019

Suzan Verberne 2019
WHAT CAN YOU DO WITH IT?

GENSIM WORD2VEC
 Implementation in Python package gensim
import gensim
model = gensim.models.Word2Vec(sentences, size=100,
window=5, min_count=5,
workers=4)
size: the dimensionality of the feature vectors (common: 100, 200 or 320)
window: the maximum distance between the current and predicted word
within a sentence
min_count: minimum number of occurrences of a word in the corpus to be
included in the model
workers: for parallellization with multicore machines
Suzan Verberne 2019

GENSIM WORD2VEC
Sørensen, N. H., & Nimb, S. (2018):
 We used the version of the word2vec algorithm implemented in the Gensim Python
package
 to train a model based on the Danish corpus used by the lexicographers of DDO
 The corpus included at the time of the training roughly 920 million running words,
mainly newswire, but also, material from magazines, transcripts from the Danish
Parliament, and some fiction, among others, spanning the years 1982 to 2017
 We trained the model with 500 features, a window size of five, a minimum occurrence
of five for all types
 The corpus included 6.3 million types, five million of which occurred less than five times
 The training took roughly 18 hours on a 2017 MacBook Pro
Suzan Verberne 2019

DO IT YOURSELF
model.most_similar(‘apple’)
 [(’banana’, 0.8571481704711914), ...]
model.doesnt_match("breakfast cereal dinner lunch".split())
 ‘cereal’
model.similarity(‘woman’, ‘man’)
 0.73723527
Suzan Verberne 2019
Cosine similarity

 Mining knowledge about natural language
 Improve NLP applications
Suzan Verberne 2019

 Learning semantic and semantic relations
Suzan Verberne 2019

 A is to B as C is to ?
 This is the famous example:
vector(king) – vector(man) + vector(woman) = vector(queen)
 Actually, what the original paper says is: if you substract the vector
for ‘man’ from the one for ‘king’ and add the vector for ‘woman’,
the vector closest to the one you end up with turns out to be the
one for ‘queen’
 More interesting:
France is to Paris as Germany is to …
Suzan Verberne 2019

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed
representations of words and phrases and their compositionality. In Advances
in Neural Information Processing Systems, pages 3111–3119, 2013.

 A is to B as C is to ?
 It also works for syntactic relations:
 vector(biggest) - vector(big) + vector(small) =
Suzan Verberne 2019
vector(smallest)

 Learning semantic and semantic relations
 Selecting out-of-the-list words
 Example: which word does not belong in [monkey, lion, dog, truck]
 Selectional preferences
 Example: predict typical verb-noun pairs: people as subject of eating is more
likely than people as object of eating
 Discover new words
Suzan Verberne 2019

Tshitoyan, V., Dagdelen, J., Weston, L., Dunn, A., Rong, Z., Kononova, O., ... & Jain, A.
(2019). Unsupervised word embeddings capture latent knowledge from materials
science literature. Nature, 571(7763), 95.
https://github.com/materialsintelligence/mat2vec

 Improve NLP applications:
 Sentence completion/text prediction/reply suggestion
 Bilingual Word Embeddings for Machine Translation with LSTMs
 (Near-)Synonym detection ( query expansion)
 Concept representation of texts
 Example: Twitter sentiment classification
 Document similarity
 Example: cluster news articles per news event
Suzan Verberne 2019

WORD EMBEDDINGS AS FEATURES
 NLP models take word embeddings as low-level
representation of words
 Word embeddings as input for convolutional neural
networks in text categorization
 Word embeddings as input for recurrent neural networks
in sequence labelling
 Since 2018: word embeddings are used as language
models that can be fine-tuned towards any natural
language processing task
Suzan Verberne 2019

Suzan Verberne 2019
CONCLUSIONS

SUMMARY
Text Mining for Lexicography
 Word2Dict: tool for lemma selection (Sørensen & Nimb 2018)
 Word embeddings
 Distributional hypothesis
 Vector space model
 From sparse to dense representations
 Neural language modelling
 Practical use in the gensim package
Suzan Verberne 2019

FURTHER READING
 T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word
representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
 https://rare-technologies.com/word2vec-tutorial/
 http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-
model/
 http://www.davidsbatista.net/blog/2018/12/06/Word_Embeddings/
 Visualisation of embeddings models: https://projector.tensorflow.org/
Suzan Verberne 2019

Suzan Verberne 2019
 http://tmr.liacs.nl

Text Mining for Lexicography

More Related Content

What's hot

Similar to Text Mining for Lexicography

More from Leiden University

Recently uploaded

Text Mining for Lexicography

Editor's Notes