W O R K S H O P : ‘ T H E F U T U R E O F A C A D E M I C L E X I C O G R A P H Y ’
TEXT MINING FOR
LEXICOGRAPHY
SUZAN VERBERNE 2019
ABOUT ME
 Master in Natural Language Processing, 2002
 PhD in Information Retrieval, 2010
 Postdoc at Radboud University, 2009-2017
 Assistant Professor at Leiden University
 Leiden Institute of Advanced Computer Science
 Data Science Research Programme
 Research group: Text Mining and Retrieval (TMR)
Suzan Verberne 2019
BEFORE I START…
 Who has a background in linguistics?
 Who has experience with programming in Python?
 Who is familiar with the vector space model?
 Who is familiar with word embeddings?
 Who is familiar with logistic regression?
 Who is familiar with artificial neural networks?
Suzan Verberne 2019
QUESTIONS
 “Can Big Data Analytics solve the current bottle neck in the
continuous updating of dictionaries?”
 Text mining: Automatic extraction of knowledge from text
 Text = unstructured
 Knowledge = structured
Suzan Verberne 2019
TEXT MINING FOR LEXICOGRAPHY
 “How can we automatically extract structured information from the
constant stream of text data on the web and social media in
particular?”
 Structured lexical information:
 Discovery and selection of new lemmas
 New meanings of existing lemmas
 Collocations / multi-word expressions
Suzan Verberne 2019
TEXT MINING & LEXICOGRAPHY
 Discovery and selection of new lemmas
 Sørensen, N. H., & Nimb, S. (2018). Word2Dict–Lemma Selection and
Dictionary Editing Assisted by Word Embeddings. In The XVIII EURALEX
International Congress (p. 146).
 Trend analysis: Change in the meaning of existing words
 Mitra, S., Mitra, R., Maity, S. K., Riedl, M., Biemann, C., Goyal, P., &
Mukherjee, A. (2015). An automatic approach to identify word sense
changes in text media across timescales. Natural Language Engineering,
21(5), 773-798.
 Extraction of collocations/multiword expressions
 Sanni Nimb, Henrik Lorentzen & Nicolai Hartvig Sørensen (2019): “Updating
the dictionary: semantic change identification based on change in bigrams
over time”. Presented in Workshop on collocations.
https://elex.link/elex2019/programme/workshop-on-collocations/
Suzan Verberne 2019
Suzan Verberne 2019
DISCOVERY AND SELECTION OF
NEW LEMMAS
TASK AND RESEARCH QUESTIONS
 Example: Den Danske Ordbog (DDO)
 Task: “augmenting the lemmata of an existing dictionary by adding
either completely new or formerly neglected lemmas”
 “How do you in a fast and consistent way compare new lemma
candidates to already described lemmas within the same semantic
field in order to ensure the consistency of the definitions?”
Suzan Verberne 2019 Sørensen, N. H., & Nimb, S. (2018)
TOOL FOR LEMMA SELECTION
Suzan Verberne 2019 Sørensen, N. H., & Nimb, S. (2018)
WORD2DICT
 A lexicographic tool based on a word embedding model
 Goal: “to present a number of words that are most semantically
related to the lemma that the lexicographer is describing”
Suzan Verberne 2019 Sørensen, N. H., & Nimb, S. (2018)
Suzan Verberne 2019
WORD EMBEDDINGS
WHERE TO START
 Linguistics: Distributional hypothesis
 Data science: Vector Space Model (VSM)
Suzan Verberne 2019
DISTRIBUTIONAL HYPOTHESIS
 Harris, Z. (1954). “Distributional structure”. Word. 10 (23): 146–162
 The context of a word defines its meaning
 Words that occur in similar contexts tend to be similar
Suzan Verberne 2019
VECTOR SPACE MODEL
 Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for
automatic indexing. Communications of the ACM, 18(11), 613-620.
 Documents and queries represented in a vector space
 Where the dimensions are the words
Suzan Verberne 2019
VECTOR SPACE MODEL
 In the vector space model, we can model similarity as closeness
 The closer two documents are in the space, the more similar they
are
Suzan Verberne 2019
 We can compute the similarity
between two points/vectors using
a metric for distance or angle
 Most used metric: cosine
similarity: the cosine of the angle
𝝷 between two vectors
VECTOR SPACE MODEL
Linguistic issues with the vector space model:
 synonymy: multiple ways to refer to the same concept, e.g. bicycle
and bike
 polysemy/homonyms: most words have more than one distinct
meaning, e.g. bank, bass, chips
Suzan Verberne 2019
VECTOR SPACE MODEL
 Computational issues with the vector space model:
 The vector representations are high-dimensional (easily 10,000
dimensions – one for each term in the collection)
 The vector representations are sparse (a given document only contains
a fraction of those 10,000 terms – the other dimensions have a 0 value)
Suzan Verberne 2019
VECTOR SPACE MODEL
Suzan Verberne 2019
VECTOR SPACE MODEL
Suzan Verberne 2019
WORD EMBEDDINGS
 Word embeddings are dense representations of words
Suzan Verberne 2019
WORD EMBEDDINGS
 Word embeddings models represent (embed) words in a
continuous vector space
 The vector space is relatively low-dimensional (100 – 400
dimensions instead of 10,000s)
 Semantically and syntactically similar words are mapped to nearby
points because the representations are learned from word
occurrences in context (Distributional Hypothesis)
Suzan Verberne 2019
Suzan Verberne 2019
PCA projection of 320-
dimensional vector space
Suzan Verberne 2019
Suzan Verberne 2019
WORD2VEC
WHAT IS WORD2VEC?
 Word2vec is a particularly computationally-efficient predictive
model for learning word embeddings from raw text
 Intuition:
 Train a classifier on a binary prediction task (on a text without labels!):
“Is word w likely to show up near the word bicycle?”
 We don’t actually care about this prediction task; instead we’ll take the
learned classifier weights as the word embeddings
Suzan Verberne 2019
WHERE DOES IT COME FROM
 Neural network language model (NNLM) (Bengio et al., 2003)
 Mikolov proposed to learn word vectors using a neural network
with a single hidden layer (Mikolov et al 2013)  word2vec
 Many neural architectures and models have been proposed for
computing word vectors
 GloVe (2014) - Global Vectors for Word Representation
 FastText (2017) - Enriching Word Vectors with Subword Information
 ELMo (2018) - Deep contextualized word representations
 BERT (2019) - Bidirectional Encoder Representations from Transformers
Suzan Verberne 2019
WORD2VEC
 Starting point: large collection (e.g. 10 Million words)
 First step: extract the vocabulary (e.g. 10,000 terms)
 Goal: to represent each of these 10,000 terms as a dense, lower-
dimensional vector (typically 100-400 dimensions)
 Idea: to use the contexts of words to learn their meaning
Suzan Verberne 2019
TRAINING WORD2VEC
 Training task: binary classification of words in the text
1. Treat the target word and a neighboring context word as positive
examples
2. Randomly sample other words in the lexicon to get negative samples
3. Train a classifier to distinguish those two cases
Suzan Verberne 2019
TRAINING WORD2VEC
 This example has a target word t (apricot), and 4 context words in
the L = ±2 window, resulting in 4 positive training instances
 Negative examples are artificially generated:
Jurafsky and Martin. Speech and Language Processing (3rd edition, 2019)
TRAINING WORD2VEC
 The classifier is a
neural network with
one hidden layer
 Logistic functions
are used as
activation functions
in the hidden layer
 The regression
weights are the
embeddings
Suzan Verberne 2019
sparse vector
dense vector
(embeddings)
TRAINING WORD2VEC
 The weights on the nodes in the hidden layer get random
initializations and get updated while the model processes the
collection
 The outcome of the classification determines whether we adjust
the current word vector
 Gradually, the vectors converge to sensible descriptors
(embeddings) for words
Suzan Verberne 2019
LANGUAGE MODELLING
 The word prediction task is called language modelling
 Traditional n-gram model: given the previous n words, predict the next
word
 Neural language models can handle much longer histories, and they can
generalize over contexts of similar words
 The resulting embeddings are referred to as language model
 It is important that the context classification here is not an aim in
itself: it is just an auxiliary task to learn vector representations good
for other tasks
Suzan Verberne 2019
ADVANTAGES OF WORD2VEC
 It scales
 Train on billion word corpora
 In limited time
 Possibility of parallel training
 Pre-trained word embeddings trained by one can be used by others
 For entirely different tasks
 Incremental training
 Train on one piece of data, save results, continue training later on
 There is a Python module for it:
 Gensim word2vec
Suzan Verberne 2019
Suzan Verberne 2019
WHAT CAN YOU DO WITH IT?
GENSIM WORD2VEC
 Implementation in Python package gensim
import gensim
model = gensim.models.Word2Vec(sentences, size=100,
window=5, min_count=5,
workers=4)
size: the dimensionality of the feature vectors (common: 100, 200 or 320)
window: the maximum distance between the current and predicted word
within a sentence
min_count: minimum number of occurrences of a word in the corpus to be
included in the model
workers: for parallellization with multicore machines
Suzan Verberne 2019
GENSIM WORD2VEC
Sørensen, N. H., & Nimb, S. (2018):
 We used the version of the word2vec algorithm implemented in the Gensim Python
package
 to train a model based on the Danish corpus used by the lexicographers of DDO
 The corpus included at the time of the training roughly 920 million running words,
mainly newswire, but also, material from magazines, transcripts from the Danish
Parliament, and some fiction, among others, spanning the years 1982 to 2017
 We trained the model with 500 features, a window size of five, a minimum occurrence
of five for all types
 The corpus included 6.3 million types, five million of which occurred less than five times
 The training took roughly 18 hours on a 2017 MacBook Pro
Suzan Verberne 2019
DO IT YOURSELF
model.most_similar(‘apple’)
 [(’banana’, 0.8571481704711914), ...]
model.doesnt_match("breakfast cereal dinner lunch".split())
 ‘cereal’
model.similarity(‘woman’, ‘man’)
 0.73723527
Suzan Verberne 2019
Cosine similarity
WHAT CAN YOU DO WITH IT?
 Mining knowledge about natural language
 Improve NLP applications
Suzan Verberne 2019
WHAT CAN YOU DO WITH IT?
 Mining knowledge about natural language
 Learning semantic and semantic relations
Suzan Verberne 2019
WHAT CAN YOU DO WITH IT?
 A is to B as C is to ?
 This is the famous example:
vector(king) – vector(man) + vector(woman) = vector(queen)
 Actually, what the original paper says is: if you substract the vector
for ‘man’ from the one for ‘king’ and add the vector for ‘woman’,
the vector closest to the one you end up with turns out to be the
one for ‘queen’
 More interesting:
France is to Paris as Germany is to …
Suzan Verberne 2019
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed
representations of words and phrases and their compositionality. In Advances
in Neural Information Processing Systems, pages 3111–3119, 2013.
WHAT CAN YOU DO WITH IT?
 A is to B as C is to ?
 It also works for syntactic relations:
 vector(biggest) - vector(big) + vector(small) =
Suzan Verberne 2019
vector(smallest)
WHAT CAN YOU DO WITH IT?
 Mining knowledge about natural language
 Learning semantic and semantic relations
 Selecting out-of-the-list words
 Example: which word does not belong in [monkey, lion, dog, truck]
 Selectional preferences
 Example: predict typical verb-noun pairs: people as subject of eating is more
likely than people as object of eating
 Discover new words
Suzan Verberne 2019
Tshitoyan, V., Dagdelen, J., Weston, L., Dunn, A., Rong, Z., Kononova, O., ... & Jain, A.
(2019). Unsupervised word embeddings capture latent knowledge from materials
science literature. Nature, 571(7763), 95.
https://github.com/materialsintelligence/mat2vec
WHAT CAN YOU DO WITH IT?
 Improve NLP applications:
 Sentence completion/text prediction/reply suggestion
 Bilingual Word Embeddings for Machine Translation with LSTMs
 (Near-)Synonym detection ( query expansion)
 Concept representation of texts
 Example: Twitter sentiment classification
 Document similarity
 Example: cluster news articles per news event
Suzan Verberne 2019
WORD EMBEDDINGS AS FEATURES
 NLP models take word embeddings as low-level
representation of words
 Word embeddings as input for convolutional neural
networks in text categorization
 Word embeddings as input for recurrent neural networks
in sequence labelling
 Since 2018: word embeddings are used as language
models that can be fine-tuned towards any natural
language processing task
Suzan Verberne 2019
Suzan Verberne 2019
CONCLUSIONS
SUMMARY
Text Mining for Lexicography
 Discovery and selection of new lemmas
 Word2Dict: tool for lemma selection (Sørensen & Nimb 2018)
 Word embeddings
 Distributional hypothesis
 Vector space model
 From sparse to dense representations
 Neural language modelling
 Practical use in the gensim package
Suzan Verberne 2019
FURTHER READING
 T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word
representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
 https://rare-technologies.com/word2vec-tutorial/
 http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-
model/
 http://www.davidsbatista.net/blog/2018/12/06/Word_Embeddings/
 Visualisation of embeddings models: https://projector.tensorflow.org/
Suzan Verberne 2019
Suzan Verberne 2019
 http://tmr.liacs.nl

Text Mining for Lexicography

  • 1.
    W O RK S H O P : ‘ T H E F U T U R E O F A C A D E M I C L E X I C O G R A P H Y ’ TEXT MINING FOR LEXICOGRAPHY SUZAN VERBERNE 2019
  • 2.
    ABOUT ME  Masterin Natural Language Processing, 2002  PhD in Information Retrieval, 2010  Postdoc at Radboud University, 2009-2017  Assistant Professor at Leiden University  Leiden Institute of Advanced Computer Science  Data Science Research Programme  Research group: Text Mining and Retrieval (TMR) Suzan Verberne 2019
  • 3.
    BEFORE I START… Who has a background in linguistics?  Who has experience with programming in Python?  Who is familiar with the vector space model?  Who is familiar with word embeddings?  Who is familiar with logistic regression?  Who is familiar with artificial neural networks? Suzan Verberne 2019
  • 4.
    QUESTIONS  “Can BigData Analytics solve the current bottle neck in the continuous updating of dictionaries?”  Text mining: Automatic extraction of knowledge from text  Text = unstructured  Knowledge = structured Suzan Verberne 2019
  • 5.
    TEXT MINING FORLEXICOGRAPHY  “How can we automatically extract structured information from the constant stream of text data on the web and social media in particular?”  Structured lexical information:  Discovery and selection of new lemmas  New meanings of existing lemmas  Collocations / multi-word expressions Suzan Verberne 2019
  • 6.
    TEXT MINING &LEXICOGRAPHY  Discovery and selection of new lemmas  Sørensen, N. H., & Nimb, S. (2018). Word2Dict–Lemma Selection and Dictionary Editing Assisted by Word Embeddings. In The XVIII EURALEX International Congress (p. 146).  Trend analysis: Change in the meaning of existing words  Mitra, S., Mitra, R., Maity, S. K., Riedl, M., Biemann, C., Goyal, P., & Mukherjee, A. (2015). An automatic approach to identify word sense changes in text media across timescales. Natural Language Engineering, 21(5), 773-798.  Extraction of collocations/multiword expressions  Sanni Nimb, Henrik Lorentzen & Nicolai Hartvig Sørensen (2019): “Updating the dictionary: semantic change identification based on change in bigrams over time”. Presented in Workshop on collocations. https://elex.link/elex2019/programme/workshop-on-collocations/ Suzan Verberne 2019
  • 7.
    Suzan Verberne 2019 DISCOVERYAND SELECTION OF NEW LEMMAS
  • 8.
    TASK AND RESEARCHQUESTIONS  Example: Den Danske Ordbog (DDO)  Task: “augmenting the lemmata of an existing dictionary by adding either completely new or formerly neglected lemmas”  “How do you in a fast and consistent way compare new lemma candidates to already described lemmas within the same semantic field in order to ensure the consistency of the definitions?” Suzan Verberne 2019 Sørensen, N. H., & Nimb, S. (2018)
  • 9.
    TOOL FOR LEMMASELECTION Suzan Verberne 2019 Sørensen, N. H., & Nimb, S. (2018)
  • 10.
    WORD2DICT  A lexicographictool based on a word embedding model  Goal: “to present a number of words that are most semantically related to the lemma that the lexicographer is describing” Suzan Verberne 2019 Sørensen, N. H., & Nimb, S. (2018)
  • 11.
  • 12.
    WHERE TO START Linguistics: Distributional hypothesis  Data science: Vector Space Model (VSM) Suzan Verberne 2019
  • 13.
    DISTRIBUTIONAL HYPOTHESIS  Harris,Z. (1954). “Distributional structure”. Word. 10 (23): 146–162  The context of a word defines its meaning  Words that occur in similar contexts tend to be similar Suzan Verberne 2019
  • 14.
    VECTOR SPACE MODEL Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.  Documents and queries represented in a vector space  Where the dimensions are the words Suzan Verberne 2019
  • 15.
    VECTOR SPACE MODEL In the vector space model, we can model similarity as closeness  The closer two documents are in the space, the more similar they are Suzan Verberne 2019  We can compute the similarity between two points/vectors using a metric for distance or angle  Most used metric: cosine similarity: the cosine of the angle 𝝷 between two vectors
  • 16.
    VECTOR SPACE MODEL Linguisticissues with the vector space model:  synonymy: multiple ways to refer to the same concept, e.g. bicycle and bike  polysemy/homonyms: most words have more than one distinct meaning, e.g. bank, bass, chips Suzan Verberne 2019
  • 17.
    VECTOR SPACE MODEL Computational issues with the vector space model:  The vector representations are high-dimensional (easily 10,000 dimensions – one for each term in the collection)  The vector representations are sparse (a given document only contains a fraction of those 10,000 terms – the other dimensions have a 0 value) Suzan Verberne 2019
  • 18.
  • 19.
  • 20.
    WORD EMBEDDINGS  Wordembeddings are dense representations of words Suzan Verberne 2019
  • 21.
    WORD EMBEDDINGS  Wordembeddings models represent (embed) words in a continuous vector space  The vector space is relatively low-dimensional (100 – 400 dimensions instead of 10,000s)  Semantically and syntactically similar words are mapped to nearby points because the representations are learned from word occurrences in context (Distributional Hypothesis) Suzan Verberne 2019
  • 22.
    Suzan Verberne 2019 PCAprojection of 320- dimensional vector space
  • 23.
  • 24.
  • 25.
    WHAT IS WORD2VEC? Word2vec is a particularly computationally-efficient predictive model for learning word embeddings from raw text  Intuition:  Train a classifier on a binary prediction task (on a text without labels!): “Is word w likely to show up near the word bicycle?”  We don’t actually care about this prediction task; instead we’ll take the learned classifier weights as the word embeddings Suzan Verberne 2019
  • 26.
    WHERE DOES ITCOME FROM  Neural network language model (NNLM) (Bengio et al., 2003)  Mikolov proposed to learn word vectors using a neural network with a single hidden layer (Mikolov et al 2013)  word2vec  Many neural architectures and models have been proposed for computing word vectors  GloVe (2014) - Global Vectors for Word Representation  FastText (2017) - Enriching Word Vectors with Subword Information  ELMo (2018) - Deep contextualized word representations  BERT (2019) - Bidirectional Encoder Representations from Transformers Suzan Verberne 2019
  • 27.
    WORD2VEC  Starting point:large collection (e.g. 10 Million words)  First step: extract the vocabulary (e.g. 10,000 terms)  Goal: to represent each of these 10,000 terms as a dense, lower- dimensional vector (typically 100-400 dimensions)  Idea: to use the contexts of words to learn their meaning Suzan Verberne 2019
  • 28.
    TRAINING WORD2VEC  Trainingtask: binary classification of words in the text 1. Treat the target word and a neighboring context word as positive examples 2. Randomly sample other words in the lexicon to get negative samples 3. Train a classifier to distinguish those two cases Suzan Verberne 2019
  • 29.
    TRAINING WORD2VEC  Thisexample has a target word t (apricot), and 4 context words in the L = ±2 window, resulting in 4 positive training instances  Negative examples are artificially generated: Jurafsky and Martin. Speech and Language Processing (3rd edition, 2019)
  • 30.
    TRAINING WORD2VEC  Theclassifier is a neural network with one hidden layer  Logistic functions are used as activation functions in the hidden layer  The regression weights are the embeddings Suzan Verberne 2019 sparse vector dense vector (embeddings)
  • 31.
    TRAINING WORD2VEC  Theweights on the nodes in the hidden layer get random initializations and get updated while the model processes the collection  The outcome of the classification determines whether we adjust the current word vector  Gradually, the vectors converge to sensible descriptors (embeddings) for words Suzan Verberne 2019
  • 32.
    LANGUAGE MODELLING  Theword prediction task is called language modelling  Traditional n-gram model: given the previous n words, predict the next word  Neural language models can handle much longer histories, and they can generalize over contexts of similar words  The resulting embeddings are referred to as language model  It is important that the context classification here is not an aim in itself: it is just an auxiliary task to learn vector representations good for other tasks Suzan Verberne 2019
  • 33.
    ADVANTAGES OF WORD2VEC It scales  Train on billion word corpora  In limited time  Possibility of parallel training  Pre-trained word embeddings trained by one can be used by others  For entirely different tasks  Incremental training  Train on one piece of data, save results, continue training later on  There is a Python module for it:  Gensim word2vec Suzan Verberne 2019
  • 34.
    Suzan Verberne 2019 WHATCAN YOU DO WITH IT?
  • 35.
    GENSIM WORD2VEC  Implementationin Python package gensim import gensim model = gensim.models.Word2Vec(sentences, size=100, window=5, min_count=5, workers=4) size: the dimensionality of the feature vectors (common: 100, 200 or 320) window: the maximum distance between the current and predicted word within a sentence min_count: minimum number of occurrences of a word in the corpus to be included in the model workers: for parallellization with multicore machines Suzan Verberne 2019
  • 36.
    GENSIM WORD2VEC Sørensen, N.H., & Nimb, S. (2018):  We used the version of the word2vec algorithm implemented in the Gensim Python package  to train a model based on the Danish corpus used by the lexicographers of DDO  The corpus included at the time of the training roughly 920 million running words, mainly newswire, but also, material from magazines, transcripts from the Danish Parliament, and some fiction, among others, spanning the years 1982 to 2017  We trained the model with 500 features, a window size of five, a minimum occurrence of five for all types  The corpus included 6.3 million types, five million of which occurred less than five times  The training took roughly 18 hours on a 2017 MacBook Pro Suzan Verberne 2019
  • 37.
    DO IT YOURSELF model.most_similar(‘apple’) [(’banana’, 0.8571481704711914), ...] model.doesnt_match("breakfast cereal dinner lunch".split())  ‘cereal’ model.similarity(‘woman’, ‘man’)  0.73723527 Suzan Verberne 2019 Cosine similarity
  • 38.
    WHAT CAN YOUDO WITH IT?  Mining knowledge about natural language  Improve NLP applications Suzan Verberne 2019
  • 39.
    WHAT CAN YOUDO WITH IT?  Mining knowledge about natural language  Learning semantic and semantic relations Suzan Verberne 2019
  • 40.
    WHAT CAN YOUDO WITH IT?  A is to B as C is to ?  This is the famous example: vector(king) – vector(man) + vector(woman) = vector(queen)  Actually, what the original paper says is: if you substract the vector for ‘man’ from the one for ‘king’ and add the vector for ‘woman’, the vector closest to the one you end up with turns out to be the one for ‘queen’  More interesting: France is to Paris as Germany is to … Suzan Verberne 2019
  • 41.
    T. Mikolov, I.Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119, 2013.
  • 42.
    WHAT CAN YOUDO WITH IT?  A is to B as C is to ?  It also works for syntactic relations:  vector(biggest) - vector(big) + vector(small) = Suzan Verberne 2019 vector(smallest)
  • 43.
    WHAT CAN YOUDO WITH IT?  Mining knowledge about natural language  Learning semantic and semantic relations  Selecting out-of-the-list words  Example: which word does not belong in [monkey, lion, dog, truck]  Selectional preferences  Example: predict typical verb-noun pairs: people as subject of eating is more likely than people as object of eating  Discover new words Suzan Verberne 2019
  • 44.
    Tshitoyan, V., Dagdelen,J., Weston, L., Dunn, A., Rong, Z., Kononova, O., ... & Jain, A. (2019). Unsupervised word embeddings capture latent knowledge from materials science literature. Nature, 571(7763), 95. https://github.com/materialsintelligence/mat2vec
  • 45.
    WHAT CAN YOUDO WITH IT?  Improve NLP applications:  Sentence completion/text prediction/reply suggestion  Bilingual Word Embeddings for Machine Translation with LSTMs  (Near-)Synonym detection ( query expansion)  Concept representation of texts  Example: Twitter sentiment classification  Document similarity  Example: cluster news articles per news event Suzan Verberne 2019
  • 46.
    WORD EMBEDDINGS ASFEATURES  NLP models take word embeddings as low-level representation of words  Word embeddings as input for convolutional neural networks in text categorization  Word embeddings as input for recurrent neural networks in sequence labelling  Since 2018: word embeddings are used as language models that can be fine-tuned towards any natural language processing task Suzan Verberne 2019
  • 47.
  • 48.
    SUMMARY Text Mining forLexicography  Discovery and selection of new lemmas  Word2Dict: tool for lemma selection (Sørensen & Nimb 2018)  Word embeddings  Distributional hypothesis  Vector space model  From sparse to dense representations  Neural language modelling  Practical use in the gensim package Suzan Verberne 2019
  • 49.
    FURTHER READING  T.Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.  https://rare-technologies.com/word2vec-tutorial/  http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram- model/  http://www.davidsbatista.net/blog/2018/12/06/Word_Embeddings/  Visualisation of embeddings models: https://projector.tensorflow.org/ Suzan Verberne 2019
  • 50.
    Suzan Verberne 2019 http://tmr.liacs.nl

Editor's Notes

  • #5 Because the data we are interested in is text data and we want to mine knowledge from those text data
  • #10 Figure 1: A search for ananasjuice (‘pineapple juice’). To the left (the top-half of the interface) we see the most similar words according to the context in which they appear in a corpus. Frequency counts for each word and whether or not the word is included in DDO is also displayed. The frequency counts are color- coded for quicker visual decoding: the darker the color, the higher the frequency. To the right (the bottom- half of the interface), definitions of the words already in the dictionary are shown, as well as their editorial status (e.g. “publiceret” (‘published’)) and the similarity score from the model (e.g. “0.75”, “0.71” – 1.0 equals identical).
  • #20 One-hot encoding