Introduction to Text Analytics
and Natural Language
Processing
Nick Grattan
Application Architecture Director, Dassault Systèmes
PhD Student, Insight Centre for Data Analytics, University College Cork
www.3ds.com insight-centre.org/
Cork AI Meetup 15th March 2018
www.meetup.com/Cork-AI/
@NickGrattan
“You shall know a word by the
company it keeps”
J.R. Firth (1957)
Agenda
• Introduction to Text Analytics.
• Overview of common techniques and the types of problems commonly
solved.
• Traditional “Frequentist” text analysis
• Bag-of-Words and Vector Space Models (High Dimensional, Low Density)
• Measuring document / text similarity with distance metrics and clustering
documents.
• Hands-On: Document Clustering with Python, NLTK and Scipy.
• Word Embeddings with word2vec for semantic term analysis
• Unsupervised semantic analysis using a corpus of words
• Hands-On: Creating a semantic space with a Neural Network in TensorFlow
Natural Language Processing and Text
Analytics
• Natural Language Processing (NLP)
• Area of AI concerned with interactions between computers and human natural
language, to process or “understand” natural language
• Common tasks: speech recognition, natural language understanding & generation,
automatic summarization, part-of-speech tagging, disambiguation, named entity
recognition …
• To fully understand and represent the meaning of language is a difficult goal (AI-
Complete) [1]
• Text Analytics (Text Mining):
• The process or practice of examining large collections of written resources in order
to generate new information (Oxford English Dictionary)
• Transforms text to data for information discovery, establishing relationships, often
using NLP
Text Preparation
• Extract text from documents
• E.g. use “BeautifulSoup” in Python to process HTML/XML documents
• Process terms (words) from text
• Tokenisation – breaks text into discrete terms
• Stop Words – remove common words (“the”, “and” etc.)
• Stemming – Reduce words to their root or base form ("fishing", "fished", and
"fisher" => "fish")
• E.g. “NLTK” (Natural Language Toolkit) in Python
• All, some, or none of these techniques may be used, depending on the
application
Bag-of-Words & Jaccard Similarity
• Bag-of-Words is the set of terms found
in a document, corpus etc.
• Jaccard Similarity between two Bag-of-
Words, A &B:
• Ratio of the Intersection length over the
Union length of two sets
• ‘0’ – Identical, ‘1’ – Dissimilar
• Simple & quick to calculate
Term Frequencies (TF) & Vector Space Models
• Term Frequency (TF)
• Count of number of term
occurrences in a document
• Vector Space Model
• Dimension for each Term in Vocabulary
• Map documents into this space
Very high dimensionality, low density
For many documents, many dimensions
with be zero
Distance Measures
• Distance between two documents in a vector space model
• Two common measures: Euclidean and Cosine
Term Frequency / Inverse Document
Frequency (TF/IDF)
• Term Frequency / Inverse Document Frequency (TF/IDF)
• Reflects how important a word is to a document in a corpus
• Increases proportionally to the number of times a document appears in the
document
• Offset by the frequency of the word in the corpus
• Adjusts for words that appear more frequently in general
See: https://deeplearning4j.org/bagofwords-tf-idf
Distance Matrices & Clustering
• Square, symmetrical matrix with pair-wise distances between
documents in a corpus
• Used for clustering documents, e.g.
• K-Means clustering
• Hierarchical clustering (Ward algorithm commonly used)
See: https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/
Edit Distance
• Number of inserts / deletes / substitutions to
transform one document to another
• Weights may be applied to different types of
edit
• E.G. Terms that are semantically related may have
a lower weight
• Levenstein Edit Distance may be solved using
Dynamic Programming
• Allows document alignments to be produced
• But: Expensive in time and space!
https://wordpress.com/read/feeds/71910664/posts/1718047915
Document Retrieval with MinHash & Local
Sensitivity Hashing (LSH)
• Problem: How to retrieve similar document from a large corpus
• MinHash:
• “Document Fingerprint” with n-hash values (n ≈ 200)
• Characteristic: Similar document have similar hash values
• Use Jaccard similarity to measure MinHash similarity, and hence document similarity
• Independent of document size, small storage and retrieval costs
• Local Sensitivity Hashing (LSH):
• For large number of documents
• Organizes documents represented by MinHash into buckets
• Documents within a bucket are similar
• Reduces retrieval time, good for document duplication/near duplication detection
etc.
https://nickgrattandatascience.wordpress.com/2013/11/12/minhash-implementation-in-c/
https://nickgrattandatascience.wordpress.com/2017/12/31/lsh-for-finding-similar-documents-from-a-large-number-of-documents-in-c/
Natural Language Processing (NLP)
• Techniques describe thus far are Text Analytical
• Numerical in nature, take little account of the meaning of
text
• Terms are numerically encoded symbols
• NLP attempts to understand text
• Semantics – The meaning of a word based on how / where
it’s used
• Part of Speech (POS) Tagging- Understanding the
construction of sentences, phrases etc.
• Word Relatedness & Concepts: Wordnet -
https://wordnet.princeton.edu/
E.g. Homonym Problem:
Words with same spelling but
different meanings, depending
on how / where used
E.G. Disambiguation: “Like” as Verb (Fruit
flies like to eat bananas), “Like” as a adjective
(“Fruit flies that look like a banana”)
Word Embeddings – word2vec
• Unsupervised semantic analysis from
corpus of terms
• Define number of dimensions for the
semantic space (e.g. 300)
• Window: Define number of words before
/ after (e.g. 1,2 or 5) target word
• Generate Training Samples
• For each word, create parameters that
map the word into the semantic space
• The “Word Vector Lookup Table”
See: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
Word Embeddings – word2vec
• Neural Network trained on samples
• Output layer discarded
• Keep the Hidden Layer Weight Matrix!
See: https://www.tensorflow.org/tutorials/word2vec
Also look at “gensim” for a word2vec implementation:
https://radimrehurek.com/gensim/models/word2vec.html
Optimisation:
1. Continuous Bag of Words (CBOW)
2. Skipgram
Word Embedding - Visualisation
• Use Principal Component Analysis (PCA) to create 2d representation
of semantic vector space
RNN and NLP
• Recurrent Neural Networks (RNN) may
be used for generative models
• Once trained, they can generate text
with the same structure, syntax and
semantics as the training set
• For a bit of fun, see “The Unreasonable
Effectiveness of Recurrent Neural
Networks”
• http://karpathy.github.io/2015/05/21/rnn-effectiveness/
C Code generated from an RNN trained on the Linux
code base. While it does not execute (!) it is
syntactically correct. The model, for example, has
learnt matched “{“ “}” and parentheses
Resources and References
[1] “Natural Language Processing with Deep Learning” – Christopher Manning et
al, Stamford University. https://www.youtube.com/watch?v=OQQ-W_63UgQ
• Lecture series including excellent description of back propagation, word2vec and GLOVE
[2] “Hands-On Machine Learning with Scikit-Learn & TensorFlow”, Aurélien
Géron (O'Reilly Media, 2017)
• Excellent introduction, Jupyter Notebooks available here: https://github.com/ageron
[3] “Speech and Language Processing”, Daniel Jurafsky & James H. Martin (2nd
Edition, Pearson Education 2009)
• In-depth introduction to NLP
[4] “Introduction to Information Retrieval”, Christopher Manning et al
(Cambridge University Press, 2008)
• Probabilistic models for text retrieval, TF/IDF, Vector Space, Support Vector Machines…

Cork AI Meetup Number 3

  • 1.
    Introduction to TextAnalytics and Natural Language Processing Nick Grattan Application Architecture Director, Dassault Systèmes PhD Student, Insight Centre for Data Analytics, University College Cork www.3ds.com insight-centre.org/ Cork AI Meetup 15th March 2018 www.meetup.com/Cork-AI/ @NickGrattan
  • 2.
    “You shall knowa word by the company it keeps” J.R. Firth (1957)
  • 3.
    Agenda • Introduction toText Analytics. • Overview of common techniques and the types of problems commonly solved. • Traditional “Frequentist” text analysis • Bag-of-Words and Vector Space Models (High Dimensional, Low Density) • Measuring document / text similarity with distance metrics and clustering documents. • Hands-On: Document Clustering with Python, NLTK and Scipy. • Word Embeddings with word2vec for semantic term analysis • Unsupervised semantic analysis using a corpus of words • Hands-On: Creating a semantic space with a Neural Network in TensorFlow
  • 4.
    Natural Language Processingand Text Analytics • Natural Language Processing (NLP) • Area of AI concerned with interactions between computers and human natural language, to process or “understand” natural language • Common tasks: speech recognition, natural language understanding & generation, automatic summarization, part-of-speech tagging, disambiguation, named entity recognition … • To fully understand and represent the meaning of language is a difficult goal (AI- Complete) [1] • Text Analytics (Text Mining): • The process or practice of examining large collections of written resources in order to generate new information (Oxford English Dictionary) • Transforms text to data for information discovery, establishing relationships, often using NLP
  • 5.
    Text Preparation • Extracttext from documents • E.g. use “BeautifulSoup” in Python to process HTML/XML documents • Process terms (words) from text • Tokenisation – breaks text into discrete terms • Stop Words – remove common words (“the”, “and” etc.) • Stemming – Reduce words to their root or base form ("fishing", "fished", and "fisher" => "fish") • E.g. “NLTK” (Natural Language Toolkit) in Python • All, some, or none of these techniques may be used, depending on the application
  • 6.
    Bag-of-Words & JaccardSimilarity • Bag-of-Words is the set of terms found in a document, corpus etc. • Jaccard Similarity between two Bag-of- Words, A &B: • Ratio of the Intersection length over the Union length of two sets • ‘0’ – Identical, ‘1’ – Dissimilar • Simple & quick to calculate
  • 7.
    Term Frequencies (TF)& Vector Space Models • Term Frequency (TF) • Count of number of term occurrences in a document • Vector Space Model • Dimension for each Term in Vocabulary • Map documents into this space Very high dimensionality, low density For many documents, many dimensions with be zero
  • 8.
    Distance Measures • Distancebetween two documents in a vector space model • Two common measures: Euclidean and Cosine
  • 9.
    Term Frequency /Inverse Document Frequency (TF/IDF) • Term Frequency / Inverse Document Frequency (TF/IDF) • Reflects how important a word is to a document in a corpus • Increases proportionally to the number of times a document appears in the document • Offset by the frequency of the word in the corpus • Adjusts for words that appear more frequently in general See: https://deeplearning4j.org/bagofwords-tf-idf
  • 10.
    Distance Matrices &Clustering • Square, symmetrical matrix with pair-wise distances between documents in a corpus • Used for clustering documents, e.g. • K-Means clustering • Hierarchical clustering (Ward algorithm commonly used) See: https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/
  • 11.
    Edit Distance • Numberof inserts / deletes / substitutions to transform one document to another • Weights may be applied to different types of edit • E.G. Terms that are semantically related may have a lower weight • Levenstein Edit Distance may be solved using Dynamic Programming • Allows document alignments to be produced • But: Expensive in time and space! https://wordpress.com/read/feeds/71910664/posts/1718047915
  • 12.
    Document Retrieval withMinHash & Local Sensitivity Hashing (LSH) • Problem: How to retrieve similar document from a large corpus • MinHash: • “Document Fingerprint” with n-hash values (n ≈ 200) • Characteristic: Similar document have similar hash values • Use Jaccard similarity to measure MinHash similarity, and hence document similarity • Independent of document size, small storage and retrieval costs • Local Sensitivity Hashing (LSH): • For large number of documents • Organizes documents represented by MinHash into buckets • Documents within a bucket are similar • Reduces retrieval time, good for document duplication/near duplication detection etc. https://nickgrattandatascience.wordpress.com/2013/11/12/minhash-implementation-in-c/ https://nickgrattandatascience.wordpress.com/2017/12/31/lsh-for-finding-similar-documents-from-a-large-number-of-documents-in-c/
  • 13.
    Natural Language Processing(NLP) • Techniques describe thus far are Text Analytical • Numerical in nature, take little account of the meaning of text • Terms are numerically encoded symbols • NLP attempts to understand text • Semantics – The meaning of a word based on how / where it’s used • Part of Speech (POS) Tagging- Understanding the construction of sentences, phrases etc. • Word Relatedness & Concepts: Wordnet - https://wordnet.princeton.edu/ E.g. Homonym Problem: Words with same spelling but different meanings, depending on how / where used E.G. Disambiguation: “Like” as Verb (Fruit flies like to eat bananas), “Like” as a adjective (“Fruit flies that look like a banana”)
  • 14.
    Word Embeddings –word2vec • Unsupervised semantic analysis from corpus of terms • Define number of dimensions for the semantic space (e.g. 300) • Window: Define number of words before / after (e.g. 1,2 or 5) target word • Generate Training Samples • For each word, create parameters that map the word into the semantic space • The “Word Vector Lookup Table” See: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
  • 15.
    Word Embeddings –word2vec • Neural Network trained on samples • Output layer discarded • Keep the Hidden Layer Weight Matrix! See: https://www.tensorflow.org/tutorials/word2vec Also look at “gensim” for a word2vec implementation: https://radimrehurek.com/gensim/models/word2vec.html Optimisation: 1. Continuous Bag of Words (CBOW) 2. Skipgram
  • 16.
    Word Embedding -Visualisation • Use Principal Component Analysis (PCA) to create 2d representation of semantic vector space
  • 17.
    RNN and NLP •Recurrent Neural Networks (RNN) may be used for generative models • Once trained, they can generate text with the same structure, syntax and semantics as the training set • For a bit of fun, see “The Unreasonable Effectiveness of Recurrent Neural Networks” • http://karpathy.github.io/2015/05/21/rnn-effectiveness/ C Code generated from an RNN trained on the Linux code base. While it does not execute (!) it is syntactically correct. The model, for example, has learnt matched “{“ “}” and parentheses
  • 18.
    Resources and References [1]“Natural Language Processing with Deep Learning” – Christopher Manning et al, Stamford University. https://www.youtube.com/watch?v=OQQ-W_63UgQ • Lecture series including excellent description of back propagation, word2vec and GLOVE [2] “Hands-On Machine Learning with Scikit-Learn & TensorFlow”, Aurélien Géron (O'Reilly Media, 2017) • Excellent introduction, Jupyter Notebooks available here: https://github.com/ageron [3] “Speech and Language Processing”, Daniel Jurafsky & James H. Martin (2nd Edition, Pearson Education 2009) • In-depth introduction to NLP [4] “Introduction to Information Retrieval”, Christopher Manning et al (Cambridge University Press, 2008) • Probabilistic models for text retrieval, TF/IDF, Vector Space, Support Vector Machines…