A brief introduction to Text Mining and Topic Modelling given at the Urban Big Data Centre (University of Glasgow).
Want to know more? Visit my website davidpaule.es
4. Outline
● What is Text Mining?
● Preparing Text Data: Preprocessing
● Text Data: How to represent?
● Topic Modelling: LDA
LEARN !!!!!
5. What is Text Mining?
Is the process of extracting high quality
information from large amounts of
unstructured textual data, using
Information Retrieval Information Extraction
Natural Language Processing
Data Mining
computational techniques.
7. Information Retrieval
Search Engine…...
...to connect the right user with the
right information.
...to help the user analyse and
facilitate decision making.
Text Mining
Pattern Discovery/Mining…..
8. Which features distinguish
text data
from other
quantitative and relational data?
● Supervised/Unsupervised Learning Models
● Clustering
● Classification
Need to be adapted to work with Text Data !!!!
9. Text Data Features
● High Dimensional
● Sparse
● Ambiguous
● Unstructured
● Noisy
10. How to represent text data?
● Word-Level
○ Bag of Words: Isolated terms
● Semantically
○ Natural Language Processing: Syntactic Analysis
11. Preprocessing Text Data
'Hey!!!.....This is an exanple to be preprocessed by @Jorge in #UBDC :) Awesome !!....
http://catvideos.com'
● Clean punctuation or other non-meaningful characters (regexp)
'hey this is an exanple to be preprocessed by in ubdc awesome'
● Tokenize
['hey', 'this', 'is', 'an', 'exanple', 'to', 'be', 'preprocessed', 'by', 'in', 'ubdc', 'awesome']
● Remove stopwords
['exanple', 'preprocessed', 'ubdc', 'awesome']
● Spelling corrector
['example', 'preprocessed', 'ubdc', 'awesome']
● Stemming/Lemmatization (WordNet)
['exampl', 'preprocess', 'ubdc', 'awesom']
'are' -> 'be'
13. TF-IDF Weighting
Is the product of two statistics:
1. TF = Term Frequency
1. IDF = Inverse Document Frequency.
a. Measure of the discriminative power of a word with respect to a document in a collection
Given:
The TF-IDF is calculated as:
15. World-Level Analysis weakness
Bag of Words representation does not take context into
account.
The semantic approach use Natural Language
Processing to consider the overall context of a
word in a sentence.
16. Natural Language Processing (NLP)
● Key Idea: Learn the language from data as
a human being !!
● Tasks:
○ Name Entity Recognition (NER)
○ Part-Of-Speech Tagging
○ Parsing (Grammatical analysis)
○ Sentiment Analysis
○ …...
18. What is Topic Modelling?
● Unsupervised learning method
● Analyse the words in original text…..
● …. to annotate each document with thematic information.
● Models: LSI and LDA
Latent Dirichlet Allocation is the most used !!!!
19. Latent Dirichlet Allocation
LDA
# Topics Model Parameters
1. Distribution of Topics per Document.
1. Distribution of Words by Topic.
20. Latent Dirichlet Allocation
1. Assumes data are observations that arises
from a generative probabilistic process that
includes hidden variables
1. 2. Infer the hidden structure using posterior
inference
3. Allocate new data into the estimated model
24. Generative Process
Formal Definition
= Topics Distributions over vocabulary
= Topic proportion in document
= Topic assignment for nth word in document d
= nth word in document d
The Generative Process is defined as
the Joint Distribution of the hidden and observed variables.
26. Steyvers, Mark, and Tom Griffiths. "Probabilistic topic models." Handbook of latent semantic analysis 427.7 (2007): 424-440.
Generative Model
27. Inference Algorithm
Is the process of computing the following
Posterior/Conditional Distribution of the hidden structure of the topics.
Joint Distribution
Marginal Probability of the observations.
All possibles ways to assign each
observed word of the collection to one of
the topics.
Hard to Compute -> Approximation
with Gibbs Sampling
= Topics Distributions over vocabulary
= Topic proportion in document
= Topic assignment for nth word in document d
= nth word in document d