Introduction to Text Mining and Topic Modelling

Introduction to
Text Mining
and
Topic Modelling
by Jorge David Gonzalez Paule
j.gonzalez-paule.1@research.gla.ac.uk

Outline
● What is Text Mining?
● Preparing Text Data: Preprocessing
● Text Data: How to represent?
● Topic Modelling: LDA
LEARN !!!!!

What is Text Mining?
Is the process of extracting high quality
information from large amounts of
unstructured textual data, using
Information Retrieval Information Extraction
Natural Language Processing
Data Mining
computational techniques.

Process Overview
Filtering and
organisation
Knowledge
Discovery
● Information Retrieval.
● Natural Language
Processing.
● Information Extraction
● Data Mining.
● Machine Learning.
● Prediction Models.
…….

Information Retrieval
Search Engine…...
...to connect the right user with the
right information.
...to help the user analyse and
facilitate decision making.
Text Mining
Pattern Discovery/Mining…..

Which features distinguish
text data
from other
quantitative and relational data?
● Supervised/Unsupervised Learning Models
● Clustering
● Classification
Need to be adapted to work with Text Data !!!!

Text Data Features
● High Dimensional
● Sparse
● Ambiguous
● Unstructured
● Noisy

How to represent text data?
● Word-Level
○ Bag of Words: Isolated terms
● Semantically
○ Natural Language Processing: Syntactic Analysis

Preprocessing Text Data
'Hey!!!.....This is an exanple to be preprocessed by @Jorge in #UBDC :) Awesome !!....
http://catvideos.com'
● Clean punctuation or other non-meaningful characters (regexp)
'hey this is an exanple to be preprocessed by in ubdc awesome'
● Tokenize
['hey', 'this', 'is', 'an', 'exanple', 'to', 'be', 'preprocessed', 'by', 'in', 'ubdc', 'awesome']
● Remove stopwords
['exanple', 'preprocessed', 'ubdc', 'awesome']
● Spelling corrector
['example', 'preprocessed', 'ubdc', 'awesome']
● Stemming/Lemmatization (WordNet)
['exampl', 'preprocess', 'ubdc', 'awesom']
'are' -> 'be'

Word-Level: Vector Space Model
Term-Document
Matrix
Documents Corpus/ Collection

TF-IDF Weighting
Is the product of two statistics:
1. TF = Term Frequency
1. IDF = Inverse Document Frequency.
a. Measure of the discriminative power of a word with respect to a document in a collection
Given:
The TF-IDF is calculated as:

World-Level Analysis weakness
Bag of Words representation does not take context into
account.
The semantic approach use Natural Language
Processing to consider the overall context of a
word in a sentence.

Natural Language Processing (NLP)
● Key Idea: Learn the language from data as
a human being !!
● Tasks:
○ Name Entity Recognition (NER)
○ Part-Of-Speech Tagging
○ Parsing (Grammatical analysis)
○ Sentiment Analysis
○ …...

What is Topic Modelling?
● Unsupervised learning method
● Analyse the words in original text…..
● …. to annotate each document with thematic information.
● Models: LSI and LDA
Latent Dirichlet Allocation is the most used !!!!

Latent Dirichlet Allocation
LDA
# Topics Model Parameters
1. Distribution of Topics per Document.
1. Distribution of Words by Topic.

Latent Dirichlet Allocation
1. Assumes data are observations that arises
from a generative probabilistic process that
includes hidden variables
1. 2. Infer the hidden structure using posterior
inference
3. Allocate new data into the estimated model

Probabilistic Generative Model
Hidden Variables
Joint Distribution
Prior Distributions
Generative Process
Posterior Distributions

LDA intuition
Blei, David M. "Probabilistic topic models." Communications of the ACM 55.4 (2012): 77-84.
“Documents exhibit multiple topics”

Generative Process
Word
Distributions
Topic Distributions
Choose through a
Dirichlet Distribution !!
REVERSE !!!!

Generative Process
Formal Definition
= Topics Distributions over vocabulary
= Topic proportion in document
= Topic assignment for nth word in document d
= nth word in document d
The Generative Process is defined as
the Joint Distribution of the hidden and observed variables.

Steyvers, Mark, and Tom Griffiths. "Probabilistic topic models." Handbook of latent semantic analysis 427.7 (2007): 424-440.
Generative Model

Inference Algorithm
Is the process of computing the following
Posterior/Conditional Distribution of the hidden structure of the topics.
Joint Distribution
Marginal Probability of the observations.
All possibles ways to assign each
observed word of the collection to one of
the topics.
Hard to Compute -> Approximation
with Gibbs Sampling
= Topics Distributions over vocabulary
= Topic proportion in document
= Topic assignment for nth word in document d
= nth word in document d

Real World Example
http://blog.echen.me/2011/06/27/topic-
modeling-the-sarah-palin-emails/
http://sarah-palin.herokuapp.com/

Resources
● http://videolectures.net/mlss09uk_blei_tm/
● Blei, David M. "Probabilistic topic models." Communications of the ACM
55.4 (2012): 77-84.
● Steyvers, Mark, and Tom Griffiths. "Probabilistic topic models." Handbook
of latent semantic analysis 427.7 (2007): 424-440.
● Charu C. Aggarwal and Cheng Xiang Zhai. 2012. Mining Text Data.
Springer Publishing Company, Incorporated.
And…………
!!!!!!

Introduction to Text Mining and Topic Modelling

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Text Mining and Topic Modelling

Similar to Introduction to Text Mining and Topic Modelling (20)

Recently uploaded

Recently uploaded (20)

Introduction to Text Mining and Topic Modelling