SlideShare a Scribd company logo
1 of 30
Introduction to
Text Mining
and
Topic Modelling
by Jorge David Gonzalez Paule
j.gonzalez-paule.1@research.gla.ac.uk
Awesome
Practical
Outline
● What is Text Mining?
● Preparing Text Data: Preprocessing
● Text Data: How to represent?
● Topic Modelling: LDA
LEARN !!!!!
What is Text Mining?
Is the process of extracting high quality
information from large amounts of
unstructured textual data, using
Information Retrieval Information Extraction
Natural Language Processing
Data Mining
computational techniques.
Process Overview
Filtering and
organisation
Knowledge
Discovery
● Information Retrieval.
● Natural Language
Processing.
● Information Extraction
● Data Mining.
● Machine Learning.
● Prediction Models.
…….
Information Retrieval
Search Engine…...
...to connect the right user with the
right information.
...to help the user analyse and
facilitate decision making.
Text Mining
Pattern Discovery/Mining…..
Which features distinguish
text data
from other
quantitative and relational data?
● Supervised/Unsupervised Learning Models
● Clustering
● Classification
Need to be adapted to work with Text Data !!!!
Text Data Features
● High Dimensional
● Sparse
● Ambiguous
● Unstructured
● Noisy
How to represent text data?
● Word-Level
○ Bag of Words: Isolated terms
● Semantically
○ Natural Language Processing: Syntactic Analysis
Preprocessing Text Data
'Hey!!!.....This is an exanple to be preprocessed by @Jorge in #UBDC :) Awesome !!....
http://catvideos.com'
● Clean punctuation or other non-meaningful characters (regexp)
'hey this is an exanple to be preprocessed by in ubdc awesome'
● Tokenize
['hey', 'this', 'is', 'an', 'exanple', 'to', 'be', 'preprocessed', 'by', 'in', 'ubdc', 'awesome']
● Remove stopwords
['exanple', 'preprocessed', 'ubdc', 'awesome']
● Spelling corrector
['example', 'preprocessed', 'ubdc', 'awesome']
● Stemming/Lemmatization (WordNet)
['exampl', 'preprocess', 'ubdc', 'awesom']
'are' -> 'be'
Word-Level: Vector Space Model
Term-Document
Matrix
Documents Corpus/ Collection
TF-IDF Weighting
Is the product of two statistics:
1. TF = Term Frequency
1. IDF = Inverse Document Frequency.
a. Measure of the discriminative power of a word with respect to a document in a collection
Given:
The TF-IDF is calculated as:
Word-Level: TF-IDF
World-Level Analysis weakness
Bag of Words representation does not take context into
account.
The semantic approach use Natural Language
Processing to consider the overall context of a
word in a sentence.
Natural Language Processing (NLP)
● Key Idea: Learn the language from data as
a human being !!
● Tasks:
○ Name Entity Recognition (NER)
○ Part-Of-Speech Tagging
○ Parsing (Grammatical analysis)
○ Sentiment Analysis
○ …...
Topic Modelling
What is Topic Modelling?
● Unsupervised learning method
● Analyse the words in original text…..
● …. to annotate each document with thematic information.
● Models: LSI and LDA
Latent Dirichlet Allocation is the most used !!!!
Latent Dirichlet Allocation
LDA
# Topics Model Parameters
1. Distribution of Topics per Document.
1. Distribution of Words by Topic.
Latent Dirichlet Allocation
1. Assumes data are observations that arises
from a generative probabilistic process that
includes hidden variables
1. 2. Infer the hidden structure using posterior
inference
3. Allocate new data into the estimated model
Probabilistic Generative Model
Hidden Variables
Joint Distribution
Prior Distributions
Generative Process
Posterior Distributions
LDA intuition
Blei, David M. "Probabilistic topic models." Communications of the ACM 55.4 (2012): 77-84.
“Documents exhibit multiple topics”
Generative Process
Word
Distributions
Topic Distributions
Choose through a
Dirichlet Distribution !!
REVERSE !!!!
Generative Process
Formal Definition
= Topics Distributions over vocabulary
= Topic proportion in document
= Topic assignment for nth word in document d
= nth word in document d
The Generative Process is defined as
the Joint Distribution of the hidden and observed variables.
Graphical Model
(Blei et al.)
Steyvers, Mark, and Tom Griffiths. "Probabilistic topic models." Handbook of latent semantic analysis 427.7 (2007): 424-440.
Generative Model
Inference Algorithm
Is the process of computing the following
Posterior/Conditional Distribution of the hidden structure of the topics.
Joint Distribution
Marginal Probability of the observations.
All possibles ways to assign each
observed word of the collection to one of
the topics.
Hard to Compute -> Approximation
with Gibbs Sampling
= Topics Distributions over vocabulary
= Topic proportion in document
= Topic assignment for nth word in document d
= nth word in document d
Real World Example
http://blog.echen.me/2011/06/27/topic-
modeling-the-sarah-palin-emails/
http://sarah-palin.herokuapp.com/
Questions
Resources
● http://videolectures.net/mlss09uk_blei_tm/
● Blei, David M. "Probabilistic topic models." Communications of the ACM
55.4 (2012): 77-84.
● Steyvers, Mark, and Tom Griffiths. "Probabilistic topic models." Handbook
of latent semantic analysis 427.7 (2007): 424-440.
● Charu C. Aggarwal and Cheng Xiang Zhai. 2012. Mining Text Data.
Springer Publishing Company, Incorporated.
And…………
!!!!!!

More Related Content

What's hot

Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Jen Aman
 
Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningﺁﺻﻒ ﻋﻠﯽ ﻣﯿﺮ
 
Lecture #1: Introduction to machine learning (ML)
Lecture #1: Introduction to machine learning (ML)Lecture #1: Introduction to machine learning (ML)
Lecture #1: Introduction to machine learning (ML)butest
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
 
Stable Diffusion path
Stable Diffusion pathStable Diffusion path
Stable Diffusion pathVitaly Bondar
 
Knowledge graphs ilaria maresi the hyve 23apr2020
Knowledge graphs   ilaria maresi the hyve 23apr2020Knowledge graphs   ilaria maresi the hyve 23apr2020
Knowledge graphs ilaria maresi the hyve 23apr2020Pistoia Alliance
 
Hadoop - Hệ thống tính toán và xử lý dữ liệu lớn
Hadoop - Hệ thống tính toán và xử lý dữ liệu lớnHadoop - Hệ thống tính toán và xử lý dữ liệu lớn
Hadoop - Hệ thống tính toán và xử lý dữ liệu lớnThành Thư Thái
 
Lap trinh-huong-doi-tuong-bang-c#
Lap trinh-huong-doi-tuong-bang-c#Lap trinh-huong-doi-tuong-bang-c#
Lap trinh-huong-doi-tuong-bang-c#Thanhlanh nguyen
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingCloudxLab
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
Fraud Detection in Insurance with Machine Learning for WARTA - Artur Suchwalko
Fraud Detection in Insurance with Machine Learning for WARTA - Artur SuchwalkoFraud Detection in Insurance with Machine Learning for WARTA - Artur Suchwalko
Fraud Detection in Insurance with Machine Learning for WARTA - Artur SuchwalkoInstitute of Contemporary Sciences
 
Large Language Models Bootcamp
Large Language Models BootcampLarge Language Models Bootcamp
Large Language Models BootcampData Science Dojo
 

What's hot (20)

Đề tài: Áp dụng Design Pattern trong phát triển phần mềm, 9đ
Đề tài: Áp dụng Design Pattern trong phát triển phần mềm, 9đĐề tài: Áp dụng Design Pattern trong phát triển phần mềm, 9đ
Đề tài: Áp dụng Design Pattern trong phát triển phần mềm, 9đ
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow
 
Text Similarity
Text SimilarityText Similarity
Text Similarity
 
Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learning
 
Lecture #1: Introduction to machine learning (ML)
Lecture #1: Introduction to machine learning (ML)Lecture #1: Introduction to machine learning (ML)
Lecture #1: Introduction to machine learning (ML)
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
Stable Diffusion path
Stable Diffusion pathStable Diffusion path
Stable Diffusion path
 
Knowledge graphs ilaria maresi the hyve 23apr2020
Knowledge graphs   ilaria maresi the hyve 23apr2020Knowledge graphs   ilaria maresi the hyve 23apr2020
Knowledge graphs ilaria maresi the hyve 23apr2020
 
Hadoop - Hệ thống tính toán và xử lý dữ liệu lớn
Hadoop - Hệ thống tính toán và xử lý dữ liệu lớnHadoop - Hệ thống tính toán và xử lý dữ liệu lớn
Hadoop - Hệ thống tính toán và xử lý dữ liệu lớn
 
Lap trinh-huong-doi-tuong-bang-c#
Lap trinh-huong-doi-tuong-bang-c#Lap trinh-huong-doi-tuong-bang-c#
Lap trinh-huong-doi-tuong-bang-c#
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Luận văn: Nhận dạng cảm xúc khuôn mặt người, HAY, 9đ
Luận văn: Nhận dạng cảm xúc khuôn mặt người, HAY, 9đLuận văn: Nhận dạng cảm xúc khuôn mặt người, HAY, 9đ
Luận văn: Nhận dạng cảm xúc khuôn mặt người, HAY, 9đ
 
Tutorial on Deep Learning
Tutorial on Deep LearningTutorial on Deep Learning
Tutorial on Deep Learning
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
Hadoop
HadoopHadoop
Hadoop
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Fraud Detection in Insurance with Machine Learning for WARTA - Artur Suchwalko
Fraud Detection in Insurance with Machine Learning for WARTA - Artur SuchwalkoFraud Detection in Insurance with Machine Learning for WARTA - Artur Suchwalko
Fraud Detection in Insurance with Machine Learning for WARTA - Artur Suchwalko
 
Large Language Models Bootcamp
Large Language Models BootcampLarge Language Models Bootcamp
Large Language Models Bootcamp
 

Similar to Introduction to Text Mining and Topic Modelling

TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
Machine Learning - Intro & Applications .pptx
Machine Learning - Intro & Applications .pptxMachine Learning - Intro & Applications .pptx
Machine Learning - Intro & Applications .pptxssuserf3aa89
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
TSL3133 Topic 11 Qualitative Data Analysis
TSL3133 Topic 11 Qualitative Data AnalysisTSL3133 Topic 11 Qualitative Data Analysis
TSL3133 Topic 11 Qualitative Data AnalysisYee Bee Choo
 
Topic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental surveyTopic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental surveyICDEcCnferenece
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLPSatyam Saxena
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLPAnuj Gupta
 
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
 An Investigation of Keywords Extraction from Textual Documents using Word2Ve... An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...IJCSIS Research Publications
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
 
Understanding Natural Language Queries over Relational Databases
Understanding Natural Language Queries over Relational DatabasesUnderstanding Natural Language Queries over Relational Databases
Understanding Natural Language Queries over Relational DatabasesAshis Kumar Chanda
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain OntologyKeerti Bhogaraju
 
nlp-01.pptxvvvffffffvvvvvfeddeeddffffffffff
nlp-01.pptxvvvffffffvvvvvfeddeeddffffffffffnlp-01.pptxvvvffffffvvvvvfeddeeddffffffffff
nlp-01.pptxvvvffffffvvvvvfeddeeddffffffffffSushantVyas1
 
How can text-mining leverage developments in Deep Learning? Presentation at ...
How can text-mining leverage developments in Deep Learning?  Presentation at ...How can text-mining leverage developments in Deep Learning?  Presentation at ...
How can text-mining leverage developments in Deep Learning? Presentation at ...jcscholtes
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPindico data
 

Similar to Introduction to Text Mining and Topic Modelling (20)

Topic modelling
Topic modellingTopic modelling
Topic modelling
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Machine Learning - Intro & Applications .pptx
Machine Learning - Intro & Applications .pptxMachine Learning - Intro & Applications .pptx
Machine Learning - Intro & Applications .pptx
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
TSL3133 Topic 11 Qualitative Data Analysis
TSL3133 Topic 11 Qualitative Data AnalysisTSL3133 Topic 11 Qualitative Data Analysis
TSL3133 Topic 11 Qualitative Data Analysis
 
Topic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental surveyTopic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental survey
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
 
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
 An Investigation of Keywords Extraction from Textual Documents using Word2Ve... An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
 
Understanding Natural Language Queries over Relational Databases
Understanding Natural Language Queries over Relational DatabasesUnderstanding Natural Language Queries over Relational Databases
Understanding Natural Language Queries over Relational Databases
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
 
nlp-01.pptxvvvffffffvvvvvfeddeeddffffffffff
nlp-01.pptxvvvffffffvvvvvfeddeeddffffffffffnlp-01.pptxvvvffffffvvvvvfeddeeddffffffffff
nlp-01.pptxvvvffffffvvvvvfeddeeddffffffffff
 
How can text-mining leverage developments in Deep Learning? Presentation at ...
How can text-mining leverage developments in Deep Learning?  Presentation at ...How can text-mining leverage developments in Deep Learning?  Presentation at ...
How can text-mining leverage developments in Deep Learning? Presentation at ...
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
 
Topics Modeling
Topics ModelingTopics Modeling
Topics Modeling
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
 

Recently uploaded

Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxAleenaTreesaSaji
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdfNAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdfWadeK3
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 

Recently uploaded (20)

Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptx
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdfNAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 

Introduction to Text Mining and Topic Modelling

  • 1. Introduction to Text Mining and Topic Modelling by Jorge David Gonzalez Paule j.gonzalez-paule.1@research.gla.ac.uk
  • 3.
  • 4. Outline ● What is Text Mining? ● Preparing Text Data: Preprocessing ● Text Data: How to represent? ● Topic Modelling: LDA LEARN !!!!!
  • 5. What is Text Mining? Is the process of extracting high quality information from large amounts of unstructured textual data, using Information Retrieval Information Extraction Natural Language Processing Data Mining computational techniques.
  • 6. Process Overview Filtering and organisation Knowledge Discovery ● Information Retrieval. ● Natural Language Processing. ● Information Extraction ● Data Mining. ● Machine Learning. ● Prediction Models. …….
  • 7. Information Retrieval Search Engine…... ...to connect the right user with the right information. ...to help the user analyse and facilitate decision making. Text Mining Pattern Discovery/Mining…..
  • 8. Which features distinguish text data from other quantitative and relational data? ● Supervised/Unsupervised Learning Models ● Clustering ● Classification Need to be adapted to work with Text Data !!!!
  • 9. Text Data Features ● High Dimensional ● Sparse ● Ambiguous ● Unstructured ● Noisy
  • 10. How to represent text data? ● Word-Level ○ Bag of Words: Isolated terms ● Semantically ○ Natural Language Processing: Syntactic Analysis
  • 11. Preprocessing Text Data 'Hey!!!.....This is an exanple to be preprocessed by @Jorge in #UBDC :) Awesome !!.... http://catvideos.com' ● Clean punctuation or other non-meaningful characters (regexp) 'hey this is an exanple to be preprocessed by in ubdc awesome' ● Tokenize ['hey', 'this', 'is', 'an', 'exanple', 'to', 'be', 'preprocessed', 'by', 'in', 'ubdc', 'awesome'] ● Remove stopwords ['exanple', 'preprocessed', 'ubdc', 'awesome'] ● Spelling corrector ['example', 'preprocessed', 'ubdc', 'awesome'] ● Stemming/Lemmatization (WordNet) ['exampl', 'preprocess', 'ubdc', 'awesom'] 'are' -> 'be'
  • 12. Word-Level: Vector Space Model Term-Document Matrix Documents Corpus/ Collection
  • 13. TF-IDF Weighting Is the product of two statistics: 1. TF = Term Frequency 1. IDF = Inverse Document Frequency. a. Measure of the discriminative power of a word with respect to a document in a collection Given: The TF-IDF is calculated as:
  • 15. World-Level Analysis weakness Bag of Words representation does not take context into account. The semantic approach use Natural Language Processing to consider the overall context of a word in a sentence.
  • 16. Natural Language Processing (NLP) ● Key Idea: Learn the language from data as a human being !! ● Tasks: ○ Name Entity Recognition (NER) ○ Part-Of-Speech Tagging ○ Parsing (Grammatical analysis) ○ Sentiment Analysis ○ …...
  • 18. What is Topic Modelling? ● Unsupervised learning method ● Analyse the words in original text….. ● …. to annotate each document with thematic information. ● Models: LSI and LDA Latent Dirichlet Allocation is the most used !!!!
  • 19. Latent Dirichlet Allocation LDA # Topics Model Parameters 1. Distribution of Topics per Document. 1. Distribution of Words by Topic.
  • 20. Latent Dirichlet Allocation 1. Assumes data are observations that arises from a generative probabilistic process that includes hidden variables 1. 2. Infer the hidden structure using posterior inference 3. Allocate new data into the estimated model
  • 21. Probabilistic Generative Model Hidden Variables Joint Distribution Prior Distributions Generative Process Posterior Distributions
  • 22. LDA intuition Blei, David M. "Probabilistic topic models." Communications of the ACM 55.4 (2012): 77-84. “Documents exhibit multiple topics”
  • 23. Generative Process Word Distributions Topic Distributions Choose through a Dirichlet Distribution !! REVERSE !!!!
  • 24. Generative Process Formal Definition = Topics Distributions over vocabulary = Topic proportion in document = Topic assignment for nth word in document d = nth word in document d The Generative Process is defined as the Joint Distribution of the hidden and observed variables.
  • 26. Steyvers, Mark, and Tom Griffiths. "Probabilistic topic models." Handbook of latent semantic analysis 427.7 (2007): 424-440. Generative Model
  • 27. Inference Algorithm Is the process of computing the following Posterior/Conditional Distribution of the hidden structure of the topics. Joint Distribution Marginal Probability of the observations. All possibles ways to assign each observed word of the collection to one of the topics. Hard to Compute -> Approximation with Gibbs Sampling = Topics Distributions over vocabulary = Topic proportion in document = Topic assignment for nth word in document d = nth word in document d
  • 30. Resources ● http://videolectures.net/mlss09uk_blei_tm/ ● Blei, David M. "Probabilistic topic models." Communications of the ACM 55.4 (2012): 77-84. ● Steyvers, Mark, and Tom Griffiths. "Probabilistic topic models." Handbook of latent semantic analysis 427.7 (2007): 424-440. ● Charu C. Aggarwal and Cheng Xiang Zhai. 2012. Mining Text Data. Springer Publishing Company, Incorporated. And………… !!!!!!