Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Text Mining and Topic Modelling

110 views

Published on

A brief introduction to Text Mining and Topic Modelling given at the Urban Big Data Centre (University of Glasgow).

Want to know more? Visit my website davidpaule.es

Published in: Science
  • Be the first to comment

  • Be the first to like this

Introduction to Text Mining and Topic Modelling

  1. 1. Introduction to Text Mining and Topic Modelling by Jorge David Gonzalez Paule j.gonzalez-paule.1@research.gla.ac.uk
  2. 2. Awesome Practical
  3. 3. Outline ● What is Text Mining? ● Preparing Text Data: Preprocessing ● Text Data: How to represent? ● Topic Modelling: LDA LEARN !!!!!
  4. 4. What is Text Mining? Is the process of extracting high quality information from large amounts of unstructured textual data, using Information Retrieval Information Extraction Natural Language Processing Data Mining computational techniques.
  5. 5. Process Overview Filtering and organisation Knowledge Discovery ● Information Retrieval. ● Natural Language Processing. ● Information Extraction ● Data Mining. ● Machine Learning. ● Prediction Models. …….
  6. 6. Information Retrieval Search Engine…... ...to connect the right user with the right information. ...to help the user analyse and facilitate decision making. Text Mining Pattern Discovery/Mining…..
  7. 7. Which features distinguish text data from other quantitative and relational data? ● Supervised/Unsupervised Learning Models ● Clustering ● Classification Need to be adapted to work with Text Data !!!!
  8. 8. Text Data Features ● High Dimensional ● Sparse ● Ambiguous ● Unstructured ● Noisy
  9. 9. How to represent text data? ● Word-Level ○ Bag of Words: Isolated terms ● Semantically ○ Natural Language Processing: Syntactic Analysis
  10. 10. Preprocessing Text Data 'Hey!!!.....This is an exanple to be preprocessed by @Jorge in #UBDC :) Awesome !!.... http://catvideos.com' ● Clean punctuation or other non-meaningful characters (regexp) 'hey this is an exanple to be preprocessed by in ubdc awesome' ● Tokenize ['hey', 'this', 'is', 'an', 'exanple', 'to', 'be', 'preprocessed', 'by', 'in', 'ubdc', 'awesome'] ● Remove stopwords ['exanple', 'preprocessed', 'ubdc', 'awesome'] ● Spelling corrector ['example', 'preprocessed', 'ubdc', 'awesome'] ● Stemming/Lemmatization (WordNet) ['exampl', 'preprocess', 'ubdc', 'awesom'] 'are' -> 'be'
  11. 11. Word-Level: Vector Space Model Term-Document Matrix Documents Corpus/ Collection
  12. 12. TF-IDF Weighting Is the product of two statistics: 1. TF = Term Frequency 1. IDF = Inverse Document Frequency. a. Measure of the discriminative power of a word with respect to a document in a collection Given: The TF-IDF is calculated as:
  13. 13. Word-Level: TF-IDF
  14. 14. World-Level Analysis weakness Bag of Words representation does not take context into account. The semantic approach use Natural Language Processing to consider the overall context of a word in a sentence.
  15. 15. Natural Language Processing (NLP) ● Key Idea: Learn the language from data as a human being !! ● Tasks: ○ Name Entity Recognition (NER) ○ Part-Of-Speech Tagging ○ Parsing (Grammatical analysis) ○ Sentiment Analysis ○ …...
  16. 16. Topic Modelling
  17. 17. What is Topic Modelling? ● Unsupervised learning method ● Analyse the words in original text….. ● …. to annotate each document with thematic information. ● Models: LSI and LDA Latent Dirichlet Allocation is the most used !!!!
  18. 18. Latent Dirichlet Allocation LDA # Topics Model Parameters 1. Distribution of Topics per Document. 1. Distribution of Words by Topic.
  19. 19. Latent Dirichlet Allocation 1. Assumes data are observations that arises from a generative probabilistic process that includes hidden variables 1. 2. Infer the hidden structure using posterior inference 3. Allocate new data into the estimated model
  20. 20. Probabilistic Generative Model Hidden Variables Joint Distribution Prior Distributions Generative Process Posterior Distributions
  21. 21. LDA intuition Blei, David M. "Probabilistic topic models." Communications of the ACM 55.4 (2012): 77-84. “Documents exhibit multiple topics”
  22. 22. Generative Process Word Distributions Topic Distributions Choose through a Dirichlet Distribution !! REVERSE !!!!
  23. 23. Generative Process Formal Definition = Topics Distributions over vocabulary = Topic proportion in document = Topic assignment for nth word in document d = nth word in document d The Generative Process is defined as the Joint Distribution of the hidden and observed variables.
  24. 24. Graphical Model (Blei et al.)
  25. 25. Steyvers, Mark, and Tom Griffiths. "Probabilistic topic models." Handbook of latent semantic analysis 427.7 (2007): 424-440. Generative Model
  26. 26. Inference Algorithm Is the process of computing the following Posterior/Conditional Distribution of the hidden structure of the topics. Joint Distribution Marginal Probability of the observations. All possibles ways to assign each observed word of the collection to one of the topics. Hard to Compute -> Approximation with Gibbs Sampling = Topics Distributions over vocabulary = Topic proportion in document = Topic assignment for nth word in document d = nth word in document d
  27. 27. Real World Example http://blog.echen.me/2011/06/27/topic- modeling-the-sarah-palin-emails/ http://sarah-palin.herokuapp.com/
  28. 28. Questions
  29. 29. Resources ● http://videolectures.net/mlss09uk_blei_tm/ ● Blei, David M. "Probabilistic topic models." Communications of the ACM 55.4 (2012): 77-84. ● Steyvers, Mark, and Tom Griffiths. "Probabilistic topic models." Handbook of latent semantic analysis 427.7 (2007): 424-440. ● Charu C. Aggarwal and Cheng Xiang Zhai. 2012. Mining Text Data. Springer Publishing Company, Incorporated. And………… !!!!!!

×