Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
Text mining, By Hadi Mohammadzadeh
Next
Download to read offline and view in fullscreen.

3

Share

Download to read offline

Text Mining Analytics 101

Download to read offline

High level introduction to text mining analytics, which covers the building blocks or most commonly used techniques of text mining along with useful additional references/links where required for background/literature and R codes to get you started.

Text Mining Analytics 101

  1. 1. Text Mining 101 Manohar Swamynathan August 2012
  2. 2. agenda: o Text Mining Process Steps o Calculate Term Weight o Similarity Distance Measure o Common Text Mining Techniques o Appendix - Required R packages for Text Mining - Implemented Examples o R code for obtaining and analyzing tweets. o RTextTools – Ensemble Classification o References Manohar Swamynathan Aug 2012
  3. 3. Step 1 – Data assemble Text Corpus Flat files Social Corporate Database CommonTextDataSources
  4. 4. Data Processing Step Brief Description Explore Corpus through Exploratory Data Analysis Understand the types of variables, their functions, permissible values, and so on. Some formats including html and xml contain tags and other data structures that provide more metadata. Convert text to lowercase This is to avoid distinguish between words simply on case. Remove Number(if required) Numbers may or may not be relevant to our analyses. Remove Punctuations Punctuation can provide grammatical context which supports understanding. Often for initial analyses we ignore the punctuation. Later we will use punctuation to support the extraction of meaning. Remove English stop words Stop words are common words found in a language. Words like for, of, are, etc are common stop words. Remove Own stop words(if required) Along with English stop words, we could instead or in addition remove our own stop words. The choice of own stop word might depend on the domain of discourse, and might not become apparent until we've done some analysis. Strip whitespace Eliminate extra white-spaces. Any additional space that is not the space that occur within the sentence or between words. Stemming Stemming uses an algorithm that removes common word endings for English words, such as “es”, “ed” and “'s”. Example, "computer" & "computers" become "comput" Lemmatization Transform to dictionary base form i.e., "produce" & "produced" become "produce" Sparse terms We are often not interested in infrequent terms in our documents. Such “sparse" terms should be removed from the document term matrix. Document term matrix A document term matrix is simply a matrix with documents as the rows and terms as the columns and a count of the frequency of words as the cells of the matrix. Step 2 - Data Processing 4 Python packages – textmining, nltk R packages - tm, qdap, openNLP
  5. 5. Step 3 - Data Visualization Frequency Chart Word Cloud Correlation Plot
  6. 6. Step 4 – Models Clustering Classification Sentiment Analysis Document
  7. 7. Term Frequency - How frequently term appears? Term Frequency TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document) Example: Calculate Term Weight (TF * IDF) Inverse Document Frequency - How important a term is? Document Frequency DF = d (number of documents containing a given term) / D (the size of the collection of documents) To normalize take log(d/D), but often D > d and log(d/D) will give negative value. So invert the ratio inside log expression. Essentially we are compressing the scale of values so that very large or very small quantities are smoothly compared Inverse Document Frequency IDF(t) = log(Total number of documents / Number of documents with term t in it) 7 - Assume we have overall 10 million documents and the word spindle appears in one thousand of these - Consider 2 documents containing 100 total words each, and contains term spindle x number of times Document spindle – Frequency Total Words TF IDF TF * IDF 1 3 100 3/100 = 0.03 log(10,000,000/1,000) = 4 0.03 * 4 = 0.12 2 30 100 30/100 = .3 log(10,000,000/1,000) = 4 0.3 * 4 = 1.2
  8. 8. Similarity Distance Measure Example: Text 1: statistics skills and programming skills are equally important for analytics Text 2: statistics skills and domain knowledge are important for analytics Text 3: I like reading books and travelling The three vectors are: T1 = (1,2,1,1,0,1,1,1,1,1,0,0,0,0,0,0) T2 = (1,1,1,0,1,1,0,1,1,1,1,0,0,0,0,0) T3 = (0,0,1,0,0,0,0,0,0,0,0,1,1,1,1,1) Degree of Similarity (T1 & T2) = (T1 %*% T2) / (sqrt(sum(T1^2)) * sqrt(sum(T2^2))) = 77% Degree of Similarity (T1 & T3) = (T1 %*% T3) / (sqrt(sum(T1^2)) * sqrt(sum(T3^2))) = 12% Additional Reading: Here is a detailed paper on comparing the efficiency of different distance measures for text documents. URL – 1) http://home.iitk.ac.in/~spranjal/cs671/project/report.pdf 2) http://users.dsic.upv.es/~prosso/resources/BarronEtAl_ICON09.pdf statisticsskills and programming knowledge are equally important for analytics domain I like reading books travelling Text 1 1 2 1 1 0 1 1 1 1 1 0 0 0 0 0 0 Text 2 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 0 Text 3 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 X Y Euclidean Cosine - cosine value will be a number between 0 and 1 - Smaller the angel bigger the cosine value/similarity 8
  9. 9. Common Text Mining Techniques • N-grams • Shallow Natural Language Processing • Deep Natural Language Processing
  10. 10. Example: "defense attorney for liberty and montecito” 1-gram: defense attorney for liberty and montecito 2-gram: defense attorney for liberty and montecito attorney for liberty and attorney for 3-gram: defense attorney for liberty and montecito attorney for liberty for liberty and liberty and montecito 4-gram: defense attorney for liberty attorney for liberty and for liberty and montecito 5-gram: defense attorney for liberty and montecito attorney for liberty and montecito Application:  Probabilistic language model for predicting the next item in a sequence in the form of a (n − 1)  Widely used in probability, communication theory, computational linguistics, biological sequence analysis Advantage:  Relatively simple  Simply increasing n, model can be used to store more context Disadvantage:  Semantic value of the item is not considered n-gram Definition: • n-gram is a contiguous sequence of n items from a given sequence of text • The items can be letters, words, syllables or base pairs according to the application 10
  11. 11. Application: - Taxonomy extraction (predefined terms and entities) - Entities: People, organizations, locations, times, dates, prices, genes, proteins, diseases, medicines - Concept extraction (main idea or a theme) Advantage: - Less noisy than n-grams Disadvantage: - Does not specify role of items in the main sentence Shallow NLP Technique Definition: - Assign a syntactic label (noun, verb etc.) to a chunk - Knowledge extraction from text through semantic/syntactic analysis approach 11
  12. 12. Sentence - “The driver from Europe crashed the car with the white bumper” 1-gram the driver from europe crashed the car with the white bumper Part of Speech DT – Determiner NN - Noun, singular or mass IN - Preposition or subordinating conjunction NNP - Proper Noun, singular VBD - Verb, past tense DT – Determiner NN - Noun, singular or mass IN - Preposition or subordinating conjunction DT – Determiner JJ – Adjective NN - Noun, singular or mass - Convert to lowercase & PoS tag Concept Extraction: - Remove Stop words - Retain only Noun’s & Verb’s - Bi-gram with Noun’s & Verb’s retained Bi-gram PoS car white NN JJ crashed car VBD NN driver europe NN NNP europe crashed NNP VBD white bumper JJ NN 3-gram PoS car white bumper NN JJ NN crashed car white VBD NN JJ driver europe crashed NN NNP VBD europe crashed car NNP VBD NN - 3-gram with Noun’s & Verb’s retained Conclusion: 1-gram: Reduced noise, however no clear context Bi-gram & 3-gram: Increased context, however there is a information loss Shallow NLP Technique 12 Stop words Noun/Verb
  13. 13. Definition: - Extension to the shallow NLP - Detected relationships are expressed as complex construction to retain the context - Example relationships: Located in, employed by, part of, married to Applications: - Develop features and representations appropriate for complex interpretation tasks - Fraud detection - Life science: prediction activities based on complex RNA-Sequence Deep NLP technique Example: The above sentence can be represented using triples (Subject: Predicate [Modifier]: Object) without loosing the context. Triples: driver : crash : car driver : crash with : bumper driver : be from : Europe 13
  14. 14. Technique General Steps Pros Cons N-Gram - Convert to lowercase - Remove punctuations - Remove special characters Simple technique Extremely noisy Shallow NLP technique - POS tagging - Lemmatization i.e., transform to dictionary base form i.e., "produce" & "produced" become "produce" - Stemming i.e., transform to root word i.e., 1) "computer" & "computers" become "comput" 2) "product", "produce" & "produced" become "produc" - Chunking i.e., identify the phrasal constituents in a sentence , including noun/verb phrase etc., and splits the sentence into chunks of semantically related words Less noisy than N- Grams Computationally expensive solution for analyzing the structure of texts. Does not specify the internal structure or the role of words in the sentence Deep NLP technique - Generate syntactic relationship between each pair of words - Extract subject, predicate, nagation, objecct and named entity to form triples. Context of the sentence is retained. Sentence level analysis is too structured Techniques - Summary 14
  15. 15. Appendix 15
  16. 16. 2A - Explore Corpus through EDA 2B - Convert text to lowercase 2C - Remove a) Numbers(if required) b) Punctuations c) English stop words d) Own stop words(if required) e) Strip whitespace f) Lemmatization/Stemming g) Sparse terms 2D - Create document term matrix Step 3 - Visualization Corpus Web Documents Step 1 – Data Assemble Step 2 – Data Processing Step 4 – Build Model(s)  Clustering  Classification  Sentiment Analysis FrequencyChartWordCloudCorrelationPlot R - Text Mining Process Overview 16 DB
  17. 17. Package Name Category Description tm Text Mining A framework for text mining applications topicmodels Topic Modelling Fit topic models with Latent Dirichlet Allocation (LDA) and Comparative Text Mining (CTM) wordcloud Visualization Plot a cloud comparing the frequencies of words across documents lda Topic Modelling Fit topic models with Latent Dirichlet Allocation wordnet Text Mining Database of English which is commonly used in linguistics and text mining RTextTools Text Mining Automatic text classification via supervised learning qdap Sentiment analysis Transcript analysis, text mining and natural language processing tm.plugin.dc Text Mining A plug-in for package tm to support distributed text mining tm.plugin.mail Text Mining A plug-in for package tm to handle mail textir Text Mining A suite of tools for inference about text documents and associated sentiment tau Text Mining Utilities for text analysis textcat Text Mining N-gram based text categorization SnowballC Text Mining Word stemmer twitteR Text Mining Provides an interface to the Twitter web API ROAuth Text Mining Allows users to authenticate to the server of their choice (like Twitter) RColorBrewer Visualization The packages provides palettes for drawing nice maps shaded according to a variable ggplot2 Visualization Graphing package implemented on top of the R statistical package. Inspired by the Grammar of Graphics seminal work of Leland Wilkinson R – Required packages for Text Mining 17
  18. 18. Example 1 - Obtaining and analyzing tweets Objective: R code for analyzing tweets relating to #AAA2011 (text mining, topic modelling, network analysis, clustering and sentiment analysis) What does the code do? The code details ten steps in the analysis and visualization of the tweets: Acquiring the raw Twitter data Calculating some basic statistics with the raw Twitter data Calculating some basic retweet statistics Calculating the ratio of retweets to tweets Calculating some basic statistics about URLs in tweets Basic text mining for token frequency and token association analysis (word cloud) Calculating sentiment scores of tweets, including on subsets containing tokens of interest Hierarchical clustering of tokens based on multi scale bootstrap resampling Topic modelling the tweet corpus using latent Dirichlet allocation Network analysis of tweeters based on retweets Code Source: Code was taken from following link and tweaked/added additional bits where required to ensure code runs fine https://github.com/benmarwick/AAA2011-Tweets How to Run or Test the code? - From the word doc copy the R code in the given sequence highlighted in yellow color and paste on your R console. 18
  19. 19. RTextTools – Example for supervised Learning for Text Classification using Ensemble RTextTools is a free, open source R machine learning package for automatic text classification. The package includes nine algorithms for ensemble classification (svm, slda, boosting, bagging, random forests, glmnet, decision trees, neural networks, and maximum entropy), comprehensive analytics, and thorough documentation. Users may use n-fold cross validation to calculate the accuracy of each algorithm on their dataset and determine which algorithms to use in their ensemble. (Using a four-ensemble agreement approach, Collingwood and Wilkerson (2012) found that when four of their algorithms agree on the label of a textual document, the machine label matches the human label over 90% of the time. The rate is just 45% when only two algorithms agree on the text label.) Code Source: The codes is readily available for download and usage from the following link https://github.com/timjurka/RTextTools . The code can be implemented without modification for testing, however it’s set up such that changes can be incorporated easily based on our requirement. Additional Reading: http://www.rtexttools.com/about-the-project.html 19 Example 2 - RTextTools
  20. 20. Penn Treebank - https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html Stanford info lab - Finding Similar Items: http://infolab.stanford.edu/~ullman/mmds/ch3.pdf TRIPLET EXTRACTION FROM SENTENCES: URL - http://ailab.ijs.si/delia_rusu/Papers/is_2007.pdf Shallow and Deep NLP Processing for ontology learning, a Quick Overview: http://azouaq.athabascau.ca/publications/Conferences,%20Workshops,%20Books/%5BBC2%5D_K DW_2010.pdf References 20
  • alhaitham

    Apr. 7, 2018
  • chandukumargupta

    Oct. 11, 2017
  • PrasannaJanardhanam

    Aug. 16, 2016

High level introduction to text mining analytics, which covers the building blocks or most commonly used techniques of text mining along with useful additional references/links where required for background/literature and R codes to get you started.

Views

Total views

1,031

On Slideshare

0

From embeds

0

Number of embeds

5

Actions

Downloads

50

Shares

0

Comments

0

Likes

3

×