Introduction to Text Mining and
Analytics
1
Copyright © Ivy Professional School - 2009-10 (All Rights Reserved)
Analytics
Wiki Definition
Text mining, also referred to as text data mining, roughly equivalent
to text analytics, is the process of deriving high-quality information from text
Source of Text Data
2
Copyright © Ivy Professional School - 2009-10 (All Rights Reserved)
Source of Text Data
Organizations today encounter textual data while running their day to day
business. The source of the data could be electronic text, call center logs,
social media, corporate documents, research papers, application forms,
service notes, emails, etc.
Unstructured Data
• “80 % of business-relevant information originates in unstructured form, primarily
text.” (a quote in 2008)
• “Based on the industry’s current estimations, unstructured data will occupy 90%
of the data by volume in the entire digital space over the next decade.” (a quote in
2010)
3
Copyright © Ivy Professional School - 2009-10 (All Rights Reserved)
Text Mining and Analytics
• Text analytics uses algorithms for turning free-form text (unstructured
data) into data that can be analyzed (structured data) by applying
statistical and machine learning methods, as well as Natural Language
Processing (NLP) techniques.
• Once structured data is obtained, the same mining and analytic
techniques can apply.
4
Copyright © Ivy Professional School - 2009-10 (All Rights Reserved)
techniques can apply.
• So the most significant part of Text Mining/Analytics is how to convert
texts into structured data.
Text Mining Paradigm
5
Copyright © Ivy Professional School - 2009-10 (All Rights Reserved)
Text Mining Process Pipeline
6
Copyright © Ivy Professional School - 2009-10 (All Rights Reserved)
• Process is essentially a linear pipeline.
• Feedback from the results of Text Mining might
affect earlier preprocessing (to Parsing, or even data
collection)..
Converting Text into Structured Data
• A huge amount of preprocessing is required to convert text.
– Cleaning up ‘dirty’ texts
• Remove mark-up tags from web documents, encrypted symbols such as emoticons/emoji’s,
extraneous strings such as “AHHHHHHHHHHHHHHHHHHHHH”
• Correct misspelled words..
– Tokenization
• Remove punctuations, normalizing upper/lower cases, etc.
– Sentence splitting
7
Copyright © Ivy Professional School - 2009-10 (All Rights Reserved)
– Sentence splitting
– Identifying multi-word expressions (e.g. “as well as”, “radio wave”) and Named Entities
(e.g. “Allied Waste”, “Super Mario Bros.”)
• Adding other linguistic information
– Parts-of-speech (e.g. noun, verb, adjective, adverb, preposition)
• Filtering non-significant/irrelevant words – to reduce dimensions
– Filtering non-content words using a stop-list (e.g. “the”, “a”, “an”, “and”)
– Combining tokens by stemming/lemmatizing or using synonyms
• Other NLP features/techniques, e.g. n-grams, syntax trees
Text Mining Applications
• Text Clustering • Trend Analysis
8
Copyright © Ivy Professional School - 2009-10 (All Rights Reserved)
Trend for the Term “text mining” from Google Trends
• Spam filtering
Text Mining – Sentiment Analysis
• Sentiment Analysis
The field of sentiment analysis deals
with categorization (or classification)
of opinions expressed in textual
documents
9
Copyright © Ivy Professional School - 2009-10 (All Rights Reserved)
Sample Tweet:
14 days after #DeMonetisation, PM seeks opinion instead of
addressing the pain & anguish. This is called-Arrogate,
subjugate & dictate!
Two months after RBI Governor changes, #DeMonetisation
happens. Can you imagine what will happen after CJI Thakur
retires on 4 January 2017?
Typical Text Pre-processing Methods
• Given a raw text (in a corpus), we typically pre-process the text by
applying either of the following methods:
1. Part-Of-Speech (POS) tagging – assign a POS to every word in a
sentence in the text
2. Named Entity Recognition (NER) – identify named entities (proper
nouns and some common nouns which are relevant in the domain of
10
Copyright © Ivy Professional School - 2009-10 (All Rights Reserved)
nouns and some common nouns which are relevant in the domain of
the text)
3. Information Extraction (IE) – identify relations between phrases, and
extract the relevant/significant “information” described in the text
1. Part-Of-Speech (POS) Tagging
• POS tagging is a process of assigning a POS or lexical class marker to each
word in a sentence (and all sentences in a corpus).
Input: the lead paint is unsafe
Output: the/Det lead/N paint/N is/V unsafe/Adj
2. Named Entity Recognition (NER)
11
Copyright © Ivy Professional School - 2009-10 (All Rights Reserved)
2. Named Entity Recognition (NER)
• NER is to process a text and identify named entities in a sentence
e.g. “U.N. official Ekeus heads for Baghdad.”
3. Information Extraction (IE)
• Identify specific pieces of information (data) in an
unstructured or semi-structured text
• Transform unstructured information in a corpus of texts or
web pages into a structured database (or templates)
• Applied to various types of text, e.g.
12
Copyright © Ivy Professional School - 2009-10 (All Rights Reserved)
– Newspaper
articles
– Scientific
articles
– Web pages
– etc.
Overview
• Tokenization
• Bag of words
• N-Grams
• TF*IDF
13
Copyright © Ivy Professional School - 2009-10 (All Rights Reserved)
• TF*IDF
• Topic modeling LDA (Latent Dirichlet allocation)

Analysing Demonetisation through Text Mining using Live Twitter Data!

  • 1.
    Introduction to TextMining and Analytics 1 Copyright © Ivy Professional School - 2009-10 (All Rights Reserved) Analytics
  • 2.
    Wiki Definition Text mining,also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text Source of Text Data 2 Copyright © Ivy Professional School - 2009-10 (All Rights Reserved) Source of Text Data Organizations today encounter textual data while running their day to day business. The source of the data could be electronic text, call center logs, social media, corporate documents, research papers, application forms, service notes, emails, etc.
  • 3.
    Unstructured Data • “80% of business-relevant information originates in unstructured form, primarily text.” (a quote in 2008) • “Based on the industry’s current estimations, unstructured data will occupy 90% of the data by volume in the entire digital space over the next decade.” (a quote in 2010) 3 Copyright © Ivy Professional School - 2009-10 (All Rights Reserved)
  • 4.
    Text Mining andAnalytics • Text analytics uses algorithms for turning free-form text (unstructured data) into data that can be analyzed (structured data) by applying statistical and machine learning methods, as well as Natural Language Processing (NLP) techniques. • Once structured data is obtained, the same mining and analytic techniques can apply. 4 Copyright © Ivy Professional School - 2009-10 (All Rights Reserved) techniques can apply. • So the most significant part of Text Mining/Analytics is how to convert texts into structured data.
  • 5.
    Text Mining Paradigm 5 Copyright© Ivy Professional School - 2009-10 (All Rights Reserved)
  • 6.
    Text Mining ProcessPipeline 6 Copyright © Ivy Professional School - 2009-10 (All Rights Reserved) • Process is essentially a linear pipeline. • Feedback from the results of Text Mining might affect earlier preprocessing (to Parsing, or even data collection)..
  • 7.
    Converting Text intoStructured Data • A huge amount of preprocessing is required to convert text. – Cleaning up ‘dirty’ texts • Remove mark-up tags from web documents, encrypted symbols such as emoticons/emoji’s, extraneous strings such as “AHHHHHHHHHHHHHHHHHHHHH” • Correct misspelled words.. – Tokenization • Remove punctuations, normalizing upper/lower cases, etc. – Sentence splitting 7 Copyright © Ivy Professional School - 2009-10 (All Rights Reserved) – Sentence splitting – Identifying multi-word expressions (e.g. “as well as”, “radio wave”) and Named Entities (e.g. “Allied Waste”, “Super Mario Bros.”) • Adding other linguistic information – Parts-of-speech (e.g. noun, verb, adjective, adverb, preposition) • Filtering non-significant/irrelevant words – to reduce dimensions – Filtering non-content words using a stop-list (e.g. “the”, “a”, “an”, “and”) – Combining tokens by stemming/lemmatizing or using synonyms • Other NLP features/techniques, e.g. n-grams, syntax trees
  • 8.
    Text Mining Applications •Text Clustering • Trend Analysis 8 Copyright © Ivy Professional School - 2009-10 (All Rights Reserved) Trend for the Term “text mining” from Google Trends • Spam filtering
  • 9.
    Text Mining –Sentiment Analysis • Sentiment Analysis The field of sentiment analysis deals with categorization (or classification) of opinions expressed in textual documents 9 Copyright © Ivy Professional School - 2009-10 (All Rights Reserved) Sample Tweet: 14 days after #DeMonetisation, PM seeks opinion instead of addressing the pain & anguish. This is called-Arrogate, subjugate & dictate! Two months after RBI Governor changes, #DeMonetisation happens. Can you imagine what will happen after CJI Thakur retires on 4 January 2017?
  • 10.
    Typical Text Pre-processingMethods • Given a raw text (in a corpus), we typically pre-process the text by applying either of the following methods: 1. Part-Of-Speech (POS) tagging – assign a POS to every word in a sentence in the text 2. Named Entity Recognition (NER) – identify named entities (proper nouns and some common nouns which are relevant in the domain of 10 Copyright © Ivy Professional School - 2009-10 (All Rights Reserved) nouns and some common nouns which are relevant in the domain of the text) 3. Information Extraction (IE) – identify relations between phrases, and extract the relevant/significant “information” described in the text
  • 11.
    1. Part-Of-Speech (POS)Tagging • POS tagging is a process of assigning a POS or lexical class marker to each word in a sentence (and all sentences in a corpus). Input: the lead paint is unsafe Output: the/Det lead/N paint/N is/V unsafe/Adj 2. Named Entity Recognition (NER) 11 Copyright © Ivy Professional School - 2009-10 (All Rights Reserved) 2. Named Entity Recognition (NER) • NER is to process a text and identify named entities in a sentence e.g. “U.N. official Ekeus heads for Baghdad.”
  • 12.
    3. Information Extraction(IE) • Identify specific pieces of information (data) in an unstructured or semi-structured text • Transform unstructured information in a corpus of texts or web pages into a structured database (or templates) • Applied to various types of text, e.g. 12 Copyright © Ivy Professional School - 2009-10 (All Rights Reserved) – Newspaper articles – Scientific articles – Web pages – etc.
  • 13.
    Overview • Tokenization • Bagof words • N-Grams • TF*IDF 13 Copyright © Ivy Professional School - 2009-10 (All Rights Reserved) • TF*IDF • Topic modeling LDA (Latent Dirichlet allocation)