Analysing Demonetisation through Text Mining using Live Twitter Data!

Introduction to Text Mining and
Analytics
1
Copyright © Ivy Professional School - 2009-10 (All Rights Reserved)
Analytics

Wiki Definition
Text mining, also referred to as text data mining, roughly equivalent
to text analytics, is the process of deriving high-quality information from text
Source of Text Data
2
Source of Text Data
Organizations today encounter textual data while running their day to day
business. The source of the data could be electronic text, call center logs,
social media, corporate documents, research papers, application forms,
service notes, emails, etc.

Unstructured Data
• “80 % of business-relevant information originates in unstructured form, primarily
text.” (a quote in 2008)
• “Based on the industry’s current estimations, unstructured data will occupy 90%
of the data by volume in the entire digital space over the next decade.” (a quote in
2010)
3

Text Mining and Analytics
• Text analytics uses algorithms for turning free-form text (unstructured
data) into data that can be analyzed (structured data) by applying
statistical and machine learning methods, as well as Natural Language
Processing (NLP) techniques.
• Once structured data is obtained, the same mining and analytic
techniques can apply.
4
techniques can apply.
• So the most significant part of Text Mining/Analytics is how to convert
texts into structured data.

Text Mining Paradigm
5

Text Mining Process Pipeline
6
• Process is essentially a linear pipeline.
• Feedback from the results of Text Mining might
affect earlier preprocessing (to Parsing, or even data
collection)..

Converting Text into Structured Data
• A huge amount of preprocessing is required to convert text.
– Cleaning up ‘dirty’ texts
• Remove mark-up tags from web documents, encrypted symbols such as emoticons/emoji’s,
extraneous strings such as “AHHHHHHHHHHHHHHHHHHHHH”
• Correct misspelled words..
– Tokenization
• Remove punctuations, normalizing upper/lower cases, etc.
– Sentence splitting
7
– Sentence splitting
– Identifying multi-word expressions (e.g. “as well as”, “radio wave”) and Named Entities
(e.g. “Allied Waste”, “Super Mario Bros.”)
• Adding other linguistic information
– Parts-of-speech (e.g. noun, verb, adjective, adverb, preposition)
• Filtering non-significant/irrelevant words – to reduce dimensions
– Filtering non-content words using a stop-list (e.g. “the”, “a”, “an”, “and”)
– Combining tokens by stemming/lemmatizing or using synonyms
• Other NLP features/techniques, e.g. n-grams, syntax trees

Text Mining Applications
• Text Clustering • Trend Analysis
8
Trend for the Term “text mining” from Google Trends
• Spam filtering

Text Mining – Sentiment Analysis
• Sentiment Analysis
The field of sentiment analysis deals
with categorization (or classification)
of opinions expressed in textual
documents
9
Sample Tweet:
14 days after #DeMonetisation, PM seeks opinion instead of
addressing the pain & anguish. This is called-Arrogate,
subjugate & dictate!
Two months after RBI Governor changes, #DeMonetisation
happens. Can you imagine what will happen after CJI Thakur
retires on 4 January 2017?

Typical Text Pre-processing Methods
• Given a raw text (in a corpus), we typically pre-process the text by
applying either of the following methods:
1. Part-Of-Speech (POS) tagging – assign a POS to every word in a
sentence in the text
2. Named Entity Recognition (NER) – identify named entities (proper
nouns and some common nouns which are relevant in the domain of
10
nouns and some common nouns which are relevant in the domain of
the text)
3. Information Extraction (IE) – identify relations between phrases, and
extract the relevant/significant “information” described in the text

1. Part-Of-Speech (POS) Tagging
• POS tagging is a process of assigning a POS or lexical class marker to each
word in a sentence (and all sentences in a corpus).
Input: the lead paint is unsafe
Output: the/Det lead/N paint/N is/V unsafe/Adj
2. Named Entity Recognition (NER)
11
2. Named Entity Recognition (NER)
• NER is to process a text and identify named entities in a sentence
e.g. “U.N. official Ekeus heads for Baghdad.”

3. Information Extraction (IE)
• Identify specific pieces of information (data) in an
unstructured or semi-structured text
• Transform unstructured information in a corpus of texts or
web pages into a structured database (or templates)
• Applied to various types of text, e.g.
12
– Newspaper
articles
– Scientific
articles
– Web pages
– etc.

Overview
• Tokenization
• Bag of words
• N-Grams
• TF*IDF
13
• TF*IDF
• Topic modeling LDA (Latent Dirichlet allocation)

Analysing Demonetisation through Text Mining using Live Twitter Data!

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Analysing Demonetisation through Text Mining using Live Twitter Data!

Similar to Analysing Demonetisation through Text Mining using Live Twitter Data! (20)

Recently uploaded

Recently uploaded (20)

Analysing Demonetisation through Text Mining using Live Twitter Data!