Text mining

WEBINAR ON
Presented by
Dr. A. MUTHUSAMY
Head, Associate Professor
Department of Computer Science with Cognitive Systems
Dr. N.G.P. Arts and Science College
Coimbatore-641 048
Tamil Nadu, India
Mobile: +91 8675390072, E-mail: muthusamy@drngpasc.ac.in 1
TM

OUTLINE
2
Dr. NGPASC
COIMBATORE | INDIA
Introduction - Data & Its Types
Need of Text Mining
Objectives of Text Mining
Search Vs Discover
Relevant Disciplines
Process & Tools - Text Mining
Benefits & Applications of TM
Conclusion

ABOUT DATA
4
Dr. NGPASC
COIMBATORE | INDIA
• Data - collection of raw facts & symbols
• Data is collected from multiple sources
• Data can be of the following forms
 Text – Alphabets
 Number – 0 to 9
 Image – Scanner
 Audio – Microphone
 Video – Camera
 Sensors

ABOUT DATA
5
Dr. NGPASC
COIMBATORE | INDIA

ABOUT DATA
6
Dr. NGPASC
COIMBATORE | INDIA
• Every organization has its own specific data - to perform certain
operations within it
• It gives the status of past activities and enable us to make decision
• Information Processing – Converting data into information

REAL TIME SCENARIOS
• Many individuals use some form of computing every day
whether they realize it or not
• Swiping a debit card, sending an email / using a mobile phone
can all be considered forms of computing
7
Dr. NGPASC
COIMBATORE | INDIA

DATA VARIETIES
8
Dr. NGPASC
COIMBATORE | INDIA
• Data that has no inherent structure & is
usually stored as different types of files
• Examples: Text docs, PDFs, images & Videos
• Textual data with erratic formats that can be
formatted with effort & S/W tools
• Examples: Click stream data
• Textual data files with an apparent pattern,
enabling analysis
• Examples: Spreadsheets & XML files
• Data having a defined data model, format,
structure
• Example: Database

DATA VARIETIES
9
Dr. NGPASC
COIMBATORE | INDIA
Structured Unstructured

NEED OF TEXT MINING
11
Dr. NGPASC
COIMBATORE | INDIA
• Approximately 90% of the world’s data is held in unstructured formats
(source: Oracle Corporation)
• Most of the information in public and private sectors are stored
electronically in the form of text databases
• Information intensive business processes demand that we transcend
from simple document retrieval to Knowledge Discovery
Text Databases

NEED OF TEXT MINING
12
Dr. NGPASC
COIMBATORE | INDIA
Why Cats sit on mats?

NEED OF TEXT MINING
13
Dr. NGPASC
COIMBATORE | INDIA
• It would be impossible to read all the millions of research articles on
the topic yourself.
• Text mining helps to filter large amounts of research and extracts the
relevant information
• It helps to identify, ‘cat’ is the noun, ‘sit’ is the verb and ‘on’ is the
proposition

NEED OF TEXT MINING
14
Dr. NGPASC
COIMBATORE | INDIA
• It is not just a search tool, it can also understand that the ‘cat’ is an
animal, ‘sit’ is an action, and a ‘mat’ is an object.
• Identifies and maps patterns and trends across the millions of articles.
Answer: Most of the cats who sit on mats come from cold climates.

TEXT MINING
• Knowledge Discovery of Text (KDT) - Large collections of written
resources to generate valuable information from unstructured formats
• Transform the unstructured text into structured data
• It involves algorithms of DM, ML, Statistics, and NLP, attempts to
extract high quality of useful information
• To exploit information contained in textual documents in various ways
including pattern discovery and trends in data, associations among
entities, predictive rules etc.,
16
Dr. NGPASC
COIMBATORE | INDIA

TEXT MINING
17
Dr. NGPASC
COIMBATORE | INDIA
Framework to Create Dictionary

SEARCH Vs DISCOVER
19
Dr. NGPASC
COIMBATORE | INDIA

SEARCH Vs DISCOVER
20
Dr. NGPASC
COIMBATORE | INDIA
Find records within a structured database
• Database Type : Structured
• Search Mode : Goal-driven
• Atomic entity : Data Record
• Information Need : Find a Japanese restaurant in Bangalore that
serves vegetarian food
• Query : SELECT * FROM restaurants WHERE city =
Bangalore AND type = Japanese AND has_veg =
true
Data Retrieval

SEARCH Vs DISCOVER
21
Dr. NGPASC
COIMBATORE | INDIA
Find relevant information in an unstructured information source (usually text)
• Database Type : Unstructured
• Search Mode : Goal-driven
• Atomic entity : Document
• Information Need : Find a Japanese restaurant in Bangalore
• Query : “Japanese restaurant Bangalore” or
Bangalore ->Restaurants->Japanese
Inf. Retrieval

SEARCH Vs DISCOVER
22
Dr. NGPASC
COIMBATORE | INDIA
Discover new knowledge through analysis of data
• Database Type : Structured
• Search Mode : Opportunistic
• Atomic entity : Numbers and Dimensions
• Information Need : Show trend over time in # of visits to
Japanese restaurants in Bangalore
• Query : SELECT SUM(visits) FROM restaurants WHERE city
=Bangalore AND type=Japanese ORDER BY date
Data Mining

SEARCH Vs DISCOVER
23
Dr. NGPASC
COIMBATORE | INDIA
Discover new knowledge through analysis of text
• Database Type : Unstructured
• Search Mode : Opportunistic
• Atomic entity : Language feature or concept
• Information Need : Find the types of food poisoning most often
associated with Japanese restaurants
• Query : Rank diseases found associated with “Japanese
restaurants”
Text Mining

RELEAVANT DISCIPLINES
25
Dr. NGPASC
COIMBATORE | INDIA
Information Extraction (IE)
• The process of automatically obtaining structured data from unstructured
data
Natural Language Processing (NLP)
• Study of human language - primary component of Artificial Intelligence (AI)
• Computational technique to analysis & synthesis of natural language &
speech

26
Dr. NGPASC
COIMBATORE | INDIA
Data Mining (DM)
• It refers to the extraction of useful data, hidden patterns from large data
sets.
• Data mining tools can predict behaviors & future trends that allow
businesses to make a better data-driven decision
Information Retrieval (IR)
• It is the process of obtaining information system resources that are
relevant to an information need from a collection of those resources
• IR an extension to document extraction
• It helps to narrow down the set of records that are associated with a
specific problem

27
Dr. NGPASC
COIMBATORE | INDIA
Example
Named Entity Recognition

PROCESS OF TEXT MINING
29
Dr. NGPASC
COIMBATORE | INDIA

PROCESS OF TEXT MINING
30
Dr. NGPASC
COIMBATORE | INDIA
• Document extract
from different
external sources
• Web pages
Extract Docs.
• Document storage
• Handle multiple
documents at a
time
Corpus
• Tokenization, PoS
Tagging
• Stemming,
Lemmatization &
Stop words
Text Transformation
• Bags of Words
• Vector space
Extract Feature
• TF - IDF
• SVD & LDA
Reduce Dimension
• Supervised
• Unsupervised
Data Mining

EXAMPLES
31
Dr. NGPASC
COIMBATORE | INDIA
Tokenization
• Tokenization is just the process of splitting a sentence into words
Tokenization of a sentence

EXAMPLES
32
Dr. NGPASC
COIMBATORE | INDIA
Stemming and Lemmatization
• Stemming and Lemmatization is the method to normalize the text documents
• Text normalization helps to improve the vocabulary (reducing the inflectional
forms) and accuracy of many language modeling tasks
• It is the process of converting a word to its base form

EXAMPLES
33
Dr. NGPASC
COIMBATORE | INDIA
PoS Tagging
• Processing text and assigning parts of speech to each word
• Tags may vary by different tagging sets as, Noun, Verb, Preposition,
Determiner etc.

EXAMPLES
34
Dr. NGPASC
COIMBATORE | INDIA

EXAMPLES
35
Dr. NGPASC
COIMBATORE | INDIA

EXAMPLES
36
Dr. NGPASC
COIMBATORE | INDIA
Reduce Dimensions
• Each row in the document / term matrix represents each document as a
high dimension vector (each dimension represents the occurrence of the
term)
• It can be measured by, tf-idf
 The reason, to reduce the time & memory
 To transform the vector from term space to a topic space - allows
document of similar topics to associate with each other use different
terms
• Example: "pet" and "cat" are map to the same topic based on their
co-occurrence

EXAMPLES
37
Dr. NGPASC
COIMBATORE | INDIA
Data Exploration
• By computing term frequency, Word cloud & Histogram are presented to
represent the significance of the terms
• To prove the correctness of the algorithm include statistical tests within the
model
Data Visualization

TOOLS
38
Dr. NGPASC
COIMBATORE | INDIA

BENEFITS OF TEXT MINING
40
Dr. NGPASC
COIMBATORE | INDIA
Extract all kinds of
information
E-commerce –
personalized marketing
Society - identifying
criminal activities
Companies - better
customer relationship
Scalability &
Real Time Analysis
Predictive Modeling

APPLICATIONS OF TEXT MINING
41
Dr. NGPASC
COIMBATORE | INDIA
Search engine technology
Analysis of survey data
Spam identification
Call center routing
Surveillance
Public health early warning

CONCLUSION
• Overcome the database storage limits
• Extraction of relevant information & relationships from
unstructured docs
• Empirically proven as more accurate & highly effective
• Domain knowledge integration
• Multilingual text refinement
• Varying concepts granularity
• NLP ambiguity
42
Dr. NGPASC
COIMBATORE | INDIA
CHALLENGES

Queries
43
Dr. NGPASC
COIMBATORE | INDIA

Text mining

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Similar to Text mining

Similar to Text mining (20)

Recently uploaded

Recently uploaded (20)

Text mining