WEBINAR ON
Presented by
Dr. A. MUTHUSAMY
Head, Associate Professor
Department of Computer Science with Cognitive Systems
Dr. N.G.P. Arts and Science College
Coimbatore-641 048
Tamil Nadu, India
Mobile: +91 8675390072, E-mail: muthusamy@drngpasc.ac.in 1
TM
OUTLINE
2
Dr. NGPASC
COIMBATORE | INDIA
Introduction - Data & Its Types
Need of Text Mining
Objectives of Text Mining
Search Vs Discover
Relevant Disciplines
Process & Tools - Text Mining
Benefits & Applications of TM
Conclusion
INTRODUCTION
ABOUT DATA
4
Dr. NGPASC
COIMBATORE | INDIA
• Data - collection of raw facts & symbols
• Data is collected from multiple sources
• Data can be of the following forms
 Text – Alphabets
 Number – 0 to 9
 Image – Scanner
 Audio – Microphone
 Video – Camera
 Sensors
ABOUT DATA
5
Dr. NGPASC
COIMBATORE | INDIA
ABOUT DATA
6
Dr. NGPASC
COIMBATORE | INDIA
• Every organization has its own specific data - to perform certain
operations within it
• It gives the status of past activities and enable us to make decision
• Information Processing – Converting data into information
REAL TIME SCENARIOS
• Many individuals use some form of computing every day
whether they realize it or not
• Swiping a debit card, sending an email / using a mobile phone
can all be considered forms of computing
7
Dr. NGPASC
COIMBATORE | INDIA
DATA VARIETIES
8
Dr. NGPASC
COIMBATORE | INDIA
• Data that has no inherent structure & is
usually stored as different types of files
• Examples: Text docs, PDFs, images & Videos
• Textual data with erratic formats that can be
formatted with effort & S/W tools
• Examples: Click stream data
• Textual data files with an apparent pattern,
enabling analysis
• Examples: Spreadsheets & XML files
• Data having a defined data model, format,
structure
• Example: Database
DATA VARIETIES
9
Dr. NGPASC
COIMBATORE | INDIA
Structured Unstructured
10
TM
TM
NEED OF TEXT MINING
11
Dr. NGPASC
COIMBATORE | INDIA
• Approximately 90% of the world’s data is held in unstructured formats
(source: Oracle Corporation)
• Most of the information in public and private sectors are stored
electronically in the form of text databases
• Information intensive business processes demand that we transcend
from simple document retrieval to Knowledge Discovery
Text Databases
NEED OF TEXT MINING
12
Dr. NGPASC
COIMBATORE | INDIA
Why Cats sit on mats?
NEED OF TEXT MINING
13
Dr. NGPASC
COIMBATORE | INDIA
• It would be impossible to read all the millions of research articles on
the topic yourself.
• Text mining helps to filter large amounts of research and extracts the
relevant information
• It helps to identify, ‘cat’ is the noun, ‘sit’ is the verb and ‘on’ is the
proposition
NEED OF TEXT MINING
14
Dr. NGPASC
COIMBATORE | INDIA
• It is not just a search tool, it can also understand that the ‘cat’ is an
animal, ‘sit’ is an action, and a ‘mat’ is an object.
• Identifies and maps patterns and trends across the millions of articles.
Answer: Most of the cats who sit on mats come from cold climates.
TEXT MINING
• Knowledge Discovery of Text (KDT) - Large collections of written
resources to generate valuable information from unstructured formats
• Transform the unstructured text into structured data
• It involves algorithms of DM, ML, Statistics, and NLP, attempts to
extract high quality of useful information
• To exploit information contained in textual documents in various ways
including pattern discovery and trends in data, associations among
entities, predictive rules etc.,
16
Dr. NGPASC
COIMBATORE | INDIA
TEXT MINING
17
Dr. NGPASC
COIMBATORE | INDIA
Framework to Create Dictionary
Dr. NGPASC
COIMBATORE | INDIA
SEARCH Vs DISCOVER
19
Dr. NGPASC
COIMBATORE | INDIA
SEARCH Vs DISCOVER
20
Dr. NGPASC
COIMBATORE | INDIA
Find records within a structured database
• Database Type : Structured
• Search Mode : Goal-driven
• Atomic entity : Data Record
• Information Need : Find a Japanese restaurant in Bangalore that
serves vegetarian food
• Query : SELECT * FROM restaurants WHERE city =
Bangalore AND type = Japanese AND has_veg =
true
Data Retrieval
SEARCH Vs DISCOVER
21
Dr. NGPASC
COIMBATORE | INDIA
Find relevant information in an unstructured information source (usually text)
• Database Type : Unstructured
• Search Mode : Goal-driven
• Atomic entity : Document
• Information Need : Find a Japanese restaurant in Bangalore
• Query : “Japanese restaurant Bangalore” or
Bangalore ->Restaurants->Japanese
Inf. Retrieval
SEARCH Vs DISCOVER
22
Dr. NGPASC
COIMBATORE | INDIA
Discover new knowledge through analysis of data
• Database Type : Structured
• Search Mode : Opportunistic
• Atomic entity : Numbers and Dimensions
• Information Need : Show trend over time in # of visits to
Japanese restaurants in Bangalore
• Query : SELECT SUM(visits) FROM restaurants WHERE city
=Bangalore AND type=Japanese ORDER BY date
Data Mining
SEARCH Vs DISCOVER
23
Dr. NGPASC
COIMBATORE | INDIA
Discover new knowledge through analysis of text
• Database Type : Unstructured
• Search Mode : Opportunistic
• Atomic entity : Language feature or concept
• Information Need : Find the types of food poisoning most often
associated with Japanese restaurants
• Query : Rank diseases found associated with “Japanese
restaurants”
Text Mining
24
RELEAVANT DISCIPLINES
25
Dr. NGPASC
COIMBATORE | INDIA
Information Extraction (IE)
• The process of automatically obtaining structured data from unstructured
data
Natural Language Processing (NLP)
• Study of human language - primary component of Artificial Intelligence (AI)
• Computational technique to analysis & synthesis of natural language &
speech
RELEAVANT DISCIPLINES
26
Dr. NGPASC
COIMBATORE | INDIA
Data Mining (DM)
• It refers to the extraction of useful data, hidden patterns from large data
sets.
• Data mining tools can predict behaviors & future trends that allow
businesses to make a better data-driven decision
Information Retrieval (IR)
• It is the process of obtaining information system resources that are
relevant to an information need from a collection of those resources
• IR an extension to document extraction
• It helps to narrow down the set of records that are associated with a
specific problem
RELEAVANT DISCIPLINES
27
Dr. NGPASC
COIMBATORE | INDIA
Example
Named Entity Recognition
P
R
O
C
E
S
S
-
T
M
PROCESS OF TEXT MINING
29
Dr. NGPASC
COIMBATORE | INDIA
PROCESS OF TEXT MINING
30
Dr. NGPASC
COIMBATORE | INDIA
• Document extract
from different
external sources
• Web pages
Extract Docs.
• Document storage
• Handle multiple
documents at a
time
Corpus
• Tokenization, PoS
Tagging
• Stemming,
Lemmatization &
Stop words
Text Transformation
• Bags of Words
• Vector space
Extract Feature
• TF - IDF
• SVD & LDA
Reduce Dimension
• Supervised
• Unsupervised
Data Mining
EXAMPLES
31
Dr. NGPASC
COIMBATORE | INDIA
Tokenization
• Tokenization is just the process of splitting a sentence into words
Tokenization of a sentence
EXAMPLES
32
Dr. NGPASC
COIMBATORE | INDIA
Stemming and Lemmatization
• Stemming and Lemmatization is the method to normalize the text documents
• Text normalization helps to improve the vocabulary (reducing the inflectional
forms) and accuracy of many language modeling tasks
• It is the process of converting a word to its base form
EXAMPLES
33
Dr. NGPASC
COIMBATORE | INDIA
PoS Tagging
• Processing text and assigning parts of speech to each word
• Tags may vary by different tagging sets as, Noun, Verb, Preposition,
Determiner etc.
EXAMPLES
34
Dr. NGPASC
COIMBATORE | INDIA
EXAMPLES
35
Dr. NGPASC
COIMBATORE | INDIA
EXAMPLES
36
Dr. NGPASC
COIMBATORE | INDIA
Reduce Dimensions
• Each row in the document / term matrix represents each document as a
high dimension vector (each dimension represents the occurrence of the
term)
• It can be measured by, tf-idf
 The reason, to reduce the time & memory
 To transform the vector from term space to a topic space - allows
document of similar topics to associate with each other use different
terms
• Example: "pet" and "cat" are map to the same topic based on their
co-occurrence
EXAMPLES
37
Dr. NGPASC
COIMBATORE | INDIA
Data Exploration
• By computing term frequency, Word cloud & Histogram are presented to
represent the significance of the terms
• To prove the correctness of the algorithm include statistical tests within the
model
Data Visualization
TOOLS
38
Dr. NGPASC
COIMBATORE | INDIA
TEXT ANALYTICS
BENEFITS OF TEXT MINING
40
Dr. NGPASC
COIMBATORE | INDIA
Extract all kinds of
information
E-commerce –
personalized marketing
Society - identifying
criminal activities
Companies - better
customer relationship
Scalability &
Real Time Analysis
Predictive Modeling
APPLICATIONS OF TEXT MINING
41
Dr. NGPASC
COIMBATORE | INDIA
Search engine technology
Analysis of survey data
Spam identification
Call center routing
Surveillance
Public health early warning
CONCLUSION
• Overcome the database storage limits
• Extraction of relevant information & relationships from
unstructured docs
• Empirically proven as more accurate & highly effective
• Domain knowledge integration
• Multilingual text refinement
• Varying concepts granularity
• NLP ambiguity
42
Dr. NGPASC
COIMBATORE | INDIA
CHALLENGES
Queries
43
Dr. NGPASC
COIMBATORE | INDIA
Text mining

Text mining

  • 1.
    WEBINAR ON Presented by Dr.A. MUTHUSAMY Head, Associate Professor Department of Computer Science with Cognitive Systems Dr. N.G.P. Arts and Science College Coimbatore-641 048 Tamil Nadu, India Mobile: +91 8675390072, E-mail: muthusamy@drngpasc.ac.in 1 TM
  • 2.
    OUTLINE 2 Dr. NGPASC COIMBATORE |INDIA Introduction - Data & Its Types Need of Text Mining Objectives of Text Mining Search Vs Discover Relevant Disciplines Process & Tools - Text Mining Benefits & Applications of TM Conclusion
  • 3.
  • 4.
    ABOUT DATA 4 Dr. NGPASC COIMBATORE| INDIA • Data - collection of raw facts & symbols • Data is collected from multiple sources • Data can be of the following forms  Text – Alphabets  Number – 0 to 9  Image – Scanner  Audio – Microphone  Video – Camera  Sensors
  • 5.
  • 6.
    ABOUT DATA 6 Dr. NGPASC COIMBATORE| INDIA • Every organization has its own specific data - to perform certain operations within it • It gives the status of past activities and enable us to make decision • Information Processing – Converting data into information
  • 7.
    REAL TIME SCENARIOS •Many individuals use some form of computing every day whether they realize it or not • Swiping a debit card, sending an email / using a mobile phone can all be considered forms of computing 7 Dr. NGPASC COIMBATORE | INDIA
  • 8.
    DATA VARIETIES 8 Dr. NGPASC COIMBATORE| INDIA • Data that has no inherent structure & is usually stored as different types of files • Examples: Text docs, PDFs, images & Videos • Textual data with erratic formats that can be formatted with effort & S/W tools • Examples: Click stream data • Textual data files with an apparent pattern, enabling analysis • Examples: Spreadsheets & XML files • Data having a defined data model, format, structure • Example: Database
  • 9.
    DATA VARIETIES 9 Dr. NGPASC COIMBATORE| INDIA Structured Unstructured
  • 10.
  • 11.
    NEED OF TEXTMINING 11 Dr. NGPASC COIMBATORE | INDIA • Approximately 90% of the world’s data is held in unstructured formats (source: Oracle Corporation) • Most of the information in public and private sectors are stored electronically in the form of text databases • Information intensive business processes demand that we transcend from simple document retrieval to Knowledge Discovery Text Databases
  • 12.
    NEED OF TEXTMINING 12 Dr. NGPASC COIMBATORE | INDIA Why Cats sit on mats?
  • 13.
    NEED OF TEXTMINING 13 Dr. NGPASC COIMBATORE | INDIA • It would be impossible to read all the millions of research articles on the topic yourself. • Text mining helps to filter large amounts of research and extracts the relevant information • It helps to identify, ‘cat’ is the noun, ‘sit’ is the verb and ‘on’ is the proposition
  • 14.
    NEED OF TEXTMINING 14 Dr. NGPASC COIMBATORE | INDIA • It is not just a search tool, it can also understand that the ‘cat’ is an animal, ‘sit’ is an action, and a ‘mat’ is an object. • Identifies and maps patterns and trends across the millions of articles. Answer: Most of the cats who sit on mats come from cold climates.
  • 16.
    TEXT MINING • KnowledgeDiscovery of Text (KDT) - Large collections of written resources to generate valuable information from unstructured formats • Transform the unstructured text into structured data • It involves algorithms of DM, ML, Statistics, and NLP, attempts to extract high quality of useful information • To exploit information contained in textual documents in various ways including pattern discovery and trends in data, associations among entities, predictive rules etc., 16 Dr. NGPASC COIMBATORE | INDIA
  • 17.
    TEXT MINING 17 Dr. NGPASC COIMBATORE| INDIA Framework to Create Dictionary
  • 18.
  • 19.
    SEARCH Vs DISCOVER 19 Dr.NGPASC COIMBATORE | INDIA
  • 20.
    SEARCH Vs DISCOVER 20 Dr.NGPASC COIMBATORE | INDIA Find records within a structured database • Database Type : Structured • Search Mode : Goal-driven • Atomic entity : Data Record • Information Need : Find a Japanese restaurant in Bangalore that serves vegetarian food • Query : SELECT * FROM restaurants WHERE city = Bangalore AND type = Japanese AND has_veg = true Data Retrieval
  • 21.
    SEARCH Vs DISCOVER 21 Dr.NGPASC COIMBATORE | INDIA Find relevant information in an unstructured information source (usually text) • Database Type : Unstructured • Search Mode : Goal-driven • Atomic entity : Document • Information Need : Find a Japanese restaurant in Bangalore • Query : “Japanese restaurant Bangalore” or Bangalore ->Restaurants->Japanese Inf. Retrieval
  • 22.
    SEARCH Vs DISCOVER 22 Dr.NGPASC COIMBATORE | INDIA Discover new knowledge through analysis of data • Database Type : Structured • Search Mode : Opportunistic • Atomic entity : Numbers and Dimensions • Information Need : Show trend over time in # of visits to Japanese restaurants in Bangalore • Query : SELECT SUM(visits) FROM restaurants WHERE city =Bangalore AND type=Japanese ORDER BY date Data Mining
  • 23.
    SEARCH Vs DISCOVER 23 Dr.NGPASC COIMBATORE | INDIA Discover new knowledge through analysis of text • Database Type : Unstructured • Search Mode : Opportunistic • Atomic entity : Language feature or concept • Information Need : Find the types of food poisoning most often associated with Japanese restaurants • Query : Rank diseases found associated with “Japanese restaurants” Text Mining
  • 24.
  • 25.
    RELEAVANT DISCIPLINES 25 Dr. NGPASC COIMBATORE| INDIA Information Extraction (IE) • The process of automatically obtaining structured data from unstructured data Natural Language Processing (NLP) • Study of human language - primary component of Artificial Intelligence (AI) • Computational technique to analysis & synthesis of natural language & speech
  • 26.
    RELEAVANT DISCIPLINES 26 Dr. NGPASC COIMBATORE| INDIA Data Mining (DM) • It refers to the extraction of useful data, hidden patterns from large data sets. • Data mining tools can predict behaviors & future trends that allow businesses to make a better data-driven decision Information Retrieval (IR) • It is the process of obtaining information system resources that are relevant to an information need from a collection of those resources • IR an extension to document extraction • It helps to narrow down the set of records that are associated with a specific problem
  • 27.
    RELEAVANT DISCIPLINES 27 Dr. NGPASC COIMBATORE| INDIA Example Named Entity Recognition
  • 28.
  • 29.
    PROCESS OF TEXTMINING 29 Dr. NGPASC COIMBATORE | INDIA
  • 30.
    PROCESS OF TEXTMINING 30 Dr. NGPASC COIMBATORE | INDIA • Document extract from different external sources • Web pages Extract Docs. • Document storage • Handle multiple documents at a time Corpus • Tokenization, PoS Tagging • Stemming, Lemmatization & Stop words Text Transformation • Bags of Words • Vector space Extract Feature • TF - IDF • SVD & LDA Reduce Dimension • Supervised • Unsupervised Data Mining
  • 31.
    EXAMPLES 31 Dr. NGPASC COIMBATORE |INDIA Tokenization • Tokenization is just the process of splitting a sentence into words Tokenization of a sentence
  • 32.
    EXAMPLES 32 Dr. NGPASC COIMBATORE |INDIA Stemming and Lemmatization • Stemming and Lemmatization is the method to normalize the text documents • Text normalization helps to improve the vocabulary (reducing the inflectional forms) and accuracy of many language modeling tasks • It is the process of converting a word to its base form
  • 33.
    EXAMPLES 33 Dr. NGPASC COIMBATORE |INDIA PoS Tagging • Processing text and assigning parts of speech to each word • Tags may vary by different tagging sets as, Noun, Verb, Preposition, Determiner etc.
  • 34.
  • 35.
  • 36.
    EXAMPLES 36 Dr. NGPASC COIMBATORE |INDIA Reduce Dimensions • Each row in the document / term matrix represents each document as a high dimension vector (each dimension represents the occurrence of the term) • It can be measured by, tf-idf  The reason, to reduce the time & memory  To transform the vector from term space to a topic space - allows document of similar topics to associate with each other use different terms • Example: "pet" and "cat" are map to the same topic based on their co-occurrence
  • 37.
    EXAMPLES 37 Dr. NGPASC COIMBATORE |INDIA Data Exploration • By computing term frequency, Word cloud & Histogram are presented to represent the significance of the terms • To prove the correctness of the algorithm include statistical tests within the model Data Visualization
  • 38.
  • 39.
  • 40.
    BENEFITS OF TEXTMINING 40 Dr. NGPASC COIMBATORE | INDIA Extract all kinds of information E-commerce – personalized marketing Society - identifying criminal activities Companies - better customer relationship Scalability & Real Time Analysis Predictive Modeling
  • 41.
    APPLICATIONS OF TEXTMINING 41 Dr. NGPASC COIMBATORE | INDIA Search engine technology Analysis of survey data Spam identification Call center routing Surveillance Public health early warning
  • 42.
    CONCLUSION • Overcome thedatabase storage limits • Extraction of relevant information & relationships from unstructured docs • Empirically proven as more accurate & highly effective • Domain knowledge integration • Multilingual text refinement • Varying concepts granularity • NLP ambiguity 42 Dr. NGPASC COIMBATORE | INDIA CHALLENGES
  • 43.