1. Copyright 2003-4, SPSS Inc. 1
An Introduction to Text
Mining
Tim Daciuk
SPSS, Inc.
Services Manager, Canada
2. Copyright 2003-4, SPSS Inc. 2
AgendaAgenda
Introductions
An Overview of Document Warehousing
Understanding Unstructured Text
Concept Extraction
Text Mining
Data Mining
Demonstration
3. Copyright 2003-4, SPSS Inc. 3
Tim DaciukTim Daciuk
Background
Social research
Survey research
SPSS
25 years working with the product
12 years working with the company
5 years working with text analysis
Prior history
Consulting
Education
4. Copyright 2003-4, SPSS Inc. 4
Predictive analysis helps connect data to effective
action by drawing reliable conclusions about
current conditions and future events.
— Gareth Herschel, Research Director, Gartner Group
Predictive Analytics: DefinedPredictive Analytics: Defined
5. Copyright 2003-4, SPSS Inc. 5
SPSS At A GlanceSPSS At A Glance
Leadership
Market leader in Predictive Analytics
Focus on online & offline customer data acquisition and analysis
Stability
Founded in 1968
30+ year heritage in analytic technologies
Proven track record
250,000+ customers worldwide
NASDAQ: SPSS
Analytics standard
80% of Fortune 500 are SPSS customers
80% plus market share in Survey & Market Research sector
Ranked #1 Data Mining solution by KD Nuggets
7. Copyright 2003-4, SPSS Inc. 7
Unstructured Data ManagementUnstructured Data Management
Text Mining is a subset of Unstructured Data
Management.
UDM can be broken down into:
Content and Document Management
Search and Retrieval
XML database and tools
Categorization, Classification, and Visualization
8. Copyright 2003-4, SPSS Inc. 8
80% of Data is Unstructured80% of Data is Unstructured
Database notes:
Call center transcripts
Other CRM
Email
Open-ended survey
responses
Web pages
NewsGroups
Documents themselves
Competitive information
9. Copyright 2003-4, SPSS Inc. 9
Applications for Text AnalysisApplications for Text Analysis
Surveys
‘Reading’ email
Call centre data
Comment data
Abstracts
Document management
Corporate history
Thematic understanding of website
10. Copyright 2003-4, SPSS Inc. 10
Data Warehouse vs. DocumentData Warehouse vs. Document
WarehouseWarehouse
Data warehouse
Who, what, when, where, how much
Internally focused
Operational information
Rarely include external information
Document warehouse
Why
May not be internally focused
May contain a range of information
Often integrate external information
11. Copyright 2003-4, SPSS Inc. 11
Document Warehouse FeaturesDocument Warehouse Features
There is no single document structure or document
type
Documents are drawn from multiple sources
Essential features of documents are automatically
extracted and explicitly stored in the document
warehouse
Document warehouses are designed to integrate
semantically related documents
12. Copyright 2003-4, SPSS Inc. 12
Building the Document WarehouseBuilding the Document Warehouse
Identify
Sources
Retrieve
Document
Text
Analysis
Pre-process
Document
Compile
Metadata
13. Copyright 2003-4, SPSS Inc. 13
Predict, Impact, DeployPredict, Impact, Deploy
Customer
Data
Attitudes
Actions
Attributes
Business
User
Grow
Retain
Fraud
Outcomes
Attract
Data
Collection
Text
Surveys
Web
Channel
Operational
Systems
Text
BusinessUI
Expert UIExpert UI
Concepts
Concept
Maps
Clustering
Categoriza-
tion
Trending
Information
Extraction
Prediction
NLP
14. Copyright 2003-4, SPSS Inc. 14
The Building Blocks of LanguageThe Building Blocks of Language
Morphology
Syntax
Semantics
Phonology
Pragmatics
15. Copyright 2003-4, SPSS Inc. 15
MorphologyMorphology
Understanding words
Stems
Affixes
Prefix
Suffix
Inflectional elements
Reducing complexity of
analysis
Reduces complexity of
representation
Supports text mining
Noun
Prefix
Noun
Stem
Suffix
- abledisputein -
16. Copyright 2003-4, SPSS Inc. 16
SyntaxSyntax
The Bank of Canada will curb inflation with higher
interest rates
Prepositional phrase
Adjective
Sentence
Noun phrase Verb phrase
Noun
VerbAux
Noun phrase
NounAdjective
Noun
The Bank of
Canada
inflationcurbwill
Interest rateshigher
with
17. Copyright 2003-4, SPSS Inc. 17
SemanticsSemantics
The meaning of it all
Approaches to meaning
Semantic networks
Deductive logic
Rule-based systems
Useful for classification
18. Copyright 2003-4, SPSS Inc. 18
Problems with NLPProblems with NLP
Limitations of Natural Language Processing
Correctly identifying the role of noun phrases
Representing abstract concepts
Classifying synonyms
Representing the number of concepts
19. Copyright 2003-4, SPSS Inc. 19
Problems with NLPProblems with NLP
Limitations of technology
Language specific designs are required
Classification speed
Classifying hybrid words and sentences
20. Copyright 2003-4, SPSS Inc. 20
Underlying Technology is Based onUnderlying Technology is Based on
LinguisticsLinguistics
The Linguistic Approach:
Does not treat a document as a bag of words
Removes ambiguity by extracting structured concepts
Concepts are the DNA of text.
Text is unstructured, ambiguous, and language
dependent.
21. Copyright 2003-4, SPSS Inc. 21
From Text to ConceptsFrom Text to Concepts
Morphology
Syntax
Semantics Statistics
Linguistic
Terminology
Extractor
ScalableAccurate
Customizable Discovery-
Oriented
•Compound words
•Proper nouns
•Figures
•Named entities
•Domain specifics
•Speed
•Multiple formats
•Multiple languages
•SPSS dictionaries
•User dictionaries
•Extraction rules
•Extraction patterns
•Known terms
•Unknown terms
•New terms
•1GB/hour
•PDF, MS Office, text…
•English, French, German
Spanish, Italian, Dutch,
Japanese
• Inserm; merck & co…
• tnp-470; glut-4…
• factor receptor;
Inhibitory effect;
• D. John Paganoni, ..
• Positive/Negative opinion…
• London, Paris…
•Names, Orgs…
•MeSH, genes...
•Predicates
•Synonyms, stop
words..
•Trends
22. Copyright 2003-4, SPSS Inc. 22
From Concepts to PredictiveFrom Concepts to Predictive
Analytics ComponentsAnalytics Components
Linguistic
Terminology
Extractor
LexiQuest
Mine
Discover
concepts,
relationships
and trends
LexiQuest
Categorize
Understand
documents
and assign in
pre-defined
categories
Text Mining for
Clementine
Add text fields to
data mining for
better prediction
23. Copyright 2003-4, SPSS Inc. 23
Concept Extraction EngineConcept Extraction Engine
The extractor turns unstructured text into concepts:
LexiQuest Extractor Engine
Linguistic Processor
Visualization Probabilities
LexiQuest
Mine
Clementine
LexiQuest
Categorize
25. Copyright 2003-4, SPSS Inc. 25
How is a Concept Extracted?How is a Concept Extracted?
Step 1: Part-of-Speech Tagging
Using a tool like LexiQuest Mine is a great
V P N A N N V P A
idea for any organization that is interested in maintaining
N P A N P V V P V
information on competitive intelligence.
N P N N
26. Copyright 2003-4, SPSS Inc. 26
How is a Concept Extracted?How is a Concept Extracted?
Step 2: Matching to Known Patterns
This:
V P N A N N V P A N PA N P V V P V N PN N
Looks Most Like:
N C D N N
(32 Known patterns for English)
27. Copyright 2003-4, SPSS Inc. 27
How is the Concept Extracted?How is the Concept Extracted?
The extractor looks at this sentence:
Using a tool like LexiQuest Mine is a great idea for any
organization that is interested in maintaining information on
competitive intelligence.
And extracts the concept:
Competitive Intelligence
Concepts are:
Noun based
Can be longer than one word
29. Copyright 2003-4, SPSS Inc. 29
The Issue of LanguageThe Issue of Language
NLP requires separate language understanding
Clementine text mining
French
English
English/French
German
Spanish
Dutch
Japanese
Italian
Mesh (Medical subject headings)
http://www.nlm.nih.gov/mesh/meshhome.html
30. “The process of discovering meaningful
new relationships, patterns and trends by
sifting through data using pattern
recognition technologies as well as
statistical and mathematical techniques.”
- The Gartner group.
Data Mining DefinedData Mining Defined
31. Copyright 2003-4, SPSS Inc. 31
Why data mining?Why data mining?
Data Mining software generally employs modeling
algorithms designed to handle non-linearities and
unusual patterns in data
As opposed to classical linear models (e.g., linear
regression) that aren’t as capable
A related issue is ‘noise’ in the data: where, for
example, 2 seemingly similar sets of inputs yield a
different output
32. Copyright 2003-4, SPSS Inc. 32
Use the cross industry
standard process for
data mining (CRISP-
DM)
Based on real-world
lessons:
Focus on business
issues
User-centric &
interactive
Full process
Results are used
A Data Mining MethodologyA Data Mining Methodology
33. Copyright 2003-4, SPSS Inc. 33
Data Mining is not…Data Mining is not…
Keep in mind that data mining is not…
“Blind” application of analysis/modeling algorithms
Brute-force crunching of bulk data
Black box technology
Magic
35. Copyright 2003-4, SPSS Inc. 35
UnderstandingUnderstanding
Business Understanding
Determine objective
Assess situation
Determine data mining goals
Produce project plan
Data Understanding
Collect initial data
Describe data
Explore data
Verify data quality
36. Copyright 2003-4, SPSS Inc. 36
Data PreparationData Preparation
Data
Data set
Data set description
Select data
Clean data
Construct data set / Integrate data
Format data
Text
Concept extraction
Concept combination
Concept assessment
37. Copyright 2003-4, SPSS Inc. 37
ModelingModeling
Select modeling technique
Universe of techniques
Appropriate techniques
Data
Text
Requirements
Constraints
Selected tools
Generate test design
Run model(s)
Assess model(s)
38. Copyright 2003-4, SPSS Inc. 38
EvaluationEvaluation
Results = Models + Findings
Evaluate results
Review process
Determine next steps
39. Copyright 2003-4, SPSS Inc. 39
DeploymentDeployment
Plan deployment
Plan monitoring and maintenance
Final report
Project review
40. Copyright 2003-4, SPSS Inc. 40
Unsupervised methods:
Group patients by drugs and demographic information
and try to find unusual patients
Supervised methods:
Attempt to predict amount due and find sets of cases
where the amount due is very different from the
predicted amount
Data Mining ApproachesData Mining Approaches
41. Copyright 2003-4, SPSS Inc. 41
What Does Data Mining Do?What Does Data Mining Do?
Data mining uses existing data to:
Predict
Category membership
Numeric Value
Ie. Credit risk
Group
Cluster (group) things together
based on their characteristics
Ie. Different types of TV viewers
Associate
Find events that occur together, or in
a sequence
Ie. Beer and diapers
Find outliers
Identify cases that don’t follow
expected behavior
Ie. Fraudulent behaviour
42. Copyright 2003-4, SPSS Inc. 42
Benefits of Document WarehousingBenefits of Document Warehousing
Richer operational business intelligence
Knowing your customers
Macroenvironmental monitoring
Technology assessment
43. Copyright 2003-4, SPSS Inc. 43
ConclusionsConclusions
Text mining is
More than word counts
Linguistically based
Concept extraction
Data mining is
Advanced analytics applied to datasets
A family of techniques
Supervised or unsupervised
44. Copyright 2003-4, SPSS Inc. 44
ConclusionsConclusions
Text and data mining
Add dimensionality to the data
Allow for automation of the text analysis event
Create 360 degree view
Applications
Websites
Surveys
Email
Call centre
Documentation
46. Copyright 2003-4, SPSS Inc. 46
So How Do I Get Started?So How Do I Get Started?
Document Warehousing and Text Mining
Dan Sullivan, Wiley, 2001
Survey of Text Mining: Clustering, Classification
and Retrieval
Michael W. Berry (ed.), Springer, 2003
Natural Language Processing for Online
Applications: Text Retrieval, Extraction and
Categorization
P. Jackson and I. Moulinier, John Benjamins, 2002
According to Gartner Group: “Predictive Analysis helps you connect data to effective action by drawing reliable conclusions about current conditions and future events.”
Predictive analysis:
Leverages an organization’s business knowledge by applying sophisticated analytic techniques to enterprise data
It turns that data into insights that lead to the development of programs to increase revenues, reduce costs, improve processes, and prevent criminal or fraudulent activities
It encourage actions that demonstrably change how people behave as your customers, employees, patients, students, and citizens
Bottom line: it turns data into effective actions that positively impact your bottom line
Here are some of the stats you may want to know about SPSS (read highlights from the slide).
SPSS has been a cornerstone of the software industry since 1968. We’ve also been on the forefront of blending both new and established technologies to help customers around the world solve business problems. We’ve continued to grow, deliberately and thoughtfully, over the years, acquiring companies and technology complimentary to our existing business.
The bottom line for you: We’re here with innovative, proven solutions to help you solve your immediate business problems. And, we’ll be here in the future to support you and your organization…
The Clementine Server data mining workbench fits within SPSS’ overall business intelligence product strategy. Our entire business intelligence product line includes products for collecting data, preparing data, reporting and OLAP, as well as modeling. Because different users have different needs and levels of expertise, they are presented with the appropriate product interface. SPSS delivers the right product for every person supported by our 30 years of experience in data analysis and data mining. The analytical solutions we deliver, are scalable and can be deployed throughout your organization to help you transform your business with information..
It’s been estimated that 80% of the data in an organization is an unstructured format, that is, in the form of documents, HTML pages, database notes, email, open-ended survey responses, etc. This fact means that decision-makers often rely on only 20% of the data available and a little bit of the documents that they can read.
Take open-ended surveys, for example: cross-tab reports of responses are common but open-ended responses, which hold valuable information which qualify the responses and bring up new themes. Organizations rarely have the tools or the time to truly process and disseminate this important information. In a similar fashion, database notes on customer contacts are effectively used to manage individual contacts, but this valuable source of customer information is never used to really understand the customer experience overall. What if you could use this information to keep and grow customers to increase customer lifetime value?
So, I think we can agree that there is a need for text analysis, but, where can this technology be applied. [click] Well, surveys are the most obvious, we’ve just talked about that. [click] We could apply this to email, reading the email and making a handling decision based on the content of the email. [click] Call centre data is another candidate for text analysis. What are my customers complaints? Where are there problems with my product? What did the customers who left have to say about my service? [click] Reading comment data is an important potential application; an application that is laborious or subjective currently. In the State of Georgia example, comments are triaged and only those indicating a definite problem or requirement for re-arrest are used. The majority are ignored, even though there is a real sense that there is something in that group. [click] The ability to read abstracts from online databases using a more intelligent engine than a simple word search is an application. [click] Document management and the ability to categorize gigabytes of documents policies, procedures is an application, and along with that [click] Corporate history and the ability to manage corporate information resources is an application. Finally [click] we have seen some use of this technology in the analysis of message in websites. What concepts are your website conveying, and are these concepts the appropriate ones, appropriately placed?
At a high level, you need linguistics or Natural Language Processing, to extract concepts which form the bases of business user interfaces like concept maps or feeding data mining techniques to predict customer behavior.
Morphology is the study of the structure and form of words
Syntax is the study of how words and phrases form sentences
Semantics relates to the meaning of words and statements
Phonology is the study of sounds in language
Pragmatics is the study of idiomatic phrases that cannot be analyzed with strict semantic analysis
We tend to deal with the first three and ignore the last two when we are talking about natural language processing.
So how do we get from text to concepts?
Linguistics, the science of text, includes ideas such as 1) morphology, or how words change based on part of speech, 2) syntax, or how sentences are structured 3) semantics, or the meaning of words and 4) statistics, such as the frequency of terms and patterns. It takes linguistics to cut through the noise of text to find relevant concepts without leaving important concepts undiscovered. Other statistical or machine learning approaches fall short of linguistic extraction, because only a linguistic approach can deal with the ambiguity and complexity of text.
That is, linguistic extraction is [click] accurate, [click] scalable, [click] customizable and [click] discovery oriented. By accurate, I mean that [click] compound words, proper nouns, etc., [click] like these examples, are extracted. [click] In terms of scalability, we can process about 1 GB per hour, multiple formats and multiple languages [click]. By customizable, [click] I mean that you can you use dictionaries, rules and patterns [click] to tailor your extraction. Vertical resources can be used like the MeSH which is the official medical thesaurus. And Finally, [click] by discovery-oriented, I mean that, depending on your analysis, you can focus on known terms, unknown terms, new terms and [click] trends.
For the next step, to move from concepts to Predictive Analytics, tools are available which address specific business needs by delivering knowledge to adding prediction to operational systems.
[click] LexiQuest Mine enables users to quickly identify key concepts, and the relationships between them, within thousands of documents Mine displays these concepts and the links between them in an easy to navigate, color-coded graphical map and trend analysis charts. Mine is designed for people who want to discover, structure and anticipate.
[click] LexiQuest Categorize automatically catalogues documents into a predefined taxonomy based on their content.
Able to “read” and understand content, Categorize is able to automatically and accurately place a document into into its proper category. From there, it can be sent to the right audience based on their profile or simply reside there for easy retrieval from a portal, intranet or extranet site.
[click] Text Mining for Clementine is a new component of Clementine, which we will see in a few minutes, has the ability to unlock knowledge contained in unstructured text data so that it can be combined with information from databases and other data sources to build better models using traditional data mining techniques.
The extraction process works basically as three parts:
First, a linguistic processor reads the text and comes up with a set of categories.
These categories are passed to one of three different applications; depending upon the objective. These applications may be a stand alone concept understanding application, such as our LexiQuest mine. This application represents the concepts, and illustrates their relationship. Our Clementine application uses the concepts as data, and, as part of a larger data mining application. Finally, categorization uses the concept information as the basis of further analysis
The final application layer can be used for visualization, data mining, or strict probabilistic assignment of information to known categories.
Another text mining example is categorization. The folders on the left represent categories of different types of incoming emails. Text mining can be used to learn the which emails, depending on their content, should be placed in each category (and therefore routed appropriately and automatically). [click] In this case an email on a problem with an ActiveX control [click] can be routed to Dev support.
MeSH is the National Library of Medicine's controlled vocabulary thesaurus. It consists of sets of terms naming descriptors in a hierarchical structure that permits searching at various levels of specificity.
MeSH descriptors are arranged in both an alphabetic and a hierarchical structure. At the most general level of the hierarchical structure are very broad headings such as "Anatomy" or "Mental Disorders." At more narrow levels are found more specific headings such as "Ankle" and "Conduct Disorder." There are 21,973 descriptors in MeSH. There are also thousands of cross-references that assist in finding the most appropriate MeSH Heading, for example, Vitamin C see Ascorbic Acid. These entries include 23,512 printed see references and 102,346 other entry points
Let’s define data mining
“The process of discovering meaningful new relationships, patterns and trends by sifting through data using pattern recognition technologies as well as statistical and mathematical techniques.”
Data Mining Means:
finding patterns or relationships in your data that you can use to solve your organization’s problems
How does one mine data?
The CROSS INDUSTRY STANDARD PROCESS FOR DATA MINING provides a framework for all data mining efforts..
This process focuses on business issues, allows the user to work with and interact with the data, works on the data mining process from beginning to end, and USES the results.
Business Understanding where you might convert a business problem to a data mining problem
Data Understanding – where you get your first look at the data
Data Preparation – the hardest part where you clean the data
Modeling – the neatest part where you build prediction or MODELS
Evaluation – where you examine performance
Deployment – where you actually integrate the results of your data mining into your organization
Notices the arrows going around the chart and and back and forth amongst the boxes?
These arrows show that the data mining process is an iterative process where the miner may step from on box to another an back for an effective data mining activity
Broadly speaking data mining can serve four basic purposes: prediction, segmentation, association and outlier detection.
Prediction takes a known result, and attempts to combine input fields in order to best replicate this result. An example may be deciding if someone is a good or bad credit risk, or, whether someone will churn or not.
Segmentation finds groups within cases. The number of groups is unknown, ie. Any number of groups is possible. An example may be the examination of customer segments, or, the creation of groupings within financial data
Association methods try to develop an “if this, then that” type of analysis. Examples of this are people who watch news programming also watch the weather network.
Outlier detection is used to derive atypical cases. These are examples of unusual behaviour vis-s-vis the rest of the data. Examples of this may be fraud detection when examining claims data.
The ability to move to the why from the other who, what, when, where, and how much
Personalization for customer knowledge. This allows for marketers to craft messages aimed at more individuals than generalized groups
Understanding events inside and outside the organization and how they relate
The ability to assess competitors technology and to understand the technology positions of the market