Irmac presentation for website

Copyright 2003-4, SPSS Inc. 1
An Introduction to Text
Mining
Tim Daciuk
SPSS, Inc.
Services Manager, Canada

AgendaAgenda
 Introductions
 An Overview of Document Warehousing
 Understanding Unstructured Text
 Concept Extraction
 Text Mining
 Data Mining
 Demonstration

Tim DaciukTim Daciuk
 Background
 Social research
 Survey research
 SPSS
 25 years working with the product
 12 years working with the company
 5 years working with text analysis
 Prior history
 Consulting
 Education

Predictive analysis helps connect data to effective
action by drawing reliable conclusions about
current conditions and future events.
— Gareth Herschel, Research Director, Gartner Group
Predictive Analytics: DefinedPredictive Analytics: Defined

SPSS At A GlanceSPSS At A Glance
 Leadership
 Market leader in Predictive Analytics
 Focus on online & offline customer data acquisition and analysis
 Stability
 Founded in 1968
 30+ year heritage in analytic technologies
 Proven track record
 250,000+ customers worldwide
 NASDAQ: SPSS
 Analytics standard
 80% of Fortune 500 are SPSS customers
 80% plus market share in Survey & Market Research sector
 Ranked #1 Data Mining solution by KD Nuggets

Some of Our BrandsSome of Our Brands

Unstructured Data ManagementUnstructured Data Management
Text Mining is a subset of Unstructured Data
Management.
UDM can be broken down into:
 Content and Document Management
 Search and Retrieval
 XML database and tools
 Categorization, Classification, and Visualization

80% of Data is Unstructured80% of Data is Unstructured
 Database notes:
 Call center transcripts
 Other CRM
 Email
 Open-ended survey
responses
 Web pages
 NewsGroups
 Documents themselves
 Competitive information

Applications for Text AnalysisApplications for Text Analysis
 Surveys
 ‘Reading’ email
 Call centre data
 Comment data
 Abstracts
 Document management
 Corporate history
 Thematic understanding of website

Data Warehouse vs. DocumentData Warehouse vs. Document
WarehouseWarehouse
 Data warehouse
 Who, what, when, where, how much
 Internally focused
 Operational information
 Rarely include external information
 Document warehouse
 Why
 May not be internally focused
 May contain a range of information
 Often integrate external information

Document Warehouse FeaturesDocument Warehouse Features
 There is no single document structure or document
type
 Documents are drawn from multiple sources
 Essential features of documents are automatically
extracted and explicitly stored in the document
warehouse
 Document warehouses are designed to integrate
semantically related documents

Building the Document WarehouseBuilding the Document Warehouse
Identify
Sources
Retrieve
Document
Text
Analysis
Pre-process
Document
Compile
Metadata

Predict, Impact, DeployPredict, Impact, Deploy
Customer
Data
Attitudes
Actions
Attributes
Business
User
Grow
Retain
Fraud
Outcomes
Attract
Data
Collection
Text
Surveys
Web
Channel
Operational
Systems
Text
BusinessUI
Expert UIExpert UI
Concepts
Concept
Maps
Clustering
Categoriza-
tion
Trending
Information
Extraction
Prediction
NLP

The Building Blocks of LanguageThe Building Blocks of Language
 Morphology
 Syntax
 Semantics
 Phonology
 Pragmatics

MorphologyMorphology
 Understanding words
 Stems
 Affixes
 Prefix
 Suffix
 Inflectional elements
 Reducing complexity of
analysis
 Reduces complexity of
representation
 Supports text mining
Noun
Prefix
Noun
Stem
Suffix
- abledisputein -

SyntaxSyntax
 The Bank of Canada will curb inflation with higher
interest rates
Prepositional phrase
Adjective
Sentence
Noun phrase Verb phrase
Noun
VerbAux
Noun phrase
NounAdjective
Noun
The Bank of
Canada
inflationcurbwill
Interest rateshigher
with

SemanticsSemantics
 The meaning of it all
 Approaches to meaning
 Semantic networks
 Deductive logic
 Rule-based systems
 Useful for classification

Problems with NLPProblems with NLP
 Limitations of Natural Language Processing
 Correctly identifying the role of noun phrases
 Representing abstract concepts
 Classifying synonyms
 Representing the number of concepts

Problems with NLPProblems with NLP
 Limitations of technology
 Language specific designs are required
 Classification speed
 Classifying hybrid words and sentences

Underlying Technology is Based onUnderlying Technology is Based on
LinguisticsLinguistics
The Linguistic Approach:
 Does not treat a document as a bag of words
 Removes ambiguity by extracting structured concepts
Concepts are the DNA of text.
Text is unstructured, ambiguous, and language
dependent.

From Text to ConceptsFrom Text to Concepts
Morphology
Syntax
Semantics Statistics
Linguistic
Terminology
Extractor
ScalableAccurate
Customizable Discovery-
Oriented
•Compound words
•Proper nouns
•Figures
•Named entities
•Domain specifics
•Speed
•Multiple formats
•Multiple languages
•SPSS dictionaries
•User dictionaries
•Extraction rules
•Extraction patterns
•Known terms
•Unknown terms
•New terms
•1GB/hour
•PDF, MS Office, text…
•English, French, German
Spanish, Italian, Dutch,
Japanese
• Inserm; merck & co…
• tnp-470; glut-4…
• factor receptor;
Inhibitory effect;
• D. John Paganoni, ..
• Positive/Negative opinion…
• London, Paris…
•Names, Orgs…
•MeSH, genes...
•Predicates
•Synonyms, stop
words..
•Trends

From Concepts to PredictiveFrom Concepts to Predictive
Analytics ComponentsAnalytics Components
Linguistic
Terminology
Extractor
LexiQuest
Mine
Discover
concepts,
relationships
and trends
LexiQuest
Categorize
Understand
documents
and assign in
pre-defined
categories
Text Mining for
Clementine
Add text fields to
data mining for
better prediction

Concept Extraction EngineConcept Extraction Engine
The extractor turns unstructured text into concepts:
LexiQuest Extractor Engine
Linguistic Processor
Visualization Probabilities
LexiQuest
Mine
Clementine
LexiQuest
Categorize

Part-of-Speech TaggingPart-of-Speech Tagging
a: adjective b: adverb c: preposition
d: determiner n: noun v: verb
o: coordination p: participle s: stop word

How is a Concept Extracted?How is a Concept Extracted?
Step 1: Part-of-Speech Tagging
Using a tool like LexiQuest Mine is a great
V P N A N N V P A
idea for any organization that is interested in maintaining
N P A N P V V P V
information on competitive intelligence.
N P N N

How is a Concept Extracted?How is a Concept Extracted?
Step 2: Matching to Known Patterns
This:
V P N A N N V P A N PA N P V V P V N PN N
Looks Most Like:
N C D N N
(32 Known patterns for English)

How is the Concept Extracted?How is the Concept Extracted?
The extractor looks at this sentence:
Using a tool like LexiQuest Mine is a great idea for any
organization that is interested in maintaining information on
competitive intelligence.
And extracts the concept:
Competitive Intelligence
Concepts are:
 Noun based
 Can be longer than one word

Example: CategorizationExample: Categorization

The Issue of LanguageThe Issue of Language
 NLP requires separate language understanding
 Clementine text mining
 French
 English
 English/French
 German
 Spanish
 Dutch
 Japanese
 Italian
 Mesh (Medical subject headings)
 http://www.nlm.nih.gov/mesh/meshhome.html

“The process of discovering meaningful
new relationships, patterns and trends by
sifting through data using pattern
recognition technologies as well as
statistical and mathematical techniques.”
- The Gartner group.
Data Mining DefinedData Mining Defined

Why data mining?Why data mining?
 Data Mining software generally employs modeling
algorithms designed to handle non-linearities and
unusual patterns in data
 As opposed to classical linear models (e.g., linear
regression) that aren’t as capable
 A related issue is ‘noise’ in the data: where, for
example, 2 seemingly similar sets of inputs yield a
different output

 Use the cross industry
standard process for
data mining (CRISP-
DM)
 Based on real-world
lessons:
 Focus on business
issues
 User-centric &
interactive
 Full process
 Results are used
A Data Mining MethodologyA Data Mining Methodology

Data Mining is not…Data Mining is not…
 Keep in mind that data mining is not…
 “Blind” application of analysis/modeling algorithms
 Brute-force crunching of bulk data
 Black box technology
 Magic

Back to the ProcessBack to the Process
Text
Mining

UnderstandingUnderstanding
 Business Understanding
 Determine objective
 Assess situation
 Determine data mining goals
 Produce project plan
 Data Understanding
 Collect initial data
 Describe data
 Explore data
 Verify data quality

Data PreparationData Preparation
 Data
 Data set
 Data set description
 Select data
 Clean data
 Construct data set / Integrate data
 Format data
 Text
 Concept extraction
 Concept combination
 Concept assessment

ModelingModeling
 Select modeling technique
 Universe of techniques
 Appropriate techniques
 Data
 Text
 Requirements
 Constraints
 Selected tools
 Generate test design
 Run model(s)
 Assess model(s)

EvaluationEvaluation
 Results = Models + Findings
 Evaluate results
 Review process
 Determine next steps

DeploymentDeployment
 Plan deployment
 Plan monitoring and maintenance
 Final report
 Project review

 Unsupervised methods:
 Group patients by drugs and demographic information
and try to find unusual patients
 Supervised methods:
 Attempt to predict amount due and find sets of cases
where the amount due is very different from the
predicted amount
Data Mining ApproachesData Mining Approaches

What Does Data Mining Do?What Does Data Mining Do?
 Data mining uses existing data to:
 Predict
 Category membership
 Numeric Value
 Ie. Credit risk
 Group
 Cluster (group) things together
based on their characteristics
 Ie. Different types of TV viewers
 Associate
 Find events that occur together, or in
a sequence
 Ie. Beer and diapers
 Find outliers
 Identify cases that don’t follow
expected behavior
 Ie. Fraudulent behaviour

Benefits of Document WarehousingBenefits of Document Warehousing
 Richer operational business intelligence
 Knowing your customers
 Macroenvironmental monitoring
 Technology assessment

ConclusionsConclusions
 Text mining is
 More than word counts
 Linguistically based
 Concept extraction
 Data mining is
 Advanced analytics applied to datasets
 A family of techniques
 Supervised or unsupervised

ConclusionsConclusions
 Text and data mining
 Add dimensionality to the data
 Allow for automation of the text analysis event
 Create 360 degree view
 Applications
 Websites
 Surveys
 Email
 Call centre
 Documentation

?

So How Do I Get Started?So How Do I Get Started?
 Document Warehousing and Text Mining
 Dan Sullivan, Wiley, 2001
 Survey of Text Mining: Clustering, Classification
and Retrieval
 Michael W. Berry (ed.), Springer, 2003
 Natural Language Processing for Online
Applications: Text Retrieval, Extraction and
Categorization
 P. Jackson and I. Moulinier, John Benjamins, 2002

SPSS CanadaSPSS Canada
 Tim Daciuk
 Services Manager, Canada
 416-410-7921
 800-543-6607 ext. 5156
 tdaciuk@spss.com
 Hugh Rooney
 SPSS Sales Canada
 416-410-7921
 905-886-4322
 hrooney@spss.com
www.spss.com

Irmac presentation for website

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Irmac presentation for website

Similar to Irmac presentation for website (20)

Recently uploaded

Recently uploaded (20)

Irmac presentation for website

Editor's Notes