SlideShare a Scribd company logo
1 of 47
Copyright 2003-4, SPSS Inc. 1
An Introduction to Text
Mining
Tim Daciuk
SPSS, Inc.
Services Manager, Canada
Copyright 2003-4, SPSS Inc. 2
AgendaAgenda
 Introductions
 An Overview of Document Warehousing
 Understanding Unstructured Text
 Concept Extraction
 Text Mining
 Data Mining
 Demonstration
Copyright 2003-4, SPSS Inc. 3
Tim DaciukTim Daciuk
 Background
 Social research
 Survey research
 SPSS
 25 years working with the product
 12 years working with the company
 5 years working with text analysis
 Prior history
 Consulting
 Education
Copyright 2003-4, SPSS Inc. 4
Predictive analysis helps connect data to effective
action by drawing reliable conclusions about
current conditions and future events.
— Gareth Herschel, Research Director, Gartner Group
Predictive Analytics: DefinedPredictive Analytics: Defined
Copyright 2003-4, SPSS Inc. 5
SPSS At A GlanceSPSS At A Glance
 Leadership
 Market leader in Predictive Analytics
 Focus on online & offline customer data acquisition and analysis
 Stability
 Founded in 1968
 30+ year heritage in analytic technologies
 Proven track record
 250,000+ customers worldwide
 NASDAQ: SPSS
 Analytics standard
 80% of Fortune 500 are SPSS customers
 80% plus market share in Survey & Market Research sector
 Ranked #1 Data Mining solution by KD Nuggets
Some of Our BrandsSome of Our Brands
Copyright 2003-4, SPSS Inc. 7
Unstructured Data ManagementUnstructured Data Management
Text Mining is a subset of Unstructured Data
Management.
UDM can be broken down into:
 Content and Document Management
 Search and Retrieval
 XML database and tools
 Categorization, Classification, and Visualization
Copyright 2003-4, SPSS Inc. 8
80% of Data is Unstructured80% of Data is Unstructured
 Database notes:
 Call center transcripts
 Other CRM
 Email
 Open-ended survey
responses
 Web pages
 NewsGroups
 Documents themselves
 Competitive information
Copyright 2003-4, SPSS Inc. 9
Applications for Text AnalysisApplications for Text Analysis
 Surveys
 ‘Reading’ email
 Call centre data
 Comment data
 Abstracts
 Document management
 Corporate history
 Thematic understanding of website
Copyright 2003-4, SPSS Inc. 10
Data Warehouse vs. DocumentData Warehouse vs. Document
WarehouseWarehouse
 Data warehouse
 Who, what, when, where, how much
 Internally focused
 Operational information
 Rarely include external information
 Document warehouse
 Why
 May not be internally focused
 May contain a range of information
 Often integrate external information
Copyright 2003-4, SPSS Inc. 11
Document Warehouse FeaturesDocument Warehouse Features
 There is no single document structure or document
type
 Documents are drawn from multiple sources
 Essential features of documents are automatically
extracted and explicitly stored in the document
warehouse
 Document warehouses are designed to integrate
semantically related documents
Copyright 2003-4, SPSS Inc. 12
Building the Document WarehouseBuilding the Document Warehouse
Identify
Sources
Retrieve
Document
Text
Analysis
Pre-process
Document
Compile
Metadata
Copyright 2003-4, SPSS Inc. 13
Predict, Impact, DeployPredict, Impact, Deploy
Customer
Data
Attitudes
Actions
Attributes
Business
User
Grow
Retain
Fraud
Outcomes
Attract
Data
Collection
Text
Surveys
Web
Channel
Operational
Systems
Text
BusinessUI
Expert UIExpert UI
Concepts
Concept
Maps
Clustering
Categoriza-
tion
Trending
Information
Extraction
Prediction
NLP
Copyright 2003-4, SPSS Inc. 14
The Building Blocks of LanguageThe Building Blocks of Language
 Morphology
 Syntax
 Semantics
 Phonology
 Pragmatics
Copyright 2003-4, SPSS Inc. 15
MorphologyMorphology
 Understanding words
 Stems
 Affixes
 Prefix
 Suffix
 Inflectional elements
 Reducing complexity of
analysis
 Reduces complexity of
representation
 Supports text mining
Noun
Prefix
Noun
Stem
Suffix
- abledisputein -
Copyright 2003-4, SPSS Inc. 16
SyntaxSyntax
 The Bank of Canada will curb inflation with higher
interest rates
Prepositional phrase
Adjective
Sentence
Noun phrase Verb phrase
Noun
VerbAux
Noun phrase
NounAdjective
Noun
The Bank of
Canada
inflationcurbwill
Interest rateshigher
with
Copyright 2003-4, SPSS Inc. 17
SemanticsSemantics
 The meaning of it all
 Approaches to meaning
 Semantic networks
 Deductive logic
 Rule-based systems
 Useful for classification
Copyright 2003-4, SPSS Inc. 18
Problems with NLPProblems with NLP
 Limitations of Natural Language Processing
 Correctly identifying the role of noun phrases
 Representing abstract concepts
 Classifying synonyms
 Representing the number of concepts
Copyright 2003-4, SPSS Inc. 19
Problems with NLPProblems with NLP
 Limitations of technology
 Language specific designs are required
 Classification speed
 Classifying hybrid words and sentences
Copyright 2003-4, SPSS Inc. 20
Underlying Technology is Based onUnderlying Technology is Based on
LinguisticsLinguistics
The Linguistic Approach:
 Does not treat a document as a bag of words
 Removes ambiguity by extracting structured concepts
Concepts are the DNA of text.
Text is unstructured, ambiguous, and language
dependent.
Copyright 2003-4, SPSS Inc. 21
From Text to ConceptsFrom Text to Concepts
Morphology
Syntax
Semantics Statistics
Linguistic
Terminology
Extractor
ScalableAccurate
Customizable Discovery-
Oriented
•Compound words
•Proper nouns
•Figures
•Named entities
•Domain specifics
•Speed
•Multiple formats
•Multiple languages
•SPSS dictionaries
•User dictionaries
•Extraction rules
•Extraction patterns
•Known terms
•Unknown terms
•New terms
•1GB/hour
•PDF, MS Office, text…
•English, French, German
Spanish, Italian, Dutch,
Japanese
• Inserm; merck & co…
• tnp-470; glut-4…
• factor receptor;
Inhibitory effect;
• D. John Paganoni, ..
• Positive/Negative opinion…
• London, Paris…
•Names, Orgs…
•MeSH, genes...
•Predicates
•Synonyms, stop
words..
•Trends
Copyright 2003-4, SPSS Inc. 22
From Concepts to PredictiveFrom Concepts to Predictive
Analytics ComponentsAnalytics Components
Linguistic
Terminology
Extractor
LexiQuest
Mine
Discover
concepts,
relationships
and trends
LexiQuest
Categorize
Understand
documents
and assign in
pre-defined
categories
Text Mining for
Clementine
Add text fields to
data mining for
better prediction
Copyright 2003-4, SPSS Inc. 23
Concept Extraction EngineConcept Extraction Engine
The extractor turns unstructured text into concepts:
LexiQuest Extractor Engine
Linguistic Processor
Visualization Probabilities
LexiQuest
Mine
Clementine
LexiQuest
Categorize
Copyright 2003-4, SPSS Inc. 24
Part-of-Speech TaggingPart-of-Speech Tagging
a: adjective b: adverb c: preposition
d: determiner n: noun v: verb
o: coordination p: participle s: stop word
Copyright 2003-4, SPSS Inc. 25
How is a Concept Extracted?How is a Concept Extracted?
Step 1: Part-of-Speech Tagging
Using a tool like LexiQuest Mine is a great
V P N A N N V P A
idea for any organization that is interested in maintaining
N P A N P V V P V
information on competitive intelligence.
N P N N
Copyright 2003-4, SPSS Inc. 26
How is a Concept Extracted?How is a Concept Extracted?
Step 2: Matching to Known Patterns
This:
V P N A N N V P A N PA N P V V P V N PN N
Looks Most Like:
N C D N N
(32 Known patterns for English)
Copyright 2003-4, SPSS Inc. 27
How is the Concept Extracted?How is the Concept Extracted?
The extractor looks at this sentence:
Using a tool like LexiQuest Mine is a great idea for any
organization that is interested in maintaining information on
competitive intelligence.
And extracts the concept:
Competitive Intelligence
Concepts are:
 Noun based
 Can be longer than one word
Copyright 2003-4, SPSS Inc. 28
Example: CategorizationExample: Categorization
Copyright 2003-4, SPSS Inc. 29
The Issue of LanguageThe Issue of Language
 NLP requires separate language understanding
 Clementine text mining
 French
 English
 English/French
 German
 Spanish
 Dutch
 Japanese
 Italian
 Mesh (Medical subject headings)
 http://www.nlm.nih.gov/mesh/meshhome.html
“The process of discovering meaningful
new relationships, patterns and trends by
sifting through data using pattern
recognition technologies as well as
statistical and mathematical techniques.”
- The Gartner group.
Data Mining DefinedData Mining Defined
Copyright 2003-4, SPSS Inc. 31
Why data mining?Why data mining?
 Data Mining software generally employs modeling
algorithms designed to handle non-linearities and
unusual patterns in data
 As opposed to classical linear models (e.g., linear
regression) that aren’t as capable
 A related issue is ‘noise’ in the data: where, for
example, 2 seemingly similar sets of inputs yield a
different output
Copyright 2003-4, SPSS Inc. 32
 Use the cross industry
standard process for
data mining (CRISP-
DM)
 Based on real-world
lessons:
 Focus on business
issues
 User-centric &
interactive
 Full process
 Results are used
A Data Mining MethodologyA Data Mining Methodology
Copyright 2003-4, SPSS Inc. 33
Data Mining is not…Data Mining is not…
 Keep in mind that data mining is not…
 “Blind” application of analysis/modeling algorithms
 Brute-force crunching of bulk data
 Black box technology
 Magic
Copyright 2003-4, SPSS Inc. 34
Back to the ProcessBack to the Process
Text
Mining
Copyright 2003-4, SPSS Inc. 35
UnderstandingUnderstanding
 Business Understanding
 Determine objective
 Assess situation
 Determine data mining goals
 Produce project plan
 Data Understanding
 Collect initial data
 Describe data
 Explore data
 Verify data quality
Copyright 2003-4, SPSS Inc. 36
Data PreparationData Preparation
 Data
 Data set
 Data set description
 Select data
 Clean data
 Construct data set / Integrate data
 Format data
 Text
 Concept extraction
 Concept combination
 Concept assessment
Copyright 2003-4, SPSS Inc. 37
ModelingModeling
 Select modeling technique
 Universe of techniques
 Appropriate techniques
 Data
 Text
 Requirements
 Constraints
 Selected tools
 Generate test design
 Run model(s)
 Assess model(s)
Copyright 2003-4, SPSS Inc. 38
EvaluationEvaluation
 Results = Models + Findings
 Evaluate results
 Review process
 Determine next steps
Copyright 2003-4, SPSS Inc. 39
DeploymentDeployment
 Plan deployment
 Plan monitoring and maintenance
 Final report
 Project review
Copyright 2003-4, SPSS Inc. 40
 Unsupervised methods:
 Group patients by drugs and demographic information
and try to find unusual patients
 Supervised methods:
 Attempt to predict amount due and find sets of cases
where the amount due is very different from the
predicted amount
Data Mining ApproachesData Mining Approaches
Copyright 2003-4, SPSS Inc. 41
What Does Data Mining Do?What Does Data Mining Do?
 Data mining uses existing data to:
 Predict
 Category membership
 Numeric Value
 Ie. Credit risk
 Group
 Cluster (group) things together
based on their characteristics
 Ie. Different types of TV viewers
 Associate
 Find events that occur together, or in
a sequence
 Ie. Beer and diapers
 Find outliers
 Identify cases that don’t follow
expected behavior
 Ie. Fraudulent behaviour
Copyright 2003-4, SPSS Inc. 42
Benefits of Document WarehousingBenefits of Document Warehousing
 Richer operational business intelligence
 Knowing your customers
 Macroenvironmental monitoring
 Technology assessment
Copyright 2003-4, SPSS Inc. 43
ConclusionsConclusions
 Text mining is
 More than word counts
 Linguistically based
 Concept extraction
 Data mining is
 Advanced analytics applied to datasets
 A family of techniques
 Supervised or unsupervised
Copyright 2003-4, SPSS Inc. 44
ConclusionsConclusions
 Text and data mining
 Add dimensionality to the data
 Allow for automation of the text analysis event
 Create 360 degree view
 Applications
 Websites
 Surveys
 Email
 Call centre
 Documentation
Copyright 2003-4, SPSS Inc. 45
?
Copyright 2003-4, SPSS Inc. 46
So How Do I Get Started?So How Do I Get Started?
 Document Warehousing and Text Mining
 Dan Sullivan, Wiley, 2001
 Survey of Text Mining: Clustering, Classification
and Retrieval
 Michael W. Berry (ed.), Springer, 2003
 Natural Language Processing for Online
Applications: Text Retrieval, Extraction and
Categorization
 P. Jackson and I. Moulinier, John Benjamins, 2002
Copyright 2003-4, SPSS Inc. 47
SPSS CanadaSPSS Canada
 Tim Daciuk
 Services Manager, Canada
 416-410-7921
 800-543-6607 ext. 5156
 tdaciuk@spss.com
 Hugh Rooney
 SPSS Sales Canada
 416-410-7921
 905-886-4322
 hrooney@spss.com
www.spss.com

More Related Content

What's hot

Scalable keyword search on large rdf data
Scalable keyword search on large rdf dataScalable keyword search on large rdf data
Scalable keyword search on large rdf dataLeMeniz Infotech
 
Predictive Text Analytics
Predictive Text AnalyticsPredictive Text Analytics
Predictive Text AnalyticsSeth Grimes
 
Text Analytics for Dummies 2010
Text Analytics for Dummies 2010Text Analytics for Dummies 2010
Text Analytics for Dummies 2010Seth Grimes
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Kai Li
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalDustin Smith
 
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...Dr. Haxel Consult
 
An Advanced IR System of Relational Keyword Search Technique
An Advanced IR System of Relational Keyword Search TechniqueAn Advanced IR System of Relational Keyword Search Technique
An Advanced IR System of Relational Keyword Search Techniquepaperpublications3
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibEl Habib NFAOUI
 
An Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentationAn Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentationSeth Grimes
 
Semantic Data Normalization For Efficient Clinical Trial Research
Semantic Data Normalization For Efficient Clinical Trial ResearchSemantic Data Normalization For Efficient Clinical Trial Research
Semantic Data Normalization For Efficient Clinical Trial ResearchOntotext
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantinimaxfalc
 
Lexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text AnalyticsLexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text AnalyticsLexalytics
 
Interleaving, Evaluation to Self-learning Search @904Labs
Interleaving, Evaluation to Self-learning Search @904LabsInterleaving, Evaluation to Self-learning Search @904Labs
Interleaving, Evaluation to Self-learning Search @904LabsJohn T. Kane
 
How Graph Algorithms Answer your Business Questions in Banking and Beyond
How Graph Algorithms Answer your Business Questions in Banking and BeyondHow Graph Algorithms Answer your Business Questions in Banking and Beyond
How Graph Algorithms Answer your Business Questions in Banking and BeyondNeo4j
 
South Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis PanelSouth Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis PanelTrey Grainger
 
Text Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and TomorrowText Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and TomorrowTony Russell-Rose
 
Text Analytics for Non-Experts
Text Analytics for Non-ExpertsText Analytics for Non-Experts
Text Analytics for Non-ExpertsSynaptica, LLC
 
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...Kripa (कृपा) Rajshekhar
 
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Neo4j
 

What's hot (20)

Scalable keyword search on large rdf data
Scalable keyword search on large rdf dataScalable keyword search on large rdf data
Scalable keyword search on large rdf data
 
Predictive Text Analytics
Predictive Text AnalyticsPredictive Text Analytics
Predictive Text Analytics
 
Text Analytics for Dummies 2010
Text Analytics for Dummies 2010Text Analytics for Dummies 2010
Text Analytics for Dummies 2010
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
 
An Advanced IR System of Relational Keyword Search Technique
An Advanced IR System of Relational Keyword Search TechniqueAn Advanced IR System of Relational Keyword Search Technique
An Advanced IR System of Relational Keyword Search Technique
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
 
An Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentationAn Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentation
 
Semantic Data Normalization For Efficient Clinical Trial Research
Semantic Data Normalization For Efficient Clinical Trial ResearchSemantic Data Normalization For Efficient Clinical Trial Research
Semantic Data Normalization For Efficient Clinical Trial Research
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantini
 
Lexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text AnalyticsLexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text Analytics
 
Text mining
Text miningText mining
Text mining
 
Interleaving, Evaluation to Self-learning Search @904Labs
Interleaving, Evaluation to Self-learning Search @904LabsInterleaving, Evaluation to Self-learning Search @904Labs
Interleaving, Evaluation to Self-learning Search @904Labs
 
How Graph Algorithms Answer your Business Questions in Banking and Beyond
How Graph Algorithms Answer your Business Questions in Banking and BeyondHow Graph Algorithms Answer your Business Questions in Banking and Beyond
How Graph Algorithms Answer your Business Questions in Banking and Beyond
 
South Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis PanelSouth Big Data Hub: Text Data Analysis Panel
South Big Data Hub: Text Data Analysis Panel
 
Text Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and TomorrowText Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and Tomorrow
 
Text Analytics for Non-Experts
Text Analytics for Non-ExpertsText Analytics for Non-Experts
Text Analytics for Non-Experts
 
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
 
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
 

Viewers also liked

Uploading Data From Microsoft Excel - Microsoft SLQ Server 2008 (by Rakesh Mi...
Uploading Data From Microsoft Excel - Microsoft SLQ Server 2008 (by Rakesh Mi...Uploading Data From Microsoft Excel - Microsoft SLQ Server 2008 (by Rakesh Mi...
Uploading Data From Microsoft Excel - Microsoft SLQ Server 2008 (by Rakesh Mi...Rishu Mehra
 
Original definition Predictive Analytics SPSS Jan 15, 2003 Intriduction Slides
Original definition Predictive Analytics SPSS Jan 15, 2003 Intriduction SlidesOriginal definition Predictive Analytics SPSS Jan 15, 2003 Intriduction Slides
Original definition Predictive Analytics SPSS Jan 15, 2003 Intriduction SlidesJaap Vink
 
Experimental design data analysis
Experimental design data analysisExperimental design data analysis
Experimental design data analysismetalkid132
 
Move from Business Intelligence to Advanced Analytics by Integrating IBM SPSS...
Move from Business Intelligence to Advanced Analytics by Integrating IBM SPSS...Move from Business Intelligence to Advanced Analytics by Integrating IBM SPSS...
Move from Business Intelligence to Advanced Analytics by Integrating IBM SPSS...Perficient, Inc.
 
Economic Analysis of Maruti Suzuki through various software tools
Economic Analysis of Maruti Suzuki through various software toolsEconomic Analysis of Maruti Suzuki through various software tools
Economic Analysis of Maruti Suzuki through various software toolsHarish Gowtham
 
Applied Statistical Methods - Question & Answer on SPSS
Applied Statistical Methods - Question & Answer on SPSSApplied Statistical Methods - Question & Answer on SPSS
Applied Statistical Methods - Question & Answer on SPSSGökhan Ayrancıoğlu
 
PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014
PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014
PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014Daniel Westzaan
 
An Introduction to SPSS
An Introduction to SPSSAn Introduction to SPSS
An Introduction to SPSSRajesh Gunesh
 
SPSS an intro...
SPSS an intro...SPSS an intro...
SPSS an intro...Jithin Zcs
 
Wifi and Lifi Technology
Wifi and Lifi TechnologyWifi and Lifi Technology
Wifi and Lifi TechnologyUmmar Ahmed
 
Data Analysis With Spss - Reliability
Data Analysis With Spss -  ReliabilityData Analysis With Spss -  Reliability
Data Analysis With Spss - ReliabilityDr Ali Yusob Md Zain
 
Six Sigma Quality Using R: Tools and Training
Six Sigma Quality Using R: Tools and Training Six Sigma Quality Using R: Tools and Training
Six Sigma Quality Using R: Tools and Training Emilio L. Cano
 
Introduction to Mediation using SPSS
Introduction to Mediation using SPSSIntroduction to Mediation using SPSS
Introduction to Mediation using SPSSsmackinnon
 
SPSS statistics - how to use SPSS
SPSS statistics - how to use SPSSSPSS statistics - how to use SPSS
SPSS statistics - how to use SPSScsula its training
 

Viewers also liked (17)

Uploading Data From Microsoft Excel - Microsoft SLQ Server 2008 (by Rakesh Mi...
Uploading Data From Microsoft Excel - Microsoft SLQ Server 2008 (by Rakesh Mi...Uploading Data From Microsoft Excel - Microsoft SLQ Server 2008 (by Rakesh Mi...
Uploading Data From Microsoft Excel - Microsoft SLQ Server 2008 (by Rakesh Mi...
 
Original definition Predictive Analytics SPSS Jan 15, 2003 Intriduction Slides
Original definition Predictive Analytics SPSS Jan 15, 2003 Intriduction SlidesOriginal definition Predictive Analytics SPSS Jan 15, 2003 Intriduction Slides
Original definition Predictive Analytics SPSS Jan 15, 2003 Intriduction Slides
 
Experimental design data analysis
Experimental design data analysisExperimental design data analysis
Experimental design data analysis
 
Move from Business Intelligence to Advanced Analytics by Integrating IBM SPSS...
Move from Business Intelligence to Advanced Analytics by Integrating IBM SPSS...Move from Business Intelligence to Advanced Analytics by Integrating IBM SPSS...
Move from Business Intelligence to Advanced Analytics by Integrating IBM SPSS...
 
Uses of SPSS and Excel to analyze data
Uses of SPSS and Excel   to analyze dataUses of SPSS and Excel   to analyze data
Uses of SPSS and Excel to analyze data
 
Spss
SpssSpss
Spss
 
Economic Analysis of Maruti Suzuki through various software tools
Economic Analysis of Maruti Suzuki through various software toolsEconomic Analysis of Maruti Suzuki through various software tools
Economic Analysis of Maruti Suzuki through various software tools
 
Applied Statistical Methods - Question & Answer on SPSS
Applied Statistical Methods - Question & Answer on SPSSApplied Statistical Methods - Question & Answer on SPSS
Applied Statistical Methods - Question & Answer on SPSS
 
PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014
PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014
PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014
 
An Introduction to SPSS
An Introduction to SPSSAn Introduction to SPSS
An Introduction to SPSS
 
SPSS an intro...
SPSS an intro...SPSS an intro...
SPSS an intro...
 
Wifi and Lifi Technology
Wifi and Lifi TechnologyWifi and Lifi Technology
Wifi and Lifi Technology
 
Data Analysis With Spss - Reliability
Data Analysis With Spss -  ReliabilityData Analysis With Spss -  Reliability
Data Analysis With Spss - Reliability
 
Six Sigma Quality Using R: Tools and Training
Six Sigma Quality Using R: Tools and Training Six Sigma Quality Using R: Tools and Training
Six Sigma Quality Using R: Tools and Training
 
Introduction to Mediation using SPSS
Introduction to Mediation using SPSSIntroduction to Mediation using SPSS
Introduction to Mediation using SPSS
 
SPSS statistics - how to use SPSS
SPSS statistics - how to use SPSSSPSS statistics - how to use SPSS
SPSS statistics - how to use SPSS
 
Data analysis using spss
Data analysis using spssData analysis using spss
Data analysis using spss
 

Similar to Irmac presentation for website

A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...
A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...
A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...Michael Mortenson
 
A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...
A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...
A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...Michael Mortenson
 
Technical Documentation 101 for Data Engineers.pdf
Technical Documentation 101 for Data Engineers.pdfTechnical Documentation 101 for Data Engineers.pdf
Technical Documentation 101 for Data Engineers.pdfShristi Shrestha
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataAndre Freitas
 
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...Simplilearn
 
The Rensselaer IDEA: Data Exploration
The Rensselaer IDEA: Data Exploration The Rensselaer IDEA: Data Exploration
The Rensselaer IDEA: Data Exploration James Hendler
 
Tips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data ScientistTips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data ScientistLisa Cohen
 
The web of data: how are we doing so far?
The web of data: how are we doing so far?The web of data: how are we doing so far?
The web of data: how are we doing so far?Elena Simperl
 
Lessons learned from over 25 Data Virtualization implementations
Lessons learned from over 25 Data Virtualization implementationsLessons learned from over 25 Data Virtualization implementations
Lessons learned from over 25 Data Virtualization implementationsDenodo
 
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...Cambridge Semantics
 
Tips for Effective Data Science in the Enterprise
Tips for Effective Data Science in the EnterpriseTips for Effective Data Science in the Enterprise
Tips for Effective Data Science in the EnterpriseLisa Cohen
 
Open government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impactOpen government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impactElena Simperl
 
Sustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive AnalyticsSustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive AnalyticsCambridge Semantics
 
Fried data summit big data for lob content
Fried data summit big data for lob contentFried data summit big data for lob content
Fried data summit big data for lob contentJeff Fried
 
11 Towards a Research Agenda for Recommendation Systems in Requirements Engin...
11 Towards a Research Agenda for Recommendation Systems in Requirements Engin...11 Towards a Research Agenda for Recommendation Systems in Requirements Engin...
11 Towards a Research Agenda for Recommendation Systems in Requirements Engin...Walid Maalej
 

Similar to Irmac presentation for website (20)

A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...
A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...
A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...
 
A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...
A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...
A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...
 
Paper presentation
Paper presentationPaper presentation
Paper presentation
 
Text Analytics for Legal work
Text Analytics for Legal workText Analytics for Legal work
Text Analytics for Legal work
 
Technical Documentation 101 for Data Engineers.pdf
Technical Documentation 101 for Data Engineers.pdfTechnical Documentation 101 for Data Engineers.pdf
Technical Documentation 101 for Data Engineers.pdf
 
Industrial Natural Language Processing and Information Extraction
Industrial Natural Language Processing and Information ExtractionIndustrial Natural Language Processing and Information Extraction
Industrial Natural Language Processing and Information Extraction
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
 
Word Embedding In IR
Word Embedding In IRWord Embedding In IR
Word Embedding In IR
 
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
 
Big Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARLBig Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARL
 
The Rensselaer IDEA: Data Exploration
The Rensselaer IDEA: Data Exploration The Rensselaer IDEA: Data Exploration
The Rensselaer IDEA: Data Exploration
 
Tips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data ScientistTips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data Scientist
 
The web of data: how are we doing so far?
The web of data: how are we doing so far?The web of data: how are we doing so far?
The web of data: how are we doing so far?
 
Lessons learned from over 25 Data Virtualization implementations
Lessons learned from over 25 Data Virtualization implementationsLessons learned from over 25 Data Virtualization implementations
Lessons learned from over 25 Data Virtualization implementations
 
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...
 
Tips for Effective Data Science in the Enterprise
Tips for Effective Data Science in the EnterpriseTips for Effective Data Science in the Enterprise
Tips for Effective Data Science in the Enterprise
 
Open government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impactOpen government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impact
 
Sustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive AnalyticsSustainability Investment Research Using Cognitive Analytics
Sustainability Investment Research Using Cognitive Analytics
 
Fried data summit big data for lob content
Fried data summit big data for lob contentFried data summit big data for lob content
Fried data summit big data for lob content
 
11 Towards a Research Agenda for Recommendation Systems in Requirements Engin...
11 Towards a Research Agenda for Recommendation Systems in Requirements Engin...11 Towards a Research Agenda for Recommendation Systems in Requirements Engin...
11 Towards a Research Agenda for Recommendation Systems in Requirements Engin...
 

Recently uploaded

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 

Irmac presentation for website

  • 1. Copyright 2003-4, SPSS Inc. 1 An Introduction to Text Mining Tim Daciuk SPSS, Inc. Services Manager, Canada
  • 2. Copyright 2003-4, SPSS Inc. 2 AgendaAgenda  Introductions  An Overview of Document Warehousing  Understanding Unstructured Text  Concept Extraction  Text Mining  Data Mining  Demonstration
  • 3. Copyright 2003-4, SPSS Inc. 3 Tim DaciukTim Daciuk  Background  Social research  Survey research  SPSS  25 years working with the product  12 years working with the company  5 years working with text analysis  Prior history  Consulting  Education
  • 4. Copyright 2003-4, SPSS Inc. 4 Predictive analysis helps connect data to effective action by drawing reliable conclusions about current conditions and future events. — Gareth Herschel, Research Director, Gartner Group Predictive Analytics: DefinedPredictive Analytics: Defined
  • 5. Copyright 2003-4, SPSS Inc. 5 SPSS At A GlanceSPSS At A Glance  Leadership  Market leader in Predictive Analytics  Focus on online & offline customer data acquisition and analysis  Stability  Founded in 1968  30+ year heritage in analytic technologies  Proven track record  250,000+ customers worldwide  NASDAQ: SPSS  Analytics standard  80% of Fortune 500 are SPSS customers  80% plus market share in Survey & Market Research sector  Ranked #1 Data Mining solution by KD Nuggets
  • 6. Some of Our BrandsSome of Our Brands
  • 7. Copyright 2003-4, SPSS Inc. 7 Unstructured Data ManagementUnstructured Data Management Text Mining is a subset of Unstructured Data Management. UDM can be broken down into:  Content and Document Management  Search and Retrieval  XML database and tools  Categorization, Classification, and Visualization
  • 8. Copyright 2003-4, SPSS Inc. 8 80% of Data is Unstructured80% of Data is Unstructured  Database notes:  Call center transcripts  Other CRM  Email  Open-ended survey responses  Web pages  NewsGroups  Documents themselves  Competitive information
  • 9. Copyright 2003-4, SPSS Inc. 9 Applications for Text AnalysisApplications for Text Analysis  Surveys  ‘Reading’ email  Call centre data  Comment data  Abstracts  Document management  Corporate history  Thematic understanding of website
  • 10. Copyright 2003-4, SPSS Inc. 10 Data Warehouse vs. DocumentData Warehouse vs. Document WarehouseWarehouse  Data warehouse  Who, what, when, where, how much  Internally focused  Operational information  Rarely include external information  Document warehouse  Why  May not be internally focused  May contain a range of information  Often integrate external information
  • 11. Copyright 2003-4, SPSS Inc. 11 Document Warehouse FeaturesDocument Warehouse Features  There is no single document structure or document type  Documents are drawn from multiple sources  Essential features of documents are automatically extracted and explicitly stored in the document warehouse  Document warehouses are designed to integrate semantically related documents
  • 12. Copyright 2003-4, SPSS Inc. 12 Building the Document WarehouseBuilding the Document Warehouse Identify Sources Retrieve Document Text Analysis Pre-process Document Compile Metadata
  • 13. Copyright 2003-4, SPSS Inc. 13 Predict, Impact, DeployPredict, Impact, Deploy Customer Data Attitudes Actions Attributes Business User Grow Retain Fraud Outcomes Attract Data Collection Text Surveys Web Channel Operational Systems Text BusinessUI Expert UIExpert UI Concepts Concept Maps Clustering Categoriza- tion Trending Information Extraction Prediction NLP
  • 14. Copyright 2003-4, SPSS Inc. 14 The Building Blocks of LanguageThe Building Blocks of Language  Morphology  Syntax  Semantics  Phonology  Pragmatics
  • 15. Copyright 2003-4, SPSS Inc. 15 MorphologyMorphology  Understanding words  Stems  Affixes  Prefix  Suffix  Inflectional elements  Reducing complexity of analysis  Reduces complexity of representation  Supports text mining Noun Prefix Noun Stem Suffix - abledisputein -
  • 16. Copyright 2003-4, SPSS Inc. 16 SyntaxSyntax  The Bank of Canada will curb inflation with higher interest rates Prepositional phrase Adjective Sentence Noun phrase Verb phrase Noun VerbAux Noun phrase NounAdjective Noun The Bank of Canada inflationcurbwill Interest rateshigher with
  • 17. Copyright 2003-4, SPSS Inc. 17 SemanticsSemantics  The meaning of it all  Approaches to meaning  Semantic networks  Deductive logic  Rule-based systems  Useful for classification
  • 18. Copyright 2003-4, SPSS Inc. 18 Problems with NLPProblems with NLP  Limitations of Natural Language Processing  Correctly identifying the role of noun phrases  Representing abstract concepts  Classifying synonyms  Representing the number of concepts
  • 19. Copyright 2003-4, SPSS Inc. 19 Problems with NLPProblems with NLP  Limitations of technology  Language specific designs are required  Classification speed  Classifying hybrid words and sentences
  • 20. Copyright 2003-4, SPSS Inc. 20 Underlying Technology is Based onUnderlying Technology is Based on LinguisticsLinguistics The Linguistic Approach:  Does not treat a document as a bag of words  Removes ambiguity by extracting structured concepts Concepts are the DNA of text. Text is unstructured, ambiguous, and language dependent.
  • 21. Copyright 2003-4, SPSS Inc. 21 From Text to ConceptsFrom Text to Concepts Morphology Syntax Semantics Statistics Linguistic Terminology Extractor ScalableAccurate Customizable Discovery- Oriented •Compound words •Proper nouns •Figures •Named entities •Domain specifics •Speed •Multiple formats •Multiple languages •SPSS dictionaries •User dictionaries •Extraction rules •Extraction patterns •Known terms •Unknown terms •New terms •1GB/hour •PDF, MS Office, text… •English, French, German Spanish, Italian, Dutch, Japanese • Inserm; merck & co… • tnp-470; glut-4… • factor receptor; Inhibitory effect; • D. John Paganoni, .. • Positive/Negative opinion… • London, Paris… •Names, Orgs… •MeSH, genes... •Predicates •Synonyms, stop words.. •Trends
  • 22. Copyright 2003-4, SPSS Inc. 22 From Concepts to PredictiveFrom Concepts to Predictive Analytics ComponentsAnalytics Components Linguistic Terminology Extractor LexiQuest Mine Discover concepts, relationships and trends LexiQuest Categorize Understand documents and assign in pre-defined categories Text Mining for Clementine Add text fields to data mining for better prediction
  • 23. Copyright 2003-4, SPSS Inc. 23 Concept Extraction EngineConcept Extraction Engine The extractor turns unstructured text into concepts: LexiQuest Extractor Engine Linguistic Processor Visualization Probabilities LexiQuest Mine Clementine LexiQuest Categorize
  • 24. Copyright 2003-4, SPSS Inc. 24 Part-of-Speech TaggingPart-of-Speech Tagging a: adjective b: adverb c: preposition d: determiner n: noun v: verb o: coordination p: participle s: stop word
  • 25. Copyright 2003-4, SPSS Inc. 25 How is a Concept Extracted?How is a Concept Extracted? Step 1: Part-of-Speech Tagging Using a tool like LexiQuest Mine is a great V P N A N N V P A idea for any organization that is interested in maintaining N P A N P V V P V information on competitive intelligence. N P N N
  • 26. Copyright 2003-4, SPSS Inc. 26 How is a Concept Extracted?How is a Concept Extracted? Step 2: Matching to Known Patterns This: V P N A N N V P A N PA N P V V P V N PN N Looks Most Like: N C D N N (32 Known patterns for English)
  • 27. Copyright 2003-4, SPSS Inc. 27 How is the Concept Extracted?How is the Concept Extracted? The extractor looks at this sentence: Using a tool like LexiQuest Mine is a great idea for any organization that is interested in maintaining information on competitive intelligence. And extracts the concept: Competitive Intelligence Concepts are:  Noun based  Can be longer than one word
  • 28. Copyright 2003-4, SPSS Inc. 28 Example: CategorizationExample: Categorization
  • 29. Copyright 2003-4, SPSS Inc. 29 The Issue of LanguageThe Issue of Language  NLP requires separate language understanding  Clementine text mining  French  English  English/French  German  Spanish  Dutch  Japanese  Italian  Mesh (Medical subject headings)  http://www.nlm.nih.gov/mesh/meshhome.html
  • 30. “The process of discovering meaningful new relationships, patterns and trends by sifting through data using pattern recognition technologies as well as statistical and mathematical techniques.” - The Gartner group. Data Mining DefinedData Mining Defined
  • 31. Copyright 2003-4, SPSS Inc. 31 Why data mining?Why data mining?  Data Mining software generally employs modeling algorithms designed to handle non-linearities and unusual patterns in data  As opposed to classical linear models (e.g., linear regression) that aren’t as capable  A related issue is ‘noise’ in the data: where, for example, 2 seemingly similar sets of inputs yield a different output
  • 32. Copyright 2003-4, SPSS Inc. 32  Use the cross industry standard process for data mining (CRISP- DM)  Based on real-world lessons:  Focus on business issues  User-centric & interactive  Full process  Results are used A Data Mining MethodologyA Data Mining Methodology
  • 33. Copyright 2003-4, SPSS Inc. 33 Data Mining is not…Data Mining is not…  Keep in mind that data mining is not…  “Blind” application of analysis/modeling algorithms  Brute-force crunching of bulk data  Black box technology  Magic
  • 34. Copyright 2003-4, SPSS Inc. 34 Back to the ProcessBack to the Process Text Mining
  • 35. Copyright 2003-4, SPSS Inc. 35 UnderstandingUnderstanding  Business Understanding  Determine objective  Assess situation  Determine data mining goals  Produce project plan  Data Understanding  Collect initial data  Describe data  Explore data  Verify data quality
  • 36. Copyright 2003-4, SPSS Inc. 36 Data PreparationData Preparation  Data  Data set  Data set description  Select data  Clean data  Construct data set / Integrate data  Format data  Text  Concept extraction  Concept combination  Concept assessment
  • 37. Copyright 2003-4, SPSS Inc. 37 ModelingModeling  Select modeling technique  Universe of techniques  Appropriate techniques  Data  Text  Requirements  Constraints  Selected tools  Generate test design  Run model(s)  Assess model(s)
  • 38. Copyright 2003-4, SPSS Inc. 38 EvaluationEvaluation  Results = Models + Findings  Evaluate results  Review process  Determine next steps
  • 39. Copyright 2003-4, SPSS Inc. 39 DeploymentDeployment  Plan deployment  Plan monitoring and maintenance  Final report  Project review
  • 40. Copyright 2003-4, SPSS Inc. 40  Unsupervised methods:  Group patients by drugs and demographic information and try to find unusual patients  Supervised methods:  Attempt to predict amount due and find sets of cases where the amount due is very different from the predicted amount Data Mining ApproachesData Mining Approaches
  • 41. Copyright 2003-4, SPSS Inc. 41 What Does Data Mining Do?What Does Data Mining Do?  Data mining uses existing data to:  Predict  Category membership  Numeric Value  Ie. Credit risk  Group  Cluster (group) things together based on their characteristics  Ie. Different types of TV viewers  Associate  Find events that occur together, or in a sequence  Ie. Beer and diapers  Find outliers  Identify cases that don’t follow expected behavior  Ie. Fraudulent behaviour
  • 42. Copyright 2003-4, SPSS Inc. 42 Benefits of Document WarehousingBenefits of Document Warehousing  Richer operational business intelligence  Knowing your customers  Macroenvironmental monitoring  Technology assessment
  • 43. Copyright 2003-4, SPSS Inc. 43 ConclusionsConclusions  Text mining is  More than word counts  Linguistically based  Concept extraction  Data mining is  Advanced analytics applied to datasets  A family of techniques  Supervised or unsupervised
  • 44. Copyright 2003-4, SPSS Inc. 44 ConclusionsConclusions  Text and data mining  Add dimensionality to the data  Allow for automation of the text analysis event  Create 360 degree view  Applications  Websites  Surveys  Email  Call centre  Documentation
  • 46. Copyright 2003-4, SPSS Inc. 46 So How Do I Get Started?So How Do I Get Started?  Document Warehousing and Text Mining  Dan Sullivan, Wiley, 2001  Survey of Text Mining: Clustering, Classification and Retrieval  Michael W. Berry (ed.), Springer, 2003  Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization  P. Jackson and I. Moulinier, John Benjamins, 2002
  • 47. Copyright 2003-4, SPSS Inc. 47 SPSS CanadaSPSS Canada  Tim Daciuk  Services Manager, Canada  416-410-7921  800-543-6607 ext. 5156  tdaciuk@spss.com  Hugh Rooney  SPSS Sales Canada  416-410-7921  905-886-4322  hrooney@spss.com www.spss.com

Editor's Notes

  1. According to Gartner Group: “Predictive Analysis helps you connect data to effective action by drawing reliable conclusions about current conditions and future events.” Predictive analysis: Leverages an organization’s business knowledge by applying sophisticated analytic techniques to enterprise data It turns that data into insights that lead to the development of programs to increase revenues, reduce costs, improve processes, and prevent criminal or fraudulent activities It encourage actions that demonstrably change how people behave as your customers, employees, patients, students, and citizens Bottom line: it turns data into effective actions that positively impact your bottom line
  2. Here are some of the stats you may want to know about SPSS (read highlights from the slide). SPSS has been a cornerstone of the software industry since 1968. We’ve also been on the forefront of blending both new and established technologies to help customers around the world solve business problems. We’ve continued to grow, deliberately and thoughtfully, over the years, acquiring companies and technology complimentary to our existing business. The bottom line for you: We’re here with innovative, proven solutions to help you solve your immediate business problems. And, we’ll be here in the future to support you and your organization…
  3. The Clementine Server data mining workbench fits within SPSS’ overall business intelligence product strategy. Our entire business intelligence product line includes products for collecting data, preparing data, reporting and OLAP, as well as modeling. Because different users have different needs and levels of expertise, they are presented with the appropriate product interface. SPSS delivers the right product for every person supported by our 30 years of experience in data analysis and data mining. The analytical solutions we deliver, are scalable and can be deployed throughout your organization to help you transform your business with information..
  4. It’s been estimated that 80% of the data in an organization is an unstructured format, that is, in the form of documents, HTML pages, database notes, email, open-ended survey responses, etc. This fact means that decision-makers often rely on only 20% of the data available and a little bit of the documents that they can read. Take open-ended surveys, for example: cross-tab reports of responses are common but open-ended responses, which hold valuable information which qualify the responses and bring up new themes. Organizations rarely have the tools or the time to truly process and disseminate this important information. In a similar fashion, database notes on customer contacts are effectively used to manage individual contacts, but this valuable source of customer information is never used to really understand the customer experience overall. What if you could use this information to keep and grow customers to increase customer lifetime value?
  5. So, I think we can agree that there is a need for text analysis, but, where can this technology be applied. [click] Well, surveys are the most obvious, we’ve just talked about that. [click] We could apply this to email, reading the email and making a handling decision based on the content of the email. [click] Call centre data is another candidate for text analysis. What are my customers complaints? Where are there problems with my product? What did the customers who left have to say about my service? [click] Reading comment data is an important potential application; an application that is laborious or subjective currently. In the State of Georgia example, comments are triaged and only those indicating a definite problem or requirement for re-arrest are used. The majority are ignored, even though there is a real sense that there is something in that group. [click] The ability to read abstracts from online databases using a more intelligent engine than a simple word search is an application. [click] Document management and the ability to categorize gigabytes of documents policies, procedures is an application, and along with that [click] Corporate history and the ability to manage corporate information resources is an application. Finally [click] we have seen some use of this technology in the analysis of message in websites. What concepts are your website conveying, and are these concepts the appropriate ones, appropriately placed?
  6. At a high level, you need linguistics or Natural Language Processing, to extract concepts which form the bases of business user interfaces like concept maps or feeding data mining techniques to predict customer behavior.
  7. Morphology is the study of the structure and form of words Syntax is the study of how words and phrases form sentences Semantics relates to the meaning of words and statements Phonology is the study of sounds in language Pragmatics is the study of idiomatic phrases that cannot be analyzed with strict semantic analysis We tend to deal with the first three and ignore the last two when we are talking about natural language processing.
  8. So how do we get from text to concepts? Linguistics, the science of text, includes ideas such as 1) morphology, or how words change based on part of speech, 2) syntax, or how sentences are structured 3) semantics, or the meaning of words and 4) statistics, such as the frequency of terms and patterns. It takes linguistics to cut through the noise of text to find relevant concepts without leaving important concepts undiscovered. Other statistical or machine learning approaches fall short of linguistic extraction, because only a linguistic approach can deal with the ambiguity and complexity of text. That is, linguistic extraction is [click] accurate, [click] scalable, [click] customizable and [click] discovery oriented. By accurate, I mean that [click] compound words, proper nouns, etc., [click] like these examples, are extracted. [click] In terms of scalability, we can process about 1 GB per hour, multiple formats and multiple languages [click]. By customizable, [click] I mean that you can you use dictionaries, rules and patterns [click] to tailor your extraction. Vertical resources can be used like the MeSH which is the official medical thesaurus. And Finally, [click] by discovery-oriented, I mean that, depending on your analysis, you can focus on known terms, unknown terms, new terms and [click] trends.
  9. For the next step, to move from concepts to Predictive Analytics, tools are available which address specific business needs by delivering knowledge to adding prediction to operational systems. [click] LexiQuest Mine enables users to quickly identify key concepts, and the relationships between them, within thousands of documents Mine displays these concepts and the links between them in an easy to navigate, color-coded graphical map and trend analysis charts. Mine is designed for people who want to discover, structure and anticipate. [click] LexiQuest Categorize automatically catalogues documents into a predefined taxonomy based on their content. Able to “read” and understand content, Categorize is able to automatically and accurately place a document into into its proper category. From there, it can be sent to the right audience based on their profile or simply reside there for easy retrieval from a portal, intranet or extranet site. [click] Text Mining for Clementine is a new component of Clementine, which we will see in a few minutes, has the ability to unlock knowledge contained in unstructured text data so that it can be combined with information from databases and other data sources to build better models using traditional data mining techniques.
  10. The extraction process works basically as three parts: First, a linguistic processor reads the text and comes up with a set of categories. These categories are passed to one of three different applications; depending upon the objective. These applications may be a stand alone concept understanding application, such as our LexiQuest mine. This application represents the concepts, and illustrates their relationship. Our Clementine application uses the concepts as data, and, as part of a larger data mining application. Finally, categorization uses the concept information as the basis of further analysis The final application layer can be used for visualization, data mining, or strict probabilistic assignment of information to known categories.
  11. Another text mining example is categorization. The folders on the left represent categories of different types of incoming emails. Text mining can be used to learn the which emails, depending on their content, should be placed in each category (and therefore routed appropriately and automatically). [click] In this case an email on a problem with an ActiveX control [click] can be routed to Dev support.
  12. MeSH is the National Library of Medicine's controlled vocabulary thesaurus. It consists of sets of terms naming descriptors in a hierarchical structure that permits searching at various levels of specificity. MeSH descriptors are arranged in both an alphabetic and a hierarchical structure. At the most general level of the hierarchical structure are very broad headings such as "Anatomy" or "Mental Disorders." At more narrow levels are found more specific headings such as "Ankle" and "Conduct Disorder." There are 21,973 descriptors in MeSH. There are also thousands of cross-references that assist in finding the most appropriate MeSH Heading, for example, Vitamin C see Ascorbic Acid. These entries include 23,512 printed see references and 102,346 other entry points
  13. Let’s define data mining “The process of discovering meaningful new relationships, patterns and trends by sifting through data using pattern recognition technologies as well as statistical and mathematical techniques.” Data Mining Means: finding patterns or relationships in your data that you can use to solve your organization’s problems
  14.   How does one mine data? The CROSS INDUSTRY STANDARD PROCESS FOR DATA MINING provides a framework for all data mining efforts.. This process focuses on business issues, allows the user to work with and interact with the data, works on the data mining process from beginning to end, and USES the results.  Business Understanding where you might convert a business problem to a data mining problem Data Understanding – where you get your first look at the data Data Preparation – the hardest part where you clean the data Modeling – the neatest part where you build prediction or MODELS Evaluation – where you examine performance Deployment – where you actually integrate the results of your data mining into your organization Notices the arrows going around the chart and and back and forth amongst the boxes? These arrows show that the data mining process is an iterative process where the miner may step from on box to another an back for an effective data mining activity
  15. Broadly speaking data mining can serve four basic purposes: prediction, segmentation, association and outlier detection. Prediction takes a known result, and attempts to combine input fields in order to best replicate this result. An example may be deciding if someone is a good or bad credit risk, or, whether someone will churn or not. Segmentation finds groups within cases. The number of groups is unknown, ie. Any number of groups is possible. An example may be the examination of customer segments, or, the creation of groupings within financial data Association methods try to develop an “if this, then that” type of analysis. Examples of this are people who watch news programming also watch the weather network. Outlier detection is used to derive atypical cases. These are examples of unusual behaviour vis-s-vis the rest of the data. Examples of this may be fraud detection when examining claims data.
  16. The ability to move to the why from the other who, what, when, where, and how much Personalization for customer knowledge. This allows for marketers to craft messages aimed at more individuals than generalized groups Understanding events inside and outside the organization and how they relate The ability to assess competitors technology and to understand the technology positions of the market
  17. Questions