SlideShare a Scribd company logo
1 of 32
From Data Mining
to
Knowledge Discovery:
An Introduction
Gregory Piatetsky-Shapiro
KDnuggets
2
Outline
Introduction
Data Mining Tasks
Application Examples
3
Trends leading to Data Flood
 More data is generated:
 Bank, telecom, other
business transactions ...
 Scientific Data: astronomy,
biology, etc
 Web, text, and e-commerce
 More data is captured:
 Storage technology faster
and cheaper
 DBMS capable of handling
bigger DB
4
Examples
 Europe's Very Long Baseline Interferometry
(VLBI) has 16 telescopes, each of which produces
1 Gigabit/second of astronomical data over a
25-day observation session
 storage and analysis a big problem
 Walmart reported to have 24 Tera-byte DB
 AT&T handles billions of calls per day
 data cannot be stored -- analysis is done on the fly
5
Growth Trends
 Moore’s law
 Computer Speed doubles every 18
months
 Storage law
 total storage doubles every 9
months
 Consequence
 very little data will ever be looked at
by a human
 Knowledge Discovery is NEEDED
to make sense and use of data.
6
Knowledge Discovery Definition
Knowledge Discovery in Data is the
non-trivial process of identifying
 valid
 novel
 potentially useful
 and ultimately understandable patterns in data.
from Advances in Knowledge Discovery and Data
Mining, Fayyad, Piatetsky-Shapiro, Smyth, and
Uthurusamy, (Chapter 1), AAAI/MIT Press 1996
7
Related Fields
Statistics
Machine
Learning
Databases
Visualization
Data Mining and
Knowledge Discovery
8
__
__
__
__
__
__
__
__
__
Transformed
Data
Patterns
and
Rules
Target
Data
Raw
Dat
a
Knowledge
Interpretation
& Evaluation
Integration
Understanding
Knowledge Discovery Process
DATA
Ware
house
Knowledge
9
Outline
Introduction
Data Mining Tasks
Application Examples
10
Data Mining Tasks: Classification
Learn a method for predicting the instance class from
pre-labeled (classified) instances
Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...
11
Classification: Linear Regression
 Linear Regression
w0 + w1 x + w2 y >= 0
 Regression computes
wi from data to
minimize squared
error to ‘fit’ the data
 Not flexible enough
12
Classification: Decision Trees
X
Y
if X > 5 then blue
else if Y > 3 then blue
else if X > 2 then green
else blue
5
2
3
13
Classification: Neural Nets
 Can select more
complex regions
 Can be more accurate
 Also can overfit the
data – find patterns in
random noise
14
Data Mining Central Quest
Find true patterns
and avoid overfitting
(false patterns due
to randomness)
15
Data Mining Tasks: Clustering
Find “natural” grouping of
instances given un-labeled data
16
Major Data Mining Tasks
 Classification: predicting an item class
 Clustering: finding clusters in data
 Associations: e.g. A & B & C occur frequently
 Visualization: to facilitate human discovery
 Estimation: predicting a continuous value
 Deviation Detection: finding changes
 Link Analysis: finding relationships
 …
17
www.KDnuggets.com
Data Mining Software Guide
18
Outline
Introduction
Data Mining Tasks
Application Examples
19
Major Application Areas for
Data Mining Solutions
 Advertising
 Bioinformatics
 Customer Relationship Management (CRM)
 Database Marketing
 Fraud Detection
 eCommerce
 Health Care
 Investment/Securities
 Manufacturing, Process Control
 Sports and Entertainment
 Telecommunications
 Web
20
Case Study: Search Engines
 Early search engines used mainly keywords on a
page – were subject to manipulation
 Google success is due to its algorithm which uses
mainly links to the page
 Google founders Sergey Brin and Larry Page were
students in Stanford doing research in databases
and data mining in 1998 which led to Google
21
Case Study:
Direct Marketing and CRM
 Most major direct marketing companies are using
modeling and data mining
 Most financial companies are using customer
modeling
 Modeling is easier than changing customer
behaviour
 Some successes
 Verizon Wireless reduced churn rate from 2% to 1.5%
22
Biology: Molecular Diagnostics
 Leukemia: Acute Lymphoblastic (ALL) vs Acute
Myeloid (AML)
 72 samples, about 7,000 genes
ALL AML
Results: 33 correct (97% accuracy),
1 error (sample suspected mislabelled)
Outcome predictions?
23
AF1q: New Marker for
Medulloblastoma?
 AF1Q ALL1-fused gene from chromosome 1q
 transmembrane protein
 Related to leukemia (3 PUBMED entries) but not to Medulloblastoma
24
Case Study:
Security and Fraud Detection
 Credit Card Fraud Detection
 Money laundering
 FAIS (US Treasury)
 Securities Fraud
 NASDAQ Sonar system
 Phone fraud
 AT&T, Bell Atlantic, British Telecom/MCI
 Bio-terrorism detection at Salt Lake
Olympics 2002
25
Data Mining and Terrorism:
Controversy in the News
 TIA: Terrorism (formerly Total) Information
Awareness Program –
 DARPA program closed by Congress
 some functions transferred to intelligence agencies
 CAPPS II – screen all airline passengers
 controversial
 …
 Invasion of Privacy or Defensive Shield?
26
Criticism of analytic approach to
Threat Detection:
Data Mining will
 invade privacy
 generate millions of false positives
But can it be effective?
27
Can Data Mining and Statistics be
Effective for Threat Detection?
 Criticism: Databases have 5% errors, so analyzing
100 million suspects will generate 5 million false
positives
 Reality: Analytical models correlate many items of
information to reduce false positives.
 Example: Identify one biased coin from 1,000.
 After one throw of each coin, we cannot
 After 30 throws, one biased coin will stand out with
high probability.
 Can identify 19 biased coins out of 100 million with
sufficient number of throws
28
Another Approach: Link Analysis
Can Find Unusual Patterns in the Network Structure
29
Analytic technology can be effective
 Combining multiple models and link analysis can
reduce false positives
 Today there are millions of false positives with
manual analysis
 Data Mining is just one additional tool to help
analysts
 Analytic Technology has the potential to reduce
the current high rate of false positives
30
Data Mining with Privacy
 Data Mining looks for patterns, not people!
 Technical solutions can limit privacy invasion
 Replacing sensitive personal data with anon. ID
 Give randomized outputs
 Multi-party computation – distributed data
 …
 Bayardo & Srikant, Technological Solutions for
Protecting Privacy, IEEE Computer, Sep 2003
31
1990
1998 2000 2002
Expectations
Performance
The Hype Curve for
Data Mining and Knowledge Discovery
Over-inflated
expectations
Disappointment
Growing acceptance
and mainstreaming
rising
expectations
32
Summary
www.KDnuggets.com – the website for
Data Mining and Knowledge Discovery
Contact: Gregory Piatetsky-Shapiro
gregory@kdnuggets.com

More Related Content

Similar to JanData-mining-to-knowledge-discovery.ppt

Applications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesApplications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesT.S. Lim
 
20130618 presentation big data in financial services English
20130618 presentation big data in financial services English20130618 presentation big data in financial services English
20130618 presentation big data in financial services EnglishPascal Spelier
 
Introduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptIntroduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptSangrangBargayary3
 
Heavy, messy, misleading. Why Big Data is a human problem, not a technology one.
Heavy, messy, misleading. Why Big Data is a human problem, not a technology one.Heavy, messy, misleading. Why Big Data is a human problem, not a technology one.
Heavy, messy, misleading. Why Big Data is a human problem, not a technology one.Francesco D'Orazio
 
Chicago crime conference
Chicago crime conferenceChicago crime conference
Chicago crime conferenceMichael Jackson
 
Heavy, Messy, Misleading: why Big Data is a human problem, not a tech one
Heavy, Messy, Misleading: why Big Data is a human problem, not a tech oneHeavy, Messy, Misleading: why Big Data is a human problem, not a tech one
Heavy, Messy, Misleading: why Big Data is a human problem, not a tech onePulsar
 
Big data for development
Big data for development Big data for development
Big data for development Junaid Qadir
 
Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1DanWooster1
 
Big data 4 4 the art of the possible 4-en-web
Big data 4 4 the art of the possible 4-en-webBig data 4 4 the art of the possible 4-en-web
Big data 4 4 the art of the possible 4-en-webRick Bouter
 
introduction to machin learning
introduction to machin learningintroduction to machin learning
introduction to machin learningnilimapatel6
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementationSandip Tipayle Patil
 
01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.pptadmsoyadm4
 

Similar to JanData-mining-to-knowledge-discovery.ppt (20)

Applications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesApplications of Big Data Analytics in Businesses
Applications of Big Data Analytics in Businesses
 
20130618 presentation big data in financial services English
20130618 presentation big data in financial services English20130618 presentation big data in financial services English
20130618 presentation big data in financial services English
 
Introduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptIntroduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .ppt
 
Lec 01
Lec 01Lec 01
Lec 01
 
Heavy, messy, misleading. Why Big Data is a human problem, not a technology one.
Heavy, messy, misleading. Why Big Data is a human problem, not a technology one.Heavy, messy, misleading. Why Big Data is a human problem, not a technology one.
Heavy, messy, misleading. Why Big Data is a human problem, not a technology one.
 
Chicago crime conference
Chicago crime conferenceChicago crime conference
Chicago crime conference
 
Heavy, Messy, Misleading: why Big Data is a human problem, not a tech one
Heavy, Messy, Misleading: why Big Data is a human problem, not a tech oneHeavy, Messy, Misleading: why Big Data is a human problem, not a tech one
Heavy, Messy, Misleading: why Big Data is a human problem, not a tech one
 
isd314-01
isd314-01isd314-01
isd314-01
 
Big data for development
Big data for development Big data for development
Big data for development
 
Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1
 
Big data 4 4 the art of the possible 4-en-web
Big data 4 4 the art of the possible 4-en-webBig data 4 4 the art of the possible 4-en-web
Big data 4 4 the art of the possible 4-en-web
 
3.2
3.23.2
3.2
 
introduction to machin learning
introduction to machin learningintroduction to machin learning
introduction to machin learning
 
i2ml3e-chap1.pptx
i2ml3e-chap1.pptxi2ml3e-chap1.pptx
i2ml3e-chap1.pptx
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining Intro
 
data mining
data miningdata mining
data mining
 
01Intro.ppt
01Intro.ppt01Intro.ppt
01Intro.ppt
 
01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt
 
01Intro.ppt
01Intro.ppt01Intro.ppt
01Intro.ppt
 

More from georgejustymirobi1 (20)

Network IP Security.pdf
Network IP Security.pdfNetwork IP Security.pdf
Network IP Security.pdf
 
How To Write A Scientific Paper
How To Write A Scientific PaperHow To Write A Scientific Paper
How To Write A Scientific Paper
 
writing_the_research_paper.ppt
writing_the_research_paper.pptwriting_the_research_paper.ppt
writing_the_research_paper.ppt
 
Bluetooth.ppt
Bluetooth.pptBluetooth.ppt
Bluetooth.ppt
 
ABCD15042603583.pdf
ABCD15042603583.pdfABCD15042603583.pdf
ABCD15042603583.pdf
 
ch18 ABCD.pdf
ch18 ABCD.pdfch18 ABCD.pdf
ch18 ABCD.pdf
 
ch13 ABCD.ppt
ch13 ABCD.pptch13 ABCD.ppt
ch13 ABCD.ppt
 
BluetoothSecurity.ppt
BluetoothSecurity.pptBluetoothSecurity.ppt
BluetoothSecurity.ppt
 
1682302951397_PGP.pdf
1682302951397_PGP.pdf1682302951397_PGP.pdf
1682302951397_PGP.pdf
 
CNN
CNNCNN
CNN
 
CNN Algorithm
CNN AlgorithmCNN Algorithm
CNN Algorithm
 
applicationlayer.pptx
applicationlayer.pptxapplicationlayer.pptx
applicationlayer.pptx
 
Fair Bluetooth.pdf
Fair Bluetooth.pdfFair Bluetooth.pdf
Fair Bluetooth.pdf
 
Bluetooth.pptx
Bluetooth.pptxBluetooth.pptx
Bluetooth.pptx
 
Research Score.pdf
Research Score.pdfResearch Score.pdf
Research Score.pdf
 
educational_technology_meena_arora.ppt
educational_technology_meena_arora.ppteducational_technology_meena_arora.ppt
educational_technology_meena_arora.ppt
 
Array.ppt
Array.pptArray.ppt
Array.ppt
 
PYTHON-PROGRAMMING-UNIT-II (1).pptx
PYTHON-PROGRAMMING-UNIT-II (1).pptxPYTHON-PROGRAMMING-UNIT-II (1).pptx
PYTHON-PROGRAMMING-UNIT-II (1).pptx
 
cprogrammingoperator.ppt
cprogrammingoperator.pptcprogrammingoperator.ppt
cprogrammingoperator.ppt
 
cprogrammingarrayaggregatetype.ppt
cprogrammingarrayaggregatetype.pptcprogrammingarrayaggregatetype.ppt
cprogrammingarrayaggregatetype.ppt
 

Recently uploaded

The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 

Recently uploaded (20)

The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 

JanData-mining-to-knowledge-discovery.ppt

  • 1. From Data Mining to Knowledge Discovery: An Introduction Gregory Piatetsky-Shapiro KDnuggets
  • 3. 3 Trends leading to Data Flood  More data is generated:  Bank, telecom, other business transactions ...  Scientific Data: astronomy, biology, etc  Web, text, and e-commerce  More data is captured:  Storage technology faster and cheaper  DBMS capable of handling bigger DB
  • 4. 4 Examples  Europe's Very Long Baseline Interferometry (VLBI) has 16 telescopes, each of which produces 1 Gigabit/second of astronomical data over a 25-day observation session  storage and analysis a big problem  Walmart reported to have 24 Tera-byte DB  AT&T handles billions of calls per day  data cannot be stored -- analysis is done on the fly
  • 5. 5 Growth Trends  Moore’s law  Computer Speed doubles every 18 months  Storage law  total storage doubles every 9 months  Consequence  very little data will ever be looked at by a human  Knowledge Discovery is NEEDED to make sense and use of data.
  • 6. 6 Knowledge Discovery Definition Knowledge Discovery in Data is the non-trivial process of identifying  valid  novel  potentially useful  and ultimately understandable patterns in data. from Advances in Knowledge Discovery and Data Mining, Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, (Chapter 1), AAAI/MIT Press 1996
  • 10. 10 Data Mining Tasks: Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Statistics, Decision Trees, Neural Networks, ...
  • 11. 11 Classification: Linear Regression  Linear Regression w0 + w1 x + w2 y >= 0  Regression computes wi from data to minimize squared error to ‘fit’ the data  Not flexible enough
  • 12. 12 Classification: Decision Trees X Y if X > 5 then blue else if Y > 3 then blue else if X > 2 then green else blue 5 2 3
  • 13. 13 Classification: Neural Nets  Can select more complex regions  Can be more accurate  Also can overfit the data – find patterns in random noise
  • 14. 14 Data Mining Central Quest Find true patterns and avoid overfitting (false patterns due to randomness)
  • 15. 15 Data Mining Tasks: Clustering Find “natural” grouping of instances given un-labeled data
  • 16. 16 Major Data Mining Tasks  Classification: predicting an item class  Clustering: finding clusters in data  Associations: e.g. A & B & C occur frequently  Visualization: to facilitate human discovery  Estimation: predicting a continuous value  Deviation Detection: finding changes  Link Analysis: finding relationships  …
  • 19. 19 Major Application Areas for Data Mining Solutions  Advertising  Bioinformatics  Customer Relationship Management (CRM)  Database Marketing  Fraud Detection  eCommerce  Health Care  Investment/Securities  Manufacturing, Process Control  Sports and Entertainment  Telecommunications  Web
  • 20. 20 Case Study: Search Engines  Early search engines used mainly keywords on a page – were subject to manipulation  Google success is due to its algorithm which uses mainly links to the page  Google founders Sergey Brin and Larry Page were students in Stanford doing research in databases and data mining in 1998 which led to Google
  • 21. 21 Case Study: Direct Marketing and CRM  Most major direct marketing companies are using modeling and data mining  Most financial companies are using customer modeling  Modeling is easier than changing customer behaviour  Some successes  Verizon Wireless reduced churn rate from 2% to 1.5%
  • 22. 22 Biology: Molecular Diagnostics  Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML)  72 samples, about 7,000 genes ALL AML Results: 33 correct (97% accuracy), 1 error (sample suspected mislabelled) Outcome predictions?
  • 23. 23 AF1q: New Marker for Medulloblastoma?  AF1Q ALL1-fused gene from chromosome 1q  transmembrane protein  Related to leukemia (3 PUBMED entries) but not to Medulloblastoma
  • 24. 24 Case Study: Security and Fraud Detection  Credit Card Fraud Detection  Money laundering  FAIS (US Treasury)  Securities Fraud  NASDAQ Sonar system  Phone fraud  AT&T, Bell Atlantic, British Telecom/MCI  Bio-terrorism detection at Salt Lake Olympics 2002
  • 25. 25 Data Mining and Terrorism: Controversy in the News  TIA: Terrorism (formerly Total) Information Awareness Program –  DARPA program closed by Congress  some functions transferred to intelligence agencies  CAPPS II – screen all airline passengers  controversial  …  Invasion of Privacy or Defensive Shield?
  • 26. 26 Criticism of analytic approach to Threat Detection: Data Mining will  invade privacy  generate millions of false positives But can it be effective?
  • 27. 27 Can Data Mining and Statistics be Effective for Threat Detection?  Criticism: Databases have 5% errors, so analyzing 100 million suspects will generate 5 million false positives  Reality: Analytical models correlate many items of information to reduce false positives.  Example: Identify one biased coin from 1,000.  After one throw of each coin, we cannot  After 30 throws, one biased coin will stand out with high probability.  Can identify 19 biased coins out of 100 million with sufficient number of throws
  • 28. 28 Another Approach: Link Analysis Can Find Unusual Patterns in the Network Structure
  • 29. 29 Analytic technology can be effective  Combining multiple models and link analysis can reduce false positives  Today there are millions of false positives with manual analysis  Data Mining is just one additional tool to help analysts  Analytic Technology has the potential to reduce the current high rate of false positives
  • 30. 30 Data Mining with Privacy  Data Mining looks for patterns, not people!  Technical solutions can limit privacy invasion  Replacing sensitive personal data with anon. ID  Give randomized outputs  Multi-party computation – distributed data  …  Bayardo & Srikant, Technological Solutions for Protecting Privacy, IEEE Computer, Sep 2003
  • 31. 31 1990 1998 2000 2002 Expectations Performance The Hype Curve for Data Mining and Knowledge Discovery Over-inflated expectations Disappointment Growing acceptance and mainstreaming rising expectations
  • 32. 32 Summary www.KDnuggets.com – the website for Data Mining and Knowledge Discovery Contact: Gregory Piatetsky-Shapiro gregory@kdnuggets.com