SlideShare a Scribd company logo
1 of 32
From Data Mining
to
Knowledge Discovery:
An Introduction
Gregory Piatetsky-Shapiro
KDnuggets
22
Outline
Introduction
Data Mining Tasks
Application Examples
33
Trends leading to Data Flood
 More data is generated:
 Bank, telecom, other
business transactions ...
 Scientific Data: astronomy,
biology, etc
 Web, text, and e-commerce
 More data is captured:
 Storage technology faster
and cheaper
 DBMS capable of handling
bigger DB
44
Examples
 Europe's Very Long Baseline Interferometry
(VLBI) has 16 telescopes, each of which
produces 1 Gigabit/second of astronomical
data over a 25-day observation session
 storage and analysis a big problem
 Walmart reported to have 24 Tera-byte DB
 AT&T handles billions of calls per day
 data cannot be stored -- analysis is done on the fly
55
Growth Trends
 Moore’s law
 Computer Speed doubles every 18
months
 Storage law
 total storage doubles every 9
months
 Consequence
 very little data will ever be looked at
by a human
 Knowledge Discovery is
NEEDED to make sense and use
of data.
66
Knowledge Discovery Definition
Knowledge Discovery in Data is the
non-trivial process of identifying
 valid
 novel
 potentially useful
 and ultimately understandable patterns in data.
from Advances in Knowledge Discovery and Data
Mining, Fayyad, Piatetsky-Shapiro, Smyth, and
Uthurusamy, (Chapter 1), AAAI/MIT Press 1996
77
Related Fields
Statistics
Machine
Learning
Databases
Visualization
Data Mining and
Knowledge Discovery
88
__
__
__
__
__
__
__
__
__
Transformed
Data
Patterns
and
Rules
Target
Data
Raw
Data
Knowledge
Data MiningTransformation
Interpretation
& Evaluation
Selection
&
Cleaning
Integration
Understanding
Knowledge Discovery Process
DATA
Ware
house
Knowledge
99
Outline
Introduction
Data Mining Tasks
Application Examples
1010
Data Mining Tasks: Classification
Learn a method for predicting the instance class
from pre-labeled (classified) instances
Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...
1111
Classification: Linear Regression
 Linear Regression
w0 + w1 x + w2 y >= 0
 Regression computes
wi from data to
minimize squared
error to ‘fit’ the data
 Not flexible enough
1212
Classification: Decision Trees
X
Y
if X > 5 then blue
else if Y > 3 then blue
else if X > 2 then green
else blue
52
3
1313
Classification: Neural Nets
 Can select more
complex regions
 Can be more accurate
 Also can overfit the
data – find patterns in
random noise
1414
Data Mining Central Quest
Find true patterns
and avoid overfitting
(false patterns due
to randomness)
1515
Data Mining Tasks: Clustering
Find “natural” grouping of
instances given un-labeled data
1616
Major Data Mining Tasks
 Classification: predicting an item class
 Clustering: finding clusters in data
 Associations: e.g. A & B & C occur frequently
 Visualization: to facilitate human discovery
 Estimation: predicting a continuous value
 Deviation Detection: finding changes
 Link Analysis: finding relationships
 …
1717
www.KDnuggets.com
Data Mining Software Guide
1818
Outline
Introduction
Data Mining Tasks
Application Examples
1919
Major Application Areas for
Data Mining Solutions
 Advertising
 Bioinformatics
 Customer Relationship Management (CRM)
 Database Marketing
 Fraud Detection
 eCommerce
 Health Care
 Investment/Securities
 Manufacturing, Process Control
 Sports and Entertainment
 Telecommunications
 Web
2020
Case Study: Search Engines
 Early search engines used mainly keywords on a
page – were subject to manipulation
 Google success is due to its algorithm which uses
mainly links to the page
 Google founders Sergey Brin and Larry Page
were students in Stanford doing research in
databases and data mining in 1998 which led to
Google
2121
Case Study:
Direct Marketing and CRM
 Most major direct marketing companies are using
modeling and data mining
 Most financial companies are using customer
modeling
 Modeling is easier than changing customer
behaviour
 Some successes
 Verizon Wireless reduced churn rate from 2% to 1.5%
2222
Biology: Molecular Diagnostics
 Leukemia: Acute Lymphoblastic (ALL) vs Acute
Myeloid (AML)
 72 samples, about 7,000 genes
ALL AML
Results: 33 correct (97% accuracy),
1 error (sample suspected mislabelled)
Outcome predictions?
2323
AF1q: New Marker for
Medulloblastoma?
 AF1Q ALL1-fused gene from chromosome 1q
 transmembrane protein
 Related to leukemia (3 PUBMED entries) but not to Medulloblastoma
2424
Case Study:
Security and Fraud Detection
 Credit Card Fraud Detection
 Money laundering
 FAIS (US Treasury)
 Securities Fraud
 NASDAQ Sonar system
 Phone fraud
 AT&T, Bell Atlantic, British Telecom/MCI
 Bio-terrorism detection at Salt Lake
Olympics 2002
2525
Data Mining and Terrorism:
Controversy in the News
 TIA: Terrorism (formerly Total) Information
Awareness Program –
 DARPA program closed by Congress
 some functions transferred to intelligence agencies
 CAPPS II – screen all airline passengers
 controversial
 …
 Invasion of Privacy or Defensive Shield?
2626
Criticism of analytic approach to
Threat Detection:
Data Mining will
 invade privacy
 generate millions of false positives
But can it be effective?
2727
Can Data Mining and Statistics be
Effective for Threat Detection?
 Criticism: Databases have 5% errors, so
analyzing 100 million suspects will generate 5
million false positives
 Reality: Analytical models correlate many items of
information to reduce false positives.
 Example: Identify one biased coin from 1,000.
 After one throw of each coin, we cannot
 After 30 throws, one biased coin will stand out with
high probability.
 Can identify 19 biased coins out of 100 million with
sufficient number of throws
2828
Another Approach: Link Analysis
Can Find Unusual Patterns in the Network Structure
2929
Analytic technology can be effective
 Combining multiple models and link analysis can
reduce false positives
 Today there are millions of false positives with
manual analysis
 Data Mining is just one additional tool to help
analysts
 Analytic Technology has the potential to reduce
the current high rate of false positives
3030
Data Mining with Privacy
 Data Mining looks for patterns, not people!
 Technical solutions can limit privacy invasion
 Replacing sensitive personal data with anon. ID
 Give randomized outputs
 Multi-party computation – distributed data
 …
 Bayardo & Srikant, Technological Solutions for
Protecting Privacy, IEEE Computer, Sep 2003
3131
1990
1998 2000 2002
Expectations
Performance
The Hype Curve for
Data Mining and Knowledge
Discovery
Over-inflated
expectations
Disappointment
Growing acceptance
and mainstreaming
rising
expectations
3232

More Related Content

What's hot

Data Mining Techniques
Data Mining TechniquesData Mining Techniques
Data Mining TechniquesSanzid Kawsar
 
Data miningppt378
Data miningppt378Data miningppt378
Data miningppt378nitttin
 
MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)Krishan Pareek
 
Ch 1 Intro to Data Mining
Ch 1 Intro to Data MiningCh 1 Intro to Data Mining
Ch 1 Intro to Data MiningSushil Kulkarni
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CSThanveen
 
Introduction to-data-mining chapter 1
Introduction to-data-mining  chapter 1Introduction to-data-mining  chapter 1
Introduction to-data-mining chapter 1Mahmoud Alfarra
 
Application of KDD & its future scope
Application of KDD & its future scopeApplication of KDD & its future scope
Application of KDD & its future scopeTanmay Sethi
 
Data mining by_ashok
Data mining by_ashokData mining by_ashok
Data mining by_ashokAshok Kumar
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining Phi Jack
 
What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)Pratik Tambekar
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data MiningAmritanshu Mehra
 

What's hot (20)

Data Mining Techniques
Data Mining TechniquesData Mining Techniques
Data Mining Techniques
 
Data mining
Data miningData mining
Data mining
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
 
Data miningppt378
Data miningppt378Data miningppt378
Data miningppt378
 
MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)
 
Lecture 01 Data Mining
Lecture 01 Data MiningLecture 01 Data Mining
Lecture 01 Data Mining
 
Ch 1 Intro to Data Mining
Ch 1 Intro to Data MiningCh 1 Intro to Data Mining
Ch 1 Intro to Data Mining
 
Dwdm
DwdmDwdm
Dwdm
 
Data Mining
Data MiningData Mining
Data Mining
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
 
Introduction to-data-mining chapter 1
Introduction to-data-mining  chapter 1Introduction to-data-mining  chapter 1
Introduction to-data-mining chapter 1
 
Data mining
Data miningData mining
Data mining
 
Application of KDD & its future scope
Application of KDD & its future scopeApplication of KDD & its future scope
Application of KDD & its future scope
 
Data mining by_ashok
Data mining by_ashokData mining by_ashok
Data mining by_ashok
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining
 
Ch 1 intro_dw
Ch 1 intro_dwCh 1 intro_dw
Ch 1 intro_dw
 
Introduction to DataMining
Introduction to DataMiningIntroduction to DataMining
Introduction to DataMining
 

Viewers also liked

Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 
Knowledge Discovery in Databases
Knowledge Discovery in DatabasesKnowledge Discovery in Databases
Knowledge Discovery in DatabasesDiwas Kandel
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryFraboni Ec
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesSaif Ullah
 
Energy Harvesing Through Reverse Electrowetting
Energy Harvesing Through Reverse Electrowetting Energy Harvesing Through Reverse Electrowetting
Energy Harvesing Through Reverse Electrowetting Devyani Vaidya
 
Cloud Cmputing Security
Cloud Cmputing SecurityCloud Cmputing Security
Cloud Cmputing SecurityDevyani Vaidya
 
Ppt on open and close door using Applet
Ppt on open and close door using Applet Ppt on open and close door using Applet
Ppt on open and close door using Applet Devyani Vaidya
 
Wireless mobile charging using microwaves
Wireless mobile charging using microwavesWireless mobile charging using microwaves
Wireless mobile charging using microwavesDevyani Vaidya
 
Ppt on use of biomatrix in secure e trasaction
Ppt on use of biomatrix in secure e trasactionPpt on use of biomatrix in secure e trasaction
Ppt on use of biomatrix in secure e trasactionDevyani Vaidya
 
Table of contents blue brain
Table of contents blue brainTable of contents blue brain
Table of contents blue brainkoustuba
 

Viewers also liked (20)

Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Data mining
Data miningData mining
Data mining
 
Knowledge Discovery in Databases
Knowledge Discovery in DatabasesKnowledge Discovery in Databases
Knowledge Discovery in Databases
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Kdd process
Kdd processKdd process
Kdd process
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
Energy Harvesing Through Reverse Electrowetting
Energy Harvesing Through Reverse Electrowetting Energy Harvesing Through Reverse Electrowetting
Energy Harvesing Through Reverse Electrowetting
 
Wireless network
Wireless networkWireless network
Wireless network
 
Cloud Cmputing Security
Cloud Cmputing SecurityCloud Cmputing Security
Cloud Cmputing Security
 
Ppt on open and close door using Applet
Ppt on open and close door using Applet Ppt on open and close door using Applet
Ppt on open and close door using Applet
 
Wireless mobile charging using microwaves
Wireless mobile charging using microwavesWireless mobile charging using microwaves
Wireless mobile charging using microwaves
 
Resource management
Resource managementResource management
Resource management
 
Environmental law
Environmental lawEnvironmental law
Environmental law
 
secued cloud
 secued cloud secued cloud
secued cloud
 
Ppt on use of biomatrix in secure e trasaction
Ppt on use of biomatrix in secure e trasactionPpt on use of biomatrix in secure e trasaction
Ppt on use of biomatrix in secure e trasaction
 
Digital Locker
Digital LockerDigital Locker
Digital Locker
 
Data As A Service
Data As A ServiceData As A Service
Data As A Service
 
Table of contents blue brain
Table of contents blue brainTable of contents blue brain
Table of contents blue brain
 
Secued Cloud
 Secued  Cloud Secued  Cloud
Secued Cloud
 
History of Laptop
History of LaptopHistory of Laptop
History of Laptop
 

Similar to Data mining and knowledge Discovery

JanData-mining-to-knowledge-discovery.ppt
JanData-mining-to-knowledge-discovery.pptJanData-mining-to-knowledge-discovery.ppt
JanData-mining-to-knowledge-discovery.pptgeorgejustymirobi1
 
Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and butest
 
Data Mining and Knowledge Discovery in Business Databases
Data Mining and Knowledge Discovery in Business DatabasesData Mining and Knowledge Discovery in Business Databases
Data Mining and Knowledge Discovery in Business Databasesbutest
 
The REAL Impact of Big Data on Privacy
The REAL Impact of Big Data on PrivacyThe REAL Impact of Big Data on Privacy
The REAL Impact of Big Data on PrivacyClaudiu Popa
 
BigData & Supply Chain: A "Small" Introduction
BigData & Supply Chain: A "Small" IntroductionBigData & Supply Chain: A "Small" Introduction
BigData & Supply Chain: A "Small" IntroductionIvan Gruer
 
Introduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptIntroduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptSangrangBargayary3
 
SWOT of Bigdata Security Using Machine Learning Techniques
SWOT of Bigdata Security Using Machine Learning TechniquesSWOT of Bigdata Security Using Machine Learning Techniques
SWOT of Bigdata Security Using Machine Learning Techniquesijistjournal
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementationSandip Tipayle Patil
 
Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1DanWooster1
 
01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.pptadmsoyadm4
 
Data ethics and machine learning: discrimination, algorithmic bias, and how t...
Data ethics and machine learning: discrimination, algorithmic bias, and how t...Data ethics and machine learning: discrimination, algorithmic bias, and how t...
Data ethics and machine learning: discrimination, algorithmic bias, and how t...Data Driven Innovation
 

Similar to Data mining and knowledge Discovery (20)

JanData-mining-to-knowledge-discovery.ppt
JanData-mining-to-knowledge-discovery.pptJanData-mining-to-knowledge-discovery.ppt
JanData-mining-to-knowledge-discovery.ppt
 
Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and
 
Dbm630 lecture10
Dbm630 lecture10Dbm630 lecture10
Dbm630 lecture10
 
Data Mining and Knowledge Discovery in Business Databases
Data Mining and Knowledge Discovery in Business DatabasesData Mining and Knowledge Discovery in Business Databases
Data Mining and Knowledge Discovery in Business Databases
 
The REAL Impact of Big Data on Privacy
The REAL Impact of Big Data on PrivacyThe REAL Impact of Big Data on Privacy
The REAL Impact of Big Data on Privacy
 
BigData & Supply Chain: A "Small" Introduction
BigData & Supply Chain: A "Small" IntroductionBigData & Supply Chain: A "Small" Introduction
BigData & Supply Chain: A "Small" Introduction
 
Introduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptIntroduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .ppt
 
top 10 Data Mining Algorithms
top 10 Data Mining Algorithmstop 10 Data Mining Algorithms
top 10 Data Mining Algorithms
 
L18 Big Data and Analytics
L18 Big Data and AnalyticsL18 Big Data and Analytics
L18 Big Data and Analytics
 
Business with Big data
Business with Big dataBusiness with Big data
Business with Big data
 
SWOT of Bigdata Security Using Machine Learning Techniques
SWOT of Bigdata Security Using Machine Learning TechniquesSWOT of Bigdata Security Using Machine Learning Techniques
SWOT of Bigdata Security Using Machine Learning Techniques
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
 
isd314-01
isd314-01isd314-01
isd314-01
 
Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining Intro
 
data mining
data miningdata mining
data mining
 
01Intro.ppt
01Intro.ppt01Intro.ppt
01Intro.ppt
 
01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt
 
01Intro.ppt
01Intro.ppt01Intro.ppt
01Intro.ppt
 
Data ethics and machine learning: discrimination, algorithmic bias, and how t...
Data ethics and machine learning: discrimination, algorithmic bias, and how t...Data ethics and machine learning: discrimination, algorithmic bias, and how t...
Data ethics and machine learning: discrimination, algorithmic bias, and how t...
 

More from Kartik Kalpande Patil

More from Kartik Kalpande Patil (20)

wireless charging in phones
wireless charging in phoneswireless charging in phones
wireless charging in phones
 
Wirelessmobilechargingusingmicrowavesjazz 140128114925-phpapp02
Wirelessmobilechargingusingmicrowavesjazz 140128114925-phpapp02Wirelessmobilechargingusingmicrowavesjazz 140128114925-phpapp02
Wirelessmobilechargingusingmicrowavesjazz 140128114925-phpapp02
 
Viruses ppt
Viruses pptViruses ppt
Viruses ppt
 
Versions of android
Versions of androidVersions of android
Versions of android
 
Ruby programming
Ruby programmingRuby programming
Ruby programming
 
Resent intel motherboards
Resent intel motherboardsResent intel motherboards
Resent intel motherboards
 
Resent intel microprocessor
Resent intel microprocessorResent intel microprocessor
Resent intel microprocessor
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Open source movement
Open source movementOpen source movement
Open source movement
 
Object oriented dbms
Object oriented dbmsObject oriented dbms
Object oriented dbms
 
Network simulation software
Network simulation softwareNetwork simulation software
Network simulation software
 
Mirroring and replications
Mirroring and replicationsMirroring and replications
Mirroring and replications
 
Microprocessor in human body
Microprocessor in human bodyMicroprocessor in human body
Microprocessor in human body
 
Microcontroller in automobile and applications
Microcontroller in automobile and applicationsMicrocontroller in automobile and applications
Microcontroller in automobile and applications
 
Mahol. android ppt
Mahol. android pptMahol. android ppt
Mahol. android ppt
 
applet using java
applet using javaapplet using java
applet using java
 
Hadoop
HadoopHadoop
Hadoop
 
Functional block diagram_of_laser_printer
Functional block diagram_of_laser_printerFunctional block diagram_of_laser_printer
Functional block diagram_of_laser_printer
 
Digital signature and adv payment gateway
Digital signature and adv payment gatewayDigital signature and adv payment gateway
Digital signature and adv payment gateway
 
Data mining semiinar ppo
Data mining semiinar  ppoData mining semiinar  ppo
Data mining semiinar ppo
 

Recently uploaded

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 

Recently uploaded (20)

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 

Data mining and knowledge Discovery

  • 1. From Data Mining to Knowledge Discovery: An Introduction Gregory Piatetsky-Shapiro KDnuggets
  • 3. 33 Trends leading to Data Flood  More data is generated:  Bank, telecom, other business transactions ...  Scientific Data: astronomy, biology, etc  Web, text, and e-commerce  More data is captured:  Storage technology faster and cheaper  DBMS capable of handling bigger DB
  • 4. 44 Examples  Europe's Very Long Baseline Interferometry (VLBI) has 16 telescopes, each of which produces 1 Gigabit/second of astronomical data over a 25-day observation session  storage and analysis a big problem  Walmart reported to have 24 Tera-byte DB  AT&T handles billions of calls per day  data cannot be stored -- analysis is done on the fly
  • 5. 55 Growth Trends  Moore’s law  Computer Speed doubles every 18 months  Storage law  total storage doubles every 9 months  Consequence  very little data will ever be looked at by a human  Knowledge Discovery is NEEDED to make sense and use of data.
  • 6. 66 Knowledge Discovery Definition Knowledge Discovery in Data is the non-trivial process of identifying  valid  novel  potentially useful  and ultimately understandable patterns in data. from Advances in Knowledge Discovery and Data Mining, Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, (Chapter 1), AAAI/MIT Press 1996
  • 10. 1010 Data Mining Tasks: Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Statistics, Decision Trees, Neural Networks, ...
  • 11. 1111 Classification: Linear Regression  Linear Regression w0 + w1 x + w2 y >= 0  Regression computes wi from data to minimize squared error to ‘fit’ the data  Not flexible enough
  • 12. 1212 Classification: Decision Trees X Y if X > 5 then blue else if Y > 3 then blue else if X > 2 then green else blue 52 3
  • 13. 1313 Classification: Neural Nets  Can select more complex regions  Can be more accurate  Also can overfit the data – find patterns in random noise
  • 14. 1414 Data Mining Central Quest Find true patterns and avoid overfitting (false patterns due to randomness)
  • 15. 1515 Data Mining Tasks: Clustering Find “natural” grouping of instances given un-labeled data
  • 16. 1616 Major Data Mining Tasks  Classification: predicting an item class  Clustering: finding clusters in data  Associations: e.g. A & B & C occur frequently  Visualization: to facilitate human discovery  Estimation: predicting a continuous value  Deviation Detection: finding changes  Link Analysis: finding relationships  …
  • 19. 1919 Major Application Areas for Data Mining Solutions  Advertising  Bioinformatics  Customer Relationship Management (CRM)  Database Marketing  Fraud Detection  eCommerce  Health Care  Investment/Securities  Manufacturing, Process Control  Sports and Entertainment  Telecommunications  Web
  • 20. 2020 Case Study: Search Engines  Early search engines used mainly keywords on a page – were subject to manipulation  Google success is due to its algorithm which uses mainly links to the page  Google founders Sergey Brin and Larry Page were students in Stanford doing research in databases and data mining in 1998 which led to Google
  • 21. 2121 Case Study: Direct Marketing and CRM  Most major direct marketing companies are using modeling and data mining  Most financial companies are using customer modeling  Modeling is easier than changing customer behaviour  Some successes  Verizon Wireless reduced churn rate from 2% to 1.5%
  • 22. 2222 Biology: Molecular Diagnostics  Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML)  72 samples, about 7,000 genes ALL AML Results: 33 correct (97% accuracy), 1 error (sample suspected mislabelled) Outcome predictions?
  • 23. 2323 AF1q: New Marker for Medulloblastoma?  AF1Q ALL1-fused gene from chromosome 1q  transmembrane protein  Related to leukemia (3 PUBMED entries) but not to Medulloblastoma
  • 24. 2424 Case Study: Security and Fraud Detection  Credit Card Fraud Detection  Money laundering  FAIS (US Treasury)  Securities Fraud  NASDAQ Sonar system  Phone fraud  AT&T, Bell Atlantic, British Telecom/MCI  Bio-terrorism detection at Salt Lake Olympics 2002
  • 25. 2525 Data Mining and Terrorism: Controversy in the News  TIA: Terrorism (formerly Total) Information Awareness Program –  DARPA program closed by Congress  some functions transferred to intelligence agencies  CAPPS II – screen all airline passengers  controversial  …  Invasion of Privacy or Defensive Shield?
  • 26. 2626 Criticism of analytic approach to Threat Detection: Data Mining will  invade privacy  generate millions of false positives But can it be effective?
  • 27. 2727 Can Data Mining and Statistics be Effective for Threat Detection?  Criticism: Databases have 5% errors, so analyzing 100 million suspects will generate 5 million false positives  Reality: Analytical models correlate many items of information to reduce false positives.  Example: Identify one biased coin from 1,000.  After one throw of each coin, we cannot  After 30 throws, one biased coin will stand out with high probability.  Can identify 19 biased coins out of 100 million with sufficient number of throws
  • 28. 2828 Another Approach: Link Analysis Can Find Unusual Patterns in the Network Structure
  • 29. 2929 Analytic technology can be effective  Combining multiple models and link analysis can reduce false positives  Today there are millions of false positives with manual analysis  Data Mining is just one additional tool to help analysts  Analytic Technology has the potential to reduce the current high rate of false positives
  • 30. 3030 Data Mining with Privacy  Data Mining looks for patterns, not people!  Technical solutions can limit privacy invasion  Replacing sensitive personal data with anon. ID  Give randomized outputs  Multi-party computation – distributed data  …  Bayardo & Srikant, Technological Solutions for Protecting Privacy, IEEE Computer, Sep 2003
  • 31. 3131 1990 1998 2000 2002 Expectations Performance The Hype Curve for Data Mining and Knowledge Discovery Over-inflated expectations Disappointment Growing acceptance and mainstreaming rising expectations
  • 32. 3232