Submit Search
Upload
data-mining-tutorial.ppt
•
Download as PPT, PDF
•
0 likes
•
94 views
P
PerumalPitchandi
Follow
Data mining introduction
Read less
Read more
Engineering
Report
Share
Report
Share
1 of 89
Download now
Recommended
03. Data Exploration.pptx
03. Data Exploration.pptx
Sarojkumari55
Data science
Data science
Purna Chander
Introduction to XPath
Introduction to XPath
torp42
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
Simplilearn
Data science & data scientist
Data science & data scientist
VijayMohan Vasu
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
Mark Kromer
Visualization For Data Science
Visualization For Data Science
Angela Zoss
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
Precisely
Recommended
03. Data Exploration.pptx
03. Data Exploration.pptx
Sarojkumari55
Data science
Data science
Purna Chander
Introduction to XPath
Introduction to XPath
torp42
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
Simplilearn
Data science & data scientist
Data science & data scientist
VijayMohan Vasu
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
Mark Kromer
Visualization For Data Science
Visualization For Data Science
Angela Zoss
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
Precisely
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
Simplilearn
What is Data Science
What is Data Science
Ioannis Kourouklides
Exploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
datamantra
Python Machine Learning Tutorial | Machine Learning Algorithms | Python Train...
Python Machine Learning Tutorial | Machine Learning Algorithms | Python Train...
Edureka!
Machine learning
Machine learning
Dr Geetha Mohan
Content based filtering
Content based filtering
Bendito Freitas Ribeiro
Introduction to data analytics
Introduction to data analytics
SSaudia
Introduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data Science
Data Science Thailand
Data Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
Ivo Andreev
Big data
Big data
Samira Riki
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
Bhaskar Mitra
Big data Presentation
Big data Presentation
Aswadmehar
Information retrieval 3 query search interfaces
Information retrieval 3 query search interfaces
Vaibhav Khanna
Data mining presentation.ppt
Data mining presentation.ppt
neelamoberoi1030
Data science
Data science
Ranjit Nambisan
Data mining with big data implementation
Data mining with big data implementation
Sandip Tipayle Patil
End-to-End Machine Learning Project
End-to-End Machine Learning Project
Eng Teong Cheah
Data science workshop
Data science workshop
Hortonworks
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
error007
Introduction to Data Mining
Introduction to Data Mining
DataminingTools Inc
Cssu dw dm
Cssu dw dm
sumit621
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
Dr. Abdul Ahad Abro
More Related Content
What's hot
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
Simplilearn
What is Data Science
What is Data Science
Ioannis Kourouklides
Exploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
datamantra
Python Machine Learning Tutorial | Machine Learning Algorithms | Python Train...
Python Machine Learning Tutorial | Machine Learning Algorithms | Python Train...
Edureka!
Machine learning
Machine learning
Dr Geetha Mohan
Content based filtering
Content based filtering
Bendito Freitas Ribeiro
Introduction to data analytics
Introduction to data analytics
SSaudia
Introduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data Science
Data Science Thailand
Data Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
Ivo Andreev
Big data
Big data
Samira Riki
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
Bhaskar Mitra
Big data Presentation
Big data Presentation
Aswadmehar
Information retrieval 3 query search interfaces
Information retrieval 3 query search interfaces
Vaibhav Khanna
Data mining presentation.ppt
Data mining presentation.ppt
neelamoberoi1030
Data science
Data science
Ranjit Nambisan
Data mining with big data implementation
Data mining with big data implementation
Sandip Tipayle Patil
End-to-End Machine Learning Project
End-to-End Machine Learning Project
Eng Teong Cheah
Data science workshop
Data science workshop
Hortonworks
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
error007
Introduction to Data Mining
Introduction to Data Mining
DataminingTools Inc
What's hot
(20)
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What is Data Science
What is Data Science
Exploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
Python Machine Learning Tutorial | Machine Learning Algorithms | Python Train...
Python Machine Learning Tutorial | Machine Learning Algorithms | Python Train...
Machine learning
Machine learning
Content based filtering
Content based filtering
Introduction to data analytics
Introduction to data analytics
Introduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data Science
Data Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
Big data
Big data
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
Big data Presentation
Big data Presentation
Information retrieval 3 query search interfaces
Information retrieval 3 query search interfaces
Data mining presentation.ppt
Data mining presentation.ppt
Data science
Data science
Data mining with big data implementation
Data mining with big data implementation
End-to-End Machine Learning Project
End-to-End Machine Learning Project
Data science workshop
Data science workshop
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Introduction to Data Mining
Introduction to Data Mining
Similar to data-mining-tutorial.ppt
Cssu dw dm
Cssu dw dm
sumit621
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
Dr. Abdul Ahad Abro
Data mining intro-2009-v2
Data mining intro-2009-v2
Prithwis Mukerjee
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG Data
Prasant Misra
Data Mining In Market Research
Data Mining In Market Research
jim
Data Mining in Market Research
Data Mining in Market Research
butest
Data Mining In Market Research
Data Mining In Market Research
kevinlan
Introduction to Data Mining
Introduction to Data Mining
Izwan Nizal Mohd Shaharanee
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
Izwan Nizal Mohd Shaharanee
Bt9001, data mining
Bt9001, data mining
smumbahelp
presentationIDC - 14MAY2015
presentationIDC - 14MAY2015
Anat Reiner-Benaim
EDRG12_Re.doc
EDRG12_Re.doc
butest
EDRG12_Re.doc
EDRG12_Re.doc
butest
Data mining introduction
Data mining introduction
Basma Gamal
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?
Anubhav Jain
Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...
IOSRjournaljce
Data preparation and processing chapter 2
Data preparation and processing chapter 2
Mahmoud Alfarra
Anonymization of data using mapreduce on cloud
Anonymization of data using mapreduce on cloud
eSAT Journals
My3prep
My3prep
asad199
DM Lecture 3
DM Lecture 3
asad199
Similar to data-mining-tutorial.ppt
(20)
Cssu dw dm
Cssu dw dm
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
Data mining intro-2009-v2
Data mining intro-2009-v2
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG Data
Data Mining In Market Research
Data Mining In Market Research
Data Mining in Market Research
Data Mining in Market Research
Data Mining In Market Research
Data Mining In Market Research
Introduction to Data Mining
Introduction to Data Mining
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
Bt9001, data mining
Bt9001, data mining
presentationIDC - 14MAY2015
presentationIDC - 14MAY2015
EDRG12_Re.doc
EDRG12_Re.doc
EDRG12_Re.doc
EDRG12_Re.doc
Data mining introduction
Data mining introduction
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?
Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...
Data preparation and processing chapter 2
Data preparation and processing chapter 2
Anonymization of data using mapreduce on cloud
Anonymization of data using mapreduce on cloud
My3prep
My3prep
DM Lecture 3
DM Lecture 3
More from PerumalPitchandi
Lecture Notes on Recommender System Introduction
Lecture Notes on Recommender System Introduction
PerumalPitchandi
22ADE002 – Business Analytics- Module 1.pptx
22ADE002 – Business Analytics- Module 1.pptx
PerumalPitchandi
biv_mult.ppt
biv_mult.ppt
PerumalPitchandi
ppt_ids-data science.pdf
ppt_ids-data science.pdf
PerumalPitchandi
ANOVA Presentation.ppt
ANOVA Presentation.ppt
PerumalPitchandi
Data Science Intro.pptx
Data Science Intro.pptx
PerumalPitchandi
Descriptive_Statistics_PPT.ppt
Descriptive_Statistics_PPT.ppt
PerumalPitchandi
SW_Cost_Estimation.ppt
SW_Cost_Estimation.ppt
PerumalPitchandi
CostEstimation-1.ppt
CostEstimation-1.ppt
PerumalPitchandi
20IT204-COA-Lecture 18.ppt
20IT204-COA-Lecture 18.ppt
PerumalPitchandi
20IT204-COA- Lecture 17.pptx
20IT204-COA- Lecture 17.pptx
PerumalPitchandi
Capability Maturity Model (CMM).pptx
Capability Maturity Model (CMM).pptx
PerumalPitchandi
Comparison_between_Waterfall_and_Agile_m (1).pptx
Comparison_between_Waterfall_and_Agile_m (1).pptx
PerumalPitchandi
Introduction to Data Science.pptx
Introduction to Data Science.pptx
PerumalPitchandi
FDS Module I 24.1.2022.ppt
FDS Module I 24.1.2022.ppt
PerumalPitchandi
FDS Module I 20.1.2022.ppt
FDS Module I 20.1.2022.ppt
PerumalPitchandi
AgileSoftwareDevelopment.ppt
AgileSoftwareDevelopment.ppt
PerumalPitchandi
Agile and its impact to Project Management 022218.pptx
Agile and its impact to Project Management 022218.pptx
PerumalPitchandi
Data_exploration.ppt
Data_exploration.ppt
PerumalPitchandi
state-spaces29Sep06.ppt
state-spaces29Sep06.ppt
PerumalPitchandi
More from PerumalPitchandi
(20)
Lecture Notes on Recommender System Introduction
Lecture Notes on Recommender System Introduction
22ADE002 – Business Analytics- Module 1.pptx
22ADE002 – Business Analytics- Module 1.pptx
biv_mult.ppt
biv_mult.ppt
ppt_ids-data science.pdf
ppt_ids-data science.pdf
ANOVA Presentation.ppt
ANOVA Presentation.ppt
Data Science Intro.pptx
Data Science Intro.pptx
Descriptive_Statistics_PPT.ppt
Descriptive_Statistics_PPT.ppt
SW_Cost_Estimation.ppt
SW_Cost_Estimation.ppt
CostEstimation-1.ppt
CostEstimation-1.ppt
20IT204-COA-Lecture 18.ppt
20IT204-COA-Lecture 18.ppt
20IT204-COA- Lecture 17.pptx
20IT204-COA- Lecture 17.pptx
Capability Maturity Model (CMM).pptx
Capability Maturity Model (CMM).pptx
Comparison_between_Waterfall_and_Agile_m (1).pptx
Comparison_between_Waterfall_and_Agile_m (1).pptx
Introduction to Data Science.pptx
Introduction to Data Science.pptx
FDS Module I 24.1.2022.ppt
FDS Module I 24.1.2022.ppt
FDS Module I 20.1.2022.ppt
FDS Module I 20.1.2022.ppt
AgileSoftwareDevelopment.ppt
AgileSoftwareDevelopment.ppt
Agile and its impact to Project Management 022218.pptx
Agile and its impact to Project Management 022218.pptx
Data_exploration.ppt
Data_exploration.ppt
state-spaces29Sep06.ppt
state-spaces29Sep06.ppt
Recently uploaded
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
me23b1001
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
k795866
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Anamika Sarkar
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
959SahilShah
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
RajaP95
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting .
Satyam Kumar
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
rehmti665
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
Soham Mondal
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
null - The Open Security Community
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
hassan khalil
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Dr.Costas Sachpazis
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
wendy cai
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
Tsuyoshi Horigome
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
Asst.prof M.Gokilavani
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
jennyeacort
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
9953056974 Low Rate Call Girls In Saket, Delhi NCR
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
RajaP95
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
vipinkmenon1
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
DeelipZope
Recently uploaded
(20)
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting .
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
data-mining-tutorial.ppt
1.
© 2006 KDnuggets Data
Mining Tutorial Gregory Piatetsky-Shapiro KDnuggets
2.
© 2006 KDnuggets
2 Outline Introduction Data Mining Tasks Classification & Evaluation Clustering Application Examples
3.
© 2006 KDnuggets
3 Trends leading to Data Flood More data is generated: Web, text, images … Business transactions, calls, ... Scientific data: astronomy, biology, etc More data is captured: Storage technology faster and cheaper DBMS can handle bigger DB
4.
© 2006 KDnuggets
4 Largest Databases in 2005 Winter Corp. 2005 Commercial Database Survey: 1. Max Planck Inst. for Meteorology , 222 TB 2. Yahoo ~ 100 TB (Largest Data Warehouse) 3. AT&T ~ 94 TB www.wintercorp.com/VLDB/2005_TopTen_Survey/TopTenWinners_2005.asp
5.
© 2006 KDnuggets
5 Data Growth In 2 years (2003 to 2005), the size of the largest database TRIPLED!
6.
© 2006 KDnuggets
6 Data Growth Rate Twice as much information was created in 2002 as in 1999 (~30% growth rate) Other growth rate estimates even higher Very little data will ever be looked at by a human Knowledge Discovery is NEEDED to make sense and use of data.
7.
© 2006 KDnuggets
7 Knowledge Discovery Definition Knowledge Discovery in Data is the non-trivial process of identifying valid novel potentially useful and ultimately understandable patterns in data. from Advances in Knowledge Discovery and Data Mining, Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, (Chapter 1), AAAI/MIT Press 1996
8.
© 2006 KDnuggets
8 Related Fields Statistics Machine Learning Databases Visualization Data Mining and Knowledge Discovery
9.
© 2006 KDnuggets
9 Statistics, Machine Learning and Data Mining Statistics: more theory-based more focused on testing hypotheses Machine learning more heuristic focused on improving performance of a learning agent also looks at real-time learning and robotics – areas not part of data mining Data Mining and Knowledge Discovery integrates theory and heuristics focus on the entire process of knowledge discovery, including data cleaning, learning, and integration and visualization of results Distinctions are fuzzy
10.
© 2006 KDnuggets
10 Knowledge Discovery Process flow, according to CRISP-DM Monitoring see www.crisp-dm.org for more information Continuous monitoring and improvement is an addition to CRISP
11.
© 2006 KDnuggets
11 Historical Note: Many Names of Data Mining Data Fishing, Data Dredging: 1960- used by statisticians (as bad name) Data Mining :1990 -- used in DB community, business Knowledge Discovery in Databases (1989-) used by AI, Machine Learning Community also Data Archaeology, Information Harvesting, Information Discovery, Knowledge Extraction, ... Currently: Data Mining and Knowledge Discovery are used interchangeably
12.
© 2006 KDnuggets Data
Mining Tasks
13.
© 2006 KDnuggets
13 Some Definitions Instance (also Item or Record): an example, described by a number of attributes, e.g. a day can be described by temperature, humidity and cloud status Attribute or Field measuring aspects of the Instance, e.g. temperature Class (Label) grouping of instances, e.g. days good for playing
14.
© 2006 KDnuggets
14 Major Data Mining Tasks Classification: predicting an item class Clustering: finding clusters in data Associations: e.g. A & B & C occur frequently Visualization: to facilitate human discovery Summarization: describing a group Deviation Detection: finding changes Estimation: predicting a continuous value Link Analysis: finding relationships …
15.
© 2006 KDnuggets
15 Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Statistics, Decision Trees, Neural Networks, ...
16.
© 2006 KDnuggets
16 Clustering Find “natural” grouping of instances given un-labeled data
17.
© 2006 KDnuggets
17 Association Rules & Frequent Itemsets TID Produce 1 MILK, BREAD, EGGS 2 BREAD, SUGAR 3 BREAD, CEREAL 4 MILK, BREAD, SUGAR 5 MILK, CEREAL 6 BREAD, CEREAL 7 MILK, CEREAL 8 MILK, BREAD, CEREAL, EGGS 9 MILK, BREAD, CEREAL Transactions Frequent Itemsets: Milk, Bread (4) Bread, Cereal (3) Milk, Bread, Cereal (2) … Rules: Milk => Bread (66%)
18.
© 2006 KDnuggets
18 Visualization & Data Mining Visualizing the data to facilitate human discovery Presenting the discovered results in a visually "nice" way
19.
© 2006 KDnuggets
19 Summarization Describe features of the selected group Use natural language and graphics Usually in Combination with Deviation detection or other methods Average length of stay in this study area rose 45.7 percent, from 4.3 days to 6.2 days, because ...
20.
© 2006 KDnuggets
20 Data Mining Central Quest Find true patterns and avoid overfitting (finding seemingly signifcant but really random patterns due to searching too many possibilites)
21.
© 2006 KDnuggets Classification
Methods
22.
© 2006 KDnuggets
22 Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Regression, Decision Trees, Bayesian, Neural Networks, ... Given a set of points from classes what is the class of new point ?
23.
© 2006 KDnuggets
23 Classification: Linear Regression Linear Regression w0 + w1 x + w2 y >= 0 Regression computes wi from data to minimize squared error to ‘fit’ the data Not flexible enough
24.
© 2006 KDnuggets
24 Regression for Classification Any regression technique can be used for classification Training: perform a regression for each class, setting the output to 1 for training instances that belong to class, and 0 for those that don’t Prediction: predict class corresponding to model with largest output value (membership value) For linear regression this is known as multi-response linear regression
25.
© 2006 KDnuggets
25 Classification: Decision Trees X Y if X > 5 then blue else if Y > 3 then blue else if X > 2 then green else blue 5 2 3
26.
© 2006 KDnuggets
26 DECISION TREE An internal node is a test on an attribute. A branch represents an outcome of the test, e.g., Color=red. A leaf node represents a class label or class label distribution. At each node, one attribute is chosen to split training examples into distinct classes as much as possible A new instance is classified by following a matching path to a leaf node.
27.
© 2006 KDnuggets
27 Weather Data: Play or not Play? Outlook Temperature Humidity Windy Play? sunny hot high false No sunny hot high true No overcast hot high false Yes rain mild high false Yes rain cool normal false Yes rain cool normal true No overcast cool normal true Yes sunny mild high false No sunny cool normal false Yes rain mild normal false Yes sunny mild normal true Yes overcast mild high true Yes overcast hot normal false Yes rain mild high true No Note: Outlook is the Forecast, no relation to Microsoft email program
28.
© 2006 KDnuggets
28 overcast high normal false true sunny rain No No Yes Yes Yes Example Tree for “Play?” Outlook Humidity Windy
29.
© 2006 KDnuggets
29 Classification: Neural Nets Can select more complex regions Can be more accurate Also can overfit the data – find patterns in random noise
30.
© 2006 KDnuggets
30 Classification: other approaches Naïve Bayes Rules Support Vector Machines Genetic Algorithms … See www.KDnuggets.com/software/
31.
© 2006 KDnuggets Evaluation
32.
© 2006 KDnuggets
32 Evaluating which method works the best for classification No model is uniformly the best Dimensions for Comparison speed of training speed of model application noise tolerance explanation ability Best Results: Hybrid, Integrated models
33.
© 2006 KDnuggets
33 Comparison of Major Classification Approaches Train time Run Time Noise Toler ance Can Use Prior Know- ledge Accuracy on Customer Modelling Under- standable Decision Trees fast fast poor no medium medium Rules med fast poor no medium good Neural Networks slow fast good no good poor Bayesian slow fast good yes good good A hybrid method will have higher accuracy
34.
© 2006 KDnuggets
34 Evaluation of Classification Models How predictive is the model we learned? Error on the training data is not a good indicator of performance on future data The new data will probably not be exactly the same as the training data! Overfitting – fitting the training data too precisely - usually leads to poor results on new data
35.
© 2006 KDnuggets
35 Evaluation issues Possible evaluation measures: Classification Accuracy Total cost/benefit – when different errors involve different costs Lift and ROC curves Error in numeric predictions How reliable are the predicted results ?
36.
© 2006 KDnuggets
36 Classifier error rate Natural performance measure for classification problems: error rate Success: instance’s class is predicted correctly Error: instance’s class is predicted incorrectly Error rate: proportion of errors made over the whole set of instances Training set error rate: is way too optimistic! you can find patterns even in random data
37.
© 2006 KDnuggets
37 Evaluation on “LARGE” data If many (>1000) examples are available, including >100 examples from each class A simple evaluation will give useful results Randomly split data into training and test sets (usually 2/3 for train, 1/3 for test) Build a classifier using the train set and evaluate it using the test set
38.
© 2006 KDnuggets
38 Classification Step 1: Split data into train and test sets Results Known + + - - + THE PAST Data Training set Testing set
39.
© 2006 KDnuggets
39 Classification Step 2: Build a model on a training set Training set Results Known + + - - + THE PAST Data Model Builder Testing set
40.
© 2006 KDnuggets
40 Classification Step 3: Evaluate on test set (Re-train?) Data Predictions Y N Results Known Training set Testing set + + - - + Model Builder Evaluate + - + -
41.
© 2006 KDnuggets
41 Unbalanced data Sometimes, classes have very unequal frequency Attrition prediction: 97% stay, 3% attrite (in a month) medical diagnosis: 90% healthy, 10% disease eCommerce: 99% don’t buy, 1% buy Security: >99.99% of Americans are not terrorists Similar situation with multiple classes Majority class classifier can be 97% correct, but useless
42.
© 2006 KDnuggets
42 Handling unbalanced data – how? If we have two classes that are very unbalanced, then how can we evaluate our classifier method?
43.
© 2006 KDnuggets
43 Balancing unbalanced data, 1 With two classes, a good approach is to build BALANCED train and test sets, and train model on a balanced set randomly select desired number of minority class instances add equal number of randomly selected majority class How do we generalize “balancing” to multiple classes?
44.
© 2006 KDnuggets
44 Balancing unbalanced data, 2 Generalize “balancing” to multiple classes Ensure that each class is represented with approximately equal proportions in train and test
45.
© 2006 KDnuggets
45 A note on parameter tuning It is important that the test data is not used in any way to create the classifier Some learning schemes operate in two stages: Stage 1: builds the basic structure Stage 2: optimizes parameter settings The test data can’t be used for parameter tuning! Proper procedure uses three sets: training data, validation data, and test data Validation data is used to optimize parameters
46.
© 2006 KDnuggets
46 Making the most of the data Once evaluation is complete, all the data can be used to build the final classifier Generally, the larger the training data the better the classifier (but returns diminish) The larger the test data the more accurate the error estimate
47.
© 2006 KDnuggets
47 Classification: Train, Validation, Test split Data Predictions Y N Results Known Training set Validation set + + - - + Model Builder Evaluate + - + - Final Model Final Test Set + - + - Final Evaluation Model Builder
48.
© 2006 KDnuggets
48 Cross-validation Cross-validation avoids overlapping test sets First step: data is split into k subsets of equal size Second step: each subset in turn is used for testing and the remainder for training This is called k-fold cross-validation Often the subsets are stratified before the cross- validation is performed The error estimates are averaged to yield an overall error estimate
49.
© 2006 KDnuggets
49 49 Cross-validation example: —Break up data into groups of the same size — — —Hold aside one group for testing and use the rest to build model — —Repeat Test
50.
© 2006 KDnuggets
50 More on cross-validation Standard method for evaluation: stratified ten- fold cross-validation Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate Stratification reduces the estimate’s variance Even better: repeated stratified cross-validation E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance)
51.
© 2006 KDnuggets
51 Direct Marketing Paradigm Find most likely prospects to contact Not everybody needs to be contacted Number of targets is usually much smaller than number of prospects Typical Applications retailers, catalogues, direct mail (and e-mail) customer acquisition, cross-sell, attrition prediction ...
52.
© 2006 KDnuggets
52 Direct Marketing Evaluation Accuracy on the entire dataset is not the right measure Approach develop a target model score all prospects and rank them by decreasing score select top P% of prospects for action How do we decide what is the best subset of prospects ?
53.
© 2006 KDnuggets
53 Model-Sorted List No Score Target CustID Age 1 0.97 Y 1746 … 2 0.95 N 1024 … 3 0.94 Y 2478 … 4 0.93 Y 3820 … 5 0.92 N 4897 … … … … … 99 0.11 N 2734 … 100 0.06 N 2422 Use a model to assign score to each customer Sort customers by decreasing score Expect more targets (hits) near the top of the list 3 hits in top 5% of the list If there 15 targets overall, then top 5 has 3/15=20% of targets
54.
© 2006 KDnuggets
54 CPH (Cumulative Pct Hits) 0 10 20 30 40 50 60 70 80 90 100 5 15 25 35 45 55 65 75 85 95 Random 5% of random list have 5% of targets Pct list Cumulative % Hits Definition: CPH(P,M) = % of all targets in the first P% of the list scored by model M CPH frequently called Gains
55.
© 2006 KDnuggets
55 CPH: Random List vs Model-ranked list 0 10 20 30 40 50 60 70 80 90 100 5 15 25 35 45 55 65 75 85 95 Random Model 5% of random list have 5% of targets, but 5% of model ranked list have 21% of targets CPH(5%,model)=21%. Pct list Cumulative % Hits
56.
© 2006 KDnuggets
56 Lift 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 15 25 35 45 55 65 75 85 95 Lift Lift(P,M) = CPH(P,M) / P P -- percent of the list Lift (at 5%) = 21% / 5% = 4.2 better than random Note: Some authors use “Lift” for what we call CPH.
57.
© 2006 KDnuggets
57 Lift – a measure of model quality Lift helps us decide which models are better If cost/benefit values are not available or changing, we can use Lift to select a better model. Model with the higher Lift curve will generally be better
58.
© 2006 KDnuggets Clustering
59.
© 2006 KDnuggets
59 Clustering Unsupervised learning: Finds “natural” grouping of instances given un-labeled data
60.
© 2006 KDnuggets
60 Clustering Methods Many different method and algorithms: For numeric and/or symbolic data Deterministic vs. probabilistic Exclusive vs. overlapping Hierarchical vs. flat Top-down vs. bottom-up
61.
© 2006 KDnuggets
61 Clustering Evaluation Manual inspection Benchmarking on existing labels Cluster quality measures distance measures high similarity within a cluster, low across clusters
62.
© 2006 KDnuggets
62 The distance function Simplest case: one numeric attribute A Distance(X,Y) = A(X) – A(Y) Several numeric attributes: Distance(X,Y) = Euclidean distance between X,Y Nominal attributes: distance is set to 1 if values are different, 0 if they are equal Are all attributes equally important? Weighting the attributes might be necessary
63.
© 2006 KDnuggets
63 Simple Clustering: K-means Works with numeric data only 1) Pick a number (K) of cluster centers (at random) 2) Assign every item to its nearest cluster center (e.g. using Euclidean distance) 3) Move each cluster center to the mean of its assigned items 4) Repeat steps 2,3 until convergence (change in cluster assignments less than a threshold)
64.
© 2006 KDnuggets
64 K-means example, step 1 c1 c2 c3 X Y Pick 3 initial cluster centers (randomly) 1
65.
© 2006 KDnuggets
65 K-means example, step 2 c1 c2 c3 X Y Assign each point to the closest cluster center
66.
© 2006 KDnuggets
66 K-means example, step 3 X Y Move each cluster center to the mean of each cluster c1 c2 c2 c1 c3 c3
67.
© 2006 KDnuggets
67 K-means example, step 4a X Y Reassign points closest to a different new cluster center Q: Which points are reassigned? c1 c2 c3
68.
© 2006 KDnuggets
68 K-means example, step 4b X Y Reassign points closest to a different new cluster center Q: Which points are reassigned? c1 c2 c3
69.
© 2006 KDnuggets
69 K-means example, step 4c X Y A: three points with animation c1 c3 c2 1 3 2
70.
© 2006 KDnuggets
70 K-means example, step 4d X Y re-compute cluster means c1 c3 c2
71.
© 2006 KDnuggets
71 K-means example, step 5 X Y move cluster centers to cluster means c2 c1 c3
72.
© 2006 KDnuggets Data
Mining Applications
73.
© 2006 KDnuggets
73 Problems Suitable for Data-Mining require knowledge-based decisions have a changing environment have sub-optimal current methods have accessible, sufficient, and relevant data provides high payoff for the right decisions!
74.
© 2006 KDnuggets
74 Major Application Areas for Data Mining Solutions Advertising Bioinformatics Customer Relationship Management (CRM) Database Marketing Fraud Detection eCommerce Health Care Investment/Securities Manufacturing, Process Control Sports and Entertainment Telecommunications Web
75.
© 2006 KDnuggets
75 Application: Search Engines Before Google, web search engines used mainly keywords on a page – results were easily subject to manipulation Google's early success was partly due to its algorithm which uses mainly links to the page Google founders Sergey Brin and Larry Page were students at Stanford in 1990s Their research in databases and data mining led to Google
76.
© 2006 KDnuggets
76 Microarrays: Classifying Leukemia Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 1999 72 examples (38 train, 34 test), about 7,000 genes ALL AML Visually similar, but genetically very different Best Model: 97% accuracy, 1 error (sample suspected mislabelled)
77.
© 2006 KDnuggets
77 Microarray Potential Applications New and better molecular diagnostics Jan 11, 2005: FDA approved Roche Diagnostic AmpliChip, based on Affymetrix technology New molecular targets for therapy few new drugs, large pipeline, … Improved treatment outcome Partially depends on genetic signature Fundamental Biological Discovery finding and refining biological pathways Personalized medicine ?!
78.
© 2006 KDnuggets
78 Application: Direct Marketing and CRM Most major direct marketing companies are using modeling and data mining Most financial companies are using customer modeling Modeling is easier than changing customer behaviour Example Verizon Wireless reduced customer attrition rate from 2% to 1.5%, saving many millions of $
79.
© 2006 KDnuggets
79 Application: e-Commerce Amazon.com recommendations if you bought (viewed) X, you are likely to buy Y Netflix If you liked "Monty Python and the Holy Grail", you get a recommendation for "This is Spinal Tap" Comparison shopping Froogle, mySimon, Yahoo Shopping, …
80.
© 2006 KDnuggets
80 Application: Security and Fraud Detection Credit Card Fraud Detection over 20 Million credit cards protected by Neural networks (Fair, Isaac) Securities Fraud Detection NASDAQ KDD system Phone fraud detection AT&T, Bell Atlantic, British Telecom/MCI
81.
© 2006 KDnuggets
81 Data Mining, Privacy, and Security TIA: Terrorism (formerly Total) Information Awareness Program – TIA program closed by Congress in 2003 because of privacy concerns However, in 2006 we learn that NSA is analyzing US domestic call info to find potential terrorists Invasion of Privacy or Needed Intelligence?
82.
© 2006 KDnuggets
82 Criticism of Analytic Approaches to Threat Detection: Data Mining will be ineffective - generate millions of false positives and invade privacy First, can data mining be effective?
83.
© 2006 KDnuggets
83 Can Data Mining and Statistics be Effective for Threat Detection? Criticism: Databases have 5% errors, so analyzing 100 million suspects will generate 5 million false positives Reality: Analytical models correlate many items of information to reduce false positives. Example: Identify one biased coin from 1,000. After one throw of each coin, we cannot After 30 throws, one biased coin will stand out with high probability. Can identify 19 biased coins out of 100 million with sufficient number of throws
84.
© 2006 KDnuggets
84 Another Approach: Link Analysis Can find unusual patterns in the network structure
85.
© 2006 KDnuggets
85 Analytic technology can be effective Data Mining is just one additional tool to help analysts Combining multiple models and link analysis can reduce false positives Today there are millions of false positives with manual analysis Analytic technology has the potential to reduce the current high rate of false positives
86.
© 2006 KDnuggets
86 Data Mining with Privacy Data Mining looks for patterns, not people! Technical solutions can limit privacy invasion Replacing sensitive personal data with anon. ID Give randomized outputs Multi-party computation – distributed data … Bayardo & Srikant, Technological Solutions for Protecting Privacy, IEEE Computer, Sep 2003
87.
© 2006 KDnuggets
87 1990 1998 2000 2002 Expectations Performance The Hype Curve for Data Mining and Knowledge Discovery Over-inflated expectations Disappointment Growing acceptance and mainstreaming rising expectations 2005
88.
© 2006 KDnuggets
88 Summary Data Mining and Knowledge Discovery are needed to deal with the flood of data Knowledge Discovery is a process ! Avoid overfitting (finding random patterns by searching too many possibilities)
89.
© 2006 KDnuggets
89 Additional Resources www.KDnuggets.com data mining software, jobs, courses, etc www.acm.org/sigkdd ACM SIGKDD – the professional society for data mining
Download now