SlideShare a Scribd company logo
1 of 89
© 2006 KDnuggets
Data Mining
Tutorial
Gregory Piatetsky-Shapiro
KDnuggets
© 2006 KDnuggets 2
Outline
Introduction
Data Mining Tasks
Classification & Evaluation
Clustering
Application Examples
© 2006 KDnuggets 3
Trends leading to Data Flood
 More data is generated:
 Web, text, images …
 Business transactions, calls,
...
 Scientific data: astronomy,
biology, etc
 More data is captured:
 Storage technology faster
and cheaper
 DBMS can handle bigger DB
© 2006 KDnuggets 4
Largest Databases in 2005
Winter Corp. 2005 Commercial
Database Survey:
1. Max Planck Inst. for
Meteorology , 222 TB
2. Yahoo ~ 100 TB (Largest Data
Warehouse)
3. AT&T ~ 94 TB
www.wintercorp.com/VLDB/2005_TopTen_Survey/TopTenWinners_2005.asp
© 2006 KDnuggets 5
Data Growth
In 2 years (2003 to 2005),
the size of the largest database TRIPLED!
© 2006 KDnuggets 6
Data Growth Rate
 Twice as much information was created in 2002
as in 1999 (~30% growth rate)
 Other growth rate estimates even higher
 Very little data will ever be looked at by a human
Knowledge Discovery is NEEDED to make sense
and use of data.
© 2006 KDnuggets 7
Knowledge Discovery Definition
Knowledge Discovery in Data is the
non-trivial process of identifying
 valid
 novel
 potentially useful
 and ultimately understandable patterns in data.
from Advances in Knowledge Discovery and Data
Mining, Fayyad, Piatetsky-Shapiro, Smyth, and
Uthurusamy, (Chapter 1), AAAI/MIT Press 1996
© 2006 KDnuggets 8
Related Fields
Statistics
Machine
Learning
Databases
Visualization
Data Mining and
Knowledge Discovery
© 2006 KDnuggets 9
Statistics, Machine Learning and
Data Mining
 Statistics:
 more theory-based
 more focused on testing hypotheses
 Machine learning
 more heuristic
 focused on improving performance of a learning agent
 also looks at real-time learning and robotics – areas not part of data
mining
 Data Mining and Knowledge Discovery
 integrates theory and heuristics
 focus on the entire process of knowledge discovery, including data
cleaning, learning, and integration and visualization of results
 Distinctions are fuzzy
© 2006 KDnuggets 10
Knowledge Discovery Process
flow, according to CRISP-DM
Monitoring
see
www.crisp-dm.org
for more
information
Continuous
monitoring and
improvement is
an addition to CRISP
© 2006 KDnuggets 11
Historical Note:
Many Names of Data Mining
 Data Fishing, Data Dredging: 1960-
 used by statisticians (as bad name)
 Data Mining :1990 --
 used in DB community, business
 Knowledge Discovery in Databases (1989-)
 used by AI, Machine Learning Community
 also Data Archaeology, Information Harvesting,
Information Discovery, Knowledge Extraction, ...
Currently: Data Mining and Knowledge Discovery
are used interchangeably
© 2006 KDnuggets
Data Mining Tasks
© 2006 KDnuggets 13
Some Definitions
 Instance (also Item or Record):
 an example, described by a number of attributes,
 e.g. a day can be described by temperature, humidity
and cloud status
 Attribute or Field
 measuring aspects of the Instance, e.g. temperature
 Class (Label)
 grouping of instances, e.g. days good for playing
© 2006 KDnuggets 14
Major Data Mining Tasks
Classification: predicting an item class
Clustering: finding clusters in data
Associations: e.g. A & B & C occur frequently
Visualization: to facilitate human discovery
Summarization: describing a group
 Deviation Detection: finding changes
 Estimation: predicting a continuous value
 Link Analysis: finding relationships
 …
© 2006 KDnuggets 15
Classification
Learn a method for predicting the instance class from
pre-labeled (classified) instances
Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...
© 2006 KDnuggets 16
Clustering
Find “natural” grouping of
instances given un-labeled data
© 2006 KDnuggets 17
Association Rules &
Frequent Itemsets
TID Produce
1 MILK, BREAD, EGGS
2 BREAD, SUGAR
3 BREAD, CEREAL
4 MILK, BREAD, SUGAR
5 MILK, CEREAL
6 BREAD, CEREAL
7 MILK, CEREAL
8 MILK, BREAD, CEREAL, EGGS
9 MILK, BREAD, CEREAL
Transactions
Frequent Itemsets:
Milk, Bread (4)
Bread, Cereal (3)
Milk, Bread, Cereal (2)
…
Rules:
Milk => Bread (66%)
© 2006 KDnuggets 18
Visualization & Data Mining
 Visualizing the data to
facilitate human
discovery
 Presenting the
discovered results in a
visually "nice" way
© 2006 KDnuggets 19
Summarization
 Describe features of the
selected group
 Use natural language
and graphics
 Usually in Combination
with Deviation detection
or other methods
Average length of stay in this study area rose 45.7 percent,
from 4.3 days to 6.2 days, because ...
© 2006 KDnuggets 20
Data Mining Central Quest
Find true patterns
and avoid overfitting
(finding seemingly signifcant
but really random patterns due
to searching too many possibilites)
© 2006 KDnuggets
Classification Methods
© 2006 KDnuggets 22
Classification
Learn a method for predicting the instance class from
pre-labeled (classified) instances
Many approaches:
Regression,
Decision Trees,
Bayesian,
Neural Networks,
...
Given a set of points from classes
what is the class of new point ?
© 2006 KDnuggets 23
Classification: Linear Regression
 Linear Regression
w0 + w1 x + w2 y >= 0
 Regression computes
wi from data to
minimize squared error
to ‘fit’ the data
 Not flexible enough
© 2006 KDnuggets 24
Regression for Classification
 Any regression technique can be used for classification
 Training: perform a regression for each class, setting the
output to 1 for training instances that belong to class, and 0
for those that don’t
 Prediction: predict class corresponding to model with largest
output value (membership value)
 For linear regression this is known as multi-response
linear regression
© 2006 KDnuggets 25
Classification: Decision Trees
X
Y
if X > 5 then blue
else if Y > 3 then blue
else if X > 2 then green
else blue
5
2
3
© 2006 KDnuggets 26
DECISION TREE
 An internal node is a test on an attribute.
 A branch represents an outcome of the test, e.g.,
Color=red.
 A leaf node represents a class label or class label
distribution.
 At each node, one attribute is chosen to split
training examples into distinct classes as much as
possible
 A new instance is classified by following a
matching path to a leaf node.
© 2006 KDnuggets 27
Weather Data: Play or not Play?
Outlook Temperature Humidity Windy Play?
sunny hot high false No
sunny hot high true No
overcast hot high false Yes
rain mild high false Yes
rain cool normal false Yes
rain cool normal true No
overcast cool normal true Yes
sunny mild high false No
sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No
Note:
Outlook is the
Forecast,
no relation to
Microsoft
email program
© 2006 KDnuggets 28
overcast
high normal false
true
sunny
rain
No No
Yes Yes
Yes
Example Tree for “Play?”
Outlook
Humidity
Windy
© 2006 KDnuggets 29
Classification: Neural Nets
 Can select more
complex regions
 Can be more accurate
 Also can overfit the
data – find patterns in
random noise
© 2006 KDnuggets 30
Classification: other approaches
 Naïve Bayes
 Rules
 Support Vector Machines
 Genetic Algorithms
 …
See www.KDnuggets.com/software/
© 2006 KDnuggets
Evaluation
© 2006 KDnuggets 32
Evaluating which method works the
best for classification
 No model is uniformly the best
 Dimensions for Comparison
 speed of training
 speed of model application
 noise tolerance
 explanation ability
 Best Results: Hybrid, Integrated models
© 2006 KDnuggets 33
Comparison of Major
Classification Approaches
Train
time
Run
Time
Noise
Toler
ance
Can Use
Prior
Know-
ledge
Accuracy
on Customer
Modelling
Under-
standable
Decision
Trees
fast fast poor no medium medium
Rules med fast poor no medium good
Neural
Networks
slow fast good no good poor
Bayesian slow fast good yes good good
A hybrid method will have higher accuracy
© 2006 KDnuggets 34
Evaluation of Classification Models
 How predictive is the model we learned?
 Error on the training data is not a good indicator
of performance on future data
 The new data will probably not be exactly the same as
the training data!
 Overfitting – fitting the training data too precisely
- usually leads to poor results on new data
© 2006 KDnuggets 35
Evaluation issues
 Possible evaluation measures:
 Classification Accuracy
 Total cost/benefit – when different errors involve
different costs
 Lift and ROC curves
 Error in numeric predictions
 How reliable are the predicted results ?
© 2006 KDnuggets 36
Classifier error rate
 Natural performance measure for classification
problems: error rate
 Success: instance’s class is predicted correctly
 Error: instance’s class is predicted incorrectly
 Error rate: proportion of errors made over the whole set
of instances
 Training set error rate: is way too optimistic!
 you can find patterns even in random data
© 2006 KDnuggets 37
Evaluation on “LARGE” data
If many (>1000) examples are available,
including >100 examples from each class
 A simple evaluation will give useful results
 Randomly split data into training and test sets (usually
2/3 for train, 1/3 for test)
 Build a classifier using the train set and evaluate
it using the test set
© 2006 KDnuggets 38
Classification Step 1:
Split data into train and test sets
Results Known
+
+
-
-
+
THE PAST
Data
Training set
Testing set
© 2006 KDnuggets 39
Classification Step 2:
Build a model on a training set
Training set
Results Known
+
+
-
-
+
THE PAST
Data
Model Builder
Testing set
© 2006 KDnuggets 40
Classification Step 3:
Evaluate on test set (Re-train?)
Data
Predictions
Y N
Results Known
Training set
Testing set
+
+
-
-
+
Model Builder
Evaluate
+
-
+
-
© 2006 KDnuggets 41
Unbalanced data
 Sometimes, classes have very unequal frequency
 Attrition prediction: 97% stay, 3% attrite (in a month)
 medical diagnosis: 90% healthy, 10% disease
 eCommerce: 99% don’t buy, 1% buy
 Security: >99.99% of Americans are not terrorists
 Similar situation with multiple classes
 Majority class classifier can be 97% correct, but
useless
© 2006 KDnuggets 42
Handling unbalanced data –
how?
If we have two classes that are very
unbalanced, then how can we evaluate our
classifier method?
© 2006 KDnuggets 43
Balancing unbalanced data, 1
 With two classes, a good approach is to build
BALANCED train and test sets, and train model
on a balanced set
 randomly select desired number of minority class
instances
 add equal number of randomly selected majority class
 How do we generalize “balancing” to multiple
classes?
© 2006 KDnuggets 44
Balancing unbalanced data, 2
Generalize “balancing” to multiple classes
 Ensure that each class is represented with
approximately equal proportions in train and test
© 2006 KDnuggets 45
A note on parameter tuning
 It is important that the test data is not used in any way to
create the classifier
 Some learning schemes operate in two stages:
 Stage 1: builds the basic structure
 Stage 2: optimizes parameter settings
 The test data can’t be used for parameter tuning!
 Proper procedure uses three sets: training data,
validation data, and test data
 Validation data is used to optimize parameters
© 2006 KDnuggets 46
Making the most of the data
 Once evaluation is complete, all the data can be
used to build the final classifier
 Generally, the larger the training data the better
the classifier (but returns diminish)
 The larger the test data the more accurate the
error estimate
© 2006 KDnuggets 47
Classification:
Train, Validation, Test split
Data
Predictions
Y N
Results Known
Training set
Validation set
+
+
-
-
+
Model Builder
Evaluate
+
-
+
-
Final Model
Final Test Set
+
-
+
-
Final Evaluation
Model
Builder
© 2006 KDnuggets 48
Cross-validation
 Cross-validation avoids overlapping test sets
 First step: data is split into k subsets of equal size
 Second step: each subset in turn is used for testing and
the remainder for training
 This is called k-fold cross-validation
 Often the subsets are stratified before the cross-
validation is performed
 The error estimates are averaged to yield an
overall error estimate
© 2006 KDnuggets 49
49
Cross-validation example:
—Break up data into groups of the same size
—
—
—Hold aside one group for testing and use the rest to build model
—
—Repeat
Test
© 2006 KDnuggets 50
More on cross-validation
 Standard method for evaluation: stratified ten-
fold cross-validation
 Why ten? Extensive experiments have shown that
this is the best choice to get an accurate estimate
 Stratification reduces the estimate’s variance
 Even better: repeated stratified cross-validation
 E.g. ten-fold cross-validation is repeated ten times and
results are averaged (reduces the variance)
© 2006 KDnuggets 51
Direct Marketing Paradigm
 Find most likely prospects to contact
 Not everybody needs to be contacted
 Number of targets is usually much smaller than number of
prospects
 Typical Applications
 retailers, catalogues, direct mail (and e-mail)
 customer acquisition, cross-sell, attrition prediction
 ...
© 2006 KDnuggets 52
Direct Marketing Evaluation
 Accuracy on the entire dataset is not the
right measure
 Approach
 develop a target model
 score all prospects and rank them by decreasing score
 select top P% of prospects for action
 How do we decide what is the best subset of
prospects ?
© 2006 KDnuggets 53
Model-Sorted List
No Score Target CustID Age
1 0.97 Y 1746 …
2 0.95 N 1024 …
3 0.94 Y 2478 …
4 0.93 Y 3820 …
5 0.92 N 4897 …
… … … …
99 0.11 N 2734 …
100 0.06 N 2422
Use a model to assign score to each customer
Sort customers by decreasing score
Expect more targets (hits) near the top of the list
3 hits in top 5% of
the list
If there 15 targets
overall, then top 5
has 3/15=20% of
targets
© 2006 KDnuggets 54
CPH (Cumulative Pct Hits)
0
10
20
30
40
50
60
70
80
90
100
5
15
25
35
45
55
65
75
85
95
Random
5% of random list have 5% of targets
Pct list
Cumulative
%
Hits
Definition:
CPH(P,M)
= % of all targets
in the first P%
of the list scored
by model M
CPH frequently
called Gains
© 2006 KDnuggets 55
CPH: Random List vs
Model-ranked list
0
10
20
30
40
50
60
70
80
90
100
5
15
25
35
45
55
65
75
85
95
Random
Model
5% of random list have 5% of targets,
but 5% of model ranked list have 21% of targets
CPH(5%,model)=21%.
Pct list
Cumulative
%
Hits
© 2006 KDnuggets 56
Lift
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
15
25
35
45
55
65
75
85
95
Lift
Lift(P,M) = CPH(P,M) / P
P -- percent of the list
Lift (at 5%)
= 21% / 5%
= 4.2
better
than random
Note: Some authors
use “Lift” for what
we call CPH.
© 2006 KDnuggets 57
Lift – a measure of model quality
 Lift helps us decide which models are better
 If cost/benefit values are not available or
changing, we can use Lift to select a better
model.
 Model with the higher Lift curve will generally be
better
© 2006 KDnuggets
Clustering
© 2006 KDnuggets 59
Clustering
Unsupervised learning:
Finds “natural” grouping of
instances given un-labeled data
© 2006 KDnuggets 60
Clustering Methods
 Many different method and algorithms:
 For numeric and/or symbolic data
 Deterministic vs. probabilistic
 Exclusive vs. overlapping
 Hierarchical vs. flat
 Top-down vs. bottom-up
© 2006 KDnuggets 61
Clustering Evaluation
 Manual inspection
 Benchmarking on existing labels
 Cluster quality measures
 distance measures
 high similarity within a cluster, low across clusters
© 2006 KDnuggets 62
The distance function
 Simplest case: one numeric attribute A
 Distance(X,Y) = A(X) – A(Y)
 Several numeric attributes:
 Distance(X,Y) = Euclidean distance between X,Y
 Nominal attributes: distance is set to 1 if values
are different, 0 if they are equal
 Are all attributes equally important?
 Weighting the attributes might be necessary
© 2006 KDnuggets 63
Simple Clustering: K-means
Works with numeric data only
1) Pick a number (K) of cluster centers (at
random)
2) Assign every item to its nearest cluster center
(e.g. using Euclidean distance)
3) Move each cluster center to the mean of its
assigned items
4) Repeat steps 2,3 until convergence (change in
cluster assignments less than a threshold)
© 2006 KDnuggets 64
K-means example, step 1
c1
c2
c3
X
Y
Pick 3
initial
cluster
centers
(randomly)
1
© 2006 KDnuggets 65
K-means example, step 2
c1
c2
c3
X
Y
Assign
each point
to the closest
cluster
center
© 2006 KDnuggets 66
K-means example, step 3
X
Y
Move
each cluster
center
to the mean
of each cluster
c1
c2
c2
c1
c3
c3
© 2006 KDnuggets 67
K-means example, step 4a
X
Y
Reassign
points
closest to a
different new
cluster center
Q: Which
points are
reassigned?
c1
c2
c3
© 2006 KDnuggets 68
K-means example, step 4b
X
Y
Reassign
points
closest to a
different new
cluster center
Q: Which
points are
reassigned?
c1
c2
c3
© 2006 KDnuggets 69
K-means example, step 4c
X
Y
A: three
points with
animation
c1
c3
c2
1
3
2
© 2006 KDnuggets 70
K-means example, step 4d
X
Y
re-compute
cluster
means
c1
c3
c2
© 2006 KDnuggets 71
K-means example, step 5
X
Y
move cluster
centers to
cluster means
c2
c1
c3
© 2006 KDnuggets
Data Mining Applications
© 2006 KDnuggets 73
Problems Suitable for Data-Mining
 require knowledge-based decisions
 have a changing environment
 have sub-optimal current methods
 have accessible, sufficient, and relevant data
 provides high payoff for the right decisions!
© 2006 KDnuggets 74
Major Application Areas for
Data Mining Solutions
 Advertising
 Bioinformatics
 Customer Relationship Management (CRM)
 Database Marketing
 Fraud Detection
 eCommerce
 Health Care
 Investment/Securities
 Manufacturing, Process Control
 Sports and Entertainment
 Telecommunications
 Web
© 2006 KDnuggets 75
Application: Search Engines
 Before Google, web search engines used mainly
keywords on a page – results were easily subject
to manipulation
 Google's early success was partly due to its
algorithm which uses mainly links to the page
 Google founders Sergey Brin and Larry Page were
students at Stanford in 1990s
 Their research in databases and data mining led
to Google
© 2006 KDnuggets 76
Microarrays: Classifying Leukemia
 Leukemia: Acute Lymphoblastic (ALL) vs Acute
Myeloid (AML), Golub et al, Science, v.286, 1999
 72 examples (38 train, 34 test), about 7,000 genes
ALL AML
Visually similar, but genetically very different
Best Model: 97% accuracy,
1 error (sample suspected mislabelled)
© 2006 KDnuggets 77
Microarray Potential Applications
 New and better molecular diagnostics
 Jan 11, 2005: FDA approved Roche Diagnostic AmpliChip, based on
Affymetrix technology
 New molecular targets for therapy
 few new drugs, large pipeline, …
 Improved treatment outcome
 Partially depends on genetic signature
 Fundamental Biological Discovery
 finding and refining biological pathways
 Personalized medicine ?!
© 2006 KDnuggets 78
Application:
Direct Marketing and CRM
 Most major direct marketing companies are using
modeling and data mining
 Most financial companies are using customer
modeling
 Modeling is easier than changing customer
behaviour
 Example
 Verizon Wireless reduced customer attrition rate from
2% to 1.5%, saving many millions of $
© 2006 KDnuggets 79
Application: e-Commerce
 Amazon.com recommendations
 if you bought (viewed) X, you are likely to buy Y
 Netflix
 If you liked "Monty Python and the Holy Grail",
you get a recommendation for "This is Spinal Tap"
 Comparison shopping
 Froogle, mySimon, Yahoo Shopping, …
© 2006 KDnuggets 80
Application:
Security and Fraud Detection
 Credit Card Fraud Detection
 over 20 Million credit cards protected by
Neural networks (Fair, Isaac)
 Securities Fraud Detection
 NASDAQ KDD system
 Phone fraud detection
 AT&T, Bell Atlantic, British Telecom/MCI
© 2006 KDnuggets 81
Data Mining, Privacy, and Security
 TIA: Terrorism (formerly Total) Information
Awareness Program –
 TIA program closed by Congress in 2003 because of
privacy concerns
 However, in 2006 we learn that NSA is analyzing
US domestic call info to find potential terrorists
 Invasion of Privacy or Needed Intelligence?
© 2006 KDnuggets 82
Criticism of Analytic Approaches
to Threat Detection:
Data Mining will
 be ineffective - generate millions of false positives
 and invade privacy
First, can data mining be effective?
© 2006 KDnuggets 83
Can Data Mining and Statistics
be Effective for Threat Detection?
 Criticism: Databases have 5% errors, so analyzing
100 million suspects will generate 5 million false
positives
 Reality: Analytical models correlate many items of
information to reduce false positives.
 Example: Identify one biased coin from 1,000.
 After one throw of each coin, we cannot
 After 30 throws, one biased coin will stand out with
high probability.
 Can identify 19 biased coins out of 100 million with
sufficient number of throws
© 2006 KDnuggets 84
Another Approach: Link Analysis
Can find unusual patterns in the network structure
© 2006 KDnuggets 85
Analytic technology can be effective
 Data Mining is just one additional tool to help
analysts
 Combining multiple models and link analysis can
reduce false positives
 Today there are millions of false positives with
manual analysis
 Analytic technology has the potential to reduce
the current high rate of false positives
© 2006 KDnuggets 86
Data Mining with Privacy
 Data Mining looks for patterns, not people!
 Technical solutions can limit privacy invasion
 Replacing sensitive personal data with anon. ID
 Give randomized outputs
 Multi-party computation – distributed data
 …
 Bayardo & Srikant, Technological Solutions for
Protecting Privacy, IEEE Computer, Sep 2003
© 2006 KDnuggets 87
1990
1998 2000 2002
Expectations
Performance
The Hype Curve for
Data Mining and Knowledge
Discovery
Over-inflated
expectations
Disappointment
Growing acceptance
and mainstreaming
rising
expectations
2005
© 2006 KDnuggets 88
Summary
 Data Mining and Knowledge Discovery are needed
to deal with the flood of data
 Knowledge Discovery is a process !
 Avoid overfitting (finding random patterns by
searching too many possibilities)
© 2006 KDnuggets 89
Additional Resources
www.KDnuggets.com
data mining software, jobs, courses, etc
www.acm.org/sigkdd
ACM SIGKDD – the professional society for
data mining

More Related Content

What's hot

What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...Simplilearn
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Sparkdatamantra
 
Python Machine Learning Tutorial | Machine Learning Algorithms | Python Train...
Python Machine Learning Tutorial | Machine Learning Algorithms | Python Train...Python Machine Learning Tutorial | Machine Learning Algorithms | Python Train...
Python Machine Learning Tutorial | Machine Learning Algorithms | Python Train...Edureka!
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analyticsSSaudia
 
Introduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceIntroduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceData Science Thailand
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesIvo Andreev
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalBhaskar Mitra
 
Big data Presentation
Big data PresentationBig data Presentation
Big data PresentationAswadmehar
 
Information retrieval 3 query search interfaces
Information retrieval 3 query search interfacesInformation retrieval 3 query search interfaces
Information retrieval 3 query search interfacesVaibhav Khanna
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.pptneelamoberoi1030
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementationSandip Tipayle Patil
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning ProjectEng Teong Cheah
 
Data science workshop
Data science workshopData science workshop
Data science workshopHortonworks
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 

What's hot (20)

What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
 
What is Data Science
What is Data ScienceWhat is Data Science
What is Data Science
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Python Machine Learning Tutorial | Machine Learning Algorithms | Python Train...
Python Machine Learning Tutorial | Machine Learning Algorithms | Python Train...Python Machine Learning Tutorial | Machine Learning Algorithms | Python Train...
Python Machine Learning Tutorial | Machine Learning Algorithms | Python Train...
 
Machine learning
Machine learningMachine learning
Machine learning
 
Content based filtering
Content based filteringContent based filtering
Content based filtering
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
 
Introduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceIntroduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data Science
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
Big data
Big dataBig data
Big data
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Information retrieval 3 query search interfaces
Information retrieval 3 query search interfacesInformation retrieval 3 query search interfaces
Information retrieval 3 query search interfaces
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.ppt
 
Data science
Data scienceData science
Data science
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning Project
 
Data science workshop
Data science workshopData science workshop
Data science workshop
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 

Similar to data-mining-tutorial.ppt

Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dmsumit621
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationDr. Abdul Ahad Abro
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG DataPrasant Misra
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchjim
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Researchbutest
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchkevinlan
 
Bt9001, data mining
Bt9001, data miningBt9001, data mining
Bt9001, data miningsmumbahelp
 
EDRG12_Re.doc
EDRG12_Re.docEDRG12_Re.doc
EDRG12_Re.docbutest
 
EDRG12_Re.doc
EDRG12_Re.docEDRG12_Re.doc
EDRG12_Re.docbutest
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introductionBasma Gamal
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?Anubhav Jain
 
Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...IOSRjournaljce
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 
Anonymization of data using mapreduce on cloud
Anonymization of data using mapreduce on cloudAnonymization of data using mapreduce on cloud
Anonymization of data using mapreduce on cloudeSAT Journals
 
DM Lecture 3
DM Lecture 3DM Lecture 3
DM Lecture 3asad199
 

Similar to data-mining-tutorial.ppt (20)

Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
Data mining intro-2009-v2
Data mining intro-2009-v2Data mining intro-2009-v2
Data mining intro-2009-v2
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG Data
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
 
Bt9001, data mining
Bt9001, data miningBt9001, data mining
Bt9001, data mining
 
presentationIDC - 14MAY2015
presentationIDC - 14MAY2015presentationIDC - 14MAY2015
presentationIDC - 14MAY2015
 
EDRG12_Re.doc
EDRG12_Re.docEDRG12_Re.doc
EDRG12_Re.doc
 
EDRG12_Re.doc
EDRG12_Re.docEDRG12_Re.doc
EDRG12_Re.doc
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?
 
Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
Anonymization of data using mapreduce on cloud
Anonymization of data using mapreduce on cloudAnonymization of data using mapreduce on cloud
Anonymization of data using mapreduce on cloud
 
My3prep
My3prepMy3prep
My3prep
 
DM Lecture 3
DM Lecture 3DM Lecture 3
DM Lecture 3
 

More from PerumalPitchandi

Lecture Notes on Recommender System Introduction
Lecture Notes on Recommender System IntroductionLecture Notes on Recommender System Introduction
Lecture Notes on Recommender System IntroductionPerumalPitchandi
 
22ADE002 – Business Analytics- Module 1.pptx
22ADE002 – Business Analytics- Module 1.pptx22ADE002 – Business Analytics- Module 1.pptx
22ADE002 – Business Analytics- Module 1.pptxPerumalPitchandi
 
Descriptive_Statistics_PPT.ppt
Descriptive_Statistics_PPT.pptDescriptive_Statistics_PPT.ppt
Descriptive_Statistics_PPT.pptPerumalPitchandi
 
20IT204-COA-Lecture 18.ppt
20IT204-COA-Lecture 18.ppt20IT204-COA-Lecture 18.ppt
20IT204-COA-Lecture 18.pptPerumalPitchandi
 
20IT204-COA- Lecture 17.pptx
20IT204-COA- Lecture 17.pptx20IT204-COA- Lecture 17.pptx
20IT204-COA- Lecture 17.pptxPerumalPitchandi
 
Capability Maturity Model (CMM).pptx
Capability Maturity Model (CMM).pptxCapability Maturity Model (CMM).pptx
Capability Maturity Model (CMM).pptxPerumalPitchandi
 
Comparison_between_Waterfall_and_Agile_m (1).pptx
Comparison_between_Waterfall_and_Agile_m (1).pptxComparison_between_Waterfall_and_Agile_m (1).pptx
Comparison_between_Waterfall_and_Agile_m (1).pptxPerumalPitchandi
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxPerumalPitchandi
 
FDS Module I 24.1.2022.ppt
FDS Module I 24.1.2022.pptFDS Module I 24.1.2022.ppt
FDS Module I 24.1.2022.pptPerumalPitchandi
 
FDS Module I 20.1.2022.ppt
FDS Module I 20.1.2022.pptFDS Module I 20.1.2022.ppt
FDS Module I 20.1.2022.pptPerumalPitchandi
 
AgileSoftwareDevelopment.ppt
AgileSoftwareDevelopment.pptAgileSoftwareDevelopment.ppt
AgileSoftwareDevelopment.pptPerumalPitchandi
 
Agile and its impact to Project Management 022218.pptx
Agile and its impact to Project Management 022218.pptxAgile and its impact to Project Management 022218.pptx
Agile and its impact to Project Management 022218.pptxPerumalPitchandi
 

More from PerumalPitchandi (20)

Lecture Notes on Recommender System Introduction
Lecture Notes on Recommender System IntroductionLecture Notes on Recommender System Introduction
Lecture Notes on Recommender System Introduction
 
22ADE002 – Business Analytics- Module 1.pptx
22ADE002 – Business Analytics- Module 1.pptx22ADE002 – Business Analytics- Module 1.pptx
22ADE002 – Business Analytics- Module 1.pptx
 
biv_mult.ppt
biv_mult.pptbiv_mult.ppt
biv_mult.ppt
 
ppt_ids-data science.pdf
ppt_ids-data science.pdfppt_ids-data science.pdf
ppt_ids-data science.pdf
 
ANOVA Presentation.ppt
ANOVA Presentation.pptANOVA Presentation.ppt
ANOVA Presentation.ppt
 
Data Science Intro.pptx
Data Science Intro.pptxData Science Intro.pptx
Data Science Intro.pptx
 
Descriptive_Statistics_PPT.ppt
Descriptive_Statistics_PPT.pptDescriptive_Statistics_PPT.ppt
Descriptive_Statistics_PPT.ppt
 
SW_Cost_Estimation.ppt
SW_Cost_Estimation.pptSW_Cost_Estimation.ppt
SW_Cost_Estimation.ppt
 
CostEstimation-1.ppt
CostEstimation-1.pptCostEstimation-1.ppt
CostEstimation-1.ppt
 
20IT204-COA-Lecture 18.ppt
20IT204-COA-Lecture 18.ppt20IT204-COA-Lecture 18.ppt
20IT204-COA-Lecture 18.ppt
 
20IT204-COA- Lecture 17.pptx
20IT204-COA- Lecture 17.pptx20IT204-COA- Lecture 17.pptx
20IT204-COA- Lecture 17.pptx
 
Capability Maturity Model (CMM).pptx
Capability Maturity Model (CMM).pptxCapability Maturity Model (CMM).pptx
Capability Maturity Model (CMM).pptx
 
Comparison_between_Waterfall_and_Agile_m (1).pptx
Comparison_between_Waterfall_and_Agile_m (1).pptxComparison_between_Waterfall_and_Agile_m (1).pptx
Comparison_between_Waterfall_and_Agile_m (1).pptx
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
FDS Module I 24.1.2022.ppt
FDS Module I 24.1.2022.pptFDS Module I 24.1.2022.ppt
FDS Module I 24.1.2022.ppt
 
FDS Module I 20.1.2022.ppt
FDS Module I 20.1.2022.pptFDS Module I 20.1.2022.ppt
FDS Module I 20.1.2022.ppt
 
AgileSoftwareDevelopment.ppt
AgileSoftwareDevelopment.pptAgileSoftwareDevelopment.ppt
AgileSoftwareDevelopment.ppt
 
Agile and its impact to Project Management 022218.pptx
Agile and its impact to Project Management 022218.pptxAgile and its impact to Project Management 022218.pptx
Agile and its impact to Project Management 022218.pptx
 
Data_exploration.ppt
Data_exploration.pptData_exploration.ppt
Data_exploration.ppt
 
state-spaces29Sep06.ppt
state-spaces29Sep06.pptstate-spaces29Sep06.ppt
state-spaces29Sep06.ppt
 

Recently uploaded

Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .Satyam Kumar
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 

Recently uploaded (20)

Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 

data-mining-tutorial.ppt

  • 1. © 2006 KDnuggets Data Mining Tutorial Gregory Piatetsky-Shapiro KDnuggets
  • 2. © 2006 KDnuggets 2 Outline Introduction Data Mining Tasks Classification & Evaluation Clustering Application Examples
  • 3. © 2006 KDnuggets 3 Trends leading to Data Flood  More data is generated:  Web, text, images …  Business transactions, calls, ...  Scientific data: astronomy, biology, etc  More data is captured:  Storage technology faster and cheaper  DBMS can handle bigger DB
  • 4. © 2006 KDnuggets 4 Largest Databases in 2005 Winter Corp. 2005 Commercial Database Survey: 1. Max Planck Inst. for Meteorology , 222 TB 2. Yahoo ~ 100 TB (Largest Data Warehouse) 3. AT&T ~ 94 TB www.wintercorp.com/VLDB/2005_TopTen_Survey/TopTenWinners_2005.asp
  • 5. © 2006 KDnuggets 5 Data Growth In 2 years (2003 to 2005), the size of the largest database TRIPLED!
  • 6. © 2006 KDnuggets 6 Data Growth Rate  Twice as much information was created in 2002 as in 1999 (~30% growth rate)  Other growth rate estimates even higher  Very little data will ever be looked at by a human Knowledge Discovery is NEEDED to make sense and use of data.
  • 7. © 2006 KDnuggets 7 Knowledge Discovery Definition Knowledge Discovery in Data is the non-trivial process of identifying  valid  novel  potentially useful  and ultimately understandable patterns in data. from Advances in Knowledge Discovery and Data Mining, Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, (Chapter 1), AAAI/MIT Press 1996
  • 8. © 2006 KDnuggets 8 Related Fields Statistics Machine Learning Databases Visualization Data Mining and Knowledge Discovery
  • 9. © 2006 KDnuggets 9 Statistics, Machine Learning and Data Mining  Statistics:  more theory-based  more focused on testing hypotheses  Machine learning  more heuristic  focused on improving performance of a learning agent  also looks at real-time learning and robotics – areas not part of data mining  Data Mining and Knowledge Discovery  integrates theory and heuristics  focus on the entire process of knowledge discovery, including data cleaning, learning, and integration and visualization of results  Distinctions are fuzzy
  • 10. © 2006 KDnuggets 10 Knowledge Discovery Process flow, according to CRISP-DM Monitoring see www.crisp-dm.org for more information Continuous monitoring and improvement is an addition to CRISP
  • 11. © 2006 KDnuggets 11 Historical Note: Many Names of Data Mining  Data Fishing, Data Dredging: 1960-  used by statisticians (as bad name)  Data Mining :1990 --  used in DB community, business  Knowledge Discovery in Databases (1989-)  used by AI, Machine Learning Community  also Data Archaeology, Information Harvesting, Information Discovery, Knowledge Extraction, ... Currently: Data Mining and Knowledge Discovery are used interchangeably
  • 12. © 2006 KDnuggets Data Mining Tasks
  • 13. © 2006 KDnuggets 13 Some Definitions  Instance (also Item or Record):  an example, described by a number of attributes,  e.g. a day can be described by temperature, humidity and cloud status  Attribute or Field  measuring aspects of the Instance, e.g. temperature  Class (Label)  grouping of instances, e.g. days good for playing
  • 14. © 2006 KDnuggets 14 Major Data Mining Tasks Classification: predicting an item class Clustering: finding clusters in data Associations: e.g. A & B & C occur frequently Visualization: to facilitate human discovery Summarization: describing a group  Deviation Detection: finding changes  Estimation: predicting a continuous value  Link Analysis: finding relationships  …
  • 15. © 2006 KDnuggets 15 Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Statistics, Decision Trees, Neural Networks, ...
  • 16. © 2006 KDnuggets 16 Clustering Find “natural” grouping of instances given un-labeled data
  • 17. © 2006 KDnuggets 17 Association Rules & Frequent Itemsets TID Produce 1 MILK, BREAD, EGGS 2 BREAD, SUGAR 3 BREAD, CEREAL 4 MILK, BREAD, SUGAR 5 MILK, CEREAL 6 BREAD, CEREAL 7 MILK, CEREAL 8 MILK, BREAD, CEREAL, EGGS 9 MILK, BREAD, CEREAL Transactions Frequent Itemsets: Milk, Bread (4) Bread, Cereal (3) Milk, Bread, Cereal (2) … Rules: Milk => Bread (66%)
  • 18. © 2006 KDnuggets 18 Visualization & Data Mining  Visualizing the data to facilitate human discovery  Presenting the discovered results in a visually "nice" way
  • 19. © 2006 KDnuggets 19 Summarization  Describe features of the selected group  Use natural language and graphics  Usually in Combination with Deviation detection or other methods Average length of stay in this study area rose 45.7 percent, from 4.3 days to 6.2 days, because ...
  • 20. © 2006 KDnuggets 20 Data Mining Central Quest Find true patterns and avoid overfitting (finding seemingly signifcant but really random patterns due to searching too many possibilites)
  • 22. © 2006 KDnuggets 22 Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Regression, Decision Trees, Bayesian, Neural Networks, ... Given a set of points from classes what is the class of new point ?
  • 23. © 2006 KDnuggets 23 Classification: Linear Regression  Linear Regression w0 + w1 x + w2 y >= 0  Regression computes wi from data to minimize squared error to ‘fit’ the data  Not flexible enough
  • 24. © 2006 KDnuggets 24 Regression for Classification  Any regression technique can be used for classification  Training: perform a regression for each class, setting the output to 1 for training instances that belong to class, and 0 for those that don’t  Prediction: predict class corresponding to model with largest output value (membership value)  For linear regression this is known as multi-response linear regression
  • 25. © 2006 KDnuggets 25 Classification: Decision Trees X Y if X > 5 then blue else if Y > 3 then blue else if X > 2 then green else blue 5 2 3
  • 26. © 2006 KDnuggets 26 DECISION TREE  An internal node is a test on an attribute.  A branch represents an outcome of the test, e.g., Color=red.  A leaf node represents a class label or class label distribution.  At each node, one attribute is chosen to split training examples into distinct classes as much as possible  A new instance is classified by following a matching path to a leaf node.
  • 27. © 2006 KDnuggets 27 Weather Data: Play or not Play? Outlook Temperature Humidity Windy Play? sunny hot high false No sunny hot high true No overcast hot high false Yes rain mild high false Yes rain cool normal false Yes rain cool normal true No overcast cool normal true Yes sunny mild high false No sunny cool normal false Yes rain mild normal false Yes sunny mild normal true Yes overcast mild high true Yes overcast hot normal false Yes rain mild high true No Note: Outlook is the Forecast, no relation to Microsoft email program
  • 28. © 2006 KDnuggets 28 overcast high normal false true sunny rain No No Yes Yes Yes Example Tree for “Play?” Outlook Humidity Windy
  • 29. © 2006 KDnuggets 29 Classification: Neural Nets  Can select more complex regions  Can be more accurate  Also can overfit the data – find patterns in random noise
  • 30. © 2006 KDnuggets 30 Classification: other approaches  Naïve Bayes  Rules  Support Vector Machines  Genetic Algorithms  … See www.KDnuggets.com/software/
  • 32. © 2006 KDnuggets 32 Evaluating which method works the best for classification  No model is uniformly the best  Dimensions for Comparison  speed of training  speed of model application  noise tolerance  explanation ability  Best Results: Hybrid, Integrated models
  • 33. © 2006 KDnuggets 33 Comparison of Major Classification Approaches Train time Run Time Noise Toler ance Can Use Prior Know- ledge Accuracy on Customer Modelling Under- standable Decision Trees fast fast poor no medium medium Rules med fast poor no medium good Neural Networks slow fast good no good poor Bayesian slow fast good yes good good A hybrid method will have higher accuracy
  • 34. © 2006 KDnuggets 34 Evaluation of Classification Models  How predictive is the model we learned?  Error on the training data is not a good indicator of performance on future data  The new data will probably not be exactly the same as the training data!  Overfitting – fitting the training data too precisely - usually leads to poor results on new data
  • 35. © 2006 KDnuggets 35 Evaluation issues  Possible evaluation measures:  Classification Accuracy  Total cost/benefit – when different errors involve different costs  Lift and ROC curves  Error in numeric predictions  How reliable are the predicted results ?
  • 36. © 2006 KDnuggets 36 Classifier error rate  Natural performance measure for classification problems: error rate  Success: instance’s class is predicted correctly  Error: instance’s class is predicted incorrectly  Error rate: proportion of errors made over the whole set of instances  Training set error rate: is way too optimistic!  you can find patterns even in random data
  • 37. © 2006 KDnuggets 37 Evaluation on “LARGE” data If many (>1000) examples are available, including >100 examples from each class  A simple evaluation will give useful results  Randomly split data into training and test sets (usually 2/3 for train, 1/3 for test)  Build a classifier using the train set and evaluate it using the test set
  • 38. © 2006 KDnuggets 38 Classification Step 1: Split data into train and test sets Results Known + + - - + THE PAST Data Training set Testing set
  • 39. © 2006 KDnuggets 39 Classification Step 2: Build a model on a training set Training set Results Known + + - - + THE PAST Data Model Builder Testing set
  • 40. © 2006 KDnuggets 40 Classification Step 3: Evaluate on test set (Re-train?) Data Predictions Y N Results Known Training set Testing set + + - - + Model Builder Evaluate + - + -
  • 41. © 2006 KDnuggets 41 Unbalanced data  Sometimes, classes have very unequal frequency  Attrition prediction: 97% stay, 3% attrite (in a month)  medical diagnosis: 90% healthy, 10% disease  eCommerce: 99% don’t buy, 1% buy  Security: >99.99% of Americans are not terrorists  Similar situation with multiple classes  Majority class classifier can be 97% correct, but useless
  • 42. © 2006 KDnuggets 42 Handling unbalanced data – how? If we have two classes that are very unbalanced, then how can we evaluate our classifier method?
  • 43. © 2006 KDnuggets 43 Balancing unbalanced data, 1  With two classes, a good approach is to build BALANCED train and test sets, and train model on a balanced set  randomly select desired number of minority class instances  add equal number of randomly selected majority class  How do we generalize “balancing” to multiple classes?
  • 44. © 2006 KDnuggets 44 Balancing unbalanced data, 2 Generalize “balancing” to multiple classes  Ensure that each class is represented with approximately equal proportions in train and test
  • 45. © 2006 KDnuggets 45 A note on parameter tuning  It is important that the test data is not used in any way to create the classifier  Some learning schemes operate in two stages:  Stage 1: builds the basic structure  Stage 2: optimizes parameter settings  The test data can’t be used for parameter tuning!  Proper procedure uses three sets: training data, validation data, and test data  Validation data is used to optimize parameters
  • 46. © 2006 KDnuggets 46 Making the most of the data  Once evaluation is complete, all the data can be used to build the final classifier  Generally, the larger the training data the better the classifier (but returns diminish)  The larger the test data the more accurate the error estimate
  • 47. © 2006 KDnuggets 47 Classification: Train, Validation, Test split Data Predictions Y N Results Known Training set Validation set + + - - + Model Builder Evaluate + - + - Final Model Final Test Set + - + - Final Evaluation Model Builder
  • 48. © 2006 KDnuggets 48 Cross-validation  Cross-validation avoids overlapping test sets  First step: data is split into k subsets of equal size  Second step: each subset in turn is used for testing and the remainder for training  This is called k-fold cross-validation  Often the subsets are stratified before the cross- validation is performed  The error estimates are averaged to yield an overall error estimate
  • 49. © 2006 KDnuggets 49 49 Cross-validation example: —Break up data into groups of the same size — — —Hold aside one group for testing and use the rest to build model — —Repeat Test
  • 50. © 2006 KDnuggets 50 More on cross-validation  Standard method for evaluation: stratified ten- fold cross-validation  Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate  Stratification reduces the estimate’s variance  Even better: repeated stratified cross-validation  E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance)
  • 51. © 2006 KDnuggets 51 Direct Marketing Paradigm  Find most likely prospects to contact  Not everybody needs to be contacted  Number of targets is usually much smaller than number of prospects  Typical Applications  retailers, catalogues, direct mail (and e-mail)  customer acquisition, cross-sell, attrition prediction  ...
  • 52. © 2006 KDnuggets 52 Direct Marketing Evaluation  Accuracy on the entire dataset is not the right measure  Approach  develop a target model  score all prospects and rank them by decreasing score  select top P% of prospects for action  How do we decide what is the best subset of prospects ?
  • 53. © 2006 KDnuggets 53 Model-Sorted List No Score Target CustID Age 1 0.97 Y 1746 … 2 0.95 N 1024 … 3 0.94 Y 2478 … 4 0.93 Y 3820 … 5 0.92 N 4897 … … … … … 99 0.11 N 2734 … 100 0.06 N 2422 Use a model to assign score to each customer Sort customers by decreasing score Expect more targets (hits) near the top of the list 3 hits in top 5% of the list If there 15 targets overall, then top 5 has 3/15=20% of targets
  • 54. © 2006 KDnuggets 54 CPH (Cumulative Pct Hits) 0 10 20 30 40 50 60 70 80 90 100 5 15 25 35 45 55 65 75 85 95 Random 5% of random list have 5% of targets Pct list Cumulative % Hits Definition: CPH(P,M) = % of all targets in the first P% of the list scored by model M CPH frequently called Gains
  • 55. © 2006 KDnuggets 55 CPH: Random List vs Model-ranked list 0 10 20 30 40 50 60 70 80 90 100 5 15 25 35 45 55 65 75 85 95 Random Model 5% of random list have 5% of targets, but 5% of model ranked list have 21% of targets CPH(5%,model)=21%. Pct list Cumulative % Hits
  • 56. © 2006 KDnuggets 56 Lift 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 15 25 35 45 55 65 75 85 95 Lift Lift(P,M) = CPH(P,M) / P P -- percent of the list Lift (at 5%) = 21% / 5% = 4.2 better than random Note: Some authors use “Lift” for what we call CPH.
  • 57. © 2006 KDnuggets 57 Lift – a measure of model quality  Lift helps us decide which models are better  If cost/benefit values are not available or changing, we can use Lift to select a better model.  Model with the higher Lift curve will generally be better
  • 59. © 2006 KDnuggets 59 Clustering Unsupervised learning: Finds “natural” grouping of instances given un-labeled data
  • 60. © 2006 KDnuggets 60 Clustering Methods  Many different method and algorithms:  For numeric and/or symbolic data  Deterministic vs. probabilistic  Exclusive vs. overlapping  Hierarchical vs. flat  Top-down vs. bottom-up
  • 61. © 2006 KDnuggets 61 Clustering Evaluation  Manual inspection  Benchmarking on existing labels  Cluster quality measures  distance measures  high similarity within a cluster, low across clusters
  • 62. © 2006 KDnuggets 62 The distance function  Simplest case: one numeric attribute A  Distance(X,Y) = A(X) – A(Y)  Several numeric attributes:  Distance(X,Y) = Euclidean distance between X,Y  Nominal attributes: distance is set to 1 if values are different, 0 if they are equal  Are all attributes equally important?  Weighting the attributes might be necessary
  • 63. © 2006 KDnuggets 63 Simple Clustering: K-means Works with numeric data only 1) Pick a number (K) of cluster centers (at random) 2) Assign every item to its nearest cluster center (e.g. using Euclidean distance) 3) Move each cluster center to the mean of its assigned items 4) Repeat steps 2,3 until convergence (change in cluster assignments less than a threshold)
  • 64. © 2006 KDnuggets 64 K-means example, step 1 c1 c2 c3 X Y Pick 3 initial cluster centers (randomly) 1
  • 65. © 2006 KDnuggets 65 K-means example, step 2 c1 c2 c3 X Y Assign each point to the closest cluster center
  • 66. © 2006 KDnuggets 66 K-means example, step 3 X Y Move each cluster center to the mean of each cluster c1 c2 c2 c1 c3 c3
  • 67. © 2006 KDnuggets 67 K-means example, step 4a X Y Reassign points closest to a different new cluster center Q: Which points are reassigned? c1 c2 c3
  • 68. © 2006 KDnuggets 68 K-means example, step 4b X Y Reassign points closest to a different new cluster center Q: Which points are reassigned? c1 c2 c3
  • 69. © 2006 KDnuggets 69 K-means example, step 4c X Y A: three points with animation c1 c3 c2 1 3 2
  • 70. © 2006 KDnuggets 70 K-means example, step 4d X Y re-compute cluster means c1 c3 c2
  • 71. © 2006 KDnuggets 71 K-means example, step 5 X Y move cluster centers to cluster means c2 c1 c3
  • 72. © 2006 KDnuggets Data Mining Applications
  • 73. © 2006 KDnuggets 73 Problems Suitable for Data-Mining  require knowledge-based decisions  have a changing environment  have sub-optimal current methods  have accessible, sufficient, and relevant data  provides high payoff for the right decisions!
  • 74. © 2006 KDnuggets 74 Major Application Areas for Data Mining Solutions  Advertising  Bioinformatics  Customer Relationship Management (CRM)  Database Marketing  Fraud Detection  eCommerce  Health Care  Investment/Securities  Manufacturing, Process Control  Sports and Entertainment  Telecommunications  Web
  • 75. © 2006 KDnuggets 75 Application: Search Engines  Before Google, web search engines used mainly keywords on a page – results were easily subject to manipulation  Google's early success was partly due to its algorithm which uses mainly links to the page  Google founders Sergey Brin and Larry Page were students at Stanford in 1990s  Their research in databases and data mining led to Google
  • 76. © 2006 KDnuggets 76 Microarrays: Classifying Leukemia  Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 1999  72 examples (38 train, 34 test), about 7,000 genes ALL AML Visually similar, but genetically very different Best Model: 97% accuracy, 1 error (sample suspected mislabelled)
  • 77. © 2006 KDnuggets 77 Microarray Potential Applications  New and better molecular diagnostics  Jan 11, 2005: FDA approved Roche Diagnostic AmpliChip, based on Affymetrix technology  New molecular targets for therapy  few new drugs, large pipeline, …  Improved treatment outcome  Partially depends on genetic signature  Fundamental Biological Discovery  finding and refining biological pathways  Personalized medicine ?!
  • 78. © 2006 KDnuggets 78 Application: Direct Marketing and CRM  Most major direct marketing companies are using modeling and data mining  Most financial companies are using customer modeling  Modeling is easier than changing customer behaviour  Example  Verizon Wireless reduced customer attrition rate from 2% to 1.5%, saving many millions of $
  • 79. © 2006 KDnuggets 79 Application: e-Commerce  Amazon.com recommendations  if you bought (viewed) X, you are likely to buy Y  Netflix  If you liked "Monty Python and the Holy Grail", you get a recommendation for "This is Spinal Tap"  Comparison shopping  Froogle, mySimon, Yahoo Shopping, …
  • 80. © 2006 KDnuggets 80 Application: Security and Fraud Detection  Credit Card Fraud Detection  over 20 Million credit cards protected by Neural networks (Fair, Isaac)  Securities Fraud Detection  NASDAQ KDD system  Phone fraud detection  AT&T, Bell Atlantic, British Telecom/MCI
  • 81. © 2006 KDnuggets 81 Data Mining, Privacy, and Security  TIA: Terrorism (formerly Total) Information Awareness Program –  TIA program closed by Congress in 2003 because of privacy concerns  However, in 2006 we learn that NSA is analyzing US domestic call info to find potential terrorists  Invasion of Privacy or Needed Intelligence?
  • 82. © 2006 KDnuggets 82 Criticism of Analytic Approaches to Threat Detection: Data Mining will  be ineffective - generate millions of false positives  and invade privacy First, can data mining be effective?
  • 83. © 2006 KDnuggets 83 Can Data Mining and Statistics be Effective for Threat Detection?  Criticism: Databases have 5% errors, so analyzing 100 million suspects will generate 5 million false positives  Reality: Analytical models correlate many items of information to reduce false positives.  Example: Identify one biased coin from 1,000.  After one throw of each coin, we cannot  After 30 throws, one biased coin will stand out with high probability.  Can identify 19 biased coins out of 100 million with sufficient number of throws
  • 84. © 2006 KDnuggets 84 Another Approach: Link Analysis Can find unusual patterns in the network structure
  • 85. © 2006 KDnuggets 85 Analytic technology can be effective  Data Mining is just one additional tool to help analysts  Combining multiple models and link analysis can reduce false positives  Today there are millions of false positives with manual analysis  Analytic technology has the potential to reduce the current high rate of false positives
  • 86. © 2006 KDnuggets 86 Data Mining with Privacy  Data Mining looks for patterns, not people!  Technical solutions can limit privacy invasion  Replacing sensitive personal data with anon. ID  Give randomized outputs  Multi-party computation – distributed data  …  Bayardo & Srikant, Technological Solutions for Protecting Privacy, IEEE Computer, Sep 2003
  • 87. © 2006 KDnuggets 87 1990 1998 2000 2002 Expectations Performance The Hype Curve for Data Mining and Knowledge Discovery Over-inflated expectations Disappointment Growing acceptance and mainstreaming rising expectations 2005
  • 88. © 2006 KDnuggets 88 Summary  Data Mining and Knowledge Discovery are needed to deal with the flood of data  Knowledge Discovery is a process !  Avoid overfitting (finding random patterns by searching too many possibilities)
  • 89. © 2006 KDnuggets 89 Additional Resources www.KDnuggets.com data mining software, jobs, courses, etc www.acm.org/sigkdd ACM SIGKDD – the professional society for data mining