1. 1
Mrs. Dipali Meher
Modern College of Arts, Science and Commerce,
Ganeshkhind, Pune 411016
Data Mining : An Introduction
2. 2
Bayes Thm(1763)
Regression(1805)
KDD(1989)
Support Vector Machine(1992)
Data Science(2001)
Moneyball(2003)
Turing(1963)
Neural Networks(1943)
Evolutionary Computation(1965)
Databases(1970)
Genetic Algorithms(1975)
Big Data
From Then till Now…..
4. 4
Data Mining deals with the discovery of
hidden Knowledge , unexpected pattern
and new rules from large data sets
5. 5
Examples of Information extracted using query
language
List customers who use credit card to purchase
more than Rs. 10000 worth groceries
List patients who had at least one heart attack
List students who had at least one backlog
List employees who have taken home loans
6. 6
Examples of what data mining is used for
Develop a general profile of credit card customers
Determine patients whose lifestyle is prone to getting a
heart attack in near future
Differentiate poor credit risk customers from good
credit card customers
Differentiate students who had one backlogs in their
academic
Determine employees who have taken loan for any
purpose
7. Data Mining differs from usual query processing in
many ways
Query Processing Data Mining
Query Wel formed as
Select…
From…
Where……
Query is not well formed.
What is found out that is
usually hidden
Data Data from online
transaction processing
systems generally in table
formats
Data is integrated from
various sources. Huge
amount of data
Output Subset of databases Not only subset but also
in analyzed and in terms
of patterns
7
8. 8
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or
knowledge from huge amount of data
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
9. •Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or
knowledge from huge amount of data
Data mining: a misnomer?
•Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
9
10. Knowledge discovery in databases (KDD)-is a multistep
process of finding useful information and patterns in
data while Data Mining is one of the steps in KDD of
using algorithms for extraction of patterns
Steps Of KDD
1. Selection-
Data Extraction -Obtaining Data from heterogeneous data sources -
Databases, Data warehouses, World wide web or other information
repositories
2. Preprocessing-
Data Cleaning- Incomplete , noisy, inconsistent data to be cleaned-
Missing data may be ignored or predicted, erroneous data may be deleted
or corrected
10
11. 3. Transformation-
Data Integration- Combines data from multiple sources
into a coherent store -Data can be encoded in common
formats, normalized, reduced
4. Data mining –
Apply algorithms to transformed data an extract patterns
5. Pattern Interpretation/evaluation -
Pattern Evaluation- Evaluate the interestingness of resulting
patterns or apply interestingness measures to filter out
discovered patterns
Knowledge presentation- present the mined knowledge-
visualization techniques can be used
11
12. Transformation
KDD is the nontrivial extraction of
implicit previously unknown and
potentially useful knowledge from
data
Knowledge Discovery Process
Preprocessing
Data Mining
Pattern Interpretation and
evaluation
Selection
12
16. Why Data Mining?—Potential Applications
Data analysis and decision support
Market analysis and management
Target marketing, customer relationship management (CRM), market basket
analysis, cross selling, market segmentation
Risk analysis and management
Forecasting, customer retention, improved underwriting, quality control,
competitive analysis
Fraud detection and detection of unusual patterns (outliers)
Other Applications
Text mining (news group, email, documents) and Web mining
Stream data mining
Bioinformatics and bio-data analysis
16
17. Data Mining algorithms-All algorithms attempt to fit a model
closest to the data being examined.
Model is based on the analysis of attributes of a training data
set
The Model is than evaluated using a test data set
Data Model can be
Predictive model makes predictions regarding data values
using the results found from available data. Thus it makes use
of historical data to make predictions
Descriptive model identifies patterns or relationships in data. It
finds out the properties of existing data and does not predict
the new properties.
17
19. Classification- maps data into predefined groups or classes
It uses supervised learning .
The algorithm uses learning phase to build a classifier using training
data set containing data attributes and associated class labels
Example : result of a student. In which class students result will be…
Pattern recognition is type of classification where input patter is
classified into several classes based on its similarity to predefined
classes.
Example: to identify terrorists from passengers. They are identified with
their basic pattern as distance between eyes, size and shape. Then
these patterns are compared with entries into data to see whether
any match were found.
19
21. 21
Grade Useful Heat Value(kcal/kg)
A >6200
B 5601 - 6200
C 4941 - 5600
D 4201 - 4940
E 3361 - 4200
F 2401 - 3360
G 1301 - 2400
22. 22
Regression-maps data into real-valued prediction variable.
Algorithm tries to find best function (linear, Non-linear that fits the
training data). Assumes that target data always fits into some
function.
Example . College professor determines his retirement plan based on
current savings and income. If professor want to do more savings
then he must alter his experiences by using simple linear regression
formula.
23. 23
Time Series Analysis- the value of an attribute is examined as it varies over
time
It can be used to determine similarities, classify the behavior or predict future
values
Example
Share market
24. Prediction – predicts future values using regression, time series analysis or other
approaches
Example
To find out flood prediction of river depending on water level, rain amount time,
humidity. Sensors at different locations are placed in the river area which will
monitor flood condition and flood prediction can be done.
Whether analysis
Pollution analysis
24
25. 25
Clustering -Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
Unsupervised learning: no predefined classes
Interpretability and usability-results should be comprehensible
and usable-domain expert is required
Example
Students are clustered among various attributes like good
academics, area in which they live, age, height, weight, body
mass index, extra curricular activities.
Clusters do not have specific size and shape.
27. 27
Summarization - maps data into subsets with simple descriptions- It extracts or
derives representative summary type of information
Example
Summary of student result whish give you number of students appeared for the
exam passed, failed and according to classes
28. Association rules–discovers relationship among data – used in
Market basket analysis to find item frequently purchased together
Example: person buying a sugar in the mall also buys milk. The thing
which person buy together will always kept together.
28
29. Sequence Discovery- discovers sequential patterns in
data-order in which items are purchased or data is
accessed
Example:
When TV set will be purchased by customer , sales
manager assumes that customer also buys some cds and
music system.
29
30. Influence from many disciplines
Data Mining
Artificial
IntelligenceInformation
Technology Database
Technology
Machine
Learning
Pattern
Recognition
Statistics
Algorithm
Visualization
Mathematical
Modeling
30
31. Depending on data mining approach, techniques from
other disciplines may be applied such as
•Information Retrieval
•Artificial Intelligence
•Neural networks
•Fuzzy set theory
•Knowledge representation
•Logic programming
•High performance computing
31
32. Data Mining issues
Human interaction- to analyze the output and find the
correct inference after data mining step interfaces required
with both domain and technical experts
Over fitting – It occurs when the model fits for the current
data exactly but does not fit for future data-if training
dataset will be wrong then over fitting occurs
Outliers – The model may get distorted because of the
presence of outliers
Interpretation of results- experts are required due to
interpretability problems
Visualization of results- visualization helps to display
analyzed data – but for multi-dimensional data visualization
becomes problematic
32
33. Data Mining issues continued…
Large datasets- scalability may arise – as algorithms do not
scale well with massive real-world datasets- sampling and
parallelization are effective tools are used to solve this problem
High dimensionality -Conventional database may contain
many different attributes out of them all are not relevant. Some
may increases complexity and reduces efficiency. This is known
as dimensionality curse -data reduction can be done so that
dimensionality reduction will also be there.
Multimedia data - found in GIS databases proves
conventional data mining algorithms ineffective
Missing data -It is not always possible to ignore missing data
but in preprocessing data mining algorithms can be used to
replace missing data with estimates
33
34. Data Mining issues continued…
Irrelevant data – data reduced by removing irrelevant data
Noisy data –Invalid , incorrect data will lead to poor quality
data mining
Changing data- Data warehouses contain non-volatile data-
Dynamic data is uploaded and then algorithms are reapplied to
check their correct working.
Integration- KDD requests are one time needs-data mining
functions are now integrated into traditional database systems
Applications – Effective use of output of mining algorithm is
a challenge rather than the complexity of the mining algorithm
34
35. Data Mining Metrics
How to measure the effectiveness of data mining process?
-KDD process is expensive- Return on investment will be the
saving due to decision process using the results
-Difficult to measure and quantify
Social Implications of Data mining
It is two sides of the coin
Data mining can be used to improve customer service and
satisfaction
Data mining can be used to confront one’s right to privacy
Omnipresent Invisible Data mining affecting everyone
35
36. Data mining should follow certain Guidelines
Purpose specification and use limitation
Openness
Security safeguards
Individual participation
Privacy Preserving data mining
- secure Multiparty computation
- data obscuration
36
37. Applications of Data Mining
Security-To find out terrorists using classification
technique
Whether- To predict whether, pollution
Finance-Share market
Ecommerce-Market basket analysis
Education-Student result preparation
Bank- Analysis of customer for buying loan
Research- Data Analysis
Fraud detection
Marketing-targeting customers
Molecular biology
Astronomy
Health- to find out disease in peoples
37
38. Books for Reference
Data Mining, Introduction and Advanced Topics by
Margaret H. Dunham and Sridhar
Pearson Education
ISBN 81-7758-785-4
Data Mining Concepts and Techniques by Jiawei Han
and Micheline Kamber
Morgan Kaufmann Publishers
ISBN 81-312-0535-5
.
38