Data mining

DATA
MINING
PRESENTED BY:
KINZA RAZZAQ
BSIT-13-F072

Supervised
vs.
Unsupervised
Learning
A brief
introduction to
Data Mining
AGENDA
What can Data
Mining do

“There are things that we know that we know(Known
knowns)…
There are things that we know that we
don’t know(Known unknowns)…
There are things that we don’t know
we don’t know(Unknown unknowns)…
There are things that we don’t
know we know(Unknown knowns)”

Data mining has relevance to the fourth point in
red.
It is an art of digging out what exactly we don’t
know that we must know in our business.
The methodology is to first convert “unknown
unknowns” into “known unknowns” and then
finally to “known knowns”.

DATA WAREHOUSING
VS.
DATA MINING

Data Warehousing provides the
Enterprise with a memory
Data Mining provides the
Enterprise with intelligence
Data Mining works with Data
Warehouse

What is Data Mining?
• Knowledge Discovery in Databases (KDD).
• Data mining digs out valuable, non-trivial
information from large multidimensional apparently
unrelated data base.
• It’s the integration of business knowledge, people,
information, algorithms, statistics and computing
technology.
• Finding useful hidden patterns and relationships in
data.

HUGE VOLUME- THERE IS WAY TOO MUCH DATA &
GROWING!
Bridging
the gap
Supply &
Demand
To
minimize
the
volume

Example of growing DATA
• Data collected much faster than it can be
processed or managed. NASA Earth Observation
System (EOS), alone, collected 15 Peta bytes by
2007 (15,000,000,000,000,000 bytes).
• Much of which won't be used - ever!
• Much of which won't be seen - ever!
• Why not?
• There's so much volume, usefulness of some of
it will never be discovered

Solution to the Problem of Growing
Data
Reduce the volume and/or raise the information
content by structuring, querying, filtering,
summarizing, aggregating, and mining the data.

Claude Shannon's info. theory
More volume, less information
Bridging
the gap
Supply &
Demand
To
minimize
the
volume

Decision Support
The next is the level where machine
supports decision making process by
helping in selecting appropriate
pre-defined rules.
Knowledge
Next is the level where the
machine discovers and learns
rules.
Information
In the next level is the
aggregate/summarized data.
Indexed Data
We have found short cuts, to
reach desired points in the
voluminous data sea, rather than
conventional scanning.
Raw Data
Raw data having maximum
volume

Amount of digital data recording and storage
exploded during the past decade
BUT
number of scientists, engineers, and analysts
available to analyze the data has not
grown correspondingly.
Bridging
the gap
Supply &
Demand
To
minimize
the
volume

• Limitations of OLTP systems
• Massive data sets
• high dimensionality
• new data types
• multiple heterogeneous data resources
The conventional systems couldn’t keep pace with the
ever changing and increasing data sets
• Data mining algorithms are built
Bridging
the gap
Supply &
Demand
To
minimize
the
volume

How Data Mining is different?
▪ Data Warehouses (Data-driven exploration)
 Data Mining (Knowledge-driven exploration)
 Traditional Database (Transactions)
 Knowledge Discovery (KDD)

Data Mining Vs. Statistics
Formal statistical inference is assumption driven
i.e. a hypothesis is formed and validated against
the data.
Data mining is discovery driven i.e. patterns and
hypothesis are automatically extracted from
data.

Knowledge extraction using statistics
Inflation Vs Stock inedx increase
0
10
20
30
40
1.6 1.7 1.8 1.85 1.9 1.95 2 2.9 3 3.3 4.2 4.4 5 6
Inflation (%)
Stockincrease
(%)
Q: What will be the stock increase when inflation is 6%?
A: Model non-linear relationship using a line y = mx + c.
Hence answer is 13%

0
10000
20000
30000
40000
50000
60000
70000
0 5 10 15 20 25 30 35
y = -0.0127x6 + 1.5029x5 - 63.627x4 + 1190.3x3 - 9725.3x2 + 31897x - 29263
-10000
0
10000
20000
30000
40000
50000
60000
70000
0 5 10 15 20 25 30 35
Failure of regression models

Data Mining is…
• Decision Trees
If. . . . .
Then. . .
• Rule Induction
• Clustering
• Genetic Algorithms
• Neural Networks

Supervised
vs.
Unsupervised
Learning
A brief
introduction to
Data Mining
What can Data
Mining do

What can Data Mining Do
Classification
Estimation
Prediction
Market
Basket
Analysis
Clustering
Description

Classification
Estimation
Prediction
Market
Basket
Analysis
Clustering
Description
98% of people who purchased items A and B
also purchased item C

Classification
Estimation
Prediction
Market
Basket
Analysis
Clustering
Description
segmenting a
heterogeneous
population into a
number of more
homogenous sub-
groups or clusters

Classification
Estimation
Prediction
Market
Basket
Analysis
Clustering
Description
To know what is
happening in our
databases is
Beneficial, move the
cube in different
angles to get to
the information of
interest

Comparing Methods
Accuracy
Speed
Robustness
Scalability
Interpretability

Data mining: the core of
knowledge discovery process.
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
Where does Data Mining fits
in?

Supervised vs.
Unsupervised
Learning
A brief
introduction to
Data Mining
What can Data
Mining do

Data Structures in Data Mining
• Data matrix
– Table or database
– n records and m
attributes,
– n >> m
C1,1 C1,2 C1,3 C1,m
C2,1 C2,2 C2,3 C2,m
C3,1 C3,2 C3,3 C3,m
Cn,1 Cn,2 Cn,3 Cn,m
…
.
.
.
…
.
.
.
1 S1,2 S1,3 S1,n
S2,1 1 S2,3 S2,n
S3,1 S3,2 1 S3,n
Sn,1 Sn,2 Sn,3 1
…
.
.
.
…
.
.
.
• Similarity matrix
– Symmetric square matrix
– n x n or m x m

Main types of DATA MINING
Supervised
• Bayesian Modeling
• Decision Trees
• Neural Networks
• Etc.
Unsupervised
• One-way Clustering
• Two-way Clustering
Type and number of
classes are NOT
known in advance
Type and number of
classes are known in
advance

Clustering: Min-Max Distance
Age
Salary
20 40 60
outlier Inter-cluster
distances are
maximized
Intra-cluster
distances are
minimized

One-way clustering example
INPUT OUTPUT
Black spots
are noise
White spots
are missing
data

Data Mining Agriculture data
INPUT Clustered OUTPUT
clusters
Created a similarity matrix using farm area, cotton variety
and pesticide used

Which class?
Classifier (model)
Unseen Data
Classification

Output
Confidence
Level (accuracy)
Inputs
How Classification work?

Classification: Model Construction
Training
Data
NAME Time Items Gender
Moin 10 2 M
Munir 16 3 M
Meher 15 1 F
Javed 5 1 M
Mahin 20 1 F
Akram 20 4 M
Classification
Algorithms
IF time/items >= 6
THEN gender = ‘F’
Classifier
(Model)
(observations, measurements, etc.)
Relationship between shopping time and items bought

Classification : Use in Prediction
Testing
Data Unseen Data
(Addan, Time= 15 Items = 1)
Classifier
Gender?
NAME Time Items Gender
Tahir 20 1 M
Younas 11 2 M
Yasin 3 1 M

Clustering vs. Cluster Detection
• In one-way clustering, reordering of rows (or
columns) assembles clusters.
• If the clusters are NOT assembled, they are very
difficult to detect
First you cluster your data and then detect
clusters in the clustered data

The K-Means Clustering
k-means clustering aims to partition ‘n’ observations
into ‘k’ clusters in which each observation belongs to
the cluster with the nearest mean.

k-means algorithm is implemented in
4 steps
1
2
3
4

4 steps
1

4 steps
2

4 steps
3

4 steps
4
Go back to Step 2,
stop when no more
new assignment

Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
A B
D C

Data mining

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data mining

Similar to Data mining (20)

More from Kinza Razzaq

More from Kinza Razzaq (10)

Recently uploaded

Recently uploaded (20)

Data mining