1
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
Data Mining
2
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
Abdul Basit khan
Ali Rehman
Haseeb Ahmed
Saira Bano
Iftikhar Ahmed
G.Farid Razzaqi
Group Members
3
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
Some Terms
•Data raw facts and figures
•Information pattern or relationship
among data
•Information can be converted into
Knowledge
4
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
Finding VS Discovery
you find something that you know you
have lost, or you know that something
exist but not in your approach so you
search for that and find.
In the other hand, you don’t know the
results, so the term discovery rather than
finding is used. Coulumbus discovered
America..
5
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
Before we go to Data Mining
Things around us occure such as:
1. Known knowns
2. Known unknowns
3. Unknown unknowns
4. Unknown knowns
In DM we are concerned with
the-----one!!!
6
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
“finding a needle in a haystack..”
It is a tough job to find a needle in a big
box full of hay.
Data mining is like finding in the haystack
(huge data) the needle (knowledge)…
Where you don’t know the idea about
where the needle can be found ?
A Quote
7
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
Data Mining
•Non-trivial extraction of previously unknown and
potentially useful information from data in large
DATABASE
•Data mining discovers hidden information in your
data, but it cannot tell you the value of the
information to your organization.
8
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
Non-trivial (not common)
It is not a straightforward
computation of predefined quantities
like computing the average value of a
set of numbers. .e.g. Suwaiyan on
eid..
9
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
Goal of data mining
Extract information from a data
set and transform it into an
understandable structure for
further use
10
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
History
•The term "Knowledge Discovery in
Databases“ 1989 popular in AI and
Machine Learning Community.
•The term "Data Mining" appeared
around 1990 in the Database
community.
Now a day
•Data Mining and Knowledge
Discovery are used interchangeably.
11
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
Early methods of identifying patterns in data
Bayes' theorem (1700s) and regression
analysis (1800s) neural networks, cluster
analysis, genetic algorithms (1950s), decision
trees (1960s), and support vector machines
(1990s).
12
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
Spatial Data Mining
 Spatial data mining is the
application of data mining methods
to spatial data. The end objective
of spatial data mining is to find
patterns in data with respect to
geography.
13
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
•offices requiring analysis or dissemination of
geo-referenced statistical data
•public health services searching for
explanations of disease clustering
•environmental agencies assessing the impact
of changing land-use patterns on climate
change
•geo-marketing companies doing customer
segmentation based on spatial location.
EXAMPLES
14
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data Claude Shannon’s info. theory
• More volume of data means less
information
15
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
M
I
N
I
G
Data
M
I
N
I
G
Examples: What is (not) Data Mining?
 What is not Data
Mining?
– Look up phone
number in phone
directory
– Query a Web search
engine for information
about “Amazon”
 What is Data Mining?
– Certain names are more
prevalent in certain US
locations (O’Brien,
O'Rourke, O’Reilly… in
Boston area)
– Group together similar
documents returned by
search engine according to
their context (e.g. Amazon
rainforest, Amazon.com,)
16
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
Data mining VS statistics
(Both Deal With Data Analysis)
1. Data mining is
knowledge driven
2. Data mining is
discovery driven i.e.
patterns and
hypothesis are
automatically
extracted from data.
3. Data mining builds
many complex,
predictive, nonlinear
models
1. statistics is human
driven.
2. inference is
assumption driven
i.e. a hypothesis is
formed and
validated against
the data.
3. statistics focuses on
small data sets
17
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data Data Mining process
18
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
M
I
N
I
G
Data
M
I
N
I
G
Techniques
•Prediction
•Description
•Classification
•Market basket analysis
•Estimation
•Clustering
•Decision tree
19
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
Prediction
Records are classified according to some
predicted future behavior.
It is not like the prediction that a palmist
do..
Example:
Predicting that, “how much” a customer
will spend during the next six months?
For this the customers…, their interests,
their likes and dislikes, their buying
patterns should be understood
20
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
Description
It describe what is going on in the complicated data base to
increase our under standing
example
some standard tools are 1 summary statistics (center It
describe what is going on in the complicated data base to
increase our under standing
Example:
some standard tools are
summary statistics (center tendency i.e. average, mean,
median, mode or
measure of scattering i.e. range)
Graphical representation (histograms, graphs, plots
tendency or measure of scattering)
21
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
Classification
Based on the properties of existing data we assign
the new data element to either of the groups
Example:
In a News Web Site, to make a decision that at
what place a specific news will be placed in!!!
The following questions should be determined:
What should be the news chapter category?
What should be its hierarchical position?
Should it be placed in sports or weather section?
22
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
Answer
Manually it is time consuming and prone
to errors.
Classification techniques are used to scan
and process the documents
e.g. the frequent occurrence of the key
word “cricket” will help to place in a
specific category that is “SPORTS Section”
23
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
CLASSIFICATION
24
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
Market basket analysis
To know which items are sold together
Example: people often bought tea and
milk together from a shop
We can run sales-promotion campaign
by placing the things that are bought
together in near by places in the store
25
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data Estimation
Building a model and assigning a value
from 0 to 1 to each member of the set.
This value tells the probability of a record
belonging to a group
Then classifying the members into
categories based on a threshold value.
As the threshold changes the class
changes
26
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
M
I
N
I
G
Data
M
I
N
I
G
Decision Trees
•Decision trees are tree-shaped
structures that represent decision
sets.
• These decisions generate rules,
which then are used to classify data.
27
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
Data set for decision tree
height hair eyes class
short blond blue A
tall blond brown B
tall red blue A
short dark blue B
tall dark blue B
tall blond blue A
tall dark brown B
short blond brown B
28
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
height hair eyes class
short blond blue A
tall blond brown B
tall red blue A
short dark blue B
tall dark blue B
tall blond blue A
tall dark brown B
short blond brown B
Data
M
I
N
I
G
Decision Trees (cont.)
hair
dark
red
blond
short, blue = B
tall, blue = B
tall, brown= B
{tall, blue = A} short, blue = A
tall, brown = B
tall, blue = A
short, brown = B
Completely classifies dark-
haired
and red-haired people
Does not completely classify
blonde-haired people.
More work is required
29
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
Decision Trees:
Learned Predictive Rules
hair
eyesB
B
A
A
dark
red
blond
blue brown
30
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
Decision Tree
31
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
Clustering
It is one of the most important DM techniques
It involves the grouping of data items without
taking any human parametric input.
Types
1. One way clustering is when only data records
(rows) are used
2. Two way clustering is when both the rows and
columns are being used for clustering purpose.
32
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
Land use: Identification of areas
of similar land use in an earth
observation database.
City-planning: Identifying groups
of houses according to their
house type, value, and
geographical location.
33
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
Clustering
34
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data Conclusion
Data mining is more than running some complex
queries on the data you stored in your database.
You must work with your data, reformat it, or
restructure it, regardless of whether you are
using SQL, document-based databases such as
Hadoop, or simple flat files. Identifying the
format of the information that you need is based
upon the technique and the analysis that you
want to do. After you have the information in
the format you need, you can apply the different
techniques (individually or together) regardless
of the required underlying data structure or data
set.
35
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
M
I
N
I
G
Data
M
I
N
I
G
References
•free-books-online.org/.../data.../a-brief-
introduction-to-data-mining-dm/
•Advanced Data Mining Techniques by David L.
Olson Dursun Delen
•Data Mining: Concepts and Techniques
Second Edition by Jawed Han
•University of Illinois at Urbana-Champaign
•Oracle data mining technique
•Wikipedia the free encyclopedia
36
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
37
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
38
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY
Data
M
I
N
I
G
39
PUNJAB UNIVERSITY COLLEGE OF
INFORMATION TECHNOLOGY

Presentation data mining

  • 1.
    1 PUNJAB UNIVERSITY COLLEGEOF INFORMATION TECHNOLOGY
  • 2.
    Data M I N I G Data Mining 2 PUNJAB UNIVERSITYCOLLEGE OF INFORMATION TECHNOLOGY
  • 3.
    Data M I N I G Abdul Basit khan AliRehman Haseeb Ahmed Saira Bano Iftikhar Ahmed G.Farid Razzaqi Group Members 3 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY
  • 4.
    Data M I N I G Some Terms •Data rawfacts and figures •Information pattern or relationship among data •Information can be converted into Knowledge 4 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY
  • 5.
    Data M I N I G Finding VS Discovery youfind something that you know you have lost, or you know that something exist but not in your approach so you search for that and find. In the other hand, you don’t know the results, so the term discovery rather than finding is used. Coulumbus discovered America.. 5 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY
  • 6.
    Data M I N I G Before we goto Data Mining Things around us occure such as: 1. Known knowns 2. Known unknowns 3. Unknown unknowns 4. Unknown knowns In DM we are concerned with the-----one!!! 6 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY
  • 7.
    Data M I N I G “finding a needlein a haystack..” It is a tough job to find a needle in a big box full of hay. Data mining is like finding in the haystack (huge data) the needle (knowledge)… Where you don’t know the idea about where the needle can be found ? A Quote 7 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY
  • 8.
    Data M I N I G Data Mining •Non-trivial extractionof previously unknown and potentially useful information from data in large DATABASE •Data mining discovers hidden information in your data, but it cannot tell you the value of the information to your organization. 8 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY
  • 9.
    Data M I N I G Non-trivial (not common) Itis not a straightforward computation of predefined quantities like computing the average value of a set of numbers. .e.g. Suwaiyan on eid.. 9 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY
  • 10.
    Data M I N I G Goal of datamining Extract information from a data set and transform it into an understandable structure for further use 10 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY
  • 11.
    Data M I N I G History •The term "KnowledgeDiscovery in Databases“ 1989 popular in AI and Machine Learning Community. •The term "Data Mining" appeared around 1990 in the Database community. Now a day •Data Mining and Knowledge Discovery are used interchangeably. 11 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY
  • 12.
    Data M I N I G Early methods ofidentifying patterns in data Bayes' theorem (1700s) and regression analysis (1800s) neural networks, cluster analysis, genetic algorithms (1950s), decision trees (1960s), and support vector machines (1990s). 12 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY
  • 13.
    Data M I N I G Spatial Data Mining Spatial data mining is the application of data mining methods to spatial data. The end objective of spatial data mining is to find patterns in data with respect to geography. 13 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY
  • 14.
    Data M I N I G •offices requiring analysisor dissemination of geo-referenced statistical data •public health services searching for explanations of disease clustering •environmental agencies assessing the impact of changing land-use patterns on climate change •geo-marketing companies doing customer segmentation based on spatial location. EXAMPLES 14 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY
  • 15.
    Data Claude Shannon’sinfo. theory • More volume of data means less information 15 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY M I N I G
  • 16.
    Data M I N I G Examples: What is(not) Data Mining?  What is not Data Mining? – Look up phone number in phone directory – Query a Web search engine for information about “Amazon”  What is Data Mining? – Certain names are more prevalent in certain US locations (O’Brien, O'Rourke, O’Reilly… in Boston area) – Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,) 16 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY
  • 17.
    Data M I N I G Data mining VSstatistics (Both Deal With Data Analysis) 1. Data mining is knowledge driven 2. Data mining is discovery driven i.e. patterns and hypothesis are automatically extracted from data. 3. Data mining builds many complex, predictive, nonlinear models 1. statistics is human driven. 2. inference is assumption driven i.e. a hypothesis is formed and validated against the data. 3. statistics focuses on small data sets 17 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY
  • 18.
    Data Data Miningprocess 18 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY M I N I G
  • 19.
  • 20.
    Data M I N I G Prediction Records are classifiedaccording to some predicted future behavior. It is not like the prediction that a palmist do.. Example: Predicting that, “how much” a customer will spend during the next six months? For this the customers…, their interests, their likes and dislikes, their buying patterns should be understood 20 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY
  • 21.
    Data M I N I G Description It describe whatis going on in the complicated data base to increase our under standing example some standard tools are 1 summary statistics (center It describe what is going on in the complicated data base to increase our under standing Example: some standard tools are summary statistics (center tendency i.e. average, mean, median, mode or measure of scattering i.e. range) Graphical representation (histograms, graphs, plots tendency or measure of scattering) 21 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY
  • 22.
    Data M I N I G Classification Based on theproperties of existing data we assign the new data element to either of the groups Example: In a News Web Site, to make a decision that at what place a specific news will be placed in!!! The following questions should be determined: What should be the news chapter category? What should be its hierarchical position? Should it be placed in sports or weather section? 22 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY
  • 23.
    Data M I N I G Answer Manually it istime consuming and prone to errors. Classification techniques are used to scan and process the documents e.g. the frequent occurrence of the key word “cricket” will help to place in a specific category that is “SPORTS Section” 23 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY
  • 24.
  • 25.
    Data M I N I G Market basket analysis Toknow which items are sold together Example: people often bought tea and milk together from a shop We can run sales-promotion campaign by placing the things that are bought together in near by places in the store 25 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY
  • 26.
    Data Estimation Building amodel and assigning a value from 0 to 1 to each member of the set. This value tells the probability of a record belonging to a group Then classifying the members into categories based on a threshold value. As the threshold changes the class changes 26 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY M I N I G
  • 27.
    Data M I N I G Decision Trees •Decision treesare tree-shaped structures that represent decision sets. • These decisions generate rules, which then are used to classify data. 27 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY
  • 28.
    Data M I N I G Data set fordecision tree height hair eyes class short blond blue A tall blond brown B tall red blue A short dark blue B tall dark blue B tall blond blue A tall dark brown B short blond brown B 28 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY height hair eyes class short blond blue A tall blond brown B tall red blue A short dark blue B tall dark blue B tall blond blue A tall dark brown B short blond brown B
  • 29.
    Data M I N I G Decision Trees (cont.) hair dark red blond short,blue = B tall, blue = B tall, brown= B {tall, blue = A} short, blue = A tall, brown = B tall, blue = A short, brown = B Completely classifies dark- haired and red-haired people Does not completely classify blonde-haired people. More work is required 29 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY
  • 30.
    Data M I N I G Decision Trees: Learned PredictiveRules hair eyesB B A A dark red blond blue brown 30 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY
  • 31.
    Data M I N I G Decision Tree 31 PUNJAB UNIVERSITYCOLLEGE OF INFORMATION TECHNOLOGY
  • 32.
    Data M I N I G Clustering It is oneof the most important DM techniques It involves the grouping of data items without taking any human parametric input. Types 1. One way clustering is when only data records (rows) are used 2. Two way clustering is when both the rows and columns are being used for clustering purpose. 32 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY
  • 33.
    Data M I N I G Land use: Identificationof areas of similar land use in an earth observation database. City-planning: Identifying groups of houses according to their house type, value, and geographical location. 33 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY
  • 34.
  • 35.
    Data Conclusion Data miningis more than running some complex queries on the data you stored in your database. You must work with your data, reformat it, or restructure it, regardless of whether you are using SQL, document-based databases such as Hadoop, or simple flat files. Identifying the format of the information that you need is based upon the technique and the analysis that you want to do. After you have the information in the format you need, you can apply the different techniques (individually or together) regardless of the required underlying data structure or data set. 35 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY M I N I G
  • 36.
    Data M I N I G References •free-books-online.org/.../data.../a-brief- introduction-to-data-mining-dm/ •Advanced Data MiningTechniques by David L. Olson Dursun Delen •Data Mining: Concepts and Techniques Second Edition by Jawed Han •University of Illinois at Urbana-Champaign •Oracle data mining technique •Wikipedia the free encyclopedia 36 PUNJAB UNIVERSITY COLLEGE OF INFORMATION TECHNOLOGY
  • 37.
  • 38.
  • 39.