Data Mining
Zahra Pourbahman and Behnaz Sadat Motavali
Supervisor: Dr. Alireza Bagheri
Advanced Database Course
November 2016
1/59
Outline
Introduction DM Methods
Complementary
Information
Conclusion
2/59
Introduction
What is Data Mining and why is it
importante?
1
Introduction DM Methods
Complementary
Information
Conclusion
3/59
“
Data Mining
 Extracting or mining knowledge from large amount of data
 Knowledge discovery in databases
4/59
Why Data
Mining?
5/59
Process of
Knowledge
Discovery
6/59
Data
Mining
Methods
Data Mining Methods
Predictive Descriptive
Classification
Association Rules
SVM
Clustering
K-Means
Regression
Apriori
Linear
7/59
DM Methods
What are Classification, Clustering, Association
Rules and Regression in Data Mining?
2
Introduction DM Methods
Complementary
Information
Conclusion
8/59
Classification
Classification
Clustering
Association
Rules
Regression
9/59
Classification
Problem  Given: Training set
 labeled set of 𝑁 input-output pairs 𝐷={xi , yi}
where 1<i<N
 𝑦 ∈ {1,…,𝐾}
 Goal: Given an input 𝒙 as a test data, assign it to
one of 𝐾 classes
 Examples:
▸ Spam filter
▸ Shape recognition
10/59
Learning and Decision Boundary
 Assume that training data is perfectly linearly separable
 Note that we seek w such that
wT x ≥ 0 when y = +1
wT x < 0 when y = −1
wT x n yn ≥ 0 E(w) = Σ wT x n yn
11/59
Learning and Decision Boundary
 Assume that training data is perfectly linearly separable
 Note that we seek w such that
wT x ≥ 0 when y = +1
wT x < 0 when y = −1
wT x n yn ≥ 0 E(w) = Σ wT x n yn
12/59
Learning and Decision Boundary
 Assume that training data is perfectly linearly separable
 Note that we seek w such that
wT x ≥ 0 when y = +1
wT x < 0 when y = −1
wT x n yn ≥ 0 E(w) = Σ wT x n yn
13/59
Margin
 Which line is better to select as the
boundary to provide more
generalization capability?
 Larger margin provides better
generalization to unseen data
 A hyperplane that is farthest from
all training samples
 The largest margin has equal
distances to the nearest sample of
both classes
14/59
Margin
 Which line is better to select as the
boundary to provide more
generalization capability?
 Larger margin provides better
generalization to unseen data
 A hyperplane that is farthest from
all training samples
 The largest margin has equal
distances to the nearest sample of
both classes
×
15/59
Hard Margin
Support Vector
Machine
(SVM)
 When training samples are not linearly separable, it has no
solution.
16/59
Beyond Linear Separability
 Noise in the linearly separable
classes
 Overlapping classes that can be
approximately separated by a
linear boundary
17/59
Beyond Linear
Separability:
Soft-Margin
SVM
 Soft margin:
Maximizing a margin while trying to minimize the distance
between misclassified points and their correct margin plane
 SVM with slack variables:
Allows samples to fall within the margin, but penalizes them
18/59
Soft-Margin
SVM:
Parameter 𝐶 is a tradeoff parameter:
 small 𝐶 allows margin constraints to be easily
ignored large margin
 large 𝐶 makes constraints hard to ignore
narrow margin
 𝐶=∞ enforces all constraints: hard margin
19/59
Support Vectors
 Hard Margin Support Vectors :
(SVs) = {𝑥i | 𝛼>0}
 The direction of hyper-plane
can be found only based on
support vectors:
 The direction of hyper-plane can be found only based on support
vectors
𝑊 = 𝛼𝑖 𝑦(𝑖)
𝑥(𝑖)
𝛼 𝑖
20/59
Classifying
New Samples
Using only
SVs in SVM
Classification of a new sample 𝒙:
21/59
Clustering
Classification
Clustering
Association
Rules
Regression
22/59
Clusty.com
23/59
Clustering
Problem  We have a set of unlabeled data points {𝒙(i) }
where 1<i<N and we intend to find groups of
similar objects (based on the observed
features)
24/59
K-means
Clustering  Given: the number of clusters 𝐾 and a set of
unlabeled data 𝒳=𝒙1,…,𝒙N
 Goal: find groups of data points 𝒞={𝒞1,𝒞2,…,𝒞k}
 Hard Partitioning:
∀𝑗,𝒞𝑗≠∅
∀𝑖,𝑗,𝒞𝑖∩𝒞𝑗=∅
 Inter-cluster distances are small (compared with
intra-cluster distances)
25/59
Distortion measure
 Our goal is to find 𝒞={𝒞1,𝒞2,…,𝒞k } and {𝝁1,…,𝝁k} so as to minimize 𝐽C
26/59
K-means
Algorithm
Select 𝑘 random points 𝝁1,𝝁2,…𝝁k as clusters’ initial
centroids.
 Repeat until converges (or other stopping
criterion):
 for i=1 to 𝑁 do:
 Assign 𝒙(𝑖) to the closest cluster
 for k=1 to 𝐾 do:
 Centriod update
27/59
K-means Algorithm
Step by Step
28/59
Assigning data to clusters Updating means
[Bishop]
29/59
Summary
of the First
Part
Hard margin SVM :
maximizing margin
Soft margin SVM:
handling noisy data
and overlapping
classes
Linearly Separable
Labeled Data
Clustering
Unlabeled Data
K-means:
Assigning data
to clusters
Centriod update
Data
Mining
30/59
Association
Rules
Classification
Clustering
Association
Rules
Regression
31/59
“
Association Rules:
 Frequent Patterns
▹ Frequent Itemset
▹ Frequent Sequential Pattern
▹ …
 Relation Between Data
32/59
Example
{ Milk , Bread }
K = 2
33/59
Support & Confidence
{Milk} → {Bread} [Support=50% , Confidence=100%]
ID Items
1 {Milk,Bread,Meat} , k=3
2 {Sugar,Bread,Eggs} , k=3
3 {Milk,Sugar,Bread,Butter} , k=4
4 {Bread,Butter} , k=2
34/59
Association Rules
X → Y and X∩Y=Ø
minsup , minconf
Y → X != X → Y
35/59
{Milk,Sugar} → {Bread}
Example
36/59
Frequent Pattern
{Milk,Bread}
Support = 2 𝑘 − 1
N transactions
A items
2 𝑘
− 1 ∗ N ∗ A compare
 So we need an algorithm to decrease that → Apriori
37/59
Apriori
Algorithm  Used to find Frequent Items
 Uses candidate generation
 Uses prior knowledge
 Level-wise search
 Uses minimum support
 Apriori property: All nonempty subsets of a frequent
item set must also be frequent.
In level k, k-item sets are found
Then these items are used to explore k+1
38/59
Apriori Algorithm
Step by Step
39/59
Example
11 items
5 transactions
40/59
Example …
without minsup 55 with
minsup 10
41/59
Example …
combine k-itemsets to
generate k+1-itemsets
{I1,I4}+{I2,I4} → {I1,I2,I4} √
{I1,I4}+{I2,I5} → {I1,I2,I4,I5} Х
42/59
Without Apriori Algorithm:
11
1
+ 11
2
+ 11
3
= 231
Using Apriori Algorithm:
(11+10+6) = 27
43/59
Association
Rules
 Time to find Association Rules
 For each frequent pattern with k-itemset 2 𝑘
− 2
rules
 Need confidence
44/59
Regression
Classification
Clustering
Association
Rules
Regression
45/59
Regression
 Previous classifications labels
 Used for prediction
 Numeric, continuous value
 Relation between independent and dependent
variables
46/59
Linear Regression
Equation:
Simple Equation:
Error:
W0 and W1:
47/59
Example
48/59
Regression
Continue
 Class label related to attribute if not we use
Correlation Coefficient
 Nonlinear regressions can be converted to linear
 Generalized linear model is Logistic Regression
uses probability
 Decision tree to Regression trees by predicting
continuous values rather than class labels
49/59
Complementary
Information
Data Mining Tools, Usage and Types
3
Introduction DM Methods
Complementary
Information
Conclusion
50/59
Tools
Usage
Types
51/59
Business Software:
 IBM Intelligent Miner
 SAS Enterprise Miner
 Microsoft SQL Server 2005
 SPSS Clementine
 …
Open Source Software:
 Rapid-I Rapid Miner
 Weka
 …
DM Tools  Business Software:
 IBM Intelligent Miner
 SAS Enterprise Miner
 Microsoft SQL Server 2005
 SPSS Clementine
 …
 Open Source Software:
 Rapid-I Rapid Miner
 Weka
 …
52/59
DM Usage
 Bank
 Financial issues
 Perfect quality
 Granting loans
 Financial services
 Reducing risks
 Money
Laundering and
Financial damages
 Marketing
 Massive data
 Increasing fast
 E-commerce
 Shopping patterns
 Service quality
 Customer
satisfaction
 Advertising
 Discount
 Bioinformation
 Laboratory
Information
 Protein structures
(Gene)
 Massive number of
sequences
 Need for computer
algorithms to
analyze them
 Accurate
53/59
DM Types
Text Mining
• No tables
• Books, articles, texts
• Semi-structured data
• Data Recovery and
Database
• Key words
• Massive data and text
Web Mining
• Massive Unstructured,
Semi-structured,
Multimedia data
• Links, advertisements
• Poor quality, changing
• Web structure, Content,
Web usage Mining
• Search engines
Multimedia Mining
• Voice, image, video
• Nature of the data
• Key words or
Patterns and shapes
Graph Mining
• Electronic circuits,
image, web and …
• Graph search algorithm
• Difference, index
• Social networks
analisis
Spatial Mining
• Medical images,
VLSI layers
• Location based
• Efficient techniques
54/59
Conclusion
Challenges and Conclusion.
4
Introduction DM Methods
Complementary
Information
Conclusion
55/59
“Challenges
 Individual systems or
Single-purpose systems
 Scalable and Interactive systems
 Standardization of data mining languages
 Complex data
 Distributed and Real Time data mining
56/59
Review
Introduction
• Data Mining
• Knowledge
Discovery
• DM Methods
DM Methods
• Classification
• Clustering
• Association Rules
• Regression
Conclusion
• Challenges
Complementary
Information
• DM Tools
• DM Usage
• DM Types
57/59
 C. M. Bishop, Pattern Recognition and Machine
Learning; Springer, 2006.
 Jiawei Han, Micheline Kamber; Data Mining
Concepts and Techniques, Second Edition,
Elsevier Inc. , 2006.
 J. Furnkranz et al. ;Foundations of Rule
Learning: Cognitive Technologies, Springer-
Verlag Berlin Heidelberg, 2012.
 Abraham Silberschatz, Henry F.Korth,
S.Sudarshan; Database System Concepts, Sixth
Edition, McGraw-Hill, 2010.
‫اسماعیلی‬،‫مهدی‬‫؛‬‫و‬ ‫مفاهیم‬‫های‬‫تکنیک‬‫کاوی‬ ‫داده‬‫؛‬‫دانش‬ ‫نیاز‬،
‫ماه‬ ‫تیر‬1391.
References
58/59
Thanks!
Any questions?
You can find us at z.poorbahman1@yahoo.com & bs.motavali@yahoo.com
😉
59/59

Data mining

  • 1.
    Data Mining Zahra Pourbahmanand Behnaz Sadat Motavali Supervisor: Dr. Alireza Bagheri Advanced Database Course November 2016 1/59
  • 2.
  • 3.
    Introduction What is DataMining and why is it importante? 1 Introduction DM Methods Complementary Information Conclusion 3/59
  • 4.
    “ Data Mining  Extractingor mining knowledge from large amount of data  Knowledge discovery in databases 4/59
  • 5.
  • 6.
  • 7.
    Data Mining Methods Data Mining Methods PredictiveDescriptive Classification Association Rules SVM Clustering K-Means Regression Apriori Linear 7/59
  • 8.
    DM Methods What areClassification, Clustering, Association Rules and Regression in Data Mining? 2 Introduction DM Methods Complementary Information Conclusion 8/59
  • 9.
  • 10.
    Classification Problem  Given:Training set  labeled set of 𝑁 input-output pairs 𝐷={xi , yi} where 1<i<N  𝑦 ∈ {1,…,𝐾}  Goal: Given an input 𝒙 as a test data, assign it to one of 𝐾 classes  Examples: ▸ Spam filter ▸ Shape recognition 10/59
  • 11.
    Learning and DecisionBoundary  Assume that training data is perfectly linearly separable  Note that we seek w such that wT x ≥ 0 when y = +1 wT x < 0 when y = −1 wT x n yn ≥ 0 E(w) = Σ wT x n yn 11/59
  • 12.
    Learning and DecisionBoundary  Assume that training data is perfectly linearly separable  Note that we seek w such that wT x ≥ 0 when y = +1 wT x < 0 when y = −1 wT x n yn ≥ 0 E(w) = Σ wT x n yn 12/59
  • 13.
    Learning and DecisionBoundary  Assume that training data is perfectly linearly separable  Note that we seek w such that wT x ≥ 0 when y = +1 wT x < 0 when y = −1 wT x n yn ≥ 0 E(w) = Σ wT x n yn 13/59
  • 14.
    Margin  Which lineis better to select as the boundary to provide more generalization capability?  Larger margin provides better generalization to unseen data  A hyperplane that is farthest from all training samples  The largest margin has equal distances to the nearest sample of both classes 14/59
  • 15.
    Margin  Which lineis better to select as the boundary to provide more generalization capability?  Larger margin provides better generalization to unseen data  A hyperplane that is farthest from all training samples  The largest margin has equal distances to the nearest sample of both classes × 15/59
  • 16.
    Hard Margin Support Vector Machine (SVM) When training samples are not linearly separable, it has no solution. 16/59
  • 17.
    Beyond Linear Separability Noise in the linearly separable classes  Overlapping classes that can be approximately separated by a linear boundary 17/59
  • 18.
    Beyond Linear Separability: Soft-Margin SVM  Softmargin: Maximizing a margin while trying to minimize the distance between misclassified points and their correct margin plane  SVM with slack variables: Allows samples to fall within the margin, but penalizes them 18/59
  • 19.
    Soft-Margin SVM: Parameter 𝐶 isa tradeoff parameter:  small 𝐶 allows margin constraints to be easily ignored large margin  large 𝐶 makes constraints hard to ignore narrow margin  𝐶=∞ enforces all constraints: hard margin 19/59
  • 20.
    Support Vectors  HardMargin Support Vectors : (SVs) = {𝑥i | 𝛼>0}  The direction of hyper-plane can be found only based on support vectors:  The direction of hyper-plane can be found only based on support vectors 𝑊 = 𝛼𝑖 𝑦(𝑖) 𝑥(𝑖) 𝛼 𝑖 20/59
  • 21.
    Classifying New Samples Using only SVsin SVM Classification of a new sample 𝒙: 21/59
  • 22.
  • 23.
  • 24.
    Clustering Problem  Wehave a set of unlabeled data points {𝒙(i) } where 1<i<N and we intend to find groups of similar objects (based on the observed features) 24/59
  • 25.
    K-means Clustering  Given:the number of clusters 𝐾 and a set of unlabeled data 𝒳=𝒙1,…,𝒙N  Goal: find groups of data points 𝒞={𝒞1,𝒞2,…,𝒞k}  Hard Partitioning: ∀𝑗,𝒞𝑗≠∅ ∀𝑖,𝑗,𝒞𝑖∩𝒞𝑗=∅  Inter-cluster distances are small (compared with intra-cluster distances) 25/59
  • 26.
    Distortion measure  Ourgoal is to find 𝒞={𝒞1,𝒞2,…,𝒞k } and {𝝁1,…,𝝁k} so as to minimize 𝐽C 26/59
  • 27.
    K-means Algorithm Select 𝑘 randompoints 𝝁1,𝝁2,…𝝁k as clusters’ initial centroids.  Repeat until converges (or other stopping criterion):  for i=1 to 𝑁 do:  Assign 𝒙(𝑖) to the closest cluster  for k=1 to 𝐾 do:  Centriod update 27/59
  • 28.
  • 29.
    Assigning data toclusters Updating means [Bishop] 29/59
  • 30.
    Summary of the First Part Hardmargin SVM : maximizing margin Soft margin SVM: handling noisy data and overlapping classes Linearly Separable Labeled Data Clustering Unlabeled Data K-means: Assigning data to clusters Centriod update Data Mining 30/59
  • 31.
  • 32.
    “ Association Rules:  FrequentPatterns ▹ Frequent Itemset ▹ Frequent Sequential Pattern ▹ …  Relation Between Data 32/59
  • 33.
    Example { Milk ,Bread } K = 2 33/59
  • 34.
    Support & Confidence {Milk}→ {Bread} [Support=50% , Confidence=100%] ID Items 1 {Milk,Bread,Meat} , k=3 2 {Sugar,Bread,Eggs} , k=3 3 {Milk,Sugar,Bread,Butter} , k=4 4 {Bread,Butter} , k=2 34/59
  • 35.
    Association Rules X →Y and X∩Y=Ø minsup , minconf Y → X != X → Y 35/59 {Milk,Sugar} → {Bread}
  • 36.
  • 37.
    Frequent Pattern {Milk,Bread} Support =2 𝑘 − 1 N transactions A items 2 𝑘 − 1 ∗ N ∗ A compare  So we need an algorithm to decrease that → Apriori 37/59
  • 38.
    Apriori Algorithm  Usedto find Frequent Items  Uses candidate generation  Uses prior knowledge  Level-wise search  Uses minimum support  Apriori property: All nonempty subsets of a frequent item set must also be frequent. In level k, k-item sets are found Then these items are used to explore k+1 38/59
  • 39.
  • 40.
  • 41.
    Example … without minsup55 with minsup 10 41/59
  • 42.
    Example … combine k-itemsetsto generate k+1-itemsets {I1,I4}+{I2,I4} → {I1,I2,I4} √ {I1,I4}+{I2,I5} → {I1,I2,I4,I5} Х 42/59
  • 43.
    Without Apriori Algorithm: 11 1 +11 2 + 11 3 = 231 Using Apriori Algorithm: (11+10+6) = 27 43/59
  • 44.
    Association Rules  Time tofind Association Rules  For each frequent pattern with k-itemset 2 𝑘 − 2 rules  Need confidence 44/59
  • 45.
  • 46.
    Regression  Previous classificationslabels  Used for prediction  Numeric, continuous value  Relation between independent and dependent variables 46/59
  • 47.
  • 48.
  • 49.
    Regression Continue  Class labelrelated to attribute if not we use Correlation Coefficient  Nonlinear regressions can be converted to linear  Generalized linear model is Logistic Regression uses probability  Decision tree to Regression trees by predicting continuous values rather than class labels 49/59
  • 50.
    Complementary Information Data Mining Tools,Usage and Types 3 Introduction DM Methods Complementary Information Conclusion 50/59
  • 51.
  • 52.
    Business Software:  IBMIntelligent Miner  SAS Enterprise Miner  Microsoft SQL Server 2005  SPSS Clementine  … Open Source Software:  Rapid-I Rapid Miner  Weka  … DM Tools  Business Software:  IBM Intelligent Miner  SAS Enterprise Miner  Microsoft SQL Server 2005  SPSS Clementine  …  Open Source Software:  Rapid-I Rapid Miner  Weka  … 52/59
  • 53.
    DM Usage  Bank Financial issues  Perfect quality  Granting loans  Financial services  Reducing risks  Money Laundering and Financial damages  Marketing  Massive data  Increasing fast  E-commerce  Shopping patterns  Service quality  Customer satisfaction  Advertising  Discount  Bioinformation  Laboratory Information  Protein structures (Gene)  Massive number of sequences  Need for computer algorithms to analyze them  Accurate 53/59
  • 54.
    DM Types Text Mining •No tables • Books, articles, texts • Semi-structured data • Data Recovery and Database • Key words • Massive data and text Web Mining • Massive Unstructured, Semi-structured, Multimedia data • Links, advertisements • Poor quality, changing • Web structure, Content, Web usage Mining • Search engines Multimedia Mining • Voice, image, video • Nature of the data • Key words or Patterns and shapes Graph Mining • Electronic circuits, image, web and … • Graph search algorithm • Difference, index • Social networks analisis Spatial Mining • Medical images, VLSI layers • Location based • Efficient techniques 54/59
  • 55.
    Conclusion Challenges and Conclusion. 4 IntroductionDM Methods Complementary Information Conclusion 55/59
  • 56.
    “Challenges  Individual systemsor Single-purpose systems  Scalable and Interactive systems  Standardization of data mining languages  Complex data  Distributed and Real Time data mining 56/59
  • 57.
    Review Introduction • Data Mining •Knowledge Discovery • DM Methods DM Methods • Classification • Clustering • Association Rules • Regression Conclusion • Challenges Complementary Information • DM Tools • DM Usage • DM Types 57/59
  • 58.
     C. M.Bishop, Pattern Recognition and Machine Learning; Springer, 2006.  Jiawei Han, Micheline Kamber; Data Mining Concepts and Techniques, Second Edition, Elsevier Inc. , 2006.  J. Furnkranz et al. ;Foundations of Rule Learning: Cognitive Technologies, Springer- Verlag Berlin Heidelberg, 2012.  Abraham Silberschatz, Henry F.Korth, S.Sudarshan; Database System Concepts, Sixth Edition, McGraw-Hill, 2010. ‫اسماعیلی‬،‫مهدی‬‫؛‬‫و‬ ‫مفاهیم‬‫های‬‫تکنیک‬‫کاوی‬ ‫داده‬‫؛‬‫دانش‬ ‫نیاز‬، ‫ماه‬ ‫تیر‬1391. References 58/59
  • 59.
    Thanks! Any questions? You canfind us at z.poorbahman1@yahoo.com & bs.motavali@yahoo.com 😉 59/59