SlideShare a Scribd company logo
1 of 74
Download to read offline
Direct Mining of Discriminative Patterns for Classifying
Uncertain Data
Chuancong Gao, Jianyong Wang
Database Laboratory
Department of Computer Science and Technology
Tsinghua University, Beijing, China
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 1 / 1
Classification on Certain Data
Example:
A toy example about certain categorical dataset containing 4 classes.
Evaluation Price Looking Tech. Spec. Quality
Unacceptable + - / -
Acceptable / - / /
Good - + / /
Very Good / + + +
(+: Good, /: Medium, -: Bad)
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 2 / 1
Classification on Certain Data
Example:
A toy example about certain categorical dataset containing 4 classes.
Evaluation Price Looking Tech. Spec. Quality
Unacceptable + - / -
Acceptable / - / /
Good - + / /
Very Good / + + +
(+: Good, /: Medium, -: Bad)
A Lot of Methods:
Decision Tree - C4.5, etc.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 2 / 1
Classification on Certain Data
Example:
A toy example about certain categorical dataset containing 4 classes.
Evaluation Price Looking Tech. Spec. Quality
Unacceptable + - / -
Acceptable / - / /
Good - + / /
Very Good / + + +
(+: Good, /: Medium, -: Bad)
A Lot of Methods:
Decision Tree - C4.5, etc.
Rule-based Classifier - Ripper, etc.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 2 / 1
Classification on Certain Data
Example:
A toy example about certain categorical dataset containing 4 classes.
Evaluation Price Looking Tech. Spec. Quality
Unacceptable + - / -
Acceptable / - / /
Good - + / /
Very Good / + + +
(+: Good, /: Medium, -: Bad)
A Lot of Methods:
Decision Tree - C4.5, etc.
Rule-based Classifier - Ripper, etc.
Associative Classification (Pattern-based Classification) - CBA,
RCBT, HARMONY, DDPMine, MbT, etc. (Better Performance)
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 2 / 1
Classification on Certain Data
Example:
A toy example about certain categorical dataset containing 4 classes.
Evaluation Price Looking Tech. Spec. Quality
Unacceptable + - / -
Acceptable / - / /
Good - + / /
Very Good / + + +
(+: Good, /: Medium, -: Bad)
A Lot of Methods:
Decision Tree - C4.5, etc.
Rule-based Classifier - Ripper, etc.
Associative Classification (Pattern-based Classification) - CBA,
RCBT, HARMONY, DDPMine, MbT, etc. (Better Performance)
etc.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 2 / 1
Associative Classification
Two-Step Framework:
Mine a set of frequent patterns.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 3 / 1
Associative Classification
Two-Step Framework:
Mine a set of frequent patterns.
Select a subset of most discriminative patterns from the mined
patterns.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 3 / 1
Associative Classification
Two-Step Framework:
Mine a set of frequent patterns.
Select a subset of most discriminative patterns from the mined
patterns.
One-Step Framework:
Directly mine a set of most discriminative frequent patterns.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 3 / 1
Associative Classification
Two-Step Framework:
Mine a set of frequent patterns.
Select a subset of most discriminative patterns from the mined
patterns.
One-Step Framework:
Directly mine a set of most discriminative frequent patterns.
After Having Discriminative Patterns:
Convert each pattern to a binary feature: Whether the instance
contains the pattern.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 3 / 1
Associative Classification
Two-Step Framework:
Mine a set of frequent patterns.
Select a subset of most discriminative patterns from the mined
patterns.
One-Step Framework:
Directly mine a set of most discriminative frequent patterns.
After Having Discriminative Patterns:
Convert each pattern to a binary feature: Whether the instance
contains the pattern.
Train a classifier using the feature data converted from training data.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 3 / 1
Associative Classification
Mainly Differences between Different Algorithms:
Different types of mined pattern - All, Closed, Generator, etc.
(Two-Step Framework)
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 4 / 1
Associative Classification
Mainly Differences between Different Algorithms:
Different types of mined pattern - All, Closed, Generator, etc.
(Two-Step Framework)
Different discriminative measures - Confidence (CBA, HARMONY),
Information Gain (DDPMine) (Better Performance), Fisher Score, etc.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 4 / 1
Associative Classification
Mainly Differences between Different Algorithms:
Different types of mined pattern - All, Closed, Generator, etc.
(Two-Step Framework)
Different discriminative measures - Confidence (CBA, HARMONY),
Information Gain (DDPMine) (Better Performance), Fisher Score, etc.
Different instance covering strategies - Sequential Covering
(DDPMine), Top-K (RCBT), Search Tree-based Partition (MbT), etc.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 4 / 1
Associative Classification
Mainly Differences between Different Algorithms:
Different types of mined pattern - All, Closed, Generator, etc.
(Two-Step Framework)
Different discriminative measures - Confidence (CBA, HARMONY),
Information Gain (DDPMine) (Better Performance), Fisher Score, etc.
Different instance covering strategies - Sequential Covering
(DDPMine), Top-K (RCBT), Search Tree-based Partition (MbT), etc.
Different classification models - Rule-based (RCBT, HARMONY),
SVM (DDPMine, MbT) (Better Performance), Na¨ıve Bayes, etc.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 4 / 1
Associative Classification
Mainly Differences between Different Algorithms:
Different types of mined pattern - All, Closed, Generator, etc.
(Two-Step Framework)
Different discriminative measures - Confidence (CBA, HARMONY),
Information Gain (DDPMine) (Better Performance), Fisher Score, etc.
Different instance covering strategies - Sequential Covering
(DDPMine), Top-K (RCBT), Search Tree-based Partition (MbT), etc.
Different classification models - Rule-based (RCBT, HARMONY),
SVM (DDPMine, MbT) (Better Performance), Na¨ıve Bayes, etc.
Different feature types - Binary, Numeric (New, NDPMine)
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 4 / 1
Classification on Uncertain Data
Example:
A toy example about uncertain categorical dataset. The uncertainty
usually is caused by noise, measurement precisions, etc.
Evaluation Price Looking Tech. Spec. Quality
Unacceptable + - / {-: 0.8, /: 0.1, +: 0.1}
Acceptable / - / {-: 0.1, /: 0.8, +: 0.1}
Good - + / {-: 0.1, /: 0.8, +: 0.1}
Very Good / + + {-: 0.1, /: 0.1, +: 0.8}
(+: Good, /: Medium, -: Bad)
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 5 / 1
Classification on Uncertain Data
Example:
A toy example about uncertain categorical dataset. The uncertainty
usually is caused by noise, measurement precisions, etc.
Evaluation Price Looking Tech. Spec. Quality
Unacceptable + - / {-: 0.8, /: 0.1, +: 0.1}
Acceptable / - / {-: 0.1, /: 0.8, +: 0.1}
Good - + / {-: 0.1, /: 0.8, +: 0.1}
Very Good / + + {-: 0.1, /: 0.1, +: 0.8}
(+: Good, /: Medium, -: Bad)
Very Few Methods:
Uncertain Decision Tree - C4.5-based DTU.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 5 / 1
Classification on Uncertain Data
Example:
A toy example about uncertain categorical dataset. The uncertainty
usually is caused by noise, measurement precisions, etc.
Evaluation Price Looking Tech. Spec. Quality
Unacceptable + - / {-: 0.8, /: 0.1, +: 0.1}
Acceptable / - / {-: 0.1, /: 0.8, +: 0.1}
Good - + / {-: 0.1, /: 0.8, +: 0.1}
Very Good / + + {-: 0.1, /: 0.1, +: 0.8}
(+: Good, /: Medium, -: Bad)
Very Few Methods:
Uncertain Decision Tree - C4.5-based DTU.
Uncertain Rule-based Classifier - Ripper-based uRule
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 5 / 1
Our Solution
A new associative classification algorithm working on uncertain data.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 6 / 1
Our Solution
A new associative classification algorithm working on uncertain data.
Difference to Certain Dataset:
Patterns involving uncertain attributes have probabilities to appear in
instances.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 6 / 1
Our Solution
A new associative classification algorithm working on uncertain data.
Difference to Certain Dataset:
Patterns involving uncertain attributes have probabilities to appear in
instances.
Challenges:
How to represent frequentness information? - Using expected value of
support. (Easy to calculate - Sum all the probabilities appearing in
different instances)
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 6 / 1
Our Solution
A new associative classification algorithm working on uncertain data.
Difference to Certain Dataset:
Patterns involving uncertain attributes have probabilities to appear in
instances.
Challenges:
How to represent frequentness information? - Using expected value of
support. (Easy to calculate - Sum all the probabilities appearing in
different instances)
How to represent discriminative information?
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 6 / 1
Our Solution
A new associative classification algorithm working on uncertain data.
Difference to Certain Dataset:
Patterns involving uncertain attributes have probabilities to appear in
instances.
Challenges:
How to represent frequentness information? - Using expected value of
support. (Easy to calculate - Sum all the probabilities appearing in
different instances)
How to represent discriminative information?
How to cover instances?
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 6 / 1
Discriminative Measures on Uncertain Data
Choose to use expected value of confidence. Unlike expected support,
expected confidence is hard to calculate.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 7 / 1
Discriminative Measures on Uncertain Data
Choose to use expected value of confidence. Unlike expected support,
expected confidence is hard to calculate.
Definition of Expected Confidence:
Given a set of transactions T and the set of possible worlds W w.r.t. T,
the expected confidence of an itemset x on class c is
E(confx
c
) =
wi ∈W
confx,wi
c
× P(wi ) =
wi ∈W
supx,wi
c
supx,wi
× P(wi )
where P(wi ) is the probability of world wi . confx,wi
c
is the respected
confidence of x on class c in world wi , while supx,wi (supx,wi
c) is the
respected support of x (on class c) in world wi .
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 7 / 1
Discriminative Measures on Uncertain Data
Choose to use expected value of confidence. Unlike expected support,
expected confidence is hard to calculate.
Definition of Expected Confidence:
Given a set of transactions T and the set of possible worlds W w.r.t. T,
the expected confidence of an itemset x on class c is
E(confx
c
) =
wi ∈W
confx,wi
c
× P(wi ) =
wi ∈W
supx,wi
c
supx,wi
× P(wi )
where P(wi ) is the probability of world wi . confx,wi
c
is the respected
confidence of x on class c in world wi , while supx,wi (supx,wi
c) is the
respected support of x (on class c) in world wi .
O( Ak ∈Au |domAk
||T|
) possible worlds.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 7 / 1
Efficient Computation of Expected Confidence
Lemma:
Since 0 ≤ supx
c ≤ supx ≤ |T|, we have:
E(confx
c
) =
wi ∈W
confx,wi
c
× P(wi )
=
|T|
i=0
i
j=0
j
i
× P(supx = i ∧ supx
c
= j)
=
|T|
i=0
Ei (supx
c)
i
=
|T|
i=0
Ei (confx
c
)
, where Ei (supx
c) and Ei (confx
c
) denote the part of expected support and
confidence of itemset x on class c when supx = i.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 8 / 1
Efficient Computation of Expected Confidence
Given 0 ≤ n ≤ |T|, define En(supx
c) =
|T|
i=0 Ei,n(supx
c) as the expected
support of x on class c on the first n transactions of T, and Ei,n(supx
c) as
the expected support of x on class c with support of i on the first n
transactions of T.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 9 / 1
Efficient Computation of Expected Confidence
Given 0 ≤ n ≤ |T|, define En(supx
c) =
|T|
i=0 Ei,n(supx
c) as the expected
support of x on class c on the first n transactions of T, and Ei,n(supx
c) as
the expected support of x on class c with support of i on the first n
transactions of T.
Denoting P(x ⊆ ti ) as pi for each transaction ti ∈ T , we have
Ei,n(supx
c
) = pn × Ei−1,n−1(supx
c
)
+ (1 − pn) × Ei,n−1(supx
c
)
when cn = c
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 9 / 1
Efficient Computation of Expected Confidence
Given 0 ≤ n ≤ |T|, define En(supx
c) =
|T|
i=0 Ei,n(supx
c) as the expected
support of x on class c on the first n transactions of T, and Ei,n(supx
c) as
the expected support of x on class c with support of i on the first n
transactions of T.
Denoting P(x ⊆ ti ) as pi for each transaction ti ∈ T , we have
Ei,n(supx
c
) = pn × Ei−1,n−1(supx
c
)
+ (1 − pn) × Ei,n−1(supx
c
)
when cn = c , and
Ei,n(supx
c
) = pn × Ei−1,n−1(supx
c
+ 1)
+ (1 − pn) × Ei,n−1(supx
c
)
when cn = c, where 1 ≤ i ≤ n ≤ |T|.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 9 / 1
Efficient Computation of Expected Confidence
Given 0 ≤ n ≤ |T|, define En(supx
c) =
|T|
i=0 Ei,n(supx
c) as the expected
support of x on class c on the first n transactions of T, and Ei,n(supx
c) as
the expected support of x on class c with support of i on the first n
transactions of T.
Denoting P(x ⊆ ti ) as pi for each transaction ti ∈ T , we have
Ei,n(supx
c
) = pn × Ei−1,n−1(supx
c
)
+ (1 − pn) × Ei,n−1(supx
c
)
when cn = c , and
Ei,n(supx
c
) = pn × Ei−1,n−1(supx
c
+ 1)
+ (1 − pn) × Ei,n−1(supx
c
)
when cn = c, where 1 ≤ i ≤ n ≤ |T|.
Ei,n(supx
c
) = 0
for ∀n where i = 0, or where n < i.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 9 / 1
Efficient Computation of Expected Confidence
Defining Pi,n as the probability of x having support of i on the first n
transactions of T, we have
Ei,n(supx
c
) = pn × (Ei−1,n−1(supx
c
+ 1))
+ (1 − pn) × Ei,n−1(supx
c
)
= pn × (Ei−1,n−1(supx
c
) + Pi−1,n−1)
+ (1 − pn) × Ei,n−1(supx
c
)
when cn = c
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 10 / 1
Efficient Computation of Expected Confidence
Defining Pi,n as the probability of x having support of i on the first n
transactions of T, we have
Ei,n(supx
c
) = pn × (Ei−1,n−1(supx
c
+ 1))
+ (1 − pn) × Ei,n−1(supx
c
)
= pn × (Ei−1,n−1(supx
c
) + Pi−1,n−1)
+ (1 − pn) × Ei,n−1(supx
c
)
when cn = c , since we have:
Ei−1,n−1(supx
c
+ 1) = Ei−1,n−1(supx
c
) + Pi−1,n−1
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 10 / 1
Efficient Computation of Expected Confidence
Denoting P(x ⊆ ti ) as pi for each transaction ti ∈ T, we have
Pi,n = pn × Pi−1,n−1 + (1 − pn) × Pi,n−1
, where 1 ≤ i ≤ n ≤ |T|.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 11 / 1
Efficient Computation of Expected Confidence
Denoting P(x ⊆ ti ) as pi for each transaction ti ∈ T, we have
Pi,n = pn × Pi−1,n−1 + (1 − pn) × Pi,n−1
, where 1 ≤ i ≤ n ≤ |T|.
Pi,n =
1 for n = 0
Pi,n−1 × (1 − pn) for 1 ≤ n ≤ |T|
where i = 0.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 11 / 1
Efficient Computation of Expected Confidence
Denoting P(x ⊆ ti ) as pi for each transaction ti ∈ T, we have
Pi,n = pn × Pi−1,n−1 + (1 − pn) × Pi,n−1
, where 1 ≤ i ≤ n ≤ |T|.
Pi,n =
1 for n = 0
Pi,n−1 × (1 − pn) for 1 ≤ n ≤ |T|
where i = 0.
Pi,n = 0
where n < i.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 11 / 1
Efficient Computation of Expected Confidence
Since E(confx
c
) = E|T|(confx
c
) =
|T|
i=0 Ei,|T|(confx
c
). The computation
is divided into |T| + 1 steps with Ei,|T|(confx
c
) = Ei,|T|(supx
c)/i
(0 ≤ i ≤ |T|) computed in ith step.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 12 / 1
Efficient Computation of Expected Confidence
Since E(confx
c
) = E|T|(confx
c
) =
|T|
i=0 Ei,|T|(confx
c
). The computation
is divided into |T| + 1 steps with Ei,|T|(confx
c
) = Ei,|T|(supx
c)/i
(0 ≤ i ≤ |T|) computed in ith step.
#Transaction / n
Support / i
0 1 |T|
0
1
|T|
...
...
,| | ( )c
i T xconfE
1,| | 1( )c
i T xconfE − −
2
2
Computation in One Step Start of Next Step Explaination
1,| | ( )c
i T xconfE −
| |
,| |
0
( ) ( )
T
c c
x i T x
i
conf cE E onf
=
= ∑
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 12 / 1
Efficient Computation of Expected Confidence
Since E(confx
c
) = E|T|(confx
c
) =
|T|
i=0 Ei,|T|(confx
c
). The computation
is divided into |T| + 1 steps with Ei,|T|(confx
c
) = Ei,|T|(supx
c)/i
(0 ≤ i ≤ |T|) computed in ith step.
#Transaction / n
Support / i
0 1 |T|
0
1
|T|
...
...
,| | ( )c
i T xconfE
1,| | 1( )c
i T xconfE − −
2
2
Computation in One Step Start of Next Step Explaination
1,| | ( )c
i T xconfE −
| |
,| |
0
( ) ( )
T
c c
x i T x
i
conf cE E onf
=
= ∑
Time Complexity:
O(|T|2
)
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 12 / 1
Efficient Computation of Expected Confidence
Since E(confx
c
) = E|T|(confx
c
) =
|T|
i=0 Ei,|T|(confx
c
). The computation
is divided into |T| + 1 steps with Ei,|T|(confx
c
) = Ei,|T|(supx
c)/i
(0 ≤ i ≤ |T|) computed in ith step.
#Transaction / n
Support / i
0 1 |T|
0
1
|T|
...
...
,| | ( )c
i T xconfE
1,| | 1( )c
i T xconfE − −
2
2
Computation in One Step Start of Next Step Explaination
1,| | ( )c
i T xconfE −
| |
,| |
0
( ) ( )
T
c c
x i T x
i
conf cE E onf
=
= ∑
Time Complexity:
O(|T|2
)
Space Complexity:
O(|T|)
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 12 / 1
Upper Bounds of Expected Confidence
For ∀i(1 ≤ i ≤ |T|), we have
E(confx
c
) = E|T|(confx
c
)
=
i−1
k=0
Ek,|T|(supx
c)
k
+
|T|
k=i
Ek,|T|(supx
c)
k
≤
i−1
k=0
Ek,|T|(supx
c)
k
+
|T|
k=i
Ek,|T|(supx
c)
i
=
i−1
k=0
Ek,|T|(supx
c)
k
+
|T|
k=0
Ek,|T|(supx
c)
i
−
i−1
k=0
Ek,|T|(supx
c)
i
=
i−1
k=0
Ek,|T|(supx
c
) × (
1
k
−
1
i
) +
E(supx
c)
i
=boundi (confx
c
)
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 13 / 1
Upper Bounds of Expected Confidence
For 1 ≤ i ≤ |T|, we have:
E(supx
c
) = bound1(confx
c
)
≥ · · · ≥ boundi (confx
c
) ≥ · · ·
≥ bound|T|(confx
c
) = E(confx
c
)
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 14 / 1
Upper Bounds of Expected Confidence
For 1 ≤ i ≤ |T|, we have:
E(supx
c
) = bound1(confx
c
)
≥ · · · ≥ boundi (confx
c
) ≥ · · ·
≥ bound|T|(confx
c
) = E(confx
c
)
Since
boundi (confx
c
) = boundi−1(confx
c
)
− (
1
i − 1
−
1
i
) × (E(supx
c
) −
i−1
k=0
Ek,|T|(supx
c
))
, can compute boundi (confx
c
) with boundi−1(confx
c
).
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 14 / 1
Upper Bounds of Expected Confidence
#Transaction / n
Support / i
0 1 |T|
0
1
|T|
...
...
,| | ( )c
i T xconfE
1,| | ( )c
i T xconfE −
1,| | 1( )c
i T xconfE − −
2
2
Stop Condition:
SkippedComputation in One Step Start of Next Step Explaination
_
( )
c
c cur db
i x maxbound conf conf≤
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 15 / 1
Upper Bounds of Expected Confidence
Running Example:
0.1
1
10
10 20 30 40 50 60 70 80 90 100
boundi(confx
c
)
Support / i
boundi(confx
c
)
confmax
cur_dbc
confx
c
boundi(confx
c
) (Skipped)
Stop when boundi(confx
c
) <= confmax
cur_dbc
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 16 / 1
Algorithm Framework
1. Calculate the (expected) confidence of current prefix pattern.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 17 / 1
Algorithm Framework
1. Calculate the (expected) confidence of current prefix pattern.
2. If the confidence value is larger than previous maximal value, update
covered instances.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 17 / 1
Algorithm Framework
1. Calculate the (expected) confidence of current prefix pattern.
2. If the confidence value is larger than previous maximal value, update
covered instances.
3. If at least one instance covered, select the prefix.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 17 / 1
Algorithm Framework
1. Calculate the (expected) confidence of current prefix pattern.
2. If the confidence value is larger than previous maximal value, update
covered instances.
3. If at least one instance covered, select the prefix.
4. Continue growing current prefix, and go to Step 1.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 17 / 1
Algorithm Framework
1. Calculate the (expected) confidence of current prefix pattern.
2. If the confidence value is larger than previous maximal value, update
covered instances.
3. If at least one instance covered, select the prefix.
4. Continue growing current prefix, and go to Step 1.
Need to sort all the uncertain attributes after certain attributes, to help
shrink current projected database.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 17 / 1
Instance Covering Strategy
Previous Strategy in HARMONY:
Just find one most discriminative covering pattern with the highest
confidence for each instance. On uncertain data, the probability of the
instance being covered could be very low.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 18 / 1
Instance Covering Strategy
Previous Strategy in HARMONY:
Just find one most discriminative covering pattern with the highest
confidence for each instance. On uncertain data, the probability of the
instance being covered could be very low.
Our method:
Apply a threshold of minimum cover probability coverProbmin. Assure that
the probability of each instance not covered by any pattern is less than
1 − coverProbmin, by maintaining a list storing confidence values of
covering patterns on class c in descending order.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 18 / 1
Used Classifiers
SVM Classifier
Convert each pattern to a binary feature by whether it is contained by the
instance.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 19 / 1
Used Classifiers
SVM Classifier
Convert each pattern to a binary feature by whether it is contained by the
instance.
Rule-based Classifier (From HARMONY)
For each test instance we just sum up the product of the confidence of
each pattern on each class and the probability of the instance containing
the pattern. The class with the largest value is the predicted class of the
instance.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 19 / 1
Used Datasets
Dataset #Instance #Attribute #Class Area
australian 690 14 2 Financial
balance 635 4 3 Social
bands 539 38 2 Physical
breast 699 9 2 Life
bridges-v1 106 11 6 N/A
bridges-v2 106 10 6 N/A
car 1728 6 4 N/A
contraceptive 1473 9 3 Life
credit 690 15 2 Financial
echocardiogram 131 12 2 Life
flag 194 28 8 N/A
german 1000 19 2 Financial
heart 920 13 5 Life
hepatitis 155 19 2 Life
horse 368 27 2 Life
monks-1 556 6 2 N/A
monks-2 601 6 2 N/A
monks-3 554 6 2 N/A
mushroom 8124 22 2 Life
pima 768 8 2 Life
postoperative 90 8 3 Life
promoters 106 57 2 Life
spect 267 22 2 Life
survival 306 3 2 Life
ta eval 151 5 3 N/A
tic-tac-toe 958 9 2 Game
vehicle 846 18 4 N/A
voting 435 16 2 Social
wine 178 13 3 Physical
zoo 101 16 7 Life
30 Public UCI Certain
Datasets
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 20 / 1
Used Datasets
Dataset #Instance #Attribute #Class Area
australian 690 14 2 Financial
balance 635 4 3 Social
bands 539 38 2 Physical
breast 699 9 2 Life
bridges-v1 106 11 6 N/A
bridges-v2 106 10 6 N/A
car 1728 6 4 N/A
contraceptive 1473 9 3 Life
credit 690 15 2 Financial
echocardiogram 131 12 2 Life
flag 194 28 8 N/A
german 1000 19 2 Financial
heart 920 13 5 Life
hepatitis 155 19 2 Life
horse 368 27 2 Life
monks-1 556 6 2 N/A
monks-2 601 6 2 N/A
monks-3 554 6 2 N/A
mushroom 8124 22 2 Life
pima 768 8 2 Life
postoperative 90 8 3 Life
promoters 106 57 2 Life
spect 267 22 2 Life
survival 306 3 2 Life
ta eval 151 5 3 N/A
tic-tac-toe 958 9 2 Game
vehicle 846 18 4 N/A
voting 435 16 2 Social
wine 178 13 3 Physical
zoo 101 16 7 Life
30 Public UCI Certain
Datasets
Real values have been
discretizated.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 20 / 1
Convert to Uncertain Datasets
Two parameters:
Uncertain Attribute Number:
Number of attributes selected to converted to uncertain. Those with
highest information gain values are selected.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 21 / 1
Convert to Uncertain Datasets
Two parameters:
Uncertain Attribute Number:
Number of attributes selected to converted to uncertain. Those with
highest information gain values are selected.
Uncertain Degree (0 - 1):
The probability the attribute value taking values other than the original
value.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 21 / 1
Convert to Uncertain Datasets
Two parameters:
Uncertain Attribute Number:
Number of attributes selected to converted to uncertain. Those with
highest information gain values are selected.
Uncertain Degree (0 - 1):
The probability the attribute value taking values other than the original
value.
Represented by Ux@y, where x is uncertain degree and y is uncertain
attribute number.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 21 / 1
Accuracy Evaluation
Average accuracies on 30 datasets. For accuracy on each dataset, refer to
our paper.
Using SVM Classifier:
Dataset uHARMONY DTU uRule
U10@1 79.0138 74.8738 75.2111
U10@2 78.6970 73.1629 73.4107
U10@4 77.9657 72.2670 69.4649
U20@1 78.9537 74.6577 74.6287
U20@2 78.6073 72.5642 72.5460
U20@4 77.8352 69.9157 68.2066
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 22 / 1
Accuracy Evaluation
Average accuracies on 30 datasets. For accuracy on each dataset, refer to
our paper.
Using SVM Classifier:
Dataset uHARMONY DTU uRule
U10@1 79.0138 74.8738 75.2111
U10@2 78.6970 73.1629 73.4107
U10@4 77.9657 72.2670 69.4649
U20@1 78.9537 74.6577 74.6287
U20@2 78.6073 72.5642 72.5460
U20@4 77.8352 69.9157 68.2066
Using Rule-based Classifier:
Dataset uHARMONYrule DTU uRule
U10@4 73.2517 72.2670 69.4649
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 22 / 1
Sensitivity Test
94
94.5
95
95.5
96
96.5
97
1.25 2.5 5 10 20
Accuracy(in%)
Minimum Support (in %)
(a) breast
49
50
51
52
53
54
0.25 0.5 1 2 4
Accuracy(in%)
Minimum Support (in %)
(b) wine
Figure: Accuracy Evaluation of U10@1 w.r.t. Minimum Support
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 23 / 1
Sensitivity Test
94
94.5
95
95.5
96
96.5
0 10 50 90 100
Accuracy(in%)
Minimum Cover Prob. (in %)
(a) breast
51
51.5
52
52.5
53
53.5
0 10 50 90 100
Accuracy(in%)
Minimum Cover Prob. (in %)
(b) wine
Figure: Accuracy Evaluation of U10@1 w.r.t. Minimum Cover Prob.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 24 / 1
Runtime Efficiency
uHarmony DTU uRule
0.1
1
10
100
1000
breast car contraceptive heart pima wine
RunningTime(insec)
Dataset
(a) Running Time (in sec)
Figure: Classification Efficiency Evaluation of U10@1
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 25 / 1
Runtime Efficiency
uHarmony DTU uRule
10
100
1000
breast car contraceptive heart pima wine
MemoryUse(inMB)
Dataset
(b) Memory Use (in MB)
Figure: Classification Efficiency Evaluation of U10@1
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 26 / 1
Effectiveness of the Expected Confidence Upper Bound
With Expected Conf Bound Without Expected Conf Bound
10
15
20
25
30
35
40
45
50
0.2 0.4 0.6 0.8 1 1.2
RunningTime(insec)
Minimum Support (in %)
(a) car
10
15
20
25
30
35
40
0.2 0.4 0.6 0.8 1 1.2
RunningTime(insec)
Minimum Support (in %)
(b) heart
Figure: Running Time Evaluation of U10@4
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 27 / 1
Scalability Test
With Expected Conf Bound Without Expected Conf Bound
0
20
40
60
80
100
120
140
160
180
1 2 4 8 16
RunningTime(insec)
Dataset Duplication Ratio
(a) car (supmin = 0.01)
0
200
400
600
800
1000
1 2 4 8 16
RunningTime(insec)
Dataset Duplication Ratio
(b) heart (supmin = 0.01)
Figure: Scalability Evaluation (U10@1, Running Time)
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 28 / 1
Conclusions
Proposed the first associative classification algorithm on uncertain
data.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 29 / 1
Conclusions
Proposed the first associative classification algorithm on uncertain
data.
Proposed an efficient computation of expected confidence value,
together with the computation of upper bounds.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 29 / 1
Conclusions
Proposed the first associative classification algorithm on uncertain
data.
Proposed an efficient computation of expected confidence value,
together with the computation of upper bounds.
New instance covering strategy has been proposed and tested to be
effective.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 29 / 1
Conclusions
Proposed the first associative classification algorithm on uncertain
data.
Proposed an efficient computation of expected confidence value,
together with the computation of upper bounds.
New instance covering strategy has been proposed and tested to be
effective.
Conducted an extensive evaluation on 30 public real data, under
varying uncertain parameters. With significant improvements on
accuracy, comparing with two other state-of-the-art alrotihms.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 29 / 1
Conclusions
Proposed the first associative classification algorithm on uncertain
data.
Proposed an efficient computation of expected confidence value,
together with the computation of upper bounds.
New instance covering strategy has been proposed and tested to be
effective.
Conducted an extensive evaluation on 30 public real data, under
varying uncertain parameters. With significant improvements on
accuracy, comparing with two other state-of-the-art alrotihms.
Evaluated the runtime efficiency, proved the effectiveness of using
upper bounds.
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 29 / 1
The End
Thank you for Listening!
Questions or Comments?
C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 30 / 1

More Related Content

Similar to KDD 2010 - Direct mining of discriminative patterns for classifying uncertain data

Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningKai Koenig
 
Think-Aloud Protocols
Think-Aloud ProtocolsThink-Aloud Protocols
Think-Aloud Protocolsbutest
 
Classification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining TechniquesClassification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining Techniquesinventionjournals
 
07-Classification.pptx
07-Classification.pptx07-Classification.pptx
07-Classification.pptxShree Shree
 
Learning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingLearning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingVrije Universiteit Amsterdam
 
GradTrack: Getting Started with Statistics September 20, 2018
GradTrack: Getting Started with Statistics September 20, 2018GradTrack: Getting Started with Statistics September 20, 2018
GradTrack: Getting Started with Statistics September 20, 2018Nancy Garmer
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiVijay Susheedran C G
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introductionNeeraj Tewari
 
6145-Article Text-9370-1-10-20200513.pdf
6145-Article Text-9370-1-10-20200513.pdf6145-Article Text-9370-1-10-20200513.pdf
6145-Article Text-9370-1-10-20200513.pdfchalachew5
 
Unit-4 classification
Unit-4 classificationUnit-4 classification
Unit-4 classificationLokarchanaD
 
A Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of DiseasesA Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of Diseasesijsrd.com
 
Data mining approaches and methods
Data mining approaches and methodsData mining approaches and methods
Data mining approaches and methodssonangrai
 

Similar to KDD 2010 - Direct mining of discriminative patterns for classifying uncertain data (20)

Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Think-Aloud Protocols
Think-Aloud ProtocolsThink-Aloud Protocols
Think-Aloud Protocols
 
Sampling
SamplingSampling
Sampling
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Classification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining TechniquesClassification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining Techniques
 
07-Classification.pptx
07-Classification.pptx07-Classification.pptx
07-Classification.pptx
 
Unit 3.pptx
Unit 3.pptxUnit 3.pptx
Unit 3.pptx
 
Learning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingLearning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic Programming
 
GradTrack: Getting Started with Statistics September 20, 2018
GradTrack: Getting Started with Statistics September 20, 2018GradTrack: Getting Started with Statistics September 20, 2018
GradTrack: Getting Started with Statistics September 20, 2018
 
GradTrack: Getting Started with Statistics September 20, 2018
GradTrack: Getting Started with Statistics September 20, 2018GradTrack: Getting Started with Statistics September 20, 2018
GradTrack: Getting Started with Statistics September 20, 2018
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in Chennai
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introduction
 
6145-Article Text-9370-1-10-20200513.pdf
6145-Article Text-9370-1-10-20200513.pdf6145-Article Text-9370-1-10-20200513.pdf
6145-Article Text-9370-1-10-20200513.pdf
 
[IJET-V2I3P21] Authors: Amit Kumar Dewangan, Akhilesh Kumar Shrivas, Prem Kumar
[IJET-V2I3P21] Authors: Amit Kumar Dewangan, Akhilesh Kumar Shrivas, Prem Kumar[IJET-V2I3P21] Authors: Amit Kumar Dewangan, Akhilesh Kumar Shrivas, Prem Kumar
[IJET-V2I3P21] Authors: Amit Kumar Dewangan, Akhilesh Kumar Shrivas, Prem Kumar
 
Unit-4 classification
Unit-4 classificationUnit-4 classification
Unit-4 classification
 
Methodology
MethodologyMethodology
Methodology
 
Emerging patterns based classifier
Emerging patterns based classifierEmerging patterns based classifier
Emerging patterns based classifier
 
A Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of DiseasesA Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of Diseases
 
Text Quantification
Text QuantificationText Quantification
Text Quantification
 
Data mining approaches and methods
Data mining approaches and methodsData mining approaches and methods
Data mining approaches and methods
 

More from Chuancong Gao

WI 2017 - Preference-driven Similarity Join
WI 2017 - Preference-driven Similarity JoinWI 2017 - Preference-driven Similarity Join
WI 2017 - Preference-driven Similarity JoinChuancong Gao
 
IRI 2017 - Schemaless Join for Result Set Preferences
IRI 2017 - Schemaless Join for Result Set PreferencesIRI 2017 - Schemaless Join for Result Set Preferences
IRI 2017 - Schemaless Join for Result Set PreferencesChuancong Gao
 
Master Thesis 2010 - Pattern Discovery Algorithms for Classification
Master Thesis 2010 - Pattern Discovery Algorithms for ClassificationMaster Thesis 2010 - Pattern Discovery Algorithms for Classification
Master Thesis 2010 - Pattern Discovery Algorithms for ClassificationChuancong Gao
 
CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...
CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...
CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...Chuancong Gao
 
WWW 2008 Poster - Efficient mining of frequent sequence generators
WWW 2008 Poster - Efficient mining of frequent sequence generatorsWWW 2008 Poster - Efficient mining of frequent sequence generators
WWW 2008 Poster - Efficient mining of frequent sequence generatorsChuancong Gao
 
EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...
EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...
EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...Chuancong Gao
 
CIKM 2009 - Efficient itemset generator discovery over a stream sliding window
CIKM 2009 - Efficient itemset generator discovery over a stream sliding windowCIKM 2009 - Efficient itemset generator discovery over a stream sliding window
CIKM 2009 - Efficient itemset generator discovery over a stream sliding windowChuancong Gao
 
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...Chuancong Gao
 

More from Chuancong Gao (8)

WI 2017 - Preference-driven Similarity Join
WI 2017 - Preference-driven Similarity JoinWI 2017 - Preference-driven Similarity Join
WI 2017 - Preference-driven Similarity Join
 
IRI 2017 - Schemaless Join for Result Set Preferences
IRI 2017 - Schemaless Join for Result Set PreferencesIRI 2017 - Schemaless Join for Result Set Preferences
IRI 2017 - Schemaless Join for Result Set Preferences
 
Master Thesis 2010 - Pattern Discovery Algorithms for Classification
Master Thesis 2010 - Pattern Discovery Algorithms for ClassificationMaster Thesis 2010 - Pattern Discovery Algorithms for Classification
Master Thesis 2010 - Pattern Discovery Algorithms for Classification
 
CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...
CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...
CIKM 2010 Demo - SEQUEL: query completion via pattern mining on multi-column ...
 
WWW 2008 Poster - Efficient mining of frequent sequence generators
WWW 2008 Poster - Efficient mining of frequent sequence generatorsWWW 2008 Poster - Efficient mining of frequent sequence generators
WWW 2008 Poster - Efficient mining of frequent sequence generators
 
EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...
EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...
EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...
 
CIKM 2009 - Efficient itemset generator discovery over a stream sliding window
CIKM 2009 - Efficient itemset generator discovery over a stream sliding windowCIKM 2009 - Efficient itemset generator discovery over a stream sliding window
CIKM 2009 - Efficient itemset generator discovery over a stream sliding window
 
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...
 

Recently uploaded

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Recently uploaded (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

KDD 2010 - Direct mining of discriminative patterns for classifying uncertain data

  • 1. Direct Mining of Discriminative Patterns for Classifying Uncertain Data Chuancong Gao, Jianyong Wang Database Laboratory Department of Computer Science and Technology Tsinghua University, Beijing, China C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 1 / 1
  • 2. Classification on Certain Data Example: A toy example about certain categorical dataset containing 4 classes. Evaluation Price Looking Tech. Spec. Quality Unacceptable + - / - Acceptable / - / / Good - + / / Very Good / + + + (+: Good, /: Medium, -: Bad) C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 2 / 1
  • 3. Classification on Certain Data Example: A toy example about certain categorical dataset containing 4 classes. Evaluation Price Looking Tech. Spec. Quality Unacceptable + - / - Acceptable / - / / Good - + / / Very Good / + + + (+: Good, /: Medium, -: Bad) A Lot of Methods: Decision Tree - C4.5, etc. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 2 / 1
  • 4. Classification on Certain Data Example: A toy example about certain categorical dataset containing 4 classes. Evaluation Price Looking Tech. Spec. Quality Unacceptable + - / - Acceptable / - / / Good - + / / Very Good / + + + (+: Good, /: Medium, -: Bad) A Lot of Methods: Decision Tree - C4.5, etc. Rule-based Classifier - Ripper, etc. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 2 / 1
  • 5. Classification on Certain Data Example: A toy example about certain categorical dataset containing 4 classes. Evaluation Price Looking Tech. Spec. Quality Unacceptable + - / - Acceptable / - / / Good - + / / Very Good / + + + (+: Good, /: Medium, -: Bad) A Lot of Methods: Decision Tree - C4.5, etc. Rule-based Classifier - Ripper, etc. Associative Classification (Pattern-based Classification) - CBA, RCBT, HARMONY, DDPMine, MbT, etc. (Better Performance) C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 2 / 1
  • 6. Classification on Certain Data Example: A toy example about certain categorical dataset containing 4 classes. Evaluation Price Looking Tech. Spec. Quality Unacceptable + - / - Acceptable / - / / Good - + / / Very Good / + + + (+: Good, /: Medium, -: Bad) A Lot of Methods: Decision Tree - C4.5, etc. Rule-based Classifier - Ripper, etc. Associative Classification (Pattern-based Classification) - CBA, RCBT, HARMONY, DDPMine, MbT, etc. (Better Performance) etc. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 2 / 1
  • 7. Associative Classification Two-Step Framework: Mine a set of frequent patterns. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 3 / 1
  • 8. Associative Classification Two-Step Framework: Mine a set of frequent patterns. Select a subset of most discriminative patterns from the mined patterns. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 3 / 1
  • 9. Associative Classification Two-Step Framework: Mine a set of frequent patterns. Select a subset of most discriminative patterns from the mined patterns. One-Step Framework: Directly mine a set of most discriminative frequent patterns. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 3 / 1
  • 10. Associative Classification Two-Step Framework: Mine a set of frequent patterns. Select a subset of most discriminative patterns from the mined patterns. One-Step Framework: Directly mine a set of most discriminative frequent patterns. After Having Discriminative Patterns: Convert each pattern to a binary feature: Whether the instance contains the pattern. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 3 / 1
  • 11. Associative Classification Two-Step Framework: Mine a set of frequent patterns. Select a subset of most discriminative patterns from the mined patterns. One-Step Framework: Directly mine a set of most discriminative frequent patterns. After Having Discriminative Patterns: Convert each pattern to a binary feature: Whether the instance contains the pattern. Train a classifier using the feature data converted from training data. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 3 / 1
  • 12. Associative Classification Mainly Differences between Different Algorithms: Different types of mined pattern - All, Closed, Generator, etc. (Two-Step Framework) C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 4 / 1
  • 13. Associative Classification Mainly Differences between Different Algorithms: Different types of mined pattern - All, Closed, Generator, etc. (Two-Step Framework) Different discriminative measures - Confidence (CBA, HARMONY), Information Gain (DDPMine) (Better Performance), Fisher Score, etc. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 4 / 1
  • 14. Associative Classification Mainly Differences between Different Algorithms: Different types of mined pattern - All, Closed, Generator, etc. (Two-Step Framework) Different discriminative measures - Confidence (CBA, HARMONY), Information Gain (DDPMine) (Better Performance), Fisher Score, etc. Different instance covering strategies - Sequential Covering (DDPMine), Top-K (RCBT), Search Tree-based Partition (MbT), etc. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 4 / 1
  • 15. Associative Classification Mainly Differences between Different Algorithms: Different types of mined pattern - All, Closed, Generator, etc. (Two-Step Framework) Different discriminative measures - Confidence (CBA, HARMONY), Information Gain (DDPMine) (Better Performance), Fisher Score, etc. Different instance covering strategies - Sequential Covering (DDPMine), Top-K (RCBT), Search Tree-based Partition (MbT), etc. Different classification models - Rule-based (RCBT, HARMONY), SVM (DDPMine, MbT) (Better Performance), Na¨ıve Bayes, etc. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 4 / 1
  • 16. Associative Classification Mainly Differences between Different Algorithms: Different types of mined pattern - All, Closed, Generator, etc. (Two-Step Framework) Different discriminative measures - Confidence (CBA, HARMONY), Information Gain (DDPMine) (Better Performance), Fisher Score, etc. Different instance covering strategies - Sequential Covering (DDPMine), Top-K (RCBT), Search Tree-based Partition (MbT), etc. Different classification models - Rule-based (RCBT, HARMONY), SVM (DDPMine, MbT) (Better Performance), Na¨ıve Bayes, etc. Different feature types - Binary, Numeric (New, NDPMine) C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 4 / 1
  • 17. Classification on Uncertain Data Example: A toy example about uncertain categorical dataset. The uncertainty usually is caused by noise, measurement precisions, etc. Evaluation Price Looking Tech. Spec. Quality Unacceptable + - / {-: 0.8, /: 0.1, +: 0.1} Acceptable / - / {-: 0.1, /: 0.8, +: 0.1} Good - + / {-: 0.1, /: 0.8, +: 0.1} Very Good / + + {-: 0.1, /: 0.1, +: 0.8} (+: Good, /: Medium, -: Bad) C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 5 / 1
  • 18. Classification on Uncertain Data Example: A toy example about uncertain categorical dataset. The uncertainty usually is caused by noise, measurement precisions, etc. Evaluation Price Looking Tech. Spec. Quality Unacceptable + - / {-: 0.8, /: 0.1, +: 0.1} Acceptable / - / {-: 0.1, /: 0.8, +: 0.1} Good - + / {-: 0.1, /: 0.8, +: 0.1} Very Good / + + {-: 0.1, /: 0.1, +: 0.8} (+: Good, /: Medium, -: Bad) Very Few Methods: Uncertain Decision Tree - C4.5-based DTU. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 5 / 1
  • 19. Classification on Uncertain Data Example: A toy example about uncertain categorical dataset. The uncertainty usually is caused by noise, measurement precisions, etc. Evaluation Price Looking Tech. Spec. Quality Unacceptable + - / {-: 0.8, /: 0.1, +: 0.1} Acceptable / - / {-: 0.1, /: 0.8, +: 0.1} Good - + / {-: 0.1, /: 0.8, +: 0.1} Very Good / + + {-: 0.1, /: 0.1, +: 0.8} (+: Good, /: Medium, -: Bad) Very Few Methods: Uncertain Decision Tree - C4.5-based DTU. Uncertain Rule-based Classifier - Ripper-based uRule C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 5 / 1
  • 20. Our Solution A new associative classification algorithm working on uncertain data. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 6 / 1
  • 21. Our Solution A new associative classification algorithm working on uncertain data. Difference to Certain Dataset: Patterns involving uncertain attributes have probabilities to appear in instances. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 6 / 1
  • 22. Our Solution A new associative classification algorithm working on uncertain data. Difference to Certain Dataset: Patterns involving uncertain attributes have probabilities to appear in instances. Challenges: How to represent frequentness information? - Using expected value of support. (Easy to calculate - Sum all the probabilities appearing in different instances) C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 6 / 1
  • 23. Our Solution A new associative classification algorithm working on uncertain data. Difference to Certain Dataset: Patterns involving uncertain attributes have probabilities to appear in instances. Challenges: How to represent frequentness information? - Using expected value of support. (Easy to calculate - Sum all the probabilities appearing in different instances) How to represent discriminative information? C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 6 / 1
  • 24. Our Solution A new associative classification algorithm working on uncertain data. Difference to Certain Dataset: Patterns involving uncertain attributes have probabilities to appear in instances. Challenges: How to represent frequentness information? - Using expected value of support. (Easy to calculate - Sum all the probabilities appearing in different instances) How to represent discriminative information? How to cover instances? C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 6 / 1
  • 25. Discriminative Measures on Uncertain Data Choose to use expected value of confidence. Unlike expected support, expected confidence is hard to calculate. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 7 / 1
  • 26. Discriminative Measures on Uncertain Data Choose to use expected value of confidence. Unlike expected support, expected confidence is hard to calculate. Definition of Expected Confidence: Given a set of transactions T and the set of possible worlds W w.r.t. T, the expected confidence of an itemset x on class c is E(confx c ) = wi ∈W confx,wi c × P(wi ) = wi ∈W supx,wi c supx,wi × P(wi ) where P(wi ) is the probability of world wi . confx,wi c is the respected confidence of x on class c in world wi , while supx,wi (supx,wi c) is the respected support of x (on class c) in world wi . C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 7 / 1
  • 27. Discriminative Measures on Uncertain Data Choose to use expected value of confidence. Unlike expected support, expected confidence is hard to calculate. Definition of Expected Confidence: Given a set of transactions T and the set of possible worlds W w.r.t. T, the expected confidence of an itemset x on class c is E(confx c ) = wi ∈W confx,wi c × P(wi ) = wi ∈W supx,wi c supx,wi × P(wi ) where P(wi ) is the probability of world wi . confx,wi c is the respected confidence of x on class c in world wi , while supx,wi (supx,wi c) is the respected support of x (on class c) in world wi . O( Ak ∈Au |domAk ||T| ) possible worlds. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 7 / 1
  • 28. Efficient Computation of Expected Confidence Lemma: Since 0 ≤ supx c ≤ supx ≤ |T|, we have: E(confx c ) = wi ∈W confx,wi c × P(wi ) = |T| i=0 i j=0 j i × P(supx = i ∧ supx c = j) = |T| i=0 Ei (supx c) i = |T| i=0 Ei (confx c ) , where Ei (supx c) and Ei (confx c ) denote the part of expected support and confidence of itemset x on class c when supx = i. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 8 / 1
  • 29. Efficient Computation of Expected Confidence Given 0 ≤ n ≤ |T|, define En(supx c) = |T| i=0 Ei,n(supx c) as the expected support of x on class c on the first n transactions of T, and Ei,n(supx c) as the expected support of x on class c with support of i on the first n transactions of T. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 9 / 1
  • 30. Efficient Computation of Expected Confidence Given 0 ≤ n ≤ |T|, define En(supx c) = |T| i=0 Ei,n(supx c) as the expected support of x on class c on the first n transactions of T, and Ei,n(supx c) as the expected support of x on class c with support of i on the first n transactions of T. Denoting P(x ⊆ ti ) as pi for each transaction ti ∈ T , we have Ei,n(supx c ) = pn × Ei−1,n−1(supx c ) + (1 − pn) × Ei,n−1(supx c ) when cn = c C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 9 / 1
  • 31. Efficient Computation of Expected Confidence Given 0 ≤ n ≤ |T|, define En(supx c) = |T| i=0 Ei,n(supx c) as the expected support of x on class c on the first n transactions of T, and Ei,n(supx c) as the expected support of x on class c with support of i on the first n transactions of T. Denoting P(x ⊆ ti ) as pi for each transaction ti ∈ T , we have Ei,n(supx c ) = pn × Ei−1,n−1(supx c ) + (1 − pn) × Ei,n−1(supx c ) when cn = c , and Ei,n(supx c ) = pn × Ei−1,n−1(supx c + 1) + (1 − pn) × Ei,n−1(supx c ) when cn = c, where 1 ≤ i ≤ n ≤ |T|. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 9 / 1
  • 32. Efficient Computation of Expected Confidence Given 0 ≤ n ≤ |T|, define En(supx c) = |T| i=0 Ei,n(supx c) as the expected support of x on class c on the first n transactions of T, and Ei,n(supx c) as the expected support of x on class c with support of i on the first n transactions of T. Denoting P(x ⊆ ti ) as pi for each transaction ti ∈ T , we have Ei,n(supx c ) = pn × Ei−1,n−1(supx c ) + (1 − pn) × Ei,n−1(supx c ) when cn = c , and Ei,n(supx c ) = pn × Ei−1,n−1(supx c + 1) + (1 − pn) × Ei,n−1(supx c ) when cn = c, where 1 ≤ i ≤ n ≤ |T|. Ei,n(supx c ) = 0 for ∀n where i = 0, or where n < i. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 9 / 1
  • 33. Efficient Computation of Expected Confidence Defining Pi,n as the probability of x having support of i on the first n transactions of T, we have Ei,n(supx c ) = pn × (Ei−1,n−1(supx c + 1)) + (1 − pn) × Ei,n−1(supx c ) = pn × (Ei−1,n−1(supx c ) + Pi−1,n−1) + (1 − pn) × Ei,n−1(supx c ) when cn = c C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 10 / 1
  • 34. Efficient Computation of Expected Confidence Defining Pi,n as the probability of x having support of i on the first n transactions of T, we have Ei,n(supx c ) = pn × (Ei−1,n−1(supx c + 1)) + (1 − pn) × Ei,n−1(supx c ) = pn × (Ei−1,n−1(supx c ) + Pi−1,n−1) + (1 − pn) × Ei,n−1(supx c ) when cn = c , since we have: Ei−1,n−1(supx c + 1) = Ei−1,n−1(supx c ) + Pi−1,n−1 C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 10 / 1
  • 35. Efficient Computation of Expected Confidence Denoting P(x ⊆ ti ) as pi for each transaction ti ∈ T, we have Pi,n = pn × Pi−1,n−1 + (1 − pn) × Pi,n−1 , where 1 ≤ i ≤ n ≤ |T|. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 11 / 1
  • 36. Efficient Computation of Expected Confidence Denoting P(x ⊆ ti ) as pi for each transaction ti ∈ T, we have Pi,n = pn × Pi−1,n−1 + (1 − pn) × Pi,n−1 , where 1 ≤ i ≤ n ≤ |T|. Pi,n = 1 for n = 0 Pi,n−1 × (1 − pn) for 1 ≤ n ≤ |T| where i = 0. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 11 / 1
  • 37. Efficient Computation of Expected Confidence Denoting P(x ⊆ ti ) as pi for each transaction ti ∈ T, we have Pi,n = pn × Pi−1,n−1 + (1 − pn) × Pi,n−1 , where 1 ≤ i ≤ n ≤ |T|. Pi,n = 1 for n = 0 Pi,n−1 × (1 − pn) for 1 ≤ n ≤ |T| where i = 0. Pi,n = 0 where n < i. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 11 / 1
  • 38. Efficient Computation of Expected Confidence Since E(confx c ) = E|T|(confx c ) = |T| i=0 Ei,|T|(confx c ). The computation is divided into |T| + 1 steps with Ei,|T|(confx c ) = Ei,|T|(supx c)/i (0 ≤ i ≤ |T|) computed in ith step. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 12 / 1
  • 39. Efficient Computation of Expected Confidence Since E(confx c ) = E|T|(confx c ) = |T| i=0 Ei,|T|(confx c ). The computation is divided into |T| + 1 steps with Ei,|T|(confx c ) = Ei,|T|(supx c)/i (0 ≤ i ≤ |T|) computed in ith step. #Transaction / n Support / i 0 1 |T| 0 1 |T| ... ... ,| | ( )c i T xconfE 1,| | 1( )c i T xconfE − − 2 2 Computation in One Step Start of Next Step Explaination 1,| | ( )c i T xconfE − | | ,| | 0 ( ) ( ) T c c x i T x i conf cE E onf = = ∑ C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 12 / 1
  • 40. Efficient Computation of Expected Confidence Since E(confx c ) = E|T|(confx c ) = |T| i=0 Ei,|T|(confx c ). The computation is divided into |T| + 1 steps with Ei,|T|(confx c ) = Ei,|T|(supx c)/i (0 ≤ i ≤ |T|) computed in ith step. #Transaction / n Support / i 0 1 |T| 0 1 |T| ... ... ,| | ( )c i T xconfE 1,| | 1( )c i T xconfE − − 2 2 Computation in One Step Start of Next Step Explaination 1,| | ( )c i T xconfE − | | ,| | 0 ( ) ( ) T c c x i T x i conf cE E onf = = ∑ Time Complexity: O(|T|2 ) C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 12 / 1
  • 41. Efficient Computation of Expected Confidence Since E(confx c ) = E|T|(confx c ) = |T| i=0 Ei,|T|(confx c ). The computation is divided into |T| + 1 steps with Ei,|T|(confx c ) = Ei,|T|(supx c)/i (0 ≤ i ≤ |T|) computed in ith step. #Transaction / n Support / i 0 1 |T| 0 1 |T| ... ... ,| | ( )c i T xconfE 1,| | 1( )c i T xconfE − − 2 2 Computation in One Step Start of Next Step Explaination 1,| | ( )c i T xconfE − | | ,| | 0 ( ) ( ) T c c x i T x i conf cE E onf = = ∑ Time Complexity: O(|T|2 ) Space Complexity: O(|T|) C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 12 / 1
  • 42. Upper Bounds of Expected Confidence For ∀i(1 ≤ i ≤ |T|), we have E(confx c ) = E|T|(confx c ) = i−1 k=0 Ek,|T|(supx c) k + |T| k=i Ek,|T|(supx c) k ≤ i−1 k=0 Ek,|T|(supx c) k + |T| k=i Ek,|T|(supx c) i = i−1 k=0 Ek,|T|(supx c) k + |T| k=0 Ek,|T|(supx c) i − i−1 k=0 Ek,|T|(supx c) i = i−1 k=0 Ek,|T|(supx c ) × ( 1 k − 1 i ) + E(supx c) i =boundi (confx c ) C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 13 / 1
  • 43. Upper Bounds of Expected Confidence For 1 ≤ i ≤ |T|, we have: E(supx c ) = bound1(confx c ) ≥ · · · ≥ boundi (confx c ) ≥ · · · ≥ bound|T|(confx c ) = E(confx c ) C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 14 / 1
  • 44. Upper Bounds of Expected Confidence For 1 ≤ i ≤ |T|, we have: E(supx c ) = bound1(confx c ) ≥ · · · ≥ boundi (confx c ) ≥ · · · ≥ bound|T|(confx c ) = E(confx c ) Since boundi (confx c ) = boundi−1(confx c ) − ( 1 i − 1 − 1 i ) × (E(supx c ) − i−1 k=0 Ek,|T|(supx c )) , can compute boundi (confx c ) with boundi−1(confx c ). C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 14 / 1
  • 45. Upper Bounds of Expected Confidence #Transaction / n Support / i 0 1 |T| 0 1 |T| ... ... ,| | ( )c i T xconfE 1,| | ( )c i T xconfE − 1,| | 1( )c i T xconfE − − 2 2 Stop Condition: SkippedComputation in One Step Start of Next Step Explaination _ ( ) c c cur db i x maxbound conf conf≤ C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 15 / 1
  • 46. Upper Bounds of Expected Confidence Running Example: 0.1 1 10 10 20 30 40 50 60 70 80 90 100 boundi(confx c ) Support / i boundi(confx c ) confmax cur_dbc confx c boundi(confx c ) (Skipped) Stop when boundi(confx c ) <= confmax cur_dbc C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 16 / 1
  • 47. Algorithm Framework 1. Calculate the (expected) confidence of current prefix pattern. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 17 / 1
  • 48. Algorithm Framework 1. Calculate the (expected) confidence of current prefix pattern. 2. If the confidence value is larger than previous maximal value, update covered instances. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 17 / 1
  • 49. Algorithm Framework 1. Calculate the (expected) confidence of current prefix pattern. 2. If the confidence value is larger than previous maximal value, update covered instances. 3. If at least one instance covered, select the prefix. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 17 / 1
  • 50. Algorithm Framework 1. Calculate the (expected) confidence of current prefix pattern. 2. If the confidence value is larger than previous maximal value, update covered instances. 3. If at least one instance covered, select the prefix. 4. Continue growing current prefix, and go to Step 1. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 17 / 1
  • 51. Algorithm Framework 1. Calculate the (expected) confidence of current prefix pattern. 2. If the confidence value is larger than previous maximal value, update covered instances. 3. If at least one instance covered, select the prefix. 4. Continue growing current prefix, and go to Step 1. Need to sort all the uncertain attributes after certain attributes, to help shrink current projected database. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 17 / 1
  • 52. Instance Covering Strategy Previous Strategy in HARMONY: Just find one most discriminative covering pattern with the highest confidence for each instance. On uncertain data, the probability of the instance being covered could be very low. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 18 / 1
  • 53. Instance Covering Strategy Previous Strategy in HARMONY: Just find one most discriminative covering pattern with the highest confidence for each instance. On uncertain data, the probability of the instance being covered could be very low. Our method: Apply a threshold of minimum cover probability coverProbmin. Assure that the probability of each instance not covered by any pattern is less than 1 − coverProbmin, by maintaining a list storing confidence values of covering patterns on class c in descending order. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 18 / 1
  • 54. Used Classifiers SVM Classifier Convert each pattern to a binary feature by whether it is contained by the instance. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 19 / 1
  • 55. Used Classifiers SVM Classifier Convert each pattern to a binary feature by whether it is contained by the instance. Rule-based Classifier (From HARMONY) For each test instance we just sum up the product of the confidence of each pattern on each class and the probability of the instance containing the pattern. The class with the largest value is the predicted class of the instance. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 19 / 1
  • 56. Used Datasets Dataset #Instance #Attribute #Class Area australian 690 14 2 Financial balance 635 4 3 Social bands 539 38 2 Physical breast 699 9 2 Life bridges-v1 106 11 6 N/A bridges-v2 106 10 6 N/A car 1728 6 4 N/A contraceptive 1473 9 3 Life credit 690 15 2 Financial echocardiogram 131 12 2 Life flag 194 28 8 N/A german 1000 19 2 Financial heart 920 13 5 Life hepatitis 155 19 2 Life horse 368 27 2 Life monks-1 556 6 2 N/A monks-2 601 6 2 N/A monks-3 554 6 2 N/A mushroom 8124 22 2 Life pima 768 8 2 Life postoperative 90 8 3 Life promoters 106 57 2 Life spect 267 22 2 Life survival 306 3 2 Life ta eval 151 5 3 N/A tic-tac-toe 958 9 2 Game vehicle 846 18 4 N/A voting 435 16 2 Social wine 178 13 3 Physical zoo 101 16 7 Life 30 Public UCI Certain Datasets C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 20 / 1
  • 57. Used Datasets Dataset #Instance #Attribute #Class Area australian 690 14 2 Financial balance 635 4 3 Social bands 539 38 2 Physical breast 699 9 2 Life bridges-v1 106 11 6 N/A bridges-v2 106 10 6 N/A car 1728 6 4 N/A contraceptive 1473 9 3 Life credit 690 15 2 Financial echocardiogram 131 12 2 Life flag 194 28 8 N/A german 1000 19 2 Financial heart 920 13 5 Life hepatitis 155 19 2 Life horse 368 27 2 Life monks-1 556 6 2 N/A monks-2 601 6 2 N/A monks-3 554 6 2 N/A mushroom 8124 22 2 Life pima 768 8 2 Life postoperative 90 8 3 Life promoters 106 57 2 Life spect 267 22 2 Life survival 306 3 2 Life ta eval 151 5 3 N/A tic-tac-toe 958 9 2 Game vehicle 846 18 4 N/A voting 435 16 2 Social wine 178 13 3 Physical zoo 101 16 7 Life 30 Public UCI Certain Datasets Real values have been discretizated. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 20 / 1
  • 58. Convert to Uncertain Datasets Two parameters: Uncertain Attribute Number: Number of attributes selected to converted to uncertain. Those with highest information gain values are selected. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 21 / 1
  • 59. Convert to Uncertain Datasets Two parameters: Uncertain Attribute Number: Number of attributes selected to converted to uncertain. Those with highest information gain values are selected. Uncertain Degree (0 - 1): The probability the attribute value taking values other than the original value. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 21 / 1
  • 60. Convert to Uncertain Datasets Two parameters: Uncertain Attribute Number: Number of attributes selected to converted to uncertain. Those with highest information gain values are selected. Uncertain Degree (0 - 1): The probability the attribute value taking values other than the original value. Represented by Ux@y, where x is uncertain degree and y is uncertain attribute number. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 21 / 1
  • 61. Accuracy Evaluation Average accuracies on 30 datasets. For accuracy on each dataset, refer to our paper. Using SVM Classifier: Dataset uHARMONY DTU uRule U10@1 79.0138 74.8738 75.2111 U10@2 78.6970 73.1629 73.4107 U10@4 77.9657 72.2670 69.4649 U20@1 78.9537 74.6577 74.6287 U20@2 78.6073 72.5642 72.5460 U20@4 77.8352 69.9157 68.2066 C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 22 / 1
  • 62. Accuracy Evaluation Average accuracies on 30 datasets. For accuracy on each dataset, refer to our paper. Using SVM Classifier: Dataset uHARMONY DTU uRule U10@1 79.0138 74.8738 75.2111 U10@2 78.6970 73.1629 73.4107 U10@4 77.9657 72.2670 69.4649 U20@1 78.9537 74.6577 74.6287 U20@2 78.6073 72.5642 72.5460 U20@4 77.8352 69.9157 68.2066 Using Rule-based Classifier: Dataset uHARMONYrule DTU uRule U10@4 73.2517 72.2670 69.4649 C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 22 / 1
  • 63. Sensitivity Test 94 94.5 95 95.5 96 96.5 97 1.25 2.5 5 10 20 Accuracy(in%) Minimum Support (in %) (a) breast 49 50 51 52 53 54 0.25 0.5 1 2 4 Accuracy(in%) Minimum Support (in %) (b) wine Figure: Accuracy Evaluation of U10@1 w.r.t. Minimum Support C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 23 / 1
  • 64. Sensitivity Test 94 94.5 95 95.5 96 96.5 0 10 50 90 100 Accuracy(in%) Minimum Cover Prob. (in %) (a) breast 51 51.5 52 52.5 53 53.5 0 10 50 90 100 Accuracy(in%) Minimum Cover Prob. (in %) (b) wine Figure: Accuracy Evaluation of U10@1 w.r.t. Minimum Cover Prob. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 24 / 1
  • 65. Runtime Efficiency uHarmony DTU uRule 0.1 1 10 100 1000 breast car contraceptive heart pima wine RunningTime(insec) Dataset (a) Running Time (in sec) Figure: Classification Efficiency Evaluation of U10@1 C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 25 / 1
  • 66. Runtime Efficiency uHarmony DTU uRule 10 100 1000 breast car contraceptive heart pima wine MemoryUse(inMB) Dataset (b) Memory Use (in MB) Figure: Classification Efficiency Evaluation of U10@1 C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 26 / 1
  • 67. Effectiveness of the Expected Confidence Upper Bound With Expected Conf Bound Without Expected Conf Bound 10 15 20 25 30 35 40 45 50 0.2 0.4 0.6 0.8 1 1.2 RunningTime(insec) Minimum Support (in %) (a) car 10 15 20 25 30 35 40 0.2 0.4 0.6 0.8 1 1.2 RunningTime(insec) Minimum Support (in %) (b) heart Figure: Running Time Evaluation of U10@4 C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 27 / 1
  • 68. Scalability Test With Expected Conf Bound Without Expected Conf Bound 0 20 40 60 80 100 120 140 160 180 1 2 4 8 16 RunningTime(insec) Dataset Duplication Ratio (a) car (supmin = 0.01) 0 200 400 600 800 1000 1 2 4 8 16 RunningTime(insec) Dataset Duplication Ratio (b) heart (supmin = 0.01) Figure: Scalability Evaluation (U10@1, Running Time) C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 28 / 1
  • 69. Conclusions Proposed the first associative classification algorithm on uncertain data. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 29 / 1
  • 70. Conclusions Proposed the first associative classification algorithm on uncertain data. Proposed an efficient computation of expected confidence value, together with the computation of upper bounds. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 29 / 1
  • 71. Conclusions Proposed the first associative classification algorithm on uncertain data. Proposed an efficient computation of expected confidence value, together with the computation of upper bounds. New instance covering strategy has been proposed and tested to be effective. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 29 / 1
  • 72. Conclusions Proposed the first associative classification algorithm on uncertain data. Proposed an efficient computation of expected confidence value, together with the computation of upper bounds. New instance covering strategy has been proposed and tested to be effective. Conducted an extensive evaluation on 30 public real data, under varying uncertain parameters. With significant improvements on accuracy, comparing with two other state-of-the-art alrotihms. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 29 / 1
  • 73. Conclusions Proposed the first associative classification algorithm on uncertain data. Proposed an efficient computation of expected confidence value, together with the computation of upper bounds. New instance covering strategy has been proposed and tested to be effective. Conducted an extensive evaluation on 30 public real data, under varying uncertain parameters. With significant improvements on accuracy, comparing with two other state-of-the-art alrotihms. Evaluated the runtime efficiency, proved the effectiveness of using upper bounds. C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 29 / 1
  • 74. The End Thank you for Listening! Questions or Comments? C. Gao, J. Wang (Tsinghua Univ.) Direct Mining of Discriminative Patterns for Classifying Uncertain Data 30 / 1