Data Mining consists of extracting patterns from data, and it is the core step of a knowledge discovery process
pre-proc data mining post-proc
Data interesting
22, M, 30K patterns
26, F, 55K IF (salary = high)
……… . THEN (credit = good)
4.
The Knowledge Discovery Process – a popular definition
“ Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”
(Fayyad et al. 1996)
Focus on the quality of discovered patterns
independent of the data mining algorithm
This definition is often quoted, but not very seriously taken into account
A lot of research on discovering valid, accurate patterns
Little research on discovering potentially useful patterns
5.
Criteria to Evaluate the “Interestingness” of Discovered Patterns useful novel, surprising comprehensible valid (accurate) Amount of Research Difficulty of measurement
6.
On the difficulty of discovering surprising patterns in data mining
Focus on maximizing accuracy leads to very accurate but useless rules, e.g. (Brin et al. 1997) – census data:
IF (person is pregnant) THEN (gender is female)
IF (age 5) THEN (employed = no)
(Tsumoto 2000) extracted 29,050 rules from a medical dataset. Out of these, just 220 (less than 1%) were considered interesting or surprising to the user
7.
Bayesian network example A B C D A Bayesian network represents potentially causal patterns, which tend to be more useful for intelligent decision making Motivation for Integrating Bayesian Networks and Simpson’s Paradox However, algorithms for constructing Bayesian networks from data were not designed to discover surprising patterns Simpson’s paradox is surprising by nature Causality + Surprisingness tends to improve Usefulness
Not scalable to datasets with many variables (attributes)
Methods based on search guided by a scoring function
Iteratively create candidate solutions (Bayesian networks) and evaluate the quality of each created network using a scoring function, until a stopping criteria is satisfied
Sequential methods consider a single candidate solution at a time
Population-based methods consider many candidate solutions at a time
B algorithm starts with an empty network and at each iteration adds, to the current candidate solution, the edge that maximizes the value of the scoring function
K2 algorithm requires that the variables be ordered and the user specifies a parameter: the maximum number of parents of each variable in the network to be constructed
Both are greedy methods (local search), which offer no guarantee of finding the optimal network
Population-based methods are global search methods, but are stochastic, so again no guarantees
10.
Limitations of methods for constructing Bayesian networks from data (1)
Theoretical limitation (best possible algorithm & data)
Bayesian networks are Independence maps (I-maps) of the true probability distribution
Every independence between variables represented in the network is an actual independence in the true probability distribution
Dependences between variables represented in the network are not guaranteed to be actual dependences in the true probability distribution
11.
Limitations of methods for constructing Bayesian networks from data (2)
Practical limitations
The problem of constructing the optimal net is too complex in large datasets, so we have to use methods which do not guarantee the discovery of the optimal net
Sampling variation and/or noisy data may mislead the Bayesian network construction method, further contributing to the discovery of a sub-optimal net
Event C (“cause”) increases the probability of event E (“effect”) in a given population but, at the same time, decreases the probability of E in every subpopulation
No paradox in terms of probability theory, it looks a “paradox” under a causal interpretation
Gender is a confounder variable in the previous example
Although Simpson’s paradox is known by statisticians , occurrences of the paradox are surprising to users
There are algorithms that systematically find instances of the paradox in data and rank them in decreasing order of surprisingness (Fabris & Freitas 2006)
14.
The proposed method for integrating Bayesian networks and Simpson’s paradox
Basic Idea:
In a Bayesian network, the dependence denoted by edge C E can be spurious, i.e., due to a confounding variable F
(for the previously discussed reasons)
Two approaches exploring this basic idea
15.
First Approach: paradox detection before network construction
First, run an algorithm that detects occurrences of Simpson’s paradox in data (Fabris & Freitas 2006)
Produces a paradox list PL
Modify Bayesian network construction algorithms to take into account this list, biasing the algorithms against including network edges involving the paradox
Consider a potential dependence represented by the edge C E, where C is apparent cause of effect E
If variables C, E are associated in an occurrence of Simpson’s paradox in PL, the algorithm is biased against including edge C E in the network
16.
Consider a greedy algorithm that starts with an empty network and adds one edge to the network at a time, guided by a scoring function FOR EACH candidate edge A B compute the score of the network if A B is added to the network penalize score if there is an occurrence of the paradox in list PL involving pair of variables A, B SELECT edge with highest score and add it to the network proposed extension
17.
The same basic kind of extension can be applied to an Estimation of Distribution Algorithm – EDA is a population-based evolutionary algorithm – It evaluates a complete candidate solution (network) at once FOR EACH candidate solution in the population compute the score of the network represented by the candidate solution penalize score in proportion to the number of paradox occurrences in list PL that are associated with direct dependences A B in the network proposed extension
18.
Second Approach: paradox detection after network construction
First, construct a Bayesian network from data
Use the network to “prune” the search space for the Simpson’s paradox detection algorithm
The algorithm will focus its search on the pairs of variables for which there is a direct dependence (i.e., an edge A B ) in the Bayesian network
For each pair of such variables, the algorithm will try to find a third variable that acts as a confounder between those two variables
19.
Bayesian net variables considered by Simpson’s paradox detection algorithm, considering the Bayesian net Cause Effect Is there a counfounder? A C ? B C ? C D ? A B C D A paradox occurrence involving the above pairs of cause and effect variables would be even more surprising to the user, due to the structure of the network
Be the first to comment