SlideShare a Scribd company logo
1 of 50
Ohio Center of Excellence in Knowledge-Enabled Computing
Ph.D. Dissertation Defense:
Contrast Pattern Aided Regression and
Classification
February 19, 2016
Vahid Taslimitehrani
Kno.e.sis Center, CSE Dept., Wright State University, USA
Committee Members: Prof. Guozhu Dong (advisor, WSU), Prof. Amit Sheth (WSU),
Prof. T.K. Prasad (WSU), Dr. Keke Chen (WSU), and Prof. Jyotishman Pathak
(Cornell University)
1
Ohio Center of Excellence in Knowledge-Enabled Computing
2
Ohio Center of Excellence in Knowledge-Enabled Computing
3
Does Asthma decrease
the mortality risk from
Pneumonia?
Ohio Center of Excellence in Knowledge-Enabled Computing
Accuracy vs. Interpretability
4
Accuracy
Interpretability
Low
High
High
Lasso
Linear/Logistic
Regression
Naïve Bayes
Decision Trees
Splines
Nearest
Neighbors
Bagging
Neural Nets
SVM
Boosting
Random Forest
Deep Learning
CPXR/CPXC
Source: Joshua Bloom and Henrik Brink of wise.io
*on real dataset
Ohio Center of Excellence in Knowledge-Enabled Computing
5
Modeling Techniques Lack Accuracy
and Interpretability
Heterogeneity &
Diversity of Given
Dataset
Predictors-Response
Interactions
Universal Model’s
Assumption
Ohio Center of Excellence in Knowledge-Enabled Computing
Predictors-Response Interactions
6
Interactive effect:
The effect of a variable on prediction
changes and varies with changes in the
values of other independent variable(s)
which are interacting with the variable.
It is not the genes or the environment!
It is their interaction that’s important.
Ohio Center of Excellence in Knowledge-Enabled Computing
Universal Model’s Assumption &
Heterogeneity
What is the universal model’s
assumption?
7
What are heterogeneous and
diverse data points?
Ohio Center of Excellence in Knowledge-Enabled Computing
Solution
1.New type of regression & classification models called Pattern
Aided Regression and Classification (PXR and PXC)
2.The new algorithms to build PXR and PXC models called Contrast
Pattern Aided Regression and Classification (CPXR and CPXC)
3.The new algorithm to handle imbalanced datasets called Contrast
Pattern Aided Classification on Imbalanced datasets (CPXCim)
8
Our proposed methodology has three components:
Ohio Center of Excellence in Knowledge-Enabled Computing
Preliminaries: patterns
• A pattern (rule) is a set of conditions describing set of objects.
• Example:
"𝑨𝒈𝒆 ≥ 60" AND “History of hypertension = YES”
is a pattern (rule) describing:
All patients more than 60 years old AND have a history of Hypertension.
• An object matches a pattern if it satisfies every condition in the pattern.
9
Patient ID Age BMI History of Hypertension Diagnosed with Heart Failure
1 75 22 YES YES
2 67 27 NO NO
Ohio Center of Excellence in Knowledge-Enabled Computing
Preliminaries: matching dataset and
contrast patterns
• The matching dataset of pattern 𝑃 in dataset 𝐷 or 𝑚𝑑𝑠(𝑃, 𝐷) is the set of all
instances matching pattern 𝑃.
• The support of pattern 𝑃 in 𝐷 is 𝑠𝑢𝑝𝑝 𝑃, 𝐷 =
𝑚𝑑𝑠(𝑃,𝐷)
𝐷
.
• Contrast patterns: patterns that distinguish objects in different classes. A
pattern is contrast pattern if it matches many objects in one class than in
another class.
• An equivalent class (EC) is a set of patterns with same matching datasets
(having same behavior).
10
Ohio Center of Excellence in Knowledge-Enabled Computing
Introduction: CPXR/CPXC overview
11
𝑷: pattern
𝒇: model
A pattern logically
characterizes a sub-
group of data.
A local model represents
predictor-response
interactions among the
data points of a sub-
group of data.
Regression
Classification
𝒇
CPXR/CPXC
(𝑷 𝟏, 𝒇 𝟏)
(𝑷 𝟐, 𝒇 𝟐)
Local model algorithms
can be simple as linear
regression.
Ohio Center of Excellence in Knowledge-Enabled Computing
Diversity of predictor-response
relationships
• Different pattern-model pairs emphasize different sets of
variables.
• Different pattern-model pairs use highly different
regression/classification models.
• Diverse predictor-response relationships may be neutralized
at the global level.
12
Ohio Center of Excellence in Knowledge-Enabled Computing
Introduction: Thesis Statement
Study regression and classification techniques to produce accurate
and interpretable models capable of adequately representing
complex and diverse predictor-response interactions and revealing
high intra-dataset heterogeneity.
13
Ohio Center of Excellence in Knowledge-Enabled Computing
Contrast Pattern Aided Regression
(CPXR)
14
Guozhu Dong, Vahid Taslimitehrani, Pattern-Aided Regression
Modeling and Prediction Model Analysis. in IEEE Transactions
on Knowledge and Data Engineering, vol.27, no.9, pp.2452-
2465, Sept. 1 2015
Ohio Center of Excellence in Knowledge-Enabled Computing
A pictorial illustration of a simple PXR
model
15
A small dataset with 100 instances and 2 numerical
predictor variables.
• Different patterns can involve different sets of variables
[describing data regions in different subspaces]
• Matching datasets of different patterns can overlap
0
2
4
6
8
10
0 2 4 6 8 10
Ohio Center of Excellence in Knowledge-Enabled Computing
PXR concepts
16
Regression
Classification
𝒇 𝒃
Given a training dataset 𝐷 =
(𝑥𝑖, 𝑦𝑖) 1 ≤ 𝑖 ≤ 𝑛 , a regression
model built on 𝐷 is called
baseline model and given as 𝑓𝑏.
(𝑷 𝟏, 𝒇 𝑷 𝟏
)
(𝑷 𝟐, 𝒇 𝑷 𝟐
)
CPXR/CPXC
Given the matching dataset
of pattern 𝑃, 𝑚𝑑𝑠(𝑃, 𝐷), a
regression built on
𝑚𝑑𝑠 𝑃, 𝐷 is called local
model and is shown by 𝑓𝑃.
Ohio Center of Excellence in Knowledge-Enabled Computing
Pattern Items Local Model Match
Pattern Aided Regression (PXR)
17
• 𝑃𝑋𝑅 = ( 𝑃1, 𝑓1, 𝑤1 , 𝑃2, 𝑓2, 𝑤2 , … , 𝑃𝑘, 𝑓𝑘, 𝑤 𝑘 , 𝑓𝑑)
• The regression function of PXR as:
𝑓𝑃𝑋𝑅 =
𝑃 𝑖∈𝜋 𝑥
𝑤𝑖 𝑓𝑖(𝑥)
𝑃 𝑖∈𝜋 𝑥
𝑤𝑖
, 𝑖𝑓 𝜋 𝑥 ≠ ∅
𝑓𝑑, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
where 𝜋 𝑥 = 𝑃𝑖 1 ≤ 𝑖 ≤ 𝑘, 𝑥 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 𝑃𝑖
Case 3:
Case 2:
Case 1:
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR/CPXC: Quality Measures
• The average residual reduction (arr) of a pattern 𝑃 w.r.t to a prediction
model 𝑓 on a dataset 𝐷 is:
𝑎𝑟𝑟 𝑃 =
𝑥∈𝑚𝑑𝑠(𝑃,𝐷) 𝑟 𝑥(𝑓 𝑏) − 𝑥∈𝑚𝑑𝑠(𝑃,𝐷) 𝑟 𝑥(𝑓 𝑃)
𝑚𝑑𝑠(𝑃,𝐷)
• The total residual reduction (trr) of a PXR/PXC is:
𝑡𝑟𝑟 𝑃𝑋𝑅/𝑃𝑋𝐶 =
𝑥∈𝑚𝑑𝑠(𝑃𝑆,𝐷) 𝑟𝑥(𝑓𝑏) − 𝑥∈𝑚𝑑𝑠(𝑃𝑆,𝐷) 𝑟𝑥(𝑓𝑃𝑋𝑅/𝑃𝑋𝐶)
𝑥∈𝐷 𝑟𝑥(𝑓)
Where 𝑃𝑆 = 𝑃1, … , 𝑃𝑘 is the pattern set, 𝑟𝑥(𝑓) is the 𝑓’s residual on an
instance 𝑥 and 𝑚𝑑𝑠 𝑃𝑆, 𝐷 = 𝑖=1
𝑘
𝑚𝑑𝑠(𝑃𝑖, 𝐷).
18
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR Algorithm
19
Dataset D CPXR
Phase1
Phase2
Phase3
Goal: A small set of cooperating patterns, where each pattern
characterize a subgroup of data points.
• A baseline model makes large residual errors on data points in
the subgroup.
• A highly accurate model is found to correct those errors.
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR Algorithm
20
Baseline
model
Regression/
Classification
LE
SE
Training
Dataset
…
…
…
…
Patterns Local Models
Pattern
Mining
[(𝑃1, 𝑓1, 𝑤1) , (𝑃4, 𝑓4, 𝑤4) , … , (𝑃𝑘, 𝑓𝑘, 𝑤 𝑘)]
(𝑓1, 𝑤1)
(𝑓4, 𝑤4)
(𝑓𝑘, 𝑤 𝑘)
𝑃1
𝑃4
𝑃𝑘
Ohio Center of Excellence in Knowledge-Enabled Computing
• How to determine spliting point 𝜅?
Minimize 𝜌 −
𝑟 𝑖>𝜅 𝑟 𝑖
𝑟 𝑖
• How to select patterns from C𝑃𝑆?
Lets 𝑃𝑆 = 𝑃0 , where 𝑃0 is the pattern 𝑃 in C𝑃𝑆 with the highest 𝑎𝑟𝑟
21
0
1
2
3
4
5
6
0 50 100 150 200
SE LE
CPXR Algorithm
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR/CPXC: Filtering methods
• Contrast patterns of LE with support ratio less than 1.
• Patterns with tiny residual reduction (𝑎𝑟𝑟).
• Patterns with Jaccard similarity more than 0.9
𝐽 𝑃1, 𝑃2 =
𝑚𝑑𝑠(𝑃1, 𝐷) ∩ 𝑚𝑑𝑠(𝑃2, 𝐷)
𝑚𝑑𝑠(𝑃1, 𝐷) ∪ 𝑚𝑑𝑠(𝑃2, 𝐷)
• Patterns with the size of matching datasets less than the number of
predictor variables.
22
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR: Prediction Accuracy Evaluation
• 50 real datasets and 23 synthetic datasets
• Different criteria to generate synthetic datasets
• Compare CPXR’s performance with 5 state-of-the-art
regression methods
• Overfitting and noise sensitivity
• Analysis of parameters
23
𝑅𝑀𝑆𝐸 𝑟𝑒𝑑𝑢𝑐𝑡𝑖𝑜𝑛 =
𝑅𝑀𝑆𝐸 𝐿𝑅 − 𝐸𝑀𝑆𝐸(𝑋)
𝑅𝑀𝑆𝐸(𝐿𝑅)
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR: Prediction Accuracy Evaluation
24
Dataset PLR SVR BART GBM CPXR
Tecator 40.62 0.16 19.35 -14.15 65.1
Tree 17.68 7.92 -7.23 -10.82 61.73
Wage 12.2 9.15 25.42 11.86 38.45
Average 18.41 4.94 20.18 14.6 42.89
CPXR’s
performance
vs. other
methods
• CPXR has the highest accuracy in 41 out of 50 datasets.
• CPXR’s results are more accurate than LR in all 50 datasets.
• In 20% of datasets, CPXR achieved more than 60% RMSE
reduction.
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR: Overfitting and Noise Sensitivity
25
5 10 15 20
102030405060
Noise(%)
Dropinaccuracycomparingtocleantestdata(%)
●
●
●
●
●
Datasets
BART
CPXR
Gradient Boosting
NN SVR BART CPXR
0.00.20.40.6
NN SVR BART CPXR
−0.2−0.10.00.10.20.30.4
RMSE
reduction on
synthetic
datasets
Train - Test
Method Training Test
Drop in
accuracy
PLR 37.11% 18.76% 49%
SVR 7.65% 4.8% 37%
BART 41.02% 20.15% 51%
CPXR(LL) 51.4% 39.88% 22%
CPXR(LP) 53.85% 42.89% 21%
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR: Analysis of Parameters
26
5 10 15 20
0.350.400.450.500.550.600.65
k (Number of patterns)
RMSEimprovementoverLR
●
●
●
●
●
●
Datasets
Fat
Mussels
Price
0.02 0.04 0.06 0.08 0.10
0.250.300.350.400.450.500.550.60
minSup
RMSEimprovementoverLR
● ●
●
●
●
Datasets
Fat
Mussels
Price
0.40 0.45 0.50 0.55 0.60 0.65 0.70
0.350.400.450.500.550.60
r
RMSEimprovementoverLR
● ●
●
● ●
● ●
●
Datasets
Fat
Mussels
Price
2% is the optimal minSup.7 patterns as average on
50 datasets.
Ohio Center of Excellence in Knowledge-Enabled Computing
Contrast Pattern Aided Classification
(CPXC)
27
Guozhu Dong, Vahid Taslimitehrani, Pattern Aided
Classification, SIAM Data Mining Conference, 2016
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXC: PXC Concept
CPXC techniques are quite
similar to those of CPXR
but CPXC has more
challenges as well as more
opportunities than CPXR
28
CPXC
Confidence
of Match
Objective
Functions
Classification
Algorithms
Loss
Functions
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXC: Confidence of Match
• Given 𝑃𝑋𝐶 = ( 𝑃1, ℎ 𝑃1
, 𝑤1 , 𝑃2, ℎ 𝑃2
, 𝑤2 , … , 𝑃𝑘, ℎ 𝑃 𝑘
, 𝑤 𝑘 , ℎ 𝑑), the class variable
of an instance 𝑥 is defined as:
𝑤𝑒𝑖𝑔ℎ𝑡𝑑 − 𝑣𝑜𝑡𝑒 (𝑃𝑋𝐶, 𝐶𝑗, 𝑥)
=
𝑃 𝑖∈𝜋 𝑥
𝑤𝑖 × 𝑚𝑎𝑡𝑐ℎ (𝑥, 𝑝𝑖) × ℎ 𝑝 𝑖
(𝑥, 𝐶𝑗)
𝑃 𝑖∈𝜋 𝑥
𝑤𝑖 × 𝑚𝑎𝑡𝑐ℎ (𝑥, 𝑝𝑖)
, 𝑖𝑓 𝜋 𝑥 ≠ ∅
ℎ 𝑑, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
where 𝜋 𝑥 = 𝑃𝑖 1 ≤ 𝑖 ≤ 𝑘, 𝑚𝑎𝑡𝑐ℎ 𝑥, 𝑝𝑖 > 0
and
𝑚𝑎𝑡𝑐ℎ 𝑥, 𝑝𝑖 =
𝑞 𝑖 𝜖𝑀𝐺(𝑝 𝑖) 𝑡 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 𝑝 𝑖
𝑀𝐺(𝑝 𝑖)
• 𝑚𝑎𝑡𝑐ℎ 𝑥, 𝑝𝑖 is the fraction of 𝑀𝐺 ‘s 𝑞 in 𝑀𝐺 𝑝𝑖 such that 𝑥 matches 𝑞.
• ℎ 𝑝(𝑥, 𝐶𝑗) is the confidence score of local model ℎ on instance 𝑥 for class 𝐶𝑗.
29
Confidence
of Match
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXC: Loss Functions
30
0.600.650.700.750.800.850.90
ClassError
AUC
●
●
●
Binary Probabilistic Standardized
●
Datasets
ILPD
Hillvalley
Planning
Probabilistic error loss
function returns the
best results.
Loss
Functions
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXC: Base/Local Algorithms & Objective
Functions
• Different methods for baseline and local classifiers:
– We used 6 classification algorithm for learning the
baseline and local classifiers
31
Classification
Algorithms
• Quality measures on pattern sets
– We used 𝑡𝑟𝑟, AUC, and ACC (accuracy) to measure the
quality of a pattern set
• Quality measures on patterns and weights on local classifiers
– We used 𝑎𝑟𝑟, AUC, and ACC (accuracy) to measure the
quality of a pattern: 𝑎𝑟𝑟 is the winner!
Objective
Functions
Ohio Center of Excellence in Knowledge-Enabled Computing
Experimental results
32
19
Public
Datasets
8
Classification
Algorithms
Noise
Sensitivity &
Overfitting
Running
Time
7
Fold Cross
Validation
minSup = 0.02
rho = 0.45
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXC: Performance
Dataset Boosting DT NBC Log RF SVM Max CPXC (NBC-DT)
Congress 0.58 0.66 0.6 0.57 0.58 0.58 0.66 0.86
Poker 0.6 0.6 0.5 0.5 0.76 0.5 0.76 0.85
HillValley 0.5 0.63 0.65 0.66 0.6 0.67 0.67 0.89
Climate 0.96 0.81 0.9 0.94 0.97 0.98 0.98 0.97
Mammography 0.94 0.91 0.94 0.94 0.93 0.93 0.94 0.98
Steel 0.96 0.88 0.91 0.95 0.95 0.94 0.95 0.99
33
• CPXC achieved average AUC of 0.886 on the 8 hard datasets.
• Average AUC of the best performing traditional classifier (RF) on hard datasets is 0.638.
• CPXC’s AUC is never lower than RF on the hard datasets.
• CPXC achieved average AUC of 0.983 on the easy datasets while the best performing
traditional algorithms obtained average AUC of 0.968.
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXC: Noise Sensitivity
34
Drop of AUC vs. noise levels
Method/Noise 0% 5% 10% 15% 20% Average
RF 5.73 6.61 12.48 25.83 33.54 16.84
CPXC 5.87 6.79 12.92 24.7 32.7 16.6
Boosting 7.02 8.93 14.2 26.8 34.65 18.32
Log 7.04 10.56 14.63 24.7 33.94 18.17
NBC 7.06 10.58 15.26 27.89 35.1 19.18
SVM 8.6 10.34 16.28 29.59 38.02 20.57
DT 8.8 11.04 16.78 30.3 43.1 22.00
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXC: Impact of Parameters
35
4 6 8 10 12 14
0.750.800.850.90
k (Number of patterns)
AUC
●
●
●
●
● ●
●
Datasets
Blood
Congress
Hillvalley
Planning
0.02 0.04 0.06 0.08 0.10
0.700.750.800.850.90
minSup
AUC
●
●
●
●
●
Datasets
Blood
Congress
Hillvalley
Planning
0.840.850.860.870.880.890.90
Objective Function
AUC
●
●
●
TER AUC ACC
●
Datasets
ILPD
Hillvalley
Planning
0.3 0.4 0.5 0.6 0.7
0.780.800.820.840.860.880.90
r
AUC
●
●
●
● ●
●
●
●
●
●
Datasets
Blood
Congress
Hillvalley
Planning
Ohio Center of Excellence in Knowledge-Enabled Computing
36
Classification on Imbalanced Datasets
• What is an imbalanced classification problem?
• What are the real world applications?
• Why traditional classification algorithms do not perform well on
imbalanced datasets?
• What is our proposed solution?
Classifying minority instances might be more important that majority class.
Ohio Center of Excellence in Knowledge-Enabled Computing
LE
SE
37
Baseline
model
Classification
LE
SE
Training
Dataset
Weighting
• 𝑒𝑟𝑟∗ ℎ 𝑏, 𝑥 =
𝑒𝑟𝑟 ℎ 𝑏, 𝑥 × 𝛿, 𝑖𝑓𝑥 ∈ 𝑚𝑖𝑛𝑜𝑟𝑖𝑡𝑦 𝑐𝑙𝑎𝑠𝑠 𝑖𝑛𝑠𝑡𝑎𝑛𝑥𝑐𝑒𝑠
𝑒𝑟𝑟(ℎ 𝑏, 𝑥), 𝑖𝑓𝑥 ∈ 𝑚𝑎𝑗𝑜𝑟𝑖𝑡𝑦 𝑐𝑙𝑎𝑠𝑠 𝑖𝑛𝑠𝑡𝑎𝑛𝑥𝑐𝑒𝑠
New Weighting idea
Ohio Center of Excellence in Knowledge-Enabled Computing
A Filtering Method to Remove Imbalanced
Local Models
38
• 𝐼𝑅 𝑚𝑑𝑠 𝑃, 𝐷 =
Number of instances in the majority class
Number of instances in the minority class
…
…
…
…
Patterns Local Models
Ohio Center of Excellence in Knowledge-Enabled Computing
Experimental results
39
• The average AUC of CPXCim is 14% and 15.2% more than the AUC of
SMOTE and SMOTE-TL, respectively.
• The performance of CPXCim is always better than other imbalanced
classifiers on these 10 datasets.
CPXCim’s performance
Dataset
# of
instances
# of
variables
Imbalance
ratio
CPXCim SMOTE SMOTE-TL
Yeast 1004 8 9.14 0.942 0.7728 0.772
Led7digit 443 7 10.97 0.978 0.8919 0.897
flareF 1066 11 23.79 0.883 0.7463 0.809
Wine Quality 1599 11 29.17 0.76 0.6008 0.59
Average - - - 0.92 0.798 0.807
Ohio Center of Excellence in Knowledge-Enabled Computing
Applications of CPXR & CPXC
40
• Vahid Taslimitehrani, Guozhu Dong. A New CPXR Based Logistic Regression Method and Clinical
Prognostic Modeling Results Using the Method on Traumatic Brain Injury", IEEE International
Conference on Bioinformatics and Bioengineering (BIBE), 2014, On page(s): 283 – 290 (Best Student
Paper)
• Behzad Ghanbarian, Vahid Taslimitehrani, Guozhu Dong, Yakov Pachepsky. Sample dimensions
effect on prediction of soil water retention curve and saturated hydraulic conductivity. Journal of
Hydrology. 528 (2015): 127-137.
• Vahid Taslimitehrani, Guozhu Dong, Naveen Pereira, Maryam Panahiazar, Jyotishman Pathak.
Develolping HER-driven Heart Failure Models using CPXR(Log) with the probabilistic loss function.
Journal of Biomedical Informatics (2016).
Ohio Center of Excellence in Knowledge-Enabled Computing
Application: Traumatic Brain Injury
What is Traumatic Brain Injury (TBI)?
It is an important public health problem and a leading
cause of death and disability worldwide.
Problem definition: prediction of patients outcome
within 6 months after TBI event, using the admission data.
• Dataset: 2159 patients collected from a trial and 15 predictor variables
• Two class variables: mortality and unfavorable outcome.
41
Vahid Taslimitehrani, Guozhu Dong. A New CPXR Based Logistic Regression
Method and Clinical Prognostic Modeling Results Using the Method on
Traumatic Brain Injury", Bioinformatics and Bioengineering (BIBE), 2014
IEEE International Conference on, On page(s): 283 – 290 (Best Student
Paper Award)
Ohio Center of Excellence in Knowledge-Enabled Computing
Application: Traumatic Brain Injury
Model Basic Basic+CT Basic+CT+Lab
Unfavorable
Specificity 0.89(0.85) 0.87(0.85) 0.91(0.84)
Sensitivity 0.54(0.52) 0.65(0.6) 0.72(0.61)
Accuracy 0.75(0.72) 0.79(0.75) 0.87(0.75)
F1 0.63(0.59) 0.7(0.66) 0.76(0.66)
AUC 0.82(0.76) 0.87(0.8) 0.93(0.81)
42
Variable set change
Mortality Unfavorable
CPXR(Log) Log CPXR(Log) Log
Basic Basic+CT 10% 7.7% 6% 5.2%
Basic+CTBasic+CT+Lab 4.5% 2.5% 6.8% 1.25%
BasicBasic+CT+Lab 15% 11.1% 13.4% 6.6%
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
False positive rate
Truepositiverate
CPXR(Log)
SLogR
SVM
RF
AUC_CPXR(Log) = 0.87
AUC_SLogR = 0.8
AUC_RF = 0.72
AUC_SVM = 0.7
Performance changes when we add more variables
CPXR(Log)’s
performance
Ohio Center of Excellence in Knowledge-Enabled Computing
Application: Heart Failure Survival Risk
Models
• Collaboration with Mayo Clinic
• Problem definition: Heart Failure survival prediction models.
• An EHR dataset on 119,749 patients admitted to Mayo Clinic.
• Predictor variables are grouped in the following categories:
– Demographic, Vitals, Labs, Medications and 24 major chronic conditions as co-
morbidities.
• Three groups of CPXC models are developed to predict survival in 1, 2 and 5 years
after heart failure event.
43
Vahid Taslimitehrani, Guozhu Dong, Naveen Pereira, Maryam Panahiazar, Jyotishman Pathak.
Develolping HER-driven Heart Failure Models using CPXR(Log) with the probabilistic loss function.
Journal of Biomedical Informatics (2016).
Ohio Center of Excellence in Knowledge-Enabled Computing
Application: Heart Failure Survival Risk
Models
Algorithm 1 Year 2 Year 5 Year
Decision Tree 0.66 0.5 0.5
Random Forest 0.8 0.72 0.72
Ada Boost 0.74 0.71 0.68
SVM 0.59 0.52 0.52
Logistic Regression 0.81 0.74 0.73
CPXC 0.937 0.83 0.786
44
Variable Log f1 f2 f3 f4 f5 f6 f7
Alzheimer 1.75 1.74 0.80 1.88 1.59 1.29 1.58 0.75
Breast Cancer 0.63 1.15 1.62 2.73 1.00 1.00 2.08 0.59
Odds ratios of PXC local models
Performance of difference classifiers
Ohio Center of Excellence in Knowledge-Enabled Computing
Application: Heart Failure Survival Risk
Models
Variable sets CPXC Log RF SVM DT Boosting
(Demo&Vital)  (Demo&Vital) +Lab 4.8% 11.5% 19% 17.3% 0% 14.7%
(Demo&Vital)  (Demo&Vital) +Lab+Med 8.9% 13.4% 21.2% 21.7% 0% 5.7%
(Demo&Vital)  (Demo&Vital) +Lab+Med+Co-morbid 27.8% 9.6% 19.1% 19.5% -10.4% 7.6%
(Demo&Vital) +Lab (Demo&Vital) +Lab+Med 3.2% 1.7% 1.7% 3.7% 0% -9.8%
(Demo&Vital) +Lab (Demo&Vital) +Lab+Med+Co-morbid 20.9% -1.7% 0% 1.8% -10.4% -8.1%
(Demo&Vital) +Lab+Med (Demo&Vital) +Lab+Med+Co-morbid 15.9% -3.3% -1.7% -1.7% -10.4% 1.8%
45
Adding co-morbidities:
• decreased the AUC of other classifiers by 5.3% on average.
• increased the AUC of CPXC by 21.5% on average.
Performance changes when we add more variables
Ohio Center of Excellence in Knowledge-Enabled Computing
Application: Saturated Hydraulic
Conductivity
• Collaboration with University of Texas at Austin and USDA-ARS
• Problem definition:
1. Prediction of the soil water retention curve (SWRC)
2. Prediction of Saturated Hydraulic Conductivity (SHC)
3. Investigating the effect of sample dimensions on
prediction accuracy.
• Number of predictor variables: 6-13
• Number of response variables: 10
• 32 CPXR models are developed.
46
Behzad Ghanbarian, Vahid Taslimitehrani, Guozhu Dong, Yakov Pachepsky. Sample
dimensions effect on prediction of soil water retention curve and saturated hydraulic
conductivity. Journal of Hydrology. 528 (2015): 127-137.
Ohio Center of Excellence in Knowledge-Enabled Computing
Application: Saturated Hydraulic
Conductivity
47
-4
-2
0
2
4
6
8
10
-4 -2 0 2 4 6 8 10
Predictedlog(Ksat)[cmday-1]
Measured log(Ksat) [cm day-1]
SHC2
RMSLE = 0.456
-4
-2
0
2
4
6
8
10
-4 -2 0 2 4 6 8 10
Predictedlog(Ksat)[cmday-1]
Measured log(Ksat) [cm day-1]
SHC2
RMSLE = 1.936
Model
s t 10 30 50 100 300 500 1000 1500
Linear Regression
SWRC1 0.79 0.73 0.77 0.84 0.85 0.84 0.83 0.84 0.81 0.77
SWRC2 0.79 0.72 0.77 0.85 0.84 0.84 0.84 0.83 0.80 0.78
CPXR
SWRC1 0.94 0.97 0.97 0.94 0.97 0.97 0.95 0.96 0.95 0.94
SWRC2 0.95 0.96 0.94 0.95 0.97 0.96 0.95 0.98 0.97 0.94
Ohio Center of Excellence in Knowledge-Enabled Computing
Conclusion
• A new type of highly accurate and interpretable regression and classification
models, PXR/PXC are presented.
• New techniques to build PXR and PXC models are discussed.
• Each pair of pattern-model represents a diverse predictor-response interaction.
• PXR and PXC models are more accurate, interpretable and less overfitting than
other regression and classification algorithms.
• A new method adopted from CPXC presented to handle classifying imbalanced
datasets.
• Several applications of CPXR and CPXC are discussed.
48
Ohio Center of Excellence in Knowledge-Enabled Computing
Related publications
• Guozhu Dong, Vahid Taslimitehrani, Pattern-Aided Regression Modeling and Prediction
Model Analysis. in IEEE Transactions on Knowledge and Data Engineering, vol.27, no.9,
pp.2452-2465, Sept. 1 2015.
• Vahid Taslimitehrani, Guozhu Dong. A New CPXR Based Logistic Regression Method
and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain
Injury", IEEE International Conference on Bioinformatics and Bioengineering (BIBE),
2014, On page(s): 283 – 290 (Best Student Paper)
• Behzad Ghanbarian, Vahid Taslimitehrani, Guozhu Dong, Yakov Pachepsky. Sample
dimensions effect on prediction of soil water retention curve and saturated hydraulic
conductivity. Journal of Hydrology. 528 (2015): 127-137.
• Vahid Taslimitehrani, Guozhu Dong, Naveen Pereira, Maryam Panahiazar, Jyotishman
Pathak. Develolping HER-driven Heart Failure Models using CPXR(Log) with the
probabilistic loss function. Journal of Biomedical Informatics (2016).
• Guozhu Dong, Vahid Taslimitehrani, Pattern Aided Classification, SIAM Data Mining
Conference, 2016
49
Ohio Center of Excellence in Knowledge-Enabled Computing
Acknowledgement
50

More Related Content

What's hot

Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavAgile Testing Alliance
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Parth Khare
 
Download-manuals-surface water-software-48appliedstatistics
 Download-manuals-surface water-software-48appliedstatistics Download-manuals-surface water-software-48appliedstatistics
Download-manuals-surface water-software-48appliedstatisticshydrologyproject001
 
Statistical Modeling in 3D: Explaining, Predicting, Describing
Statistical Modeling in 3D: Explaining, Predicting, DescribingStatistical Modeling in 3D: Explaining, Predicting, Describing
Statistical Modeling in 3D: Explaining, Predicting, DescribingGalit Shmueli
 
To Explain, To Predict, or To Describe?
To Explain, To Predict, or To Describe?To Explain, To Predict, or To Describe?
To Explain, To Predict, or To Describe?Galit Shmueli
 
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論Naoki Hayashi
 
G. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsG. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsIstituto nazionale di statistica
 
[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程台灣資料科學年會
 
Face Identification Project Abstract 2017
Face Identification Project Abstract 2017Face Identification Project Abstract 2017
Face Identification Project Abstract 2017ioshean
 
Nonnegative Garrote as a Variable Selection Method in Panel Data
Nonnegative Garrote as a Variable Selection Method in Panel DataNonnegative Garrote as a Variable Selection Method in Panel Data
Nonnegative Garrote as a Variable Selection Method in Panel DataIJCSIS Research Publications
 
Sct2013 boston,randomizationmetricsposter,d6.2
Sct2013 boston,randomizationmetricsposter,d6.2Sct2013 boston,randomizationmetricsposter,d6.2
Sct2013 boston,randomizationmetricsposter,d6.2Dennis Sweitzer
 
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...Naoki Hayashi
 
Reduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theoryReduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theorycsandit
 
How principal components analysis is different from factor
How principal components analysis is different from factorHow principal components analysis is different from factor
How principal components analysis is different from factorArup Guha
 
"Naive Bayes Classifier" @ Papers We Love Bucharest
"Naive Bayes Classifier" @ Papers We Love Bucharest"Naive Bayes Classifier" @ Papers We Love Bucharest
"Naive Bayes Classifier" @ Papers We Love BucharestStefan Adam
 

What's hot (19)

Ijetr021251
Ijetr021251Ijetr021251
Ijetr021251
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
 
Download-manuals-surface water-software-48appliedstatistics
 Download-manuals-surface water-software-48appliedstatistics Download-manuals-surface water-software-48appliedstatistics
Download-manuals-surface water-software-48appliedstatistics
 
Statistical Modeling in 3D: Explaining, Predicting, Describing
Statistical Modeling in 3D: Explaining, Predicting, DescribingStatistical Modeling in 3D: Explaining, Predicting, Describing
Statistical Modeling in 3D: Explaining, Predicting, Describing
 
One Graduate Paper
One Graduate PaperOne Graduate Paper
One Graduate Paper
 
To Explain, To Predict, or To Describe?
To Explain, To Predict, or To Describe?To Explain, To Predict, or To Describe?
To Explain, To Predict, or To Describe?
 
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
 
G. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsG. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statistics
 
[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程
 
Data science
Data scienceData science
Data science
 
Face Identification Project Abstract 2017
Face Identification Project Abstract 2017Face Identification Project Abstract 2017
Face Identification Project Abstract 2017
 
Nonnegative Garrote as a Variable Selection Method in Panel Data
Nonnegative Garrote as a Variable Selection Method in Panel DataNonnegative Garrote as a Variable Selection Method in Panel Data
Nonnegative Garrote as a Variable Selection Method in Panel Data
 
Sct2013 boston,randomizationmetricsposter,d6.2
Sct2013 boston,randomizationmetricsposter,d6.2Sct2013 boston,randomizationmetricsposter,d6.2
Sct2013 boston,randomizationmetricsposter,d6.2
 
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
 
Reduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theoryReduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theory
 
How principal components analysis is different from factor
How principal components analysis is different from factorHow principal components analysis is different from factor
How principal components analysis is different from factor
 
"Naive Bayes Classifier" @ Papers We Love Bucharest
"Naive Bayes Classifier" @ Papers We Love Bucharest"Naive Bayes Classifier" @ Papers We Love Bucharest
"Naive Bayes Classifier" @ Papers We Love Bucharest
 
L3. Decision Trees
L3. Decision TreesL3. Decision Trees
L3. Decision Trees
 

Viewers also liked

Semantic, Cognitive and Perceptual Computing -Perceptual computing from the f...
Semantic, Cognitive and Perceptual Computing -Perceptual computing from the f...Semantic, Cognitive and Perceptual Computing -Perceptual computing from the f...
Semantic, Cognitive and Perceptual Computing -Perceptual computing from the f...Artificial Intelligence Institute at UofSC
 
Semantic, Cognitive and Perceptual Computing -Keynote artificial intelligence...
Semantic, Cognitive and Perceptual Computing -Keynote artificial intelligence...Semantic, Cognitive and Perceptual Computing -Keynote artificial intelligence...
Semantic, Cognitive and Perceptual Computing -Keynote artificial intelligence...Artificial Intelligence Institute at UofSC
 
Semantic, Cognitive and Perceptual Computing -Cognitive computing in autonomo...
Semantic, Cognitive and Perceptual Computing -Cognitive computing in autonomo...Semantic, Cognitive and Perceptual Computing -Cognitive computing in autonomo...
Semantic, Cognitive and Perceptual Computing -Cognitive computing in autonomo...Artificial Intelligence Institute at UofSC
 
Trending: Social media analysis to monitor cannabis and synthetic cannabino...
Trending: Social media analysis to monitor cannabis and synthetic cannabino...Trending: Social media analysis to monitor cannabis and synthetic cannabino...
Trending: Social media analysis to monitor cannabis and synthetic cannabino...Artificial Intelligence Institute at UofSC
 
Challenges in understanding clinical notes: Why NLP engines fall short and wh...
Challenges in understanding clinical notes: Why NLP engines fall short and wh...Challenges in understanding clinical notes: Why NLP engines fall short and wh...
Challenges in understanding clinical notes: Why NLP engines fall short and wh...Artificial Intelligence Institute at UofSC
 
Stream Reasoning: mastering the velocity and variety dimensions of Big Data...
Stream Reasoning: mastering the velocity and variety dimensions of Big Data...Stream Reasoning: mastering the velocity and variety dimensions of Big Data...
Stream Reasoning: mastering the velocity and variety dimensions of Big Data...Artificial Intelligence Institute at UofSC
 
Semantic, Cognitive and Perceptual Computing -Using semantics and statistics ...
Semantic, Cognitive and Perceptual Computing -Using semantics and statistics ...Semantic, Cognitive and Perceptual Computing -Using semantics and statistics ...
Semantic, Cognitive and Perceptual Computing -Using semantics and statistics ...Artificial Intelligence Institute at UofSC
 
Listening to the pulse of our cities fusing Social Media Streams and Call Dat...
Listening to the pulse of our cities fusing Social Media Streams and Call Dat...Listening to the pulse of our cities fusing Social Media Streams and Call Dat...
Listening to the pulse of our cities fusing Social Media Streams and Call Dat...Artificial Intelligence Institute at UofSC
 

Viewers also liked (20)

Semantic, Cognitive and Perceptual Computing -Human mental representation
Semantic, Cognitive and Perceptual Computing -Human mental representationSemantic, Cognitive and Perceptual Computing -Human mental representation
Semantic, Cognitive and Perceptual Computing -Human mental representation
 
Exploring Synthetic Cannabinoid Effects Using Web Forum Data
Exploring Synthetic Cannabinoid Effects Using Web Forum Data Exploring Synthetic Cannabinoid Effects Using Web Forum Data
Exploring Synthetic Cannabinoid Effects Using Web Forum Data
 
Semantic, Cognitive and Perceptual Computing -Deep learning
Semantic, Cognitive and Perceptual Computing -Deep learning Semantic, Cognitive and Perceptual Computing -Deep learning
Semantic, Cognitive and Perceptual Computing -Deep learning
 
Semantic, Cognitive and Perceptual Computing -Moonwalking with einstein
Semantic, Cognitive and Perceptual Computing -Moonwalking with einsteinSemantic, Cognitive and Perceptual Computing -Moonwalking with einstein
Semantic, Cognitive and Perceptual Computing -Moonwalking with einstein
 
Semantic, Cognitive and Perceptual Computing -Perceptual computing from the f...
Semantic, Cognitive and Perceptual Computing -Perceptual computing from the f...Semantic, Cognitive and Perceptual Computing -Perceptual computing from the f...
Semantic, Cognitive and Perceptual Computing -Perceptual computing from the f...
 
Semantic, Cognitive and Perceptual Computing -Keynote artificial intelligence...
Semantic, Cognitive and Perceptual Computing -Keynote artificial intelligence...Semantic, Cognitive and Perceptual Computing -Keynote artificial intelligence...
Semantic, Cognitive and Perceptual Computing -Keynote artificial intelligence...
 
Finding Street Gang Members on Twitter
Finding Street Gang Members on TwitterFinding Street Gang Members on Twitter
Finding Street Gang Members on Twitter
 
Semantic, Cognitive and Perceptual Computing -Cognitive computing in autonomo...
Semantic, Cognitive and Perceptual Computing -Cognitive computing in autonomo...Semantic, Cognitive and Perceptual Computing -Cognitive computing in autonomo...
Semantic, Cognitive and Perceptual Computing -Cognitive computing in autonomo...
 
Trending: Social media analysis to monitor cannabis and synthetic cannabino...
Trending: Social media analysis to monitor cannabis and synthetic cannabino...Trending: Social media analysis to monitor cannabis and synthetic cannabino...
Trending: Social media analysis to monitor cannabis and synthetic cannabino...
 
Semantic, Cognitive and Perceptual Computing -Cognitive theory of dreaming
Semantic, Cognitive and Perceptual Computing -Cognitive theory of dreamingSemantic, Cognitive and Perceptual Computing -Cognitive theory of dreaming
Semantic, Cognitive and Perceptual Computing -Cognitive theory of dreaming
 
Challenges in understanding clinical notes: Why NLP engines fall short and wh...
Challenges in understanding clinical notes: Why NLP engines fall short and wh...Challenges in understanding clinical notes: Why NLP engines fall short and wh...
Challenges in understanding clinical notes: Why NLP engines fall short and wh...
 
Finding Street Gang Members on Twitter
Finding Street Gang Members on TwitterFinding Street Gang Members on Twitter
Finding Street Gang Members on Twitter
 
Stream Reasoning: mastering the velocity and variety dimensions of Big Data...
Stream Reasoning: mastering the velocity and variety dimensions of Big Data...Stream Reasoning: mastering the velocity and variety dimensions of Big Data...
Stream Reasoning: mastering the velocity and variety dimensions of Big Data...
 
Integrating Sensor and Social Data for Understanding City Events
Integrating Sensor and Social Data for Understanding City EventsIntegrating Sensor and Social Data for Understanding City Events
Integrating Sensor and Social Data for Understanding City Events
 
Implicit Entity Recognition in Clinical Documents
Implicit Entity Recognition in Clinical DocumentsImplicit Entity Recognition in Clinical Documents
Implicit Entity Recognition in Clinical Documents
 
Semantic, Cognitive and Perceptual Computing -Using semantics and statistics ...
Semantic, Cognitive and Perceptual Computing -Using semantics and statistics ...Semantic, Cognitive and Perceptual Computing -Using semantics and statistics ...
Semantic, Cognitive and Perceptual Computing -Using semantics and statistics ...
 
Listening to the pulse of our cities fusing Social Media Streams and Call Dat...
Listening to the pulse of our cities fusing Social Media Streams and Call Dat...Listening to the pulse of our cities fusing Social Media Streams and Call Dat...
Listening to the pulse of our cities fusing Social Media Streams and Call Dat...
 
Word Embeddings to Enhance Twitter Gang Member Profile Identification
Word Embeddings to Enhance Twitter Gang Member Profile IdentificationWord Embeddings to Enhance Twitter Gang Member Profile Identification
Word Embeddings to Enhance Twitter Gang Member Profile Identification
 
Implicit Entity Linking in Tweets
Implicit Entity Linking in TweetsImplicit Entity Linking in Tweets
Implicit Entity Linking in Tweets
 
Big Data Challenges and Trust Management at CTS -2016
Big Data Challenges and Trust Management at CTS -2016Big Data Challenges and Trust Management at CTS -2016
Big Data Challenges and Trust Management at CTS -2016
 

Similar to Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regression and Classification

High Dimensional Biological Data Analysis and Visualization
High Dimensional Biological Data Analysis and VisualizationHigh Dimensional Biological Data Analysis and Visualization
High Dimensional Biological Data Analysis and VisualizationDmitry Grapov
 
Prediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical StructurePrediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical StructureJeremy Besnard
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..butest
 
Discovering Beneficial Cooperative Structures for the Automated Construction ...
Discovering Beneficial Cooperative Structures for the Automated Construction ...Discovering Beneficial Cooperative Structures for the Automated Construction ...
Discovering Beneficial Cooperative Structures for the Automated Construction ...German Terrazas
 
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_JMM
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_JMMRodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_JMM
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_JMM​Iván Rodríguez
 
Prote-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and VisualizationProte-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and VisualizationDmitry Grapov
 
a paper reading of table recognition
a paper reading of table recognitiona paper reading of table recognition
a paper reading of table recognitionNing Lu
 
Heart disease classification
Heart disease classificationHeart disease classification
Heart disease classificationSnehaDey21
 
Introduction to 16S rRNA gene multivariate analysis
Introduction to 16S rRNA gene multivariate analysisIntroduction to 16S rRNA gene multivariate analysis
Introduction to 16S rRNA gene multivariate analysisJosh Neufeld
 
ABC short course: final chapters
ABC short course: final chaptersABC short course: final chapters
ABC short course: final chaptersChristian Robert
 
Exact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping YeExact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping YeBigMine
 
master_thesis_presentation_Sreenjay_Sen.pdf
master_thesis_presentation_Sreenjay_Sen.pdfmaster_thesis_presentation_Sreenjay_Sen.pdf
master_thesis_presentation_Sreenjay_Sen.pdfSreenjaySen1
 
On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1VitAnhNguyn94
 

Similar to Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regression and Classification (20)

Contrast Pattern Aided Regression and Classification
Contrast Pattern Aided Regression and ClassificationContrast Pattern Aided Regression and Classification
Contrast Pattern Aided Regression and Classification
 
Feedbackdriven radiologyreportretrieval ichi2015-v2
Feedbackdriven radiologyreportretrieval ichi2015-v2Feedbackdriven radiologyreportretrieval ichi2015-v2
Feedbackdriven radiologyreportretrieval ichi2015-v2
 
High Dimensional Biological Data Analysis and Visualization
High Dimensional Biological Data Analysis and VisualizationHigh Dimensional Biological Data Analysis and Visualization
High Dimensional Biological Data Analysis and Visualization
 
5 5 10
5 5 105 5 10
5 5 10
 
Prediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical StructurePrediction Of Bioactivity From Chemical Structure
Prediction Of Bioactivity From Chemical Structure
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..
 
Discovering Beneficial Cooperative Structures for the Automated Construction ...
Discovering Beneficial Cooperative Structures for the Automated Construction ...Discovering Beneficial Cooperative Structures for the Automated Construction ...
Discovering Beneficial Cooperative Structures for the Automated Construction ...
 
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_JMM
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_JMMRodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_JMM
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Poster_Presentation_JMM
 
Prote-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and VisualizationProte-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and Visualization
 
a paper reading of table recognition
a paper reading of table recognitiona paper reading of table recognition
a paper reading of table recognition
 
PMED: APPM Workshop: Overview of Methods for Subgroup Identification in Clini...
PMED: APPM Workshop: Overview of Methods for Subgroup Identification in Clini...PMED: APPM Workshop: Overview of Methods for Subgroup Identification in Clini...
PMED: APPM Workshop: Overview of Methods for Subgroup Identification in Clini...
 
Heart disease classification
Heart disease classificationHeart disease classification
Heart disease classification
 
Introduction to 16S rRNA gene multivariate analysis
Introduction to 16S rRNA gene multivariate analysisIntroduction to 16S rRNA gene multivariate analysis
Introduction to 16S rRNA gene multivariate analysis
 
P1121133727
P1121133727P1121133727
P1121133727
 
ABC short course: final chapters
ABC short course: final chaptersABC short course: final chapters
ABC short course: final chapters
 
Exact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping YeExact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping Ye
 
master_thesis_presentation_Sreenjay_Sen.pdf
master_thesis_presentation_Sreenjay_Sen.pdfmaster_thesis_presentation_Sreenjay_Sen.pdf
master_thesis_presentation_Sreenjay_Sen.pdf
 
Kdd by Mr.Sameer Kumar Das
Kdd by Mr.Sameer Kumar DasKdd by Mr.Sameer Kumar Das
Kdd by Mr.Sameer Kumar Das
 
On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1
 
Machine learning meetup
Machine learning meetupMachine learning meetup
Machine learning meetup
 

Recently uploaded

STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptxBREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptxPABOLU TEJASREE
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxTwin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxEran Akiva Sinbar
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555kikilily0909
 
Forest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantForest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantadityabhardwaj282
 

Recently uploaded (20)

STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptxBREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxTwin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555
 
Forest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are importantForest laws, Indian forest laws, why they are important
Forest laws, Indian forest laws, why they are important
 

Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regression and Classification

  • 1. Ohio Center of Excellence in Knowledge-Enabled Computing Ph.D. Dissertation Defense: Contrast Pattern Aided Regression and Classification February 19, 2016 Vahid Taslimitehrani Kno.e.sis Center, CSE Dept., Wright State University, USA Committee Members: Prof. Guozhu Dong (advisor, WSU), Prof. Amit Sheth (WSU), Prof. T.K. Prasad (WSU), Dr. Keke Chen (WSU), and Prof. Jyotishman Pathak (Cornell University) 1
  • 2. Ohio Center of Excellence in Knowledge-Enabled Computing 2
  • 3. Ohio Center of Excellence in Knowledge-Enabled Computing 3 Does Asthma decrease the mortality risk from Pneumonia?
  • 4. Ohio Center of Excellence in Knowledge-Enabled Computing Accuracy vs. Interpretability 4 Accuracy Interpretability Low High High Lasso Linear/Logistic Regression Naïve Bayes Decision Trees Splines Nearest Neighbors Bagging Neural Nets SVM Boosting Random Forest Deep Learning CPXR/CPXC Source: Joshua Bloom and Henrik Brink of wise.io *on real dataset
  • 5. Ohio Center of Excellence in Knowledge-Enabled Computing 5 Modeling Techniques Lack Accuracy and Interpretability Heterogeneity & Diversity of Given Dataset Predictors-Response Interactions Universal Model’s Assumption
  • 6. Ohio Center of Excellence in Knowledge-Enabled Computing Predictors-Response Interactions 6 Interactive effect: The effect of a variable on prediction changes and varies with changes in the values of other independent variable(s) which are interacting with the variable. It is not the genes or the environment! It is their interaction that’s important.
  • 7. Ohio Center of Excellence in Knowledge-Enabled Computing Universal Model’s Assumption & Heterogeneity What is the universal model’s assumption? 7 What are heterogeneous and diverse data points?
  • 8. Ohio Center of Excellence in Knowledge-Enabled Computing Solution 1.New type of regression & classification models called Pattern Aided Regression and Classification (PXR and PXC) 2.The new algorithms to build PXR and PXC models called Contrast Pattern Aided Regression and Classification (CPXR and CPXC) 3.The new algorithm to handle imbalanced datasets called Contrast Pattern Aided Classification on Imbalanced datasets (CPXCim) 8 Our proposed methodology has three components:
  • 9. Ohio Center of Excellence in Knowledge-Enabled Computing Preliminaries: patterns • A pattern (rule) is a set of conditions describing set of objects. • Example: "𝑨𝒈𝒆 ≥ 60" AND “History of hypertension = YES” is a pattern (rule) describing: All patients more than 60 years old AND have a history of Hypertension. • An object matches a pattern if it satisfies every condition in the pattern. 9 Patient ID Age BMI History of Hypertension Diagnosed with Heart Failure 1 75 22 YES YES 2 67 27 NO NO
  • 10. Ohio Center of Excellence in Knowledge-Enabled Computing Preliminaries: matching dataset and contrast patterns • The matching dataset of pattern 𝑃 in dataset 𝐷 or 𝑚𝑑𝑠(𝑃, 𝐷) is the set of all instances matching pattern 𝑃. • The support of pattern 𝑃 in 𝐷 is 𝑠𝑢𝑝𝑝 𝑃, 𝐷 = 𝑚𝑑𝑠(𝑃,𝐷) 𝐷 . • Contrast patterns: patterns that distinguish objects in different classes. A pattern is contrast pattern if it matches many objects in one class than in another class. • An equivalent class (EC) is a set of patterns with same matching datasets (having same behavior). 10
  • 11. Ohio Center of Excellence in Knowledge-Enabled Computing Introduction: CPXR/CPXC overview 11 𝑷: pattern 𝒇: model A pattern logically characterizes a sub- group of data. A local model represents predictor-response interactions among the data points of a sub- group of data. Regression Classification 𝒇 CPXR/CPXC (𝑷 𝟏, 𝒇 𝟏) (𝑷 𝟐, 𝒇 𝟐) Local model algorithms can be simple as linear regression.
  • 12. Ohio Center of Excellence in Knowledge-Enabled Computing Diversity of predictor-response relationships • Different pattern-model pairs emphasize different sets of variables. • Different pattern-model pairs use highly different regression/classification models. • Diverse predictor-response relationships may be neutralized at the global level. 12
  • 13. Ohio Center of Excellence in Knowledge-Enabled Computing Introduction: Thesis Statement Study regression and classification techniques to produce accurate and interpretable models capable of adequately representing complex and diverse predictor-response interactions and revealing high intra-dataset heterogeneity. 13
  • 14. Ohio Center of Excellence in Knowledge-Enabled Computing Contrast Pattern Aided Regression (CPXR) 14 Guozhu Dong, Vahid Taslimitehrani, Pattern-Aided Regression Modeling and Prediction Model Analysis. in IEEE Transactions on Knowledge and Data Engineering, vol.27, no.9, pp.2452- 2465, Sept. 1 2015
  • 15. Ohio Center of Excellence in Knowledge-Enabled Computing A pictorial illustration of a simple PXR model 15 A small dataset with 100 instances and 2 numerical predictor variables. • Different patterns can involve different sets of variables [describing data regions in different subspaces] • Matching datasets of different patterns can overlap 0 2 4 6 8 10 0 2 4 6 8 10
  • 16. Ohio Center of Excellence in Knowledge-Enabled Computing PXR concepts 16 Regression Classification 𝒇 𝒃 Given a training dataset 𝐷 = (𝑥𝑖, 𝑦𝑖) 1 ≤ 𝑖 ≤ 𝑛 , a regression model built on 𝐷 is called baseline model and given as 𝑓𝑏. (𝑷 𝟏, 𝒇 𝑷 𝟏 ) (𝑷 𝟐, 𝒇 𝑷 𝟐 ) CPXR/CPXC Given the matching dataset of pattern 𝑃, 𝑚𝑑𝑠(𝑃, 𝐷), a regression built on 𝑚𝑑𝑠 𝑃, 𝐷 is called local model and is shown by 𝑓𝑃.
  • 17. Ohio Center of Excellence in Knowledge-Enabled Computing Pattern Items Local Model Match Pattern Aided Regression (PXR) 17 • 𝑃𝑋𝑅 = ( 𝑃1, 𝑓1, 𝑤1 , 𝑃2, 𝑓2, 𝑤2 , … , 𝑃𝑘, 𝑓𝑘, 𝑤 𝑘 , 𝑓𝑑) • The regression function of PXR as: 𝑓𝑃𝑋𝑅 = 𝑃 𝑖∈𝜋 𝑥 𝑤𝑖 𝑓𝑖(𝑥) 𝑃 𝑖∈𝜋 𝑥 𝑤𝑖 , 𝑖𝑓 𝜋 𝑥 ≠ ∅ 𝑓𝑑, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 where 𝜋 𝑥 = 𝑃𝑖 1 ≤ 𝑖 ≤ 𝑘, 𝑥 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 𝑃𝑖 Case 3: Case 2: Case 1:
  • 18. Ohio Center of Excellence in Knowledge-Enabled Computing CPXR/CPXC: Quality Measures • The average residual reduction (arr) of a pattern 𝑃 w.r.t to a prediction model 𝑓 on a dataset 𝐷 is: 𝑎𝑟𝑟 𝑃 = 𝑥∈𝑚𝑑𝑠(𝑃,𝐷) 𝑟 𝑥(𝑓 𝑏) − 𝑥∈𝑚𝑑𝑠(𝑃,𝐷) 𝑟 𝑥(𝑓 𝑃) 𝑚𝑑𝑠(𝑃,𝐷) • The total residual reduction (trr) of a PXR/PXC is: 𝑡𝑟𝑟 𝑃𝑋𝑅/𝑃𝑋𝐶 = 𝑥∈𝑚𝑑𝑠(𝑃𝑆,𝐷) 𝑟𝑥(𝑓𝑏) − 𝑥∈𝑚𝑑𝑠(𝑃𝑆,𝐷) 𝑟𝑥(𝑓𝑃𝑋𝑅/𝑃𝑋𝐶) 𝑥∈𝐷 𝑟𝑥(𝑓) Where 𝑃𝑆 = 𝑃1, … , 𝑃𝑘 is the pattern set, 𝑟𝑥(𝑓) is the 𝑓’s residual on an instance 𝑥 and 𝑚𝑑𝑠 𝑃𝑆, 𝐷 = 𝑖=1 𝑘 𝑚𝑑𝑠(𝑃𝑖, 𝐷). 18
  • 19. Ohio Center of Excellence in Knowledge-Enabled Computing CPXR Algorithm 19 Dataset D CPXR Phase1 Phase2 Phase3 Goal: A small set of cooperating patterns, where each pattern characterize a subgroup of data points. • A baseline model makes large residual errors on data points in the subgroup. • A highly accurate model is found to correct those errors.
  • 20. Ohio Center of Excellence in Knowledge-Enabled Computing CPXR Algorithm 20 Baseline model Regression/ Classification LE SE Training Dataset … … … … Patterns Local Models Pattern Mining [(𝑃1, 𝑓1, 𝑤1) , (𝑃4, 𝑓4, 𝑤4) , … , (𝑃𝑘, 𝑓𝑘, 𝑤 𝑘)] (𝑓1, 𝑤1) (𝑓4, 𝑤4) (𝑓𝑘, 𝑤 𝑘) 𝑃1 𝑃4 𝑃𝑘
  • 21. Ohio Center of Excellence in Knowledge-Enabled Computing • How to determine spliting point 𝜅? Minimize 𝜌 − 𝑟 𝑖>𝜅 𝑟 𝑖 𝑟 𝑖 • How to select patterns from C𝑃𝑆? Lets 𝑃𝑆 = 𝑃0 , where 𝑃0 is the pattern 𝑃 in C𝑃𝑆 with the highest 𝑎𝑟𝑟 21 0 1 2 3 4 5 6 0 50 100 150 200 SE LE CPXR Algorithm
  • 22. Ohio Center of Excellence in Knowledge-Enabled Computing CPXR/CPXC: Filtering methods • Contrast patterns of LE with support ratio less than 1. • Patterns with tiny residual reduction (𝑎𝑟𝑟). • Patterns with Jaccard similarity more than 0.9 𝐽 𝑃1, 𝑃2 = 𝑚𝑑𝑠(𝑃1, 𝐷) ∩ 𝑚𝑑𝑠(𝑃2, 𝐷) 𝑚𝑑𝑠(𝑃1, 𝐷) ∪ 𝑚𝑑𝑠(𝑃2, 𝐷) • Patterns with the size of matching datasets less than the number of predictor variables. 22
  • 23. Ohio Center of Excellence in Knowledge-Enabled Computing CPXR: Prediction Accuracy Evaluation • 50 real datasets and 23 synthetic datasets • Different criteria to generate synthetic datasets • Compare CPXR’s performance with 5 state-of-the-art regression methods • Overfitting and noise sensitivity • Analysis of parameters 23 𝑅𝑀𝑆𝐸 𝑟𝑒𝑑𝑢𝑐𝑡𝑖𝑜𝑛 = 𝑅𝑀𝑆𝐸 𝐿𝑅 − 𝐸𝑀𝑆𝐸(𝑋) 𝑅𝑀𝑆𝐸(𝐿𝑅)
  • 24. Ohio Center of Excellence in Knowledge-Enabled Computing CPXR: Prediction Accuracy Evaluation 24 Dataset PLR SVR BART GBM CPXR Tecator 40.62 0.16 19.35 -14.15 65.1 Tree 17.68 7.92 -7.23 -10.82 61.73 Wage 12.2 9.15 25.42 11.86 38.45 Average 18.41 4.94 20.18 14.6 42.89 CPXR’s performance vs. other methods • CPXR has the highest accuracy in 41 out of 50 datasets. • CPXR’s results are more accurate than LR in all 50 datasets. • In 20% of datasets, CPXR achieved more than 60% RMSE reduction.
  • 25. Ohio Center of Excellence in Knowledge-Enabled Computing CPXR: Overfitting and Noise Sensitivity 25 5 10 15 20 102030405060 Noise(%) Dropinaccuracycomparingtocleantestdata(%) ● ● ● ● ● Datasets BART CPXR Gradient Boosting NN SVR BART CPXR 0.00.20.40.6 NN SVR BART CPXR −0.2−0.10.00.10.20.30.4 RMSE reduction on synthetic datasets Train - Test Method Training Test Drop in accuracy PLR 37.11% 18.76% 49% SVR 7.65% 4.8% 37% BART 41.02% 20.15% 51% CPXR(LL) 51.4% 39.88% 22% CPXR(LP) 53.85% 42.89% 21%
  • 26. Ohio Center of Excellence in Knowledge-Enabled Computing CPXR: Analysis of Parameters 26 5 10 15 20 0.350.400.450.500.550.600.65 k (Number of patterns) RMSEimprovementoverLR ● ● ● ● ● ● Datasets Fat Mussels Price 0.02 0.04 0.06 0.08 0.10 0.250.300.350.400.450.500.550.60 minSup RMSEimprovementoverLR ● ● ● ● ● Datasets Fat Mussels Price 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.350.400.450.500.550.60 r RMSEimprovementoverLR ● ● ● ● ● ● ● ● Datasets Fat Mussels Price 2% is the optimal minSup.7 patterns as average on 50 datasets.
  • 27. Ohio Center of Excellence in Knowledge-Enabled Computing Contrast Pattern Aided Classification (CPXC) 27 Guozhu Dong, Vahid Taslimitehrani, Pattern Aided Classification, SIAM Data Mining Conference, 2016
  • 28. Ohio Center of Excellence in Knowledge-Enabled Computing CPXC: PXC Concept CPXC techniques are quite similar to those of CPXR but CPXC has more challenges as well as more opportunities than CPXR 28 CPXC Confidence of Match Objective Functions Classification Algorithms Loss Functions
  • 29. Ohio Center of Excellence in Knowledge-Enabled Computing CPXC: Confidence of Match • Given 𝑃𝑋𝐶 = ( 𝑃1, ℎ 𝑃1 , 𝑤1 , 𝑃2, ℎ 𝑃2 , 𝑤2 , … , 𝑃𝑘, ℎ 𝑃 𝑘 , 𝑤 𝑘 , ℎ 𝑑), the class variable of an instance 𝑥 is defined as: 𝑤𝑒𝑖𝑔ℎ𝑡𝑑 − 𝑣𝑜𝑡𝑒 (𝑃𝑋𝐶, 𝐶𝑗, 𝑥) = 𝑃 𝑖∈𝜋 𝑥 𝑤𝑖 × 𝑚𝑎𝑡𝑐ℎ (𝑥, 𝑝𝑖) × ℎ 𝑝 𝑖 (𝑥, 𝐶𝑗) 𝑃 𝑖∈𝜋 𝑥 𝑤𝑖 × 𝑚𝑎𝑡𝑐ℎ (𝑥, 𝑝𝑖) , 𝑖𝑓 𝜋 𝑥 ≠ ∅ ℎ 𝑑, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 where 𝜋 𝑥 = 𝑃𝑖 1 ≤ 𝑖 ≤ 𝑘, 𝑚𝑎𝑡𝑐ℎ 𝑥, 𝑝𝑖 > 0 and 𝑚𝑎𝑡𝑐ℎ 𝑥, 𝑝𝑖 = 𝑞 𝑖 𝜖𝑀𝐺(𝑝 𝑖) 𝑡 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 𝑝 𝑖 𝑀𝐺(𝑝 𝑖) • 𝑚𝑎𝑡𝑐ℎ 𝑥, 𝑝𝑖 is the fraction of 𝑀𝐺 ‘s 𝑞 in 𝑀𝐺 𝑝𝑖 such that 𝑥 matches 𝑞. • ℎ 𝑝(𝑥, 𝐶𝑗) is the confidence score of local model ℎ on instance 𝑥 for class 𝐶𝑗. 29 Confidence of Match
  • 30. Ohio Center of Excellence in Knowledge-Enabled Computing CPXC: Loss Functions 30 0.600.650.700.750.800.850.90 ClassError AUC ● ● ● Binary Probabilistic Standardized ● Datasets ILPD Hillvalley Planning Probabilistic error loss function returns the best results. Loss Functions
  • 31. Ohio Center of Excellence in Knowledge-Enabled Computing CPXC: Base/Local Algorithms & Objective Functions • Different methods for baseline and local classifiers: – We used 6 classification algorithm for learning the baseline and local classifiers 31 Classification Algorithms • Quality measures on pattern sets – We used 𝑡𝑟𝑟, AUC, and ACC (accuracy) to measure the quality of a pattern set • Quality measures on patterns and weights on local classifiers – We used 𝑎𝑟𝑟, AUC, and ACC (accuracy) to measure the quality of a pattern: 𝑎𝑟𝑟 is the winner! Objective Functions
  • 32. Ohio Center of Excellence in Knowledge-Enabled Computing Experimental results 32 19 Public Datasets 8 Classification Algorithms Noise Sensitivity & Overfitting Running Time 7 Fold Cross Validation minSup = 0.02 rho = 0.45
  • 33. Ohio Center of Excellence in Knowledge-Enabled Computing CPXC: Performance Dataset Boosting DT NBC Log RF SVM Max CPXC (NBC-DT) Congress 0.58 0.66 0.6 0.57 0.58 0.58 0.66 0.86 Poker 0.6 0.6 0.5 0.5 0.76 0.5 0.76 0.85 HillValley 0.5 0.63 0.65 0.66 0.6 0.67 0.67 0.89 Climate 0.96 0.81 0.9 0.94 0.97 0.98 0.98 0.97 Mammography 0.94 0.91 0.94 0.94 0.93 0.93 0.94 0.98 Steel 0.96 0.88 0.91 0.95 0.95 0.94 0.95 0.99 33 • CPXC achieved average AUC of 0.886 on the 8 hard datasets. • Average AUC of the best performing traditional classifier (RF) on hard datasets is 0.638. • CPXC’s AUC is never lower than RF on the hard datasets. • CPXC achieved average AUC of 0.983 on the easy datasets while the best performing traditional algorithms obtained average AUC of 0.968.
  • 34. Ohio Center of Excellence in Knowledge-Enabled Computing CPXC: Noise Sensitivity 34 Drop of AUC vs. noise levels Method/Noise 0% 5% 10% 15% 20% Average RF 5.73 6.61 12.48 25.83 33.54 16.84 CPXC 5.87 6.79 12.92 24.7 32.7 16.6 Boosting 7.02 8.93 14.2 26.8 34.65 18.32 Log 7.04 10.56 14.63 24.7 33.94 18.17 NBC 7.06 10.58 15.26 27.89 35.1 19.18 SVM 8.6 10.34 16.28 29.59 38.02 20.57 DT 8.8 11.04 16.78 30.3 43.1 22.00
  • 35. Ohio Center of Excellence in Knowledge-Enabled Computing CPXC: Impact of Parameters 35 4 6 8 10 12 14 0.750.800.850.90 k (Number of patterns) AUC ● ● ● ● ● ● ● Datasets Blood Congress Hillvalley Planning 0.02 0.04 0.06 0.08 0.10 0.700.750.800.850.90 minSup AUC ● ● ● ● ● Datasets Blood Congress Hillvalley Planning 0.840.850.860.870.880.890.90 Objective Function AUC ● ● ● TER AUC ACC ● Datasets ILPD Hillvalley Planning 0.3 0.4 0.5 0.6 0.7 0.780.800.820.840.860.880.90 r AUC ● ● ● ● ● ● ● ● ● ● Datasets Blood Congress Hillvalley Planning
  • 36. Ohio Center of Excellence in Knowledge-Enabled Computing 36 Classification on Imbalanced Datasets • What is an imbalanced classification problem? • What are the real world applications? • Why traditional classification algorithms do not perform well on imbalanced datasets? • What is our proposed solution? Classifying minority instances might be more important that majority class.
  • 37. Ohio Center of Excellence in Knowledge-Enabled Computing LE SE 37 Baseline model Classification LE SE Training Dataset Weighting • 𝑒𝑟𝑟∗ ℎ 𝑏, 𝑥 = 𝑒𝑟𝑟 ℎ 𝑏, 𝑥 × 𝛿, 𝑖𝑓𝑥 ∈ 𝑚𝑖𝑛𝑜𝑟𝑖𝑡𝑦 𝑐𝑙𝑎𝑠𝑠 𝑖𝑛𝑠𝑡𝑎𝑛𝑥𝑐𝑒𝑠 𝑒𝑟𝑟(ℎ 𝑏, 𝑥), 𝑖𝑓𝑥 ∈ 𝑚𝑎𝑗𝑜𝑟𝑖𝑡𝑦 𝑐𝑙𝑎𝑠𝑠 𝑖𝑛𝑠𝑡𝑎𝑛𝑥𝑐𝑒𝑠 New Weighting idea
  • 38. Ohio Center of Excellence in Knowledge-Enabled Computing A Filtering Method to Remove Imbalanced Local Models 38 • 𝐼𝑅 𝑚𝑑𝑠 𝑃, 𝐷 = Number of instances in the majority class Number of instances in the minority class … … … … Patterns Local Models
  • 39. Ohio Center of Excellence in Knowledge-Enabled Computing Experimental results 39 • The average AUC of CPXCim is 14% and 15.2% more than the AUC of SMOTE and SMOTE-TL, respectively. • The performance of CPXCim is always better than other imbalanced classifiers on these 10 datasets. CPXCim’s performance Dataset # of instances # of variables Imbalance ratio CPXCim SMOTE SMOTE-TL Yeast 1004 8 9.14 0.942 0.7728 0.772 Led7digit 443 7 10.97 0.978 0.8919 0.897 flareF 1066 11 23.79 0.883 0.7463 0.809 Wine Quality 1599 11 29.17 0.76 0.6008 0.59 Average - - - 0.92 0.798 0.807
  • 40. Ohio Center of Excellence in Knowledge-Enabled Computing Applications of CPXR & CPXC 40 • Vahid Taslimitehrani, Guozhu Dong. A New CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury", IEEE International Conference on Bioinformatics and Bioengineering (BIBE), 2014, On page(s): 283 – 290 (Best Student Paper) • Behzad Ghanbarian, Vahid Taslimitehrani, Guozhu Dong, Yakov Pachepsky. Sample dimensions effect on prediction of soil water retention curve and saturated hydraulic conductivity. Journal of Hydrology. 528 (2015): 127-137. • Vahid Taslimitehrani, Guozhu Dong, Naveen Pereira, Maryam Panahiazar, Jyotishman Pathak. Develolping HER-driven Heart Failure Models using CPXR(Log) with the probabilistic loss function. Journal of Biomedical Informatics (2016).
  • 41. Ohio Center of Excellence in Knowledge-Enabled Computing Application: Traumatic Brain Injury What is Traumatic Brain Injury (TBI)? It is an important public health problem and a leading cause of death and disability worldwide. Problem definition: prediction of patients outcome within 6 months after TBI event, using the admission data. • Dataset: 2159 patients collected from a trial and 15 predictor variables • Two class variables: mortality and unfavorable outcome. 41 Vahid Taslimitehrani, Guozhu Dong. A New CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury", Bioinformatics and Bioengineering (BIBE), 2014 IEEE International Conference on, On page(s): 283 – 290 (Best Student Paper Award)
  • 42. Ohio Center of Excellence in Knowledge-Enabled Computing Application: Traumatic Brain Injury Model Basic Basic+CT Basic+CT+Lab Unfavorable Specificity 0.89(0.85) 0.87(0.85) 0.91(0.84) Sensitivity 0.54(0.52) 0.65(0.6) 0.72(0.61) Accuracy 0.75(0.72) 0.79(0.75) 0.87(0.75) F1 0.63(0.59) 0.7(0.66) 0.76(0.66) AUC 0.82(0.76) 0.87(0.8) 0.93(0.81) 42 Variable set change Mortality Unfavorable CPXR(Log) Log CPXR(Log) Log Basic Basic+CT 10% 7.7% 6% 5.2% Basic+CTBasic+CT+Lab 4.5% 2.5% 6.8% 1.25% BasicBasic+CT+Lab 15% 11.1% 13.4% 6.6% 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 False positive rate Truepositiverate CPXR(Log) SLogR SVM RF AUC_CPXR(Log) = 0.87 AUC_SLogR = 0.8 AUC_RF = 0.72 AUC_SVM = 0.7 Performance changes when we add more variables CPXR(Log)’s performance
  • 43. Ohio Center of Excellence in Knowledge-Enabled Computing Application: Heart Failure Survival Risk Models • Collaboration with Mayo Clinic • Problem definition: Heart Failure survival prediction models. • An EHR dataset on 119,749 patients admitted to Mayo Clinic. • Predictor variables are grouped in the following categories: – Demographic, Vitals, Labs, Medications and 24 major chronic conditions as co- morbidities. • Three groups of CPXC models are developed to predict survival in 1, 2 and 5 years after heart failure event. 43 Vahid Taslimitehrani, Guozhu Dong, Naveen Pereira, Maryam Panahiazar, Jyotishman Pathak. Develolping HER-driven Heart Failure Models using CPXR(Log) with the probabilistic loss function. Journal of Biomedical Informatics (2016).
  • 44. Ohio Center of Excellence in Knowledge-Enabled Computing Application: Heart Failure Survival Risk Models Algorithm 1 Year 2 Year 5 Year Decision Tree 0.66 0.5 0.5 Random Forest 0.8 0.72 0.72 Ada Boost 0.74 0.71 0.68 SVM 0.59 0.52 0.52 Logistic Regression 0.81 0.74 0.73 CPXC 0.937 0.83 0.786 44 Variable Log f1 f2 f3 f4 f5 f6 f7 Alzheimer 1.75 1.74 0.80 1.88 1.59 1.29 1.58 0.75 Breast Cancer 0.63 1.15 1.62 2.73 1.00 1.00 2.08 0.59 Odds ratios of PXC local models Performance of difference classifiers
  • 45. Ohio Center of Excellence in Knowledge-Enabled Computing Application: Heart Failure Survival Risk Models Variable sets CPXC Log RF SVM DT Boosting (Demo&Vital)  (Demo&Vital) +Lab 4.8% 11.5% 19% 17.3% 0% 14.7% (Demo&Vital)  (Demo&Vital) +Lab+Med 8.9% 13.4% 21.2% 21.7% 0% 5.7% (Demo&Vital)  (Demo&Vital) +Lab+Med+Co-morbid 27.8% 9.6% 19.1% 19.5% -10.4% 7.6% (Demo&Vital) +Lab (Demo&Vital) +Lab+Med 3.2% 1.7% 1.7% 3.7% 0% -9.8% (Demo&Vital) +Lab (Demo&Vital) +Lab+Med+Co-morbid 20.9% -1.7% 0% 1.8% -10.4% -8.1% (Demo&Vital) +Lab+Med (Demo&Vital) +Lab+Med+Co-morbid 15.9% -3.3% -1.7% -1.7% -10.4% 1.8% 45 Adding co-morbidities: • decreased the AUC of other classifiers by 5.3% on average. • increased the AUC of CPXC by 21.5% on average. Performance changes when we add more variables
  • 46. Ohio Center of Excellence in Knowledge-Enabled Computing Application: Saturated Hydraulic Conductivity • Collaboration with University of Texas at Austin and USDA-ARS • Problem definition: 1. Prediction of the soil water retention curve (SWRC) 2. Prediction of Saturated Hydraulic Conductivity (SHC) 3. Investigating the effect of sample dimensions on prediction accuracy. • Number of predictor variables: 6-13 • Number of response variables: 10 • 32 CPXR models are developed. 46 Behzad Ghanbarian, Vahid Taslimitehrani, Guozhu Dong, Yakov Pachepsky. Sample dimensions effect on prediction of soil water retention curve and saturated hydraulic conductivity. Journal of Hydrology. 528 (2015): 127-137.
  • 47. Ohio Center of Excellence in Knowledge-Enabled Computing Application: Saturated Hydraulic Conductivity 47 -4 -2 0 2 4 6 8 10 -4 -2 0 2 4 6 8 10 Predictedlog(Ksat)[cmday-1] Measured log(Ksat) [cm day-1] SHC2 RMSLE = 0.456 -4 -2 0 2 4 6 8 10 -4 -2 0 2 4 6 8 10 Predictedlog(Ksat)[cmday-1] Measured log(Ksat) [cm day-1] SHC2 RMSLE = 1.936 Model s t 10 30 50 100 300 500 1000 1500 Linear Regression SWRC1 0.79 0.73 0.77 0.84 0.85 0.84 0.83 0.84 0.81 0.77 SWRC2 0.79 0.72 0.77 0.85 0.84 0.84 0.84 0.83 0.80 0.78 CPXR SWRC1 0.94 0.97 0.97 0.94 0.97 0.97 0.95 0.96 0.95 0.94 SWRC2 0.95 0.96 0.94 0.95 0.97 0.96 0.95 0.98 0.97 0.94
  • 48. Ohio Center of Excellence in Knowledge-Enabled Computing Conclusion • A new type of highly accurate and interpretable regression and classification models, PXR/PXC are presented. • New techniques to build PXR and PXC models are discussed. • Each pair of pattern-model represents a diverse predictor-response interaction. • PXR and PXC models are more accurate, interpretable and less overfitting than other regression and classification algorithms. • A new method adopted from CPXC presented to handle classifying imbalanced datasets. • Several applications of CPXR and CPXC are discussed. 48
  • 49. Ohio Center of Excellence in Knowledge-Enabled Computing Related publications • Guozhu Dong, Vahid Taslimitehrani, Pattern-Aided Regression Modeling and Prediction Model Analysis. in IEEE Transactions on Knowledge and Data Engineering, vol.27, no.9, pp.2452-2465, Sept. 1 2015. • Vahid Taslimitehrani, Guozhu Dong. A New CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury", IEEE International Conference on Bioinformatics and Bioengineering (BIBE), 2014, On page(s): 283 – 290 (Best Student Paper) • Behzad Ghanbarian, Vahid Taslimitehrani, Guozhu Dong, Yakov Pachepsky. Sample dimensions effect on prediction of soil water retention curve and saturated hydraulic conductivity. Journal of Hydrology. 528 (2015): 127-137. • Vahid Taslimitehrani, Guozhu Dong, Naveen Pereira, Maryam Panahiazar, Jyotishman Pathak. Develolping HER-driven Heart Failure Models using CPXR(Log) with the probabilistic loss function. Journal of Biomedical Informatics (2016). • Guozhu Dong, Vahid Taslimitehrani, Pattern Aided Classification, SIAM Data Mining Conference, 2016 49
  • 50. Ohio Center of Excellence in Knowledge-Enabled Computing Acknowledgement 50

Editor's Notes

  1. Reference:
  2. HF example, old and young patient
  3. We propose a methodology that addresses those challenges.