Contrast Pattern Aided Regression and Classification

Ohio Center of Excellence in Knowledge-Enabled Computing
Ph.D. Dissertation Defense:
Contrast Pattern Aided Regression and
Classification
February 19, 2016
Vahid Taslimitehrani
Kno.e.sis Center, CSE Dept., Wright State University, USA
Committee Members: Prof. Guozhu Dong (advisor, WSU), Prof. Amit Sheth (WSU),
Prof. T.K. Prasad (WSU), Dr. Keke Chen (WSU), and Prof. Jyotishman Pathak
(Cornell University)
1

2

3
Does Asthma decrease
the mortality risk from
Pneumonia?

Accuracy vs. Interpretability
4
Accuracy
Interpretability
Low
High
High
Lasso
Linear/Logistic
Regression
Naïve Bayes
Decision Trees
Splines
Nearest
Neighbors
Bagging
Neural Nets
SVM
Boosting
Random Forest
Deep Learning
CPXR/CPXC
Source: Joshua Bloom and Henrik Brink of wise.io
*on real dataset

5
Modeling Techniques Lack Accuracy
and Interpretability
Heterogeneity &
Diversity of Given
Dataset
Predictors-Response
Interactions
Universal Model’s
Assumption

Predictors-Response Interactions
6
Interactive effect:
The effect of a variable on prediction
changes and varies with changes in the
values of other independent variable(s)
which are interacting with the variable.
It is not the genes or the environment!
It is their interaction that’s important.

Universal Model’s Assumption &
Heterogeneity
What is the universal model’s
assumption?
7
What are heterogeneous and
diverse data points?

Solution
1.New type of regression & classification models called Pattern
Aided Regression and Classification (PXR and PXC)
2.The new algorithms to build PXR and PXC models called Contrast
Pattern Aided Regression and Classification (CPXR and CPXC)
3.The new algorithm to handle imbalanced datasets called Contrast
Pattern Aided Classification on Imbalanced datasets (CPXCim)
8
Our proposed methodology has three components:

Preliminaries: patterns
• A pattern (rule) is a set of conditions describing set of objects.
• Example:
"𝑨𝒈𝒆 ≥ 60" AND “History of hypertension = YES”
is a pattern (rule) describing:
All patients more than 60 years old AND have a history of Hypertension.
• An object matches a pattern if it satisfies every condition in the pattern.
9
Patient ID Age BMI History of Hypertension Diagnosed with Heart Failure
1 75 22 YES YES
2 67 27 NO NO

Preliminaries: matching dataset and
contrast patterns
• The matching dataset of pattern 𝑃 in dataset 𝐷 or 𝑚𝑑𝑠(𝑃, 𝐷) is the set of all
instances matching pattern 𝑃.
• The support of pattern 𝑃 in 𝐷 is 𝑠𝑢𝑝𝑝 𝑃, 𝐷 =
𝑚𝑑𝑠(𝑃,𝐷)
𝐷
.
• Contrast patterns: patterns that distinguish objects in different classes. A
pattern is contrast pattern if it matches many objects in one class than in
another class.
• An equivalent class (EC) is a set of patterns with same matching datasets
(having same behavior).
10

Introduction: CPXR/CPXC overview
11
𝑷: pattern
𝒇: model
A pattern logically
characterizes a sub-
group of data.
A local model represents
predictor-response
interactions among the
data points of a sub-
group of data.
Regression
Classification
𝒇
CPXR/CPXC
(𝑷 𝟏, 𝒇 𝟏)
(𝑷 𝟐, 𝒇 𝟐)
Local model algorithms
can be simple as linear
regression.

Diversity of predictor-response
relationships
• Different pattern-model pairs emphasize different sets of
variables.
• Different pattern-model pairs use highly different
regression/classification models.
• Diverse predictor-response relationships may be neutralized
at the global level.
12

Introduction: Thesis Statement
Study regression and classification techniques to produce accurate
and interpretable models capable of adequately representing
complex and diverse predictor-response interactions and revealing
high intra-dataset heterogeneity.
13

Contrast Pattern Aided Regression
(CPXR)
14
Guozhu Dong, Vahid Taslimitehrani, Pattern-Aided Regression
Modeling and Prediction Model Analysis. in IEEE Transactions
on Knowledge and Data Engineering, vol.27, no.9, pp.2452-
2465, Sept. 1 2015

A pictorial illustration of a simple PXR
model
15
A small dataset with 100 instances and 2 numerical
predictor variables.
• Different patterns can involve different sets of variables
[describing data regions in different subspaces]
• Matching datasets of different patterns can overlap
0
2
4
6
8
10
0 2 4 6 8 10

PXR concepts
16
Regression
Classification
𝒇 𝒃
Given a training dataset 𝐷 =
(𝑥𝑖, 𝑦𝑖) 1 ≤ 𝑖 ≤ 𝑛 , a regression
model built on 𝐷 is called
baseline model and given as 𝑓𝑏.
(𝑷 𝟏, 𝒇 𝑷 𝟏
)
(𝑷 𝟐, 𝒇 𝑷 𝟐
)
CPXR/CPXC
Given the matching dataset
of pattern 𝑃, 𝑚𝑑𝑠(𝑃, 𝐷), a
regression built on
𝑚𝑑𝑠 𝑃, 𝐷 is called local
model and is shown by 𝑓𝑃.

Pattern Items Local Model Match
𝑃1 𝑓1
𝑃2 𝑓2
𝑃3 𝑓3
𝑃4 𝑓4
𝑃5 𝑓5
𝑃6 𝑓6
Pattern Aided Regression (PXR)
17
• 𝑃𝑋𝑅 = ( 𝑃1, 𝑓1, 𝑤1 , 𝑃2, 𝑓2, 𝑤2 , … , 𝑃𝑘, 𝑓𝑘, 𝑤 𝑘 , 𝑓𝑑)
• The regression function of PXR as:
𝑓𝑃𝑋𝑅 =
𝑃 𝑖∈𝜋 𝑥
𝑤𝑖 𝑓𝑖(𝑥)
𝑤𝑖
, 𝑖𝑓 𝜋 𝑥 ≠ ∅
𝑓𝑑, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
where 𝜋 𝑥 = 𝑃𝑖 1 ≤ 𝑖 ≤ 𝑘, 𝑥 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 𝑃𝑖
Case 3:
Case 2:
Case 1:

CPXR/CPXC: Quality Measures
• The average residual reduction (arr) of a pattern 𝑃 w.r.t to a prediction
model 𝑓 on a dataset 𝐷 is:
𝑎𝑟𝑟 𝑃 =
𝑥∈𝑚𝑑𝑠(𝑃,𝐷) 𝑟 𝑥(𝑓 𝑏) − 𝑥∈𝑚𝑑𝑠(𝑃,𝐷) 𝑟 𝑥(𝑓 𝑃)
𝑚𝑑𝑠(𝑃,𝐷)
• The total residual reduction (trr) of a PXR/PXC is:
𝑡𝑟𝑟 𝑃𝑋𝑅/𝑃𝑋𝐶 =
𝑥∈𝑚𝑑𝑠(𝑃𝑆,𝐷) 𝑟𝑥(𝑓𝑏) − 𝑥∈𝑚𝑑𝑠(𝑃𝑆,𝐷) 𝑟𝑥(𝑓𝑃𝑋𝑅/𝑃𝑋𝐶)
𝑥∈𝐷 𝑟𝑥(𝑓)
Where 𝑃𝑆 = 𝑃1, … , 𝑃𝑘 is the pattern set, 𝑟𝑥(𝑓) is the 𝑓’s residual on an
instance 𝑥 and 𝑚𝑑𝑠 𝑃𝑆, 𝐷 = 𝑖=1
𝑘
𝑚𝑑𝑠(𝑃𝑖, 𝐷).
18

CPXR Algorithm
19
Dataset D CPXR
Phase1
Phase2
Phase3
Goal: A small set of cooperating patterns, where each pattern
characterize a subgroup of data points.
• A baseline model makes large residual errors on data points in
the subgroup.
• A highly accurate model is found to correct those errors.

CPXR Algorithm
20
Baseline
model
Regression/
Classification
LE
SE
Training
Dataset
𝑃2
𝑃3
…
…
(𝑓2, 𝑤2)
(𝑓3, 𝑤3)
…
…
Patterns Local Models
Pattern
Mining
[(𝑃1, 𝑓1, 𝑤1) , (𝑃4, 𝑓4, 𝑤4) , … , (𝑃𝑘, 𝑓𝑘, 𝑤 𝑘)]
(𝑓1, 𝑤1)
(𝑓4, 𝑤4)
(𝑓𝑘, 𝑤 𝑘)
𝑃1
𝑃4
𝑃𝑘

• How to determine spliting point 𝜅?
Minimize 𝜌 −
𝑟 𝑖>𝜅 𝑟 𝑖
𝑟 𝑖
• How to select patterns from C𝑃𝑆?
Lets 𝑃𝑆 = 𝑃0 , where 𝑃0 is the pattern 𝑃 in C𝑃𝑆 with the highest 𝑎𝑟𝑟
21
0
1
2
3
4
5
6
0 50 100 150 200
SE LE
CPXR Algorithm

CPXR/CPXC: Filtering methods
• Contrast patterns of LE with support ratio less than 1.
• Patterns with tiny residual reduction (𝑎𝑟𝑟).
• Patterns with Jaccard similarity more than 0.9
𝐽 𝑃1, 𝑃2 =
𝑚𝑑𝑠(𝑃1, 𝐷) ∩ 𝑚𝑑𝑠(𝑃2, 𝐷)
𝑚𝑑𝑠(𝑃1, 𝐷) ∪ 𝑚𝑑𝑠(𝑃2, 𝐷)
• Patterns with the size of matching datasets less than the number of
predictor variables.
22

CPXR: Prediction Accuracy Evaluation
• 50 real datasets and 23 synthetic datasets
• Different criteria to generate synthetic datasets
• Compare CPXR’s performance with 5 state-of-the-art
regression methods
• Overfitting and noise sensitivity
• Analysis of parameters
23
𝑅𝑀𝑆𝐸 𝑟𝑒𝑑𝑢𝑐𝑡𝑖𝑜𝑛 =
𝑅𝑀𝑆𝐸 𝐿𝑅 − 𝐸𝑀𝑆𝐸(𝑋)
𝑅𝑀𝑆𝐸(𝐿𝑅)

CPXR: Prediction Accuracy Evaluation
24
Dataset PLR SVR BART GBM CPXR
Tecator 40.62 0.16 19.35 -14.15 65.1
Tree 17.68 7.92 -7.23 -10.82 61.73
Wage 12.2 9.15 25.42 11.86 38.45
Average 18.41 4.94 20.18 14.6 42.89
CPXR’s
performance
vs. other
methods
• CPXR has the highest accuracy in 41 out of 50 datasets.
• CPXR’s results are more accurate than LR in all 50 datasets.
• In 20% of datasets, CPXR achieved more than 60% RMSE
reduction.

CPXR: Overfitting and Noise Sensitivity
25
5 10 15 20
102030405060
Noise(%)
Dropinaccuracycomparingtocleantestdata(%)
●
●
●
●
●
Datasets
BART
CPXR
Gradient Boosting
NN SVR BART CPXR
0.00.20.40.6
NN SVR BART CPXR
−0.2−0.10.00.10.20.30.4
RMSE
reduction on
synthetic
datasets
Train - Test
Method Training Test
Drop in
accuracy
PLR 37.11% 18.76% 49%
SVR 7.65% 4.8% 37%
BART 41.02% 20.15% 51%
CPXR(LL) 51.4% 39.88% 22%
CPXR(LP) 53.85% 42.89% 21%

CPXR: Analysis of Parameters
26
5 10 15 20
0.350.400.450.500.550.600.65
k (Number of patterns)
RMSEimprovementoverLR
●
●
●
●
●
●
Datasets
Fat
Mussels
Price
0.02 0.04 0.06 0.08 0.10
0.250.300.350.400.450.500.550.60
minSup
● ●
●
●
●
Datasets
Fat
Mussels
Price
0.40 0.45 0.50 0.55 0.60 0.65 0.70
0.350.400.450.500.550.60
r
● ●
●
● ●
● ●
●
Datasets
Fat
Mussels
Price
2% is the optimal minSup.7 patterns as average on
50 datasets.

Contrast Pattern Aided Classification
(CPXC)
27
Guozhu Dong, Vahid Taslimitehrani, Pattern Aided
Classification, SIAM Data Mining Conference, 2016

CPXC: PXC Concept
CPXC techniques are quite
similar to those of CPXR
but CPXC has more
challenges as well as more
opportunities than CPXR
28
CPXC
Confidence
of Match
Objective
Functions
Classification
Algorithms
Loss
Functions

CPXC: Confidence of Match
• Given 𝑃𝑋𝐶 = ( 𝑃1, ℎ 𝑃1
, 𝑤1 , 𝑃2, ℎ 𝑃2
, 𝑤2 , … , 𝑃𝑘, ℎ 𝑃 𝑘
, 𝑤 𝑘 , ℎ 𝑑), the class variable
of an instance 𝑥 is defined as:
𝑤𝑒𝑖𝑔ℎ𝑡𝑑 − 𝑣𝑜𝑡𝑒 (𝑃𝑋𝐶, 𝐶𝑗, 𝑥)
=
𝑤𝑖 × 𝑚𝑎𝑡𝑐ℎ (𝑥, 𝑝𝑖) × ℎ 𝑝 𝑖
(𝑥, 𝐶𝑗)
𝑤𝑖 × 𝑚𝑎𝑡𝑐ℎ (𝑥, 𝑝𝑖)
, 𝑖𝑓 𝜋 𝑥 ≠ ∅
ℎ 𝑑, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
where 𝜋 𝑥 = 𝑃𝑖 1 ≤ 𝑖 ≤ 𝑘, 𝑚𝑎𝑡𝑐ℎ 𝑥, 𝑝𝑖 > 0
and
𝑚𝑎𝑡𝑐ℎ 𝑥, 𝑝𝑖 =
𝑞 𝑖 𝜖𝑀𝐺(𝑝 𝑖) 𝑡 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 𝑝 𝑖
𝑀𝐺(𝑝 𝑖)
• 𝑚𝑎𝑡𝑐ℎ 𝑥, 𝑝𝑖 is the fraction of 𝑀𝐺 ‘s 𝑞 in 𝑀𝐺 𝑝𝑖 such that 𝑥 matches 𝑞.
• ℎ 𝑝(𝑥, 𝐶𝑗) is the confidence score of local model ℎ on instance 𝑥 for class 𝐶𝑗.
29
Confidence
of Match

CPXC: Loss Functions
30
0.600.650.700.750.800.850.90
ClassError
AUC
●
●
●
Binary Probabilistic Standardized
●
Datasets
ILPD
Hillvalley
Planning
Probabilistic error loss
function returns the
best results.
Loss
Functions

CPXC: Base/Local Algorithms & Objective
Functions
• Different methods for baseline and local classifiers:
– We used 6 classification algorithm for learning the
baseline and local classifiers
31
Classification
Algorithms
• Quality measures on pattern sets
– We used 𝑡𝑟𝑟, AUC, and ACC (accuracy) to measure the
quality of a pattern set
• Quality measures on patterns and weights on local classifiers
– We used 𝑎𝑟𝑟, AUC, and ACC (accuracy) to measure the
quality of a pattern: 𝑎𝑟𝑟 is the winner!
Objective
Functions

Experimental results
32
19
Public
Datasets
8
Classification
Algorithms
Noise
Sensitivity &
Overfitting
Running
Time
7
Fold Cross
Validation
minSup = 0.02
rho = 0.45

CPXC: Performance
Dataset Boosting DT NBC Log RF SVM Max CPXC (NBC-DT)
Congress 0.58 0.66 0.6 0.57 0.58 0.58 0.66 0.86
Poker 0.6 0.6 0.5 0.5 0.76 0.5 0.76 0.85
HillValley 0.5 0.63 0.65 0.66 0.6 0.67 0.67 0.89
Climate 0.96 0.81 0.9 0.94 0.97 0.98 0.98 0.97
Mammography 0.94 0.91 0.94 0.94 0.93 0.93 0.94 0.98
Steel 0.96 0.88 0.91 0.95 0.95 0.94 0.95 0.99
33
• CPXC achieved average AUC of 0.886 on the 8 hard datasets.
• Average AUC of the best performing traditional classifier (RF) on hard datasets is 0.638.
• CPXC’s AUC is never lower than RF on the hard datasets.
• CPXC achieved average AUC of 0.983 on the easy datasets while the best performing
traditional algorithms obtained average AUC of 0.968.

CPXC: Noise Sensitivity
34
Drop of AUC vs. noise levels
Method/Noise 0% 5% 10% 15% 20% Average
RF 5.73 6.61 12.48 25.83 33.54 16.84
CPXC 5.87 6.79 12.92 24.7 32.7 16.6
Boosting 7.02 8.93 14.2 26.8 34.65 18.32
Log 7.04 10.56 14.63 24.7 33.94 18.17
NBC 7.06 10.58 15.26 27.89 35.1 19.18
SVM 8.6 10.34 16.28 29.59 38.02 20.57
DT 8.8 11.04 16.78 30.3 43.1 22.00

CPXC: Impact of Parameters
35
4 6 8 10 12 14
0.750.800.850.90
k (Number of patterns)
AUC
●
●
●
●
● ●
●
Datasets
Blood
Congress
Hillvalley
Planning
0.02 0.04 0.06 0.08 0.10
0.700.750.800.850.90
minSup
AUC
●
●
●
●
●
Datasets
Blood
Congress
Hillvalley
Planning
0.840.850.860.870.880.890.90
Objective Function
AUC
●
●
●
TER AUC ACC
●
Datasets
ILPD
Hillvalley
Planning
0.3 0.4 0.5 0.6 0.7
0.780.800.820.840.860.880.90
r
AUC
●
●
●
● ●
●
●
●
●
●
Datasets
Blood
Congress
Hillvalley
Planning

36
Classification on Imbalanced Datasets
• What is an imbalanced classification problem?
• What are the real world applications?
• Why traditional classification algorithms do not perform well on
imbalanced datasets?
• What is our proposed solution?
Classifying minority instances might be more important that majority class.

LE
SE
37
Baseline
model
Classification
LE
SE
Training
Dataset
Weighting
• 𝑒𝑟𝑟∗ ℎ 𝑏, 𝑥 =
𝑒𝑟𝑟 ℎ 𝑏, 𝑥 × 𝛿, 𝑖𝑓𝑥 ∈ 𝑚𝑖𝑛𝑜𝑟𝑖𝑡𝑦 𝑐𝑙𝑎𝑠𝑠 𝑖𝑛𝑠𝑡𝑎𝑛𝑥𝑐𝑒𝑠
𝑒𝑟𝑟(ℎ 𝑏, 𝑥), 𝑖𝑓𝑥 ∈ 𝑚𝑎𝑗𝑜𝑟𝑖𝑡𝑦 𝑐𝑙𝑎𝑠𝑠 𝑖𝑛𝑠𝑡𝑎𝑛𝑥𝑐𝑒𝑠
New Weighting idea

A Filtering Method to Remove Imbalanced
Local Models
38
• 𝐼𝑅 𝑚𝑑𝑠 𝑃, 𝐷 =
Number of instances in the majority class
Number of instances in the minority class
𝑃1
𝑃2
𝑃3
𝑃4
…
…
𝑃𝑘
(𝑓1, 𝑤1)
(𝑓2, 𝑤2)
(𝑓3, 𝑤3)
(𝑓4, 𝑤4)
…
…
(𝑓𝑘, 𝑤 𝑘)
Patterns Local Models

Experimental results
39
• The average AUC of CPXCim is 14% and 15.2% more than the AUC of
SMOTE and SMOTE-TL, respectively.
• The performance of CPXCim is always better than other imbalanced
classifiers on these 10 datasets.
CPXCim’s performance
Dataset
# of
instances
# of
variables
Imbalance
ratio
CPXCim SMOTE SMOTE-TL
Yeast 1004 8 9.14 0.942 0.7728 0.772
Led7digit 443 7 10.97 0.978 0.8919 0.897
flareF 1066 11 23.79 0.883 0.7463 0.809
Wine Quality 1599 11 29.17 0.76 0.6008 0.59
Average - - - 0.92 0.798 0.807

Applications of CPXR & CPXC
40
• Vahid Taslimitehrani, Guozhu Dong. A New CPXR Based Logistic Regression Method and Clinical
Prognostic Modeling Results Using the Method on Traumatic Brain Injury", IEEE International
Conference on Bioinformatics and Bioengineering (BIBE), 2014, On page(s): 283 – 290 (Best Student
Paper)
• Behzad Ghanbarian, Vahid Taslimitehrani, Guozhu Dong, Yakov Pachepsky. Sample dimensions
effect on prediction of soil water retention curve and saturated hydraulic conductivity. Journal of
Hydrology. 528 (2015): 127-137.
• Vahid Taslimitehrani, Guozhu Dong, Naveen Pereira, Maryam Panahiazar, Jyotishman Pathak.
Develolping HER-driven Heart Failure Models using CPXR(Log) with the probabilistic loss function.
Journal of Biomedical Informatics (2016).

Application: Traumatic Brain Injury
What is Traumatic Brain Injury (TBI)?
It is an important public health problem and a leading
cause of death and disability worldwide.
Problem definition: prediction of patients outcome
within 6 months after TBI event, using the admission data.
• Dataset: 2159 patients collected from a trial and 15 predictor variables
• Two class variables: mortality and unfavorable outcome.
41
Vahid Taslimitehrani, Guozhu Dong. A New CPXR Based Logistic Regression
Method and Clinical Prognostic Modeling Results Using the Method on
Traumatic Brain Injury", Bioinformatics and Bioengineering (BIBE), 2014
IEEE International Conference on, On page(s): 283 – 290 (Best Student
Paper Award)

Application: Traumatic Brain Injury
Model Basic Basic+CT Basic+CT+Lab
Unfavorable
Specificity 0.89(0.85) 0.87(0.85) 0.91(0.84)
Sensitivity 0.54(0.52) 0.65(0.6) 0.72(0.61)
Accuracy 0.75(0.72) 0.79(0.75) 0.87(0.75)
F1 0.63(0.59) 0.7(0.66) 0.76(0.66)
AUC 0.82(0.76) 0.87(0.8) 0.93(0.81)
42
Variable set change
Mortality Unfavorable
CPXR(Log) Log CPXR(Log) Log
Basic Basic+CT 10% 7.7% 6% 5.2%
Basic+CTBasic+CT+Lab 4.5% 2.5% 6.8% 1.25%
BasicBasic+CT+Lab 15% 11.1% 13.4% 6.6%
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
False positive rate
Truepositiverate
CPXR(Log)
SLogR
SVM
RF
AUC_CPXR(Log) = 0.87
AUC_SLogR = 0.8
AUC_RF = 0.72
AUC_SVM = 0.7
Performance changes when we add more variables
CPXR(Log)’s
performance

Application: Heart Failure Survival Risk
Models
• Collaboration with Mayo Clinic
• Problem definition: Heart Failure survival prediction models.
• An EHR dataset on 119,749 patients admitted to Mayo Clinic.
• Predictor variables are grouped in the following categories:
– Demographic, Vitals, Labs, Medications and 24 major chronic conditions as co-
morbidities.
• Three groups of CPXC models are developed to predict survival in 1, 2 and 5 years
after heart failure event.
43
Vahid Taslimitehrani, Guozhu Dong, Naveen Pereira, Maryam Panahiazar, Jyotishman Pathak.
Develolping HER-driven Heart Failure Models using CPXR(Log) with the probabilistic loss function.
Journal of Biomedical Informatics (2016).

Models
Algorithm 1 Year 2 Year 5 Year
Decision Tree 0.66 0.5 0.5
Random Forest 0.8 0.72 0.72
Ada Boost 0.74 0.71 0.68
SVM 0.59 0.52 0.52
Logistic Regression 0.81 0.74 0.73
CPXC 0.937 0.83 0.786
44
Variable Log f1 f2 f3 f4 f5 f6 f7
Alzheimer 1.75 1.74 0.80 1.88 1.59 1.29 1.58 0.75
Breast Cancer 0.63 1.15 1.62 2.73 1.00 1.00 2.08 0.59
Odds ratios of PXC local models
Performance of difference classifiers

Models
Variable sets CPXC Log RF SVM DT Boosting
(Demo&Vital)  (Demo&Vital) +Lab 4.8% 11.5% 19% 17.3% 0% 14.7%
(Demo&Vital)  (Demo&Vital) +Lab+Med 8.9% 13.4% 21.2% 21.7% 0% 5.7%
(Demo&Vital)  (Demo&Vital) +Lab+Med+Co-morbid 27.8% 9.6% 19.1% 19.5% -10.4% 7.6%
(Demo&Vital) +Lab (Demo&Vital) +Lab+Med 3.2% 1.7% 1.7% 3.7% 0% -9.8%
(Demo&Vital) +Lab (Demo&Vital) +Lab+Med+Co-morbid 20.9% -1.7% 0% 1.8% -10.4% -8.1%
(Demo&Vital) +Lab+Med (Demo&Vital) +Lab+Med+Co-morbid 15.9% -3.3% -1.7% -1.7% -10.4% 1.8%
45
Adding co-morbidities:
• decreased the AUC of other classifiers by 5.3% on average.
• increased the AUC of CPXC by 21.5% on average.
Performance changes when we add more variables

Application: Saturated Hydraulic
Conductivity
• Collaboration with University of Texas at Austin and USDA-ARS
• Problem definition:
1. Prediction of the soil water retention curve (SWRC)
2. Prediction of Saturated Hydraulic Conductivity (SHC)
3. Investigating the effect of sample dimensions on
prediction accuracy.
• Number of predictor variables: 6-13
• Number of response variables: 10
• 32 CPXR models are developed.
46
Behzad Ghanbarian, Vahid Taslimitehrani, Guozhu Dong, Yakov Pachepsky. Sample
dimensions effect on prediction of soil water retention curve and saturated hydraulic
conductivity. Journal of Hydrology. 528 (2015): 127-137.

Application: Saturated Hydraulic
Conductivity
47
-4
-2
0
2
4
6
8
10
-4 -2 0 2 4 6 8 10
Predictedlog(Ksat)[cmday-1]
Measured log(Ksat) [cm day-1]
SHC2
RMSLE = 0.456
-4
-2
0
2
4
6
8
10
-4 -2 0 2 4 6 8 10
Predictedlog(Ksat)[cmday-1]
Measured log(Ksat) [cm day-1]
SHC2
RMSLE = 1.936
Model
s t 10 30 50 100 300 500 1000 1500
Linear Regression
SWRC1 0.79 0.73 0.77 0.84 0.85 0.84 0.83 0.84 0.81 0.77
SWRC2 0.79 0.72 0.77 0.85 0.84 0.84 0.84 0.83 0.80 0.78
CPXR
SWRC1 0.94 0.97 0.97 0.94 0.97 0.97 0.95 0.96 0.95 0.94
SWRC2 0.95 0.96 0.94 0.95 0.97 0.96 0.95 0.98 0.97 0.94

Conclusion
• A new type of highly accurate and interpretable regression and classification
models, PXR/PXC are presented.
• New techniques to build PXR and PXC models are discussed.
• Each pair of pattern-model represents a diverse predictor-response interaction.
• PXR and PXC models are more accurate, interpretable and less overfitting than
other regression and classification algorithms.
• A new method adopted from CPXC presented to handle classifying imbalanced
datasets.
• Several applications of CPXR and CPXC are discussed.
48

Related publications
• Guozhu Dong, Vahid Taslimitehrani, Pattern-Aided Regression Modeling and Prediction
Model Analysis. in IEEE Transactions on Knowledge and Data Engineering, vol.27, no.9,
pp.2452-2465, Sept. 1 2015.
• Vahid Taslimitehrani, Guozhu Dong. A New CPXR Based Logistic Regression Method
and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain
Injury", IEEE International Conference on Bioinformatics and Bioengineering (BIBE),
2014, On page(s): 283 – 290 (Best Student Paper)
• Behzad Ghanbarian, Vahid Taslimitehrani, Guozhu Dong, Yakov Pachepsky. Sample
dimensions effect on prediction of soil water retention curve and saturated hydraulic
conductivity. Journal of Hydrology. 528 (2015): 127-137.
• Vahid Taslimitehrani, Guozhu Dong, Naveen Pereira, Maryam Panahiazar, Jyotishman
Pathak. Develolping HER-driven Heart Failure Models using CPXR(Log) with the
probabilistic loss function. Journal of Biomedical Informatics (2016).
• Guozhu Dong, Vahid Taslimitehrani, Pattern Aided Classification, SIAM Data Mining
Conference, 2016
49

Acknowledgement
50

Contrast Pattern Aided Regression and Classification

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (20)

Similar to Contrast Pattern Aided Regression and Classification

Similar to Contrast Pattern Aided Regression and Classification (20)

Recently uploaded

Recently uploaded (20)

Contrast Pattern Aided Regression and Classification

Editor's Notes