Vahid Taslimitehrani's Dissertation Defense: Friday, February 19 2015.
Ph.D. Committee: Drs. Guozhu Dong, Advisor, T.K. Prasad, Amit Sheth, Keke Chen
and Jyotishman Pathak, Division of Health Informatics, Weill Cornell Medical College, Cornell University.
ABSTRACT:
Regression and classification techniques play an essential role in many data mining tasks and have broad applications. However, most of the state-of-the-art regression and classification techniques are often unable to adequately model the interactions among predictor variables in highly heterogeneous datasets. New techniques that can effectively model such complex and heterogeneous structures are needed to significantly improve prediction accuracy.
In this dissertation, we propose a novel type of accurate and interpretable regression and classification models, named as Pattern Aided Regression (PXR) and Pattern Aided Classification (PXC) respectively. Both PXR and PXC rely on identifying regions in the data space where a given baseline model has large modeling errors, characterizing such regions using patterns, and learning specialized models for those regions. Each PXR/PXC model contains several pairs of contrast patterns and local models, where a local classifier is applied only to data instances matching its associated pattern. We also propose a class of classification and regression techniques called Contrast Pattern Aided Regression (CPXR) and Contrast Pattern Aided Classification (CPXC) to build accurate and interpretable PXR and PXC models.
We have conducted a set of comprehensive performance studies to evaluate the performance of CPXR and CPXC. The results show that CPXR and CPXC outperform state-of-the-art regression and classification algorithms, often by significant margins. The results also show that CPXR and CPXC are especially effective for heterogeneous and high dimensional datasets. Besides being new types of modeling, PXR and PXC models can also provide insights into data heterogeneity and diverse predictor-response relationships.
We have also adapted CPXC to handle classifying imbalanced datasets and introduced a new algorithm called Contrast Pattern Aided Classification for Imbalanced Datasets (CPXCim). In CPXCim, we applied a weighting method to boost minority instances as well as a new filtering method to prune patterns with imbalanced matching datasets.
Finally, we applied our techniques on three real applications, two in the healthcare domain and one in the soil mechanic domain. PXR and PXC models are significantly more accurate than other learning algorithms in those three applications.
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Contrast Pattern Aided Regression and Classification
1. Ohio Center of Excellence in Knowledge-Enabled Computing
Ph.D. Dissertation Defense:
Contrast Pattern Aided Regression and
Classification
February 19, 2016
Vahid Taslimitehrani
Kno.e.sis Center, CSE Dept., Wright State University, USA
Committee Members: Prof. Guozhu Dong (advisor, WSU), Prof. Amit Sheth (WSU),
Prof. T.K. Prasad (WSU), Dr. Keke Chen (WSU), and Prof. Jyotishman Pathak
(Cornell University)
1
2. Ohio Center of Excellence in Knowledge-Enabled Computing
2
3. Ohio Center of Excellence in Knowledge-Enabled Computing
3
Does Asthma decrease
the mortality risk from
Pneumonia?
4. Ohio Center of Excellence in Knowledge-Enabled Computing
Accuracy vs. Interpretability
4
Accuracy
Interpretability
Low
High
High
Lasso
Linear/Logistic
Regression
Naïve Bayes
Decision Trees
Splines
Nearest
Neighbors
Bagging
Neural Nets
SVM
Boosting
Random Forest
Deep Learning
CPXR/CPXC
Source: Joshua Bloom and Henrik Brink of wise.io
*on real dataset
5. Ohio Center of Excellence in Knowledge-Enabled Computing
5
Modeling Techniques Lack Accuracy
and Interpretability
Heterogeneity &
Diversity of Given
Dataset
Predictors-Response
Interactions
Universal Model’s
Assumption
6. Ohio Center of Excellence in Knowledge-Enabled Computing
Predictors-Response Interactions
6
Interactive effect:
The effect of a variable on prediction
changes and varies with changes in the
values of other independent variable(s)
which are interacting with the variable.
It is not the genes or the environment!
It is their interaction that’s important.
7. Ohio Center of Excellence in Knowledge-Enabled Computing
Universal Model’s Assumption &
Heterogeneity
What is the universal model’s
assumption?
7
What are heterogeneous and
diverse data points?
8. Ohio Center of Excellence in Knowledge-Enabled Computing
Solution
1.New type of regression & classification models called Pattern
Aided Regression and Classification (PXR and PXC)
2.The new algorithms to build PXR and PXC models called Contrast
Pattern Aided Regression and Classification (CPXR and CPXC)
3.The new algorithm to handle imbalanced datasets called Contrast
Pattern Aided Classification on Imbalanced datasets (CPXCim)
8
Our proposed methodology has three components:
9. Ohio Center of Excellence in Knowledge-Enabled Computing
Preliminaries: patterns
• A pattern (rule) is a set of conditions describing set of objects.
• Example:
"𝑨𝒈𝒆 ≥ 60" AND “History of hypertension = YES”
is a pattern (rule) describing:
All patients more than 60 years old AND have a history of Hypertension.
• An object matches a pattern if it satisfies every condition in the pattern.
9
Patient ID Age BMI History of Hypertension Diagnosed with Heart Failure
1 75 22 YES YES
2 67 27 NO NO
10. Ohio Center of Excellence in Knowledge-Enabled Computing
Preliminaries: matching dataset and
contrast patterns
• The matching dataset of pattern 𝑃 in dataset 𝐷 or 𝑚𝑑𝑠(𝑃, 𝐷) is the set of all
instances matching pattern 𝑃.
• The support of pattern 𝑃 in 𝐷 is 𝑠𝑢𝑝𝑝 𝑃, 𝐷 =
𝑚𝑑𝑠(𝑃,𝐷)
𝐷
.
• Contrast patterns: patterns that distinguish objects in different classes. A
pattern is contrast pattern if it matches many objects in one class than in
another class.
• An equivalent class (EC) is a set of patterns with same matching datasets
(having same behavior).
10
11. Ohio Center of Excellence in Knowledge-Enabled Computing
Introduction: CPXR/CPXC overview
11
𝑷: pattern
𝒇: model
A pattern logically
characterizes a sub-
group of data.
A local model represents
predictor-response
interactions among the
data points of a sub-
group of data.
Regression
Classification
𝒇
CPXR/CPXC
(𝑷 𝟏, 𝒇 𝟏)
(𝑷 𝟐, 𝒇 𝟐)
Local model algorithms
can be simple as linear
regression.
12. Ohio Center of Excellence in Knowledge-Enabled Computing
Diversity of predictor-response
relationships
• Different pattern-model pairs emphasize different sets of
variables.
• Different pattern-model pairs use highly different
regression/classification models.
• Diverse predictor-response relationships may be neutralized
at the global level.
12
13. Ohio Center of Excellence in Knowledge-Enabled Computing
Introduction: Thesis Statement
Study regression and classification techniques to produce accurate
and interpretable models capable of adequately representing
complex and diverse predictor-response interactions and revealing
high intra-dataset heterogeneity.
13
14. Ohio Center of Excellence in Knowledge-Enabled Computing
Contrast Pattern Aided Regression
(CPXR)
14
Guozhu Dong, Vahid Taslimitehrani, Pattern-Aided Regression
Modeling and Prediction Model Analysis. in IEEE Transactions
on Knowledge and Data Engineering, vol.27, no.9, pp.2452-
2465, Sept. 1 2015
15. Ohio Center of Excellence in Knowledge-Enabled Computing
A pictorial illustration of a simple PXR
model
15
A small dataset with 100 instances and 2 numerical
predictor variables.
• Different patterns can involve different sets of variables
[describing data regions in different subspaces]
• Matching datasets of different patterns can overlap
0
2
4
6
8
10
0 2 4 6 8 10
16. Ohio Center of Excellence in Knowledge-Enabled Computing
PXR concepts
16
Regression
Classification
𝒇 𝒃
Given a training dataset 𝐷 =
(𝑥𝑖, 𝑦𝑖) 1 ≤ 𝑖 ≤ 𝑛 , a regression
model built on 𝐷 is called
baseline model and given as 𝑓𝑏.
(𝑷 𝟏, 𝒇 𝑷 𝟏
)
(𝑷 𝟐, 𝒇 𝑷 𝟐
)
CPXR/CPXC
Given the matching dataset
of pattern 𝑃, 𝑚𝑑𝑠(𝑃, 𝐷), a
regression built on
𝑚𝑑𝑠 𝑃, 𝐷 is called local
model and is shown by 𝑓𝑃.
17. Ohio Center of Excellence in Knowledge-Enabled Computing
Pattern Items Local Model Match
𝑃1 𝑓1
𝑃2 𝑓2
𝑃3 𝑓3
𝑃4 𝑓4
𝑃5 𝑓5
𝑃6 𝑓6
Pattern Aided Regression (PXR)
17
• 𝑃𝑋𝑅 = ( 𝑃1, 𝑓1, 𝑤1 , 𝑃2, 𝑓2, 𝑤2 , … , 𝑃𝑘, 𝑓𝑘, 𝑤 𝑘 , 𝑓𝑑)
• The regression function of PXR as:
𝑓𝑃𝑋𝑅 =
𝑃 𝑖∈𝜋 𝑥
𝑤𝑖 𝑓𝑖(𝑥)
𝑃 𝑖∈𝜋 𝑥
𝑤𝑖
, 𝑖𝑓 𝜋 𝑥 ≠ ∅
𝑓𝑑, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
where 𝜋 𝑥 = 𝑃𝑖 1 ≤ 𝑖 ≤ 𝑘, 𝑥 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 𝑃𝑖
Case 3:
Case 2:
Case 1:
18. Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR/CPXC: Quality Measures
• The average residual reduction (arr) of a pattern 𝑃 w.r.t to a prediction
model 𝑓 on a dataset 𝐷 is:
𝑎𝑟𝑟 𝑃 =
𝑥∈𝑚𝑑𝑠(𝑃,𝐷) 𝑟 𝑥(𝑓 𝑏) − 𝑥∈𝑚𝑑𝑠(𝑃,𝐷) 𝑟 𝑥(𝑓 𝑃)
𝑚𝑑𝑠(𝑃,𝐷)
• The total residual reduction (trr) of a PXR/PXC is:
𝑡𝑟𝑟 𝑃𝑋𝑅/𝑃𝑋𝐶 =
𝑥∈𝑚𝑑𝑠(𝑃𝑆,𝐷) 𝑟𝑥(𝑓𝑏) − 𝑥∈𝑚𝑑𝑠(𝑃𝑆,𝐷) 𝑟𝑥(𝑓𝑃𝑋𝑅/𝑃𝑋𝐶)
𝑥∈𝐷 𝑟𝑥(𝑓)
Where 𝑃𝑆 = 𝑃1, … , 𝑃𝑘 is the pattern set, 𝑟𝑥(𝑓) is the 𝑓’s residual on an
instance 𝑥 and 𝑚𝑑𝑠 𝑃𝑆, 𝐷 = 𝑖=1
𝑘
𝑚𝑑𝑠(𝑃𝑖, 𝐷).
18
19. Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR Algorithm
19
Dataset D CPXR
Phase1
Phase2
Phase3
Goal: A small set of cooperating patterns, where each pattern
characterize a subgroup of data points.
• A baseline model makes large residual errors on data points in
the subgroup.
• A highly accurate model is found to correct those errors.
20. Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR Algorithm
20
Baseline
model
Regression/
Classification
LE
SE
Training
Dataset
𝑃2
𝑃3
…
…
(𝑓2, 𝑤2)
(𝑓3, 𝑤3)
…
…
Patterns Local Models
Pattern
Mining
[(𝑃1, 𝑓1, 𝑤1) , (𝑃4, 𝑓4, 𝑤4) , … , (𝑃𝑘, 𝑓𝑘, 𝑤 𝑘)]
(𝑓1, 𝑤1)
(𝑓4, 𝑤4)
(𝑓𝑘, 𝑤 𝑘)
𝑃1
𝑃4
𝑃𝑘
21. Ohio Center of Excellence in Knowledge-Enabled Computing
• How to determine spliting point 𝜅?
Minimize 𝜌 −
𝑟 𝑖>𝜅 𝑟 𝑖
𝑟 𝑖
• How to select patterns from C𝑃𝑆?
Lets 𝑃𝑆 = 𝑃0 , where 𝑃0 is the pattern 𝑃 in C𝑃𝑆 with the highest 𝑎𝑟𝑟
21
0
1
2
3
4
5
6
0 50 100 150 200
SE LE
CPXR Algorithm
22. Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR/CPXC: Filtering methods
• Contrast patterns of LE with support ratio less than 1.
• Patterns with tiny residual reduction (𝑎𝑟𝑟).
• Patterns with Jaccard similarity more than 0.9
𝐽 𝑃1, 𝑃2 =
𝑚𝑑𝑠(𝑃1, 𝐷) ∩ 𝑚𝑑𝑠(𝑃2, 𝐷)
𝑚𝑑𝑠(𝑃1, 𝐷) ∪ 𝑚𝑑𝑠(𝑃2, 𝐷)
• Patterns with the size of matching datasets less than the number of
predictor variables.
22
23. Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR: Prediction Accuracy Evaluation
• 50 real datasets and 23 synthetic datasets
• Different criteria to generate synthetic datasets
• Compare CPXR’s performance with 5 state-of-the-art
regression methods
• Overfitting and noise sensitivity
• Analysis of parameters
23
𝑅𝑀𝑆𝐸 𝑟𝑒𝑑𝑢𝑐𝑡𝑖𝑜𝑛 =
𝑅𝑀𝑆𝐸 𝐿𝑅 − 𝐸𝑀𝑆𝐸(𝑋)
𝑅𝑀𝑆𝐸(𝐿𝑅)
24. Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR: Prediction Accuracy Evaluation
24
Dataset PLR SVR BART GBM CPXR
Tecator 40.62 0.16 19.35 -14.15 65.1
Tree 17.68 7.92 -7.23 -10.82 61.73
Wage 12.2 9.15 25.42 11.86 38.45
Average 18.41 4.94 20.18 14.6 42.89
CPXR’s
performance
vs. other
methods
• CPXR has the highest accuracy in 41 out of 50 datasets.
• CPXR’s results are more accurate than LR in all 50 datasets.
• In 20% of datasets, CPXR achieved more than 60% RMSE
reduction.
25. Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR: Overfitting and Noise Sensitivity
25
5 10 15 20
102030405060
Noise(%)
Dropinaccuracycomparingtocleantestdata(%)
●
●
●
●
●
Datasets
BART
CPXR
Gradient Boosting
NN SVR BART CPXR
0.00.20.40.6
NN SVR BART CPXR
−0.2−0.10.00.10.20.30.4
RMSE
reduction on
synthetic
datasets
Train - Test
Method Training Test
Drop in
accuracy
PLR 37.11% 18.76% 49%
SVR 7.65% 4.8% 37%
BART 41.02% 20.15% 51%
CPXR(LL) 51.4% 39.88% 22%
CPXR(LP) 53.85% 42.89% 21%
26. Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR: Analysis of Parameters
26
5 10 15 20
0.350.400.450.500.550.600.65
k (Number of patterns)
RMSEimprovementoverLR
●
●
●
●
●
●
Datasets
Fat
Mussels
Price
0.02 0.04 0.06 0.08 0.10
0.250.300.350.400.450.500.550.60
minSup
RMSEimprovementoverLR
● ●
●
●
●
Datasets
Fat
Mussels
Price
0.40 0.45 0.50 0.55 0.60 0.65 0.70
0.350.400.450.500.550.60
r
RMSEimprovementoverLR
● ●
●
● ●
● ●
●
Datasets
Fat
Mussels
Price
2% is the optimal minSup.7 patterns as average on
50 datasets.
27. Ohio Center of Excellence in Knowledge-Enabled Computing
Contrast Pattern Aided Classification
(CPXC)
27
Guozhu Dong, Vahid Taslimitehrani, Pattern Aided
Classification, SIAM Data Mining Conference, 2016
28. Ohio Center of Excellence in Knowledge-Enabled Computing
CPXC: PXC Concept
CPXC techniques are quite
similar to those of CPXR
but CPXC has more
challenges as well as more
opportunities than CPXR
28
CPXC
Confidence
of Match
Objective
Functions
Classification
Algorithms
Loss
Functions
29. Ohio Center of Excellence in Knowledge-Enabled Computing
CPXC: Confidence of Match
• Given 𝑃𝑋𝐶 = ( 𝑃1, ℎ 𝑃1
, 𝑤1 , 𝑃2, ℎ 𝑃2
, 𝑤2 , … , 𝑃𝑘, ℎ 𝑃 𝑘
, 𝑤 𝑘 , ℎ 𝑑), the class variable
of an instance 𝑥 is defined as:
𝑤𝑒𝑖𝑔ℎ𝑡𝑑 − 𝑣𝑜𝑡𝑒 (𝑃𝑋𝐶, 𝐶𝑗, 𝑥)
=
𝑃 𝑖∈𝜋 𝑥
𝑤𝑖 × 𝑚𝑎𝑡𝑐ℎ (𝑥, 𝑝𝑖) × ℎ 𝑝 𝑖
(𝑥, 𝐶𝑗)
𝑃 𝑖∈𝜋 𝑥
𝑤𝑖 × 𝑚𝑎𝑡𝑐ℎ (𝑥, 𝑝𝑖)
, 𝑖𝑓 𝜋 𝑥 ≠ ∅
ℎ 𝑑, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
where 𝜋 𝑥 = 𝑃𝑖 1 ≤ 𝑖 ≤ 𝑘, 𝑚𝑎𝑡𝑐ℎ 𝑥, 𝑝𝑖 > 0
and
𝑚𝑎𝑡𝑐ℎ 𝑥, 𝑝𝑖 =
𝑞 𝑖 𝜖𝑀𝐺(𝑝 𝑖) 𝑡 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 𝑝 𝑖
𝑀𝐺(𝑝 𝑖)
• 𝑚𝑎𝑡𝑐ℎ 𝑥, 𝑝𝑖 is the fraction of 𝑀𝐺 ‘s 𝑞 in 𝑀𝐺 𝑝𝑖 such that 𝑥 matches 𝑞.
• ℎ 𝑝(𝑥, 𝐶𝑗) is the confidence score of local model ℎ on instance 𝑥 for class 𝐶𝑗.
29
Confidence
of Match
30. Ohio Center of Excellence in Knowledge-Enabled Computing
CPXC: Loss Functions
30
0.600.650.700.750.800.850.90
ClassError
AUC
●
●
●
Binary Probabilistic Standardized
●
Datasets
ILPD
Hillvalley
Planning
Probabilistic error loss
function returns the
best results.
Loss
Functions
31. Ohio Center of Excellence in Knowledge-Enabled Computing
CPXC: Base/Local Algorithms & Objective
Functions
• Different methods for baseline and local classifiers:
– We used 6 classification algorithm for learning the
baseline and local classifiers
31
Classification
Algorithms
• Quality measures on pattern sets
– We used 𝑡𝑟𝑟, AUC, and ACC (accuracy) to measure the
quality of a pattern set
• Quality measures on patterns and weights on local classifiers
– We used 𝑎𝑟𝑟, AUC, and ACC (accuracy) to measure the
quality of a pattern: 𝑎𝑟𝑟 is the winner!
Objective
Functions
32. Ohio Center of Excellence in Knowledge-Enabled Computing
Experimental results
32
19
Public
Datasets
8
Classification
Algorithms
Noise
Sensitivity &
Overfitting
Running
Time
7
Fold Cross
Validation
minSup = 0.02
rho = 0.45
33. Ohio Center of Excellence in Knowledge-Enabled Computing
CPXC: Performance
Dataset Boosting DT NBC Log RF SVM Max CPXC (NBC-DT)
Congress 0.58 0.66 0.6 0.57 0.58 0.58 0.66 0.86
Poker 0.6 0.6 0.5 0.5 0.76 0.5 0.76 0.85
HillValley 0.5 0.63 0.65 0.66 0.6 0.67 0.67 0.89
Climate 0.96 0.81 0.9 0.94 0.97 0.98 0.98 0.97
Mammography 0.94 0.91 0.94 0.94 0.93 0.93 0.94 0.98
Steel 0.96 0.88 0.91 0.95 0.95 0.94 0.95 0.99
33
• CPXC achieved average AUC of 0.886 on the 8 hard datasets.
• Average AUC of the best performing traditional classifier (RF) on hard datasets is 0.638.
• CPXC’s AUC is never lower than RF on the hard datasets.
• CPXC achieved average AUC of 0.983 on the easy datasets while the best performing
traditional algorithms obtained average AUC of 0.968.
35. Ohio Center of Excellence in Knowledge-Enabled Computing
CPXC: Impact of Parameters
35
4 6 8 10 12 14
0.750.800.850.90
k (Number of patterns)
AUC
●
●
●
●
● ●
●
Datasets
Blood
Congress
Hillvalley
Planning
0.02 0.04 0.06 0.08 0.10
0.700.750.800.850.90
minSup
AUC
●
●
●
●
●
Datasets
Blood
Congress
Hillvalley
Planning
0.840.850.860.870.880.890.90
Objective Function
AUC
●
●
●
TER AUC ACC
●
Datasets
ILPD
Hillvalley
Planning
0.3 0.4 0.5 0.6 0.7
0.780.800.820.840.860.880.90
r
AUC
●
●
●
● ●
●
●
●
●
●
Datasets
Blood
Congress
Hillvalley
Planning
36. Ohio Center of Excellence in Knowledge-Enabled Computing
36
Classification on Imbalanced Datasets
• What is an imbalanced classification problem?
• What are the real world applications?
• Why traditional classification algorithms do not perform well on
imbalanced datasets?
• What is our proposed solution?
Classifying minority instances might be more important that majority class.
37. Ohio Center of Excellence in Knowledge-Enabled Computing
LE
SE
37
Baseline
model
Classification
LE
SE
Training
Dataset
Weighting
• 𝑒𝑟𝑟∗ ℎ 𝑏, 𝑥 =
𝑒𝑟𝑟 ℎ 𝑏, 𝑥 × 𝛿, 𝑖𝑓𝑥 ∈ 𝑚𝑖𝑛𝑜𝑟𝑖𝑡𝑦 𝑐𝑙𝑎𝑠𝑠 𝑖𝑛𝑠𝑡𝑎𝑛𝑥𝑐𝑒𝑠
𝑒𝑟𝑟(ℎ 𝑏, 𝑥), 𝑖𝑓𝑥 ∈ 𝑚𝑎𝑗𝑜𝑟𝑖𝑡𝑦 𝑐𝑙𝑎𝑠𝑠 𝑖𝑛𝑠𝑡𝑎𝑛𝑥𝑐𝑒𝑠
New Weighting idea
38. Ohio Center of Excellence in Knowledge-Enabled Computing
A Filtering Method to Remove Imbalanced
Local Models
38
• 𝐼𝑅 𝑚𝑑𝑠 𝑃, 𝐷 =
Number of instances in the majority class
Number of instances in the minority class
𝑃1
𝑃2
𝑃3
𝑃4
…
…
𝑃𝑘
(𝑓1, 𝑤1)
(𝑓2, 𝑤2)
(𝑓3, 𝑤3)
(𝑓4, 𝑤4)
…
…
(𝑓𝑘, 𝑤 𝑘)
Patterns Local Models
39. Ohio Center of Excellence in Knowledge-Enabled Computing
Experimental results
39
• The average AUC of CPXCim is 14% and 15.2% more than the AUC of
SMOTE and SMOTE-TL, respectively.
• The performance of CPXCim is always better than other imbalanced
classifiers on these 10 datasets.
CPXCim’s performance
Dataset
# of
instances
# of
variables
Imbalance
ratio
CPXCim SMOTE SMOTE-TL
Yeast 1004 8 9.14 0.942 0.7728 0.772
Led7digit 443 7 10.97 0.978 0.8919 0.897
flareF 1066 11 23.79 0.883 0.7463 0.809
Wine Quality 1599 11 29.17 0.76 0.6008 0.59
Average - - - 0.92 0.798 0.807
40. Ohio Center of Excellence in Knowledge-Enabled Computing
Applications of CPXR & CPXC
40
• Vahid Taslimitehrani, Guozhu Dong. A New CPXR Based Logistic Regression Method and Clinical
Prognostic Modeling Results Using the Method on Traumatic Brain Injury", IEEE International
Conference on Bioinformatics and Bioengineering (BIBE), 2014, On page(s): 283 – 290 (Best Student
Paper)
• Behzad Ghanbarian, Vahid Taslimitehrani, Guozhu Dong, Yakov Pachepsky. Sample dimensions
effect on prediction of soil water retention curve and saturated hydraulic conductivity. Journal of
Hydrology. 528 (2015): 127-137.
• Vahid Taslimitehrani, Guozhu Dong, Naveen Pereira, Maryam Panahiazar, Jyotishman Pathak.
Develolping HER-driven Heart Failure Models using CPXR(Log) with the probabilistic loss function.
Journal of Biomedical Informatics (2016).
41. Ohio Center of Excellence in Knowledge-Enabled Computing
Application: Traumatic Brain Injury
What is Traumatic Brain Injury (TBI)?
It is an important public health problem and a leading
cause of death and disability worldwide.
Problem definition: prediction of patients outcome
within 6 months after TBI event, using the admission data.
• Dataset: 2159 patients collected from a trial and 15 predictor variables
• Two class variables: mortality and unfavorable outcome.
41
Vahid Taslimitehrani, Guozhu Dong. A New CPXR Based Logistic Regression
Method and Clinical Prognostic Modeling Results Using the Method on
Traumatic Brain Injury", Bioinformatics and Bioengineering (BIBE), 2014
IEEE International Conference on, On page(s): 283 – 290 (Best Student
Paper Award)
43. Ohio Center of Excellence in Knowledge-Enabled Computing
Application: Heart Failure Survival Risk
Models
• Collaboration with Mayo Clinic
• Problem definition: Heart Failure survival prediction models.
• An EHR dataset on 119,749 patients admitted to Mayo Clinic.
• Predictor variables are grouped in the following categories:
– Demographic, Vitals, Labs, Medications and 24 major chronic conditions as co-
morbidities.
• Three groups of CPXC models are developed to predict survival in 1, 2 and 5 years
after heart failure event.
43
Vahid Taslimitehrani, Guozhu Dong, Naveen Pereira, Maryam Panahiazar, Jyotishman Pathak.
Develolping HER-driven Heart Failure Models using CPXR(Log) with the probabilistic loss function.
Journal of Biomedical Informatics (2016).
44. Ohio Center of Excellence in Knowledge-Enabled Computing
Application: Heart Failure Survival Risk
Models
Algorithm 1 Year 2 Year 5 Year
Decision Tree 0.66 0.5 0.5
Random Forest 0.8 0.72 0.72
Ada Boost 0.74 0.71 0.68
SVM 0.59 0.52 0.52
Logistic Regression 0.81 0.74 0.73
CPXC 0.937 0.83 0.786
44
Variable Log f1 f2 f3 f4 f5 f6 f7
Alzheimer 1.75 1.74 0.80 1.88 1.59 1.29 1.58 0.75
Breast Cancer 0.63 1.15 1.62 2.73 1.00 1.00 2.08 0.59
Odds ratios of PXC local models
Performance of difference classifiers
45. Ohio Center of Excellence in Knowledge-Enabled Computing
Application: Heart Failure Survival Risk
Models
Variable sets CPXC Log RF SVM DT Boosting
(Demo&Vital) (Demo&Vital) +Lab 4.8% 11.5% 19% 17.3% 0% 14.7%
(Demo&Vital) (Demo&Vital) +Lab+Med 8.9% 13.4% 21.2% 21.7% 0% 5.7%
(Demo&Vital) (Demo&Vital) +Lab+Med+Co-morbid 27.8% 9.6% 19.1% 19.5% -10.4% 7.6%
(Demo&Vital) +Lab (Demo&Vital) +Lab+Med 3.2% 1.7% 1.7% 3.7% 0% -9.8%
(Demo&Vital) +Lab (Demo&Vital) +Lab+Med+Co-morbid 20.9% -1.7% 0% 1.8% -10.4% -8.1%
(Demo&Vital) +Lab+Med (Demo&Vital) +Lab+Med+Co-morbid 15.9% -3.3% -1.7% -1.7% -10.4% 1.8%
45
Adding co-morbidities:
• decreased the AUC of other classifiers by 5.3% on average.
• increased the AUC of CPXC by 21.5% on average.
Performance changes when we add more variables
46. Ohio Center of Excellence in Knowledge-Enabled Computing
Application: Saturated Hydraulic
Conductivity
• Collaboration with University of Texas at Austin and USDA-ARS
• Problem definition:
1. Prediction of the soil water retention curve (SWRC)
2. Prediction of Saturated Hydraulic Conductivity (SHC)
3. Investigating the effect of sample dimensions on
prediction accuracy.
• Number of predictor variables: 6-13
• Number of response variables: 10
• 32 CPXR models are developed.
46
Behzad Ghanbarian, Vahid Taslimitehrani, Guozhu Dong, Yakov Pachepsky. Sample
dimensions effect on prediction of soil water retention curve and saturated hydraulic
conductivity. Journal of Hydrology. 528 (2015): 127-137.
48. Ohio Center of Excellence in Knowledge-Enabled Computing
Conclusion
• A new type of highly accurate and interpretable regression and classification
models, PXR/PXC are presented.
• New techniques to build PXR and PXC models are discussed.
• Each pair of pattern-model represents a diverse predictor-response interaction.
• PXR and PXC models are more accurate, interpretable and less overfitting than
other regression and classification algorithms.
• A new method adopted from CPXC presented to handle classifying imbalanced
datasets.
• Several applications of CPXR and CPXC are discussed.
48
49. Ohio Center of Excellence in Knowledge-Enabled Computing
Related publications
• Guozhu Dong, Vahid Taslimitehrani, Pattern-Aided Regression Modeling and Prediction
Model Analysis. in IEEE Transactions on Knowledge and Data Engineering, vol.27, no.9,
pp.2452-2465, Sept. 1 2015.
• Vahid Taslimitehrani, Guozhu Dong. A New CPXR Based Logistic Regression Method
and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain
Injury", IEEE International Conference on Bioinformatics and Bioengineering (BIBE),
2014, On page(s): 283 – 290 (Best Student Paper)
• Behzad Ghanbarian, Vahid Taslimitehrani, Guozhu Dong, Yakov Pachepsky. Sample
dimensions effect on prediction of soil water retention curve and saturated hydraulic
conductivity. Journal of Hydrology. 528 (2015): 127-137.
• Vahid Taslimitehrani, Guozhu Dong, Naveen Pereira, Maryam Panahiazar, Jyotishman
Pathak. Develolping HER-driven Heart Failure Models using CPXR(Log) with the
probabilistic loss function. Journal of Biomedical Informatics (2016).
• Guozhu Dong, Vahid Taslimitehrani, Pattern Aided Classification, SIAM Data Mining
Conference, 2016
49
50. Ohio Center of Excellence in Knowledge-Enabled Computing
Acknowledgement
50
Editor's Notes
Reference:
HF example, old and young patient
We propose a methodology that addresses those challenges.