Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Ohio Center of Excellence in Knowledge-Enabled Computing
Ph.D. Dissertation Defense:
Contrast Pattern Aided Regression and...
Ohio Center of Excellence in Knowledge-Enabled Computing
2
Ohio Center of Excellence in Knowledge-Enabled Computing
3
Does Asthma decrease
the mortality risk from
Pneumonia?
Ohio Center of Excellence in Knowledge-Enabled Computing
Accuracy vs. Interpretability
4
Accuracy
Interpretability
Low
Hig...
Ohio Center of Excellence in Knowledge-Enabled Computing
5
Modeling Techniques Lack Accuracy
and Interpretability
Heteroge...
Ohio Center of Excellence in Knowledge-Enabled Computing
Predictors-Response Interactions
6
Interactive effect:
The effect...
Ohio Center of Excellence in Knowledge-Enabled Computing
Universal Model’s Assumption &
Heterogeneity
What is the universa...
Ohio Center of Excellence in Knowledge-Enabled Computing
Solution
1.New type of regression & classification models called ...
Ohio Center of Excellence in Knowledge-Enabled Computing
Preliminaries: patterns
• A pattern (rule) is a set of conditions...
Ohio Center of Excellence in Knowledge-Enabled Computing
Preliminaries: matching dataset and
contrast patterns
• The match...
Ohio Center of Excellence in Knowledge-Enabled Computing
Introduction: CPXR/CPXC overview
11
𝑷: pattern
𝒇: model
A pattern...
Ohio Center of Excellence in Knowledge-Enabled Computing
Diversity of predictor-response
relationships
• Different pattern...
Ohio Center of Excellence in Knowledge-Enabled Computing
Introduction: Thesis Statement
Study regression and classificatio...
Ohio Center of Excellence in Knowledge-Enabled Computing
Contrast Pattern Aided Regression
(CPXR)
14
Guozhu Dong, Vahid Ta...
Ohio Center of Excellence in Knowledge-Enabled Computing
A pictorial illustration of a simple PXR
model
15
A small dataset...
Ohio Center of Excellence in Knowledge-Enabled Computing
PXR concepts
16
Regression
Classification
𝒇 𝒃
Given a training da...
Ohio Center of Excellence in Knowledge-Enabled Computing
Pattern Items Local Model Match
𝑃1 𝑓1
𝑃2 𝑓2
𝑃3 𝑓3
𝑃4 𝑓4
𝑃5 𝑓5
𝑃6 ...
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR/CPXC: Quality Measures
• The average residual reduction (arr...
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR Algorithm
19
Dataset D CPXR
Phase1
Phase2
Phase3
Goal: A sma...
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR Algorithm
20
Baseline
model
Regression/
Classification
LE
SE...
Ohio Center of Excellence in Knowledge-Enabled Computing
• How to determine spliting point 𝜅?
Minimize 𝜌 −
𝑟 𝑖>𝜅 𝑟 𝑖
𝑟 𝑖
•...
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR/CPXC: Filtering methods
• Contrast patterns of LE with suppo...
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR: Prediction Accuracy Evaluation
• 50 real datasets and 23 sy...
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR: Prediction Accuracy Evaluation
24
Dataset PLR SVR BART GBM ...
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR: Overfitting and Noise Sensitivity
25
5 10 15 20
10203040506...
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR: Analysis of Parameters
26
5 10 15 20
0.350.400.450.500.550....
Ohio Center of Excellence in Knowledge-Enabled Computing
Contrast Pattern Aided Classification
(CPXC)
27
Guozhu Dong, Vahi...
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXC: PXC Concept
CPXC techniques are quite
similar to those of C...
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXC: Confidence of Match
• Given 𝑃𝑋𝐶 = ( 𝑃1, ℎ 𝑃1
, 𝑤1 , 𝑃2, ℎ 𝑃...
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXC: Loss Functions
30
0.600.650.700.750.800.850.90
ClassError
A...
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXC: Base/Local Algorithms & Objective
Functions
• Different met...
Ohio Center of Excellence in Knowledge-Enabled Computing
Experimental results
32
19
Public
Datasets
8
Classification
Algor...
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXC: Performance
Dataset Boosting DT NBC Log RF SVM Max CPXC (NB...
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXC: Noise Sensitivity
34
Drop of AUC vs. noise levels
Method/No...
Ohio Center of Excellence in Knowledge-Enabled Computing
CPXC: Impact of Parameters
35
4 6 8 10 12 14
0.750.800.850.90
k (...
Ohio Center of Excellence in Knowledge-Enabled Computing
36
Classification on Imbalanced Datasets
• What is an imbalanced ...
Ohio Center of Excellence in Knowledge-Enabled Computing
LE
SE
37
Baseline
model
Classification
LE
SE
Training
Dataset
Wei...
Ohio Center of Excellence in Knowledge-Enabled Computing
A Filtering Method to Remove Imbalanced
Local Models
38
• 𝐼𝑅 𝑚𝑑𝑠 ...
Ohio Center of Excellence in Knowledge-Enabled Computing
Experimental results
39
• The average AUC of CPXCim is 14% and 15...
Ohio Center of Excellence in Knowledge-Enabled Computing
Applications of CPXR & CPXC
40
• Vahid Taslimitehrani, Guozhu Don...
Ohio Center of Excellence in Knowledge-Enabled Computing
Application: Traumatic Brain Injury
What is Traumatic Brain Injur...
Ohio Center of Excellence in Knowledge-Enabled Computing
Application: Traumatic Brain Injury
Model Basic Basic+CT Basic+CT...
Ohio Center of Excellence in Knowledge-Enabled Computing
Application: Heart Failure Survival Risk
Models
• Collaboration w...
Ohio Center of Excellence in Knowledge-Enabled Computing
Application: Heart Failure Survival Risk
Models
Algorithm 1 Year ...
Ohio Center of Excellence in Knowledge-Enabled Computing
Application: Heart Failure Survival Risk
Models
Variable sets CPX...
Ohio Center of Excellence in Knowledge-Enabled Computing
Application: Saturated Hydraulic
Conductivity
• Collaboration wit...
Ohio Center of Excellence in Knowledge-Enabled Computing
Application: Saturated Hydraulic
Conductivity
47
-4
-2
0
2
4
6
8
...
Ohio Center of Excellence in Knowledge-Enabled Computing
Conclusion
• A new type of highly accurate and interpretable regr...
Ohio Center of Excellence in Knowledge-Enabled Computing
Related publications
• Guozhu Dong, Vahid Taslimitehrani, Pattern...
Ohio Center of Excellence in Knowledge-Enabled Computing
Acknowledgement
50
Upcoming SlideShare
Loading in …5
×

Contrast Pattern Aided Regression and Classification

1,288 views

Published on

Vahid Taslimitehrani's Dissertation Defense: Friday, February 19 2015.
Ph.D. Committee: Drs. Guozhu Dong, Advisor, T.K. Prasad, Amit Sheth, Keke Chen
and Jyotishman Pathak, Division of Health Informatics, Weill Cornell Medical College, Cornell University.

ABSTRACT:
Regression and classification techniques play an essential role in many data mining tasks and have broad applications. However, most of the state-of-the-art regression and classification techniques are often unable to adequately model the interactions among predictor variables in highly heterogeneous datasets. New techniques that can effectively model such complex and heterogeneous structures are needed to significantly improve prediction accuracy.

In this dissertation, we propose a novel type of accurate and interpretable regression and classification models, named as Pattern Aided Regression (PXR) and Pattern Aided Classification (PXC) respectively. Both PXR and PXC rely on identifying regions in the data space where a given baseline model has large modeling errors, characterizing such regions using patterns, and learning specialized models for those regions. Each PXR/PXC model contains several pairs of contrast patterns and local models, where a local classifier is applied only to data instances matching its associated pattern. We also propose a class of classification and regression techniques called Contrast Pattern Aided Regression (CPXR) and Contrast Pattern Aided Classification (CPXC) to build accurate and interpretable PXR and PXC models.

We have conducted a set of comprehensive performance studies to evaluate the performance of CPXR and CPXC. The results show that CPXR and CPXC outperform state-of-the-art regression and classification algorithms, often by significant margins. The results also show that CPXR and CPXC are especially effective for heterogeneous and high dimensional datasets. Besides being new types of modeling, PXR and PXC models can also provide insights into data heterogeneity and diverse predictor-response relationships.

We have also adapted CPXC to handle classifying imbalanced datasets and introduced a new algorithm called Contrast Pattern Aided Classification for Imbalanced Datasets (CPXCim). In CPXCim, we applied a weighting method to boost minority instances as well as a new filtering method to prune patterns with imbalanced matching datasets.

Finally, we applied our techniques on three real applications, two in the healthcare domain and one in the soil mechanic domain. PXR and PXC models are significantly more accurate than other learning algorithms in those three applications.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Contrast Pattern Aided Regression and Classification

  1. 1. Ohio Center of Excellence in Knowledge-Enabled Computing Ph.D. Dissertation Defense: Contrast Pattern Aided Regression and Classification February 19, 2016 Vahid Taslimitehrani Kno.e.sis Center, CSE Dept., Wright State University, USA Committee Members: Prof. Guozhu Dong (advisor, WSU), Prof. Amit Sheth (WSU), Prof. T.K. Prasad (WSU), Dr. Keke Chen (WSU), and Prof. Jyotishman Pathak (Cornell University) 1
  2. 2. Ohio Center of Excellence in Knowledge-Enabled Computing 2
  3. 3. Ohio Center of Excellence in Knowledge-Enabled Computing 3 Does Asthma decrease the mortality risk from Pneumonia?
  4. 4. Ohio Center of Excellence in Knowledge-Enabled Computing Accuracy vs. Interpretability 4 Accuracy Interpretability Low High High Lasso Linear/Logistic Regression Naïve Bayes Decision Trees Splines Nearest Neighbors Bagging Neural Nets SVM Boosting Random Forest Deep Learning CPXR/CPXC Source: Joshua Bloom and Henrik Brink of wise.io *on real dataset
  5. 5. Ohio Center of Excellence in Knowledge-Enabled Computing 5 Modeling Techniques Lack Accuracy and Interpretability Heterogeneity & Diversity of Given Dataset Predictors-Response Interactions Universal Model’s Assumption
  6. 6. Ohio Center of Excellence in Knowledge-Enabled Computing Predictors-Response Interactions 6 Interactive effect: The effect of a variable on prediction changes and varies with changes in the values of other independent variable(s) which are interacting with the variable. It is not the genes or the environment! It is their interaction that’s important.
  7. 7. Ohio Center of Excellence in Knowledge-Enabled Computing Universal Model’s Assumption & Heterogeneity What is the universal model’s assumption? 7 What are heterogeneous and diverse data points?
  8. 8. Ohio Center of Excellence in Knowledge-Enabled Computing Solution 1.New type of regression & classification models called Pattern Aided Regression and Classification (PXR and PXC) 2.The new algorithms to build PXR and PXC models called Contrast Pattern Aided Regression and Classification (CPXR and CPXC) 3.The new algorithm to handle imbalanced datasets called Contrast Pattern Aided Classification on Imbalanced datasets (CPXCim) 8 Our proposed methodology has three components:
  9. 9. Ohio Center of Excellence in Knowledge-Enabled Computing Preliminaries: patterns • A pattern (rule) is a set of conditions describing set of objects. • Example: "𝑨𝒈𝒆 ≥ 60" AND “History of hypertension = YES” is a pattern (rule) describing: All patients more than 60 years old AND have a history of Hypertension. • An object matches a pattern if it satisfies every condition in the pattern. 9 Patient ID Age BMI History of Hypertension Diagnosed with Heart Failure 1 75 22 YES YES 2 67 27 NO NO
  10. 10. Ohio Center of Excellence in Knowledge-Enabled Computing Preliminaries: matching dataset and contrast patterns • The matching dataset of pattern 𝑃 in dataset 𝐷 or 𝑚𝑑𝑠(𝑃, 𝐷) is the set of all instances matching pattern 𝑃. • The support of pattern 𝑃 in 𝐷 is 𝑠𝑢𝑝𝑝 𝑃, 𝐷 = 𝑚𝑑𝑠(𝑃,𝐷) 𝐷 . • Contrast patterns: patterns that distinguish objects in different classes. A pattern is contrast pattern if it matches many objects in one class than in another class. • An equivalent class (EC) is a set of patterns with same matching datasets (having same behavior). 10
  11. 11. Ohio Center of Excellence in Knowledge-Enabled Computing Introduction: CPXR/CPXC overview 11 𝑷: pattern 𝒇: model A pattern logically characterizes a sub- group of data. A local model represents predictor-response interactions among the data points of a sub- group of data. Regression Classification 𝒇 CPXR/CPXC (𝑷 𝟏, 𝒇 𝟏) (𝑷 𝟐, 𝒇 𝟐) Local model algorithms can be simple as linear regression.
  12. 12. Ohio Center of Excellence in Knowledge-Enabled Computing Diversity of predictor-response relationships • Different pattern-model pairs emphasize different sets of variables. • Different pattern-model pairs use highly different regression/classification models. • Diverse predictor-response relationships may be neutralized at the global level. 12
  13. 13. Ohio Center of Excellence in Knowledge-Enabled Computing Introduction: Thesis Statement Study regression and classification techniques to produce accurate and interpretable models capable of adequately representing complex and diverse predictor-response interactions and revealing high intra-dataset heterogeneity. 13
  14. 14. Ohio Center of Excellence in Knowledge-Enabled Computing Contrast Pattern Aided Regression (CPXR) 14 Guozhu Dong, Vahid Taslimitehrani, Pattern-Aided Regression Modeling and Prediction Model Analysis. in IEEE Transactions on Knowledge and Data Engineering, vol.27, no.9, pp.2452- 2465, Sept. 1 2015
  15. 15. Ohio Center of Excellence in Knowledge-Enabled Computing A pictorial illustration of a simple PXR model 15 A small dataset with 100 instances and 2 numerical predictor variables. • Different patterns can involve different sets of variables [describing data regions in different subspaces] • Matching datasets of different patterns can overlap 0 2 4 6 8 10 0 2 4 6 8 10
  16. 16. Ohio Center of Excellence in Knowledge-Enabled Computing PXR concepts 16 Regression Classification 𝒇 𝒃 Given a training dataset 𝐷 = (𝑥𝑖, 𝑦𝑖) 1 ≤ 𝑖 ≤ 𝑛 , a regression model built on 𝐷 is called baseline model and given as 𝑓𝑏. (𝑷 𝟏, 𝒇 𝑷 𝟏 ) (𝑷 𝟐, 𝒇 𝑷 𝟐 ) CPXR/CPXC Given the matching dataset of pattern 𝑃, 𝑚𝑑𝑠(𝑃, 𝐷), a regression built on 𝑚𝑑𝑠 𝑃, 𝐷 is called local model and is shown by 𝑓𝑃.
  17. 17. Ohio Center of Excellence in Knowledge-Enabled Computing Pattern Items Local Model Match 𝑃1 𝑓1 𝑃2 𝑓2 𝑃3 𝑓3 𝑃4 𝑓4 𝑃5 𝑓5 𝑃6 𝑓6 Pattern Aided Regression (PXR) 17 • 𝑃𝑋𝑅 = ( 𝑃1, 𝑓1, 𝑤1 , 𝑃2, 𝑓2, 𝑤2 , … , 𝑃𝑘, 𝑓𝑘, 𝑤 𝑘 , 𝑓𝑑) • The regression function of PXR as: 𝑓𝑃𝑋𝑅 = 𝑃 𝑖∈𝜋 𝑥 𝑤𝑖 𝑓𝑖(𝑥) 𝑃 𝑖∈𝜋 𝑥 𝑤𝑖 , 𝑖𝑓 𝜋 𝑥 ≠ ∅ 𝑓𝑑, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 where 𝜋 𝑥 = 𝑃𝑖 1 ≤ 𝑖 ≤ 𝑘, 𝑥 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 𝑃𝑖 Case 3: Case 2: Case 1:
  18. 18. Ohio Center of Excellence in Knowledge-Enabled Computing CPXR/CPXC: Quality Measures • The average residual reduction (arr) of a pattern 𝑃 w.r.t to a prediction model 𝑓 on a dataset 𝐷 is: 𝑎𝑟𝑟 𝑃 = 𝑥∈𝑚𝑑𝑠(𝑃,𝐷) 𝑟 𝑥(𝑓 𝑏) − 𝑥∈𝑚𝑑𝑠(𝑃,𝐷) 𝑟 𝑥(𝑓 𝑃) 𝑚𝑑𝑠(𝑃,𝐷) • The total residual reduction (trr) of a PXR/PXC is: 𝑡𝑟𝑟 𝑃𝑋𝑅/𝑃𝑋𝐶 = 𝑥∈𝑚𝑑𝑠(𝑃𝑆,𝐷) 𝑟𝑥(𝑓𝑏) − 𝑥∈𝑚𝑑𝑠(𝑃𝑆,𝐷) 𝑟𝑥(𝑓𝑃𝑋𝑅/𝑃𝑋𝐶) 𝑥∈𝐷 𝑟𝑥(𝑓) Where 𝑃𝑆 = 𝑃1, … , 𝑃𝑘 is the pattern set, 𝑟𝑥(𝑓) is the 𝑓’s residual on an instance 𝑥 and 𝑚𝑑𝑠 𝑃𝑆, 𝐷 = 𝑖=1 𝑘 𝑚𝑑𝑠(𝑃𝑖, 𝐷). 18
  19. 19. Ohio Center of Excellence in Knowledge-Enabled Computing CPXR Algorithm 19 Dataset D CPXR Phase1 Phase2 Phase3 Goal: A small set of cooperating patterns, where each pattern characterize a subgroup of data points. • A baseline model makes large residual errors on data points in the subgroup. • A highly accurate model is found to correct those errors.
  20. 20. Ohio Center of Excellence in Knowledge-Enabled Computing CPXR Algorithm 20 Baseline model Regression/ Classification LE SE Training Dataset 𝑃2 𝑃3 … … (𝑓2, 𝑤2) (𝑓3, 𝑤3) … … Patterns Local Models Pattern Mining [(𝑃1, 𝑓1, 𝑤1) , (𝑃4, 𝑓4, 𝑤4) , … , (𝑃𝑘, 𝑓𝑘, 𝑤 𝑘)] (𝑓1, 𝑤1) (𝑓4, 𝑤4) (𝑓𝑘, 𝑤 𝑘) 𝑃1 𝑃4 𝑃𝑘
  21. 21. Ohio Center of Excellence in Knowledge-Enabled Computing • How to determine spliting point 𝜅? Minimize 𝜌 − 𝑟 𝑖>𝜅 𝑟 𝑖 𝑟 𝑖 • How to select patterns from C𝑃𝑆? Lets 𝑃𝑆 = 𝑃0 , where 𝑃0 is the pattern 𝑃 in C𝑃𝑆 with the highest 𝑎𝑟𝑟 21 0 1 2 3 4 5 6 0 50 100 150 200 SE LE CPXR Algorithm
  22. 22. Ohio Center of Excellence in Knowledge-Enabled Computing CPXR/CPXC: Filtering methods • Contrast patterns of LE with support ratio less than 1. • Patterns with tiny residual reduction (𝑎𝑟𝑟). • Patterns with Jaccard similarity more than 0.9 𝐽 𝑃1, 𝑃2 = 𝑚𝑑𝑠(𝑃1, 𝐷) ∩ 𝑚𝑑𝑠(𝑃2, 𝐷) 𝑚𝑑𝑠(𝑃1, 𝐷) ∪ 𝑚𝑑𝑠(𝑃2, 𝐷) • Patterns with the size of matching datasets less than the number of predictor variables. 22
  23. 23. Ohio Center of Excellence in Knowledge-Enabled Computing CPXR: Prediction Accuracy Evaluation • 50 real datasets and 23 synthetic datasets • Different criteria to generate synthetic datasets • Compare CPXR’s performance with 5 state-of-the-art regression methods • Overfitting and noise sensitivity • Analysis of parameters 23 𝑅𝑀𝑆𝐸 𝑟𝑒𝑑𝑢𝑐𝑡𝑖𝑜𝑛 = 𝑅𝑀𝑆𝐸 𝐿𝑅 − 𝐸𝑀𝑆𝐸(𝑋) 𝑅𝑀𝑆𝐸(𝐿𝑅)
  24. 24. Ohio Center of Excellence in Knowledge-Enabled Computing CPXR: Prediction Accuracy Evaluation 24 Dataset PLR SVR BART GBM CPXR Tecator 40.62 0.16 19.35 -14.15 65.1 Tree 17.68 7.92 -7.23 -10.82 61.73 Wage 12.2 9.15 25.42 11.86 38.45 Average 18.41 4.94 20.18 14.6 42.89 CPXR’s performance vs. other methods • CPXR has the highest accuracy in 41 out of 50 datasets. • CPXR’s results are more accurate than LR in all 50 datasets. • In 20% of datasets, CPXR achieved more than 60% RMSE reduction.
  25. 25. Ohio Center of Excellence in Knowledge-Enabled Computing CPXR: Overfitting and Noise Sensitivity 25 5 10 15 20 102030405060 Noise(%) Dropinaccuracycomparingtocleantestdata(%) ● ● ● ● ● Datasets BART CPXR Gradient Boosting NN SVR BART CPXR 0.00.20.40.6 NN SVR BART CPXR −0.2−0.10.00.10.20.30.4 RMSE reduction on synthetic datasets Train - Test Method Training Test Drop in accuracy PLR 37.11% 18.76% 49% SVR 7.65% 4.8% 37% BART 41.02% 20.15% 51% CPXR(LL) 51.4% 39.88% 22% CPXR(LP) 53.85% 42.89% 21%
  26. 26. Ohio Center of Excellence in Knowledge-Enabled Computing CPXR: Analysis of Parameters 26 5 10 15 20 0.350.400.450.500.550.600.65 k (Number of patterns) RMSEimprovementoverLR ● ● ● ● ● ● Datasets Fat Mussels Price 0.02 0.04 0.06 0.08 0.10 0.250.300.350.400.450.500.550.60 minSup RMSEimprovementoverLR ● ● ● ● ● Datasets Fat Mussels Price 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.350.400.450.500.550.60 r RMSEimprovementoverLR ● ● ● ● ● ● ● ● Datasets Fat Mussels Price 2% is the optimal minSup.7 patterns as average on 50 datasets.
  27. 27. Ohio Center of Excellence in Knowledge-Enabled Computing Contrast Pattern Aided Classification (CPXC) 27 Guozhu Dong, Vahid Taslimitehrani, Pattern Aided Classification, SIAM Data Mining Conference, 2016
  28. 28. Ohio Center of Excellence in Knowledge-Enabled Computing CPXC: PXC Concept CPXC techniques are quite similar to those of CPXR but CPXC has more challenges as well as more opportunities than CPXR 28 CPXC Confidence of Match Objective Functions Classification Algorithms Loss Functions
  29. 29. Ohio Center of Excellence in Knowledge-Enabled Computing CPXC: Confidence of Match • Given 𝑃𝑋𝐶 = ( 𝑃1, ℎ 𝑃1 , 𝑤1 , 𝑃2, ℎ 𝑃2 , 𝑤2 , … , 𝑃𝑘, ℎ 𝑃 𝑘 , 𝑤 𝑘 , ℎ 𝑑), the class variable of an instance 𝑥 is defined as: 𝑤𝑒𝑖𝑔ℎ𝑡𝑑 − 𝑣𝑜𝑡𝑒 (𝑃𝑋𝐶, 𝐶𝑗, 𝑥) = 𝑃 𝑖∈𝜋 𝑥 𝑤𝑖 × 𝑚𝑎𝑡𝑐ℎ (𝑥, 𝑝𝑖) × ℎ 𝑝 𝑖 (𝑥, 𝐶𝑗) 𝑃 𝑖∈𝜋 𝑥 𝑤𝑖 × 𝑚𝑎𝑡𝑐ℎ (𝑥, 𝑝𝑖) , 𝑖𝑓 𝜋 𝑥 ≠ ∅ ℎ 𝑑, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 where 𝜋 𝑥 = 𝑃𝑖 1 ≤ 𝑖 ≤ 𝑘, 𝑚𝑎𝑡𝑐ℎ 𝑥, 𝑝𝑖 > 0 and 𝑚𝑎𝑡𝑐ℎ 𝑥, 𝑝𝑖 = 𝑞 𝑖 𝜖𝑀𝐺(𝑝 𝑖) 𝑡 𝑚𝑎𝑡𝑐ℎ𝑒𝑠 𝑝 𝑖 𝑀𝐺(𝑝 𝑖) • 𝑚𝑎𝑡𝑐ℎ 𝑥, 𝑝𝑖 is the fraction of 𝑀𝐺 ‘s 𝑞 in 𝑀𝐺 𝑝𝑖 such that 𝑥 matches 𝑞. • ℎ 𝑝(𝑥, 𝐶𝑗) is the confidence score of local model ℎ on instance 𝑥 for class 𝐶𝑗. 29 Confidence of Match
  30. 30. Ohio Center of Excellence in Knowledge-Enabled Computing CPXC: Loss Functions 30 0.600.650.700.750.800.850.90 ClassError AUC ● ● ● Binary Probabilistic Standardized ● Datasets ILPD Hillvalley Planning Probabilistic error loss function returns the best results. Loss Functions
  31. 31. Ohio Center of Excellence in Knowledge-Enabled Computing CPXC: Base/Local Algorithms & Objective Functions • Different methods for baseline and local classifiers: – We used 6 classification algorithm for learning the baseline and local classifiers 31 Classification Algorithms • Quality measures on pattern sets – We used 𝑡𝑟𝑟, AUC, and ACC (accuracy) to measure the quality of a pattern set • Quality measures on patterns and weights on local classifiers – We used 𝑎𝑟𝑟, AUC, and ACC (accuracy) to measure the quality of a pattern: 𝑎𝑟𝑟 is the winner! Objective Functions
  32. 32. Ohio Center of Excellence in Knowledge-Enabled Computing Experimental results 32 19 Public Datasets 8 Classification Algorithms Noise Sensitivity & Overfitting Running Time 7 Fold Cross Validation minSup = 0.02 rho = 0.45
  33. 33. Ohio Center of Excellence in Knowledge-Enabled Computing CPXC: Performance Dataset Boosting DT NBC Log RF SVM Max CPXC (NBC-DT) Congress 0.58 0.66 0.6 0.57 0.58 0.58 0.66 0.86 Poker 0.6 0.6 0.5 0.5 0.76 0.5 0.76 0.85 HillValley 0.5 0.63 0.65 0.66 0.6 0.67 0.67 0.89 Climate 0.96 0.81 0.9 0.94 0.97 0.98 0.98 0.97 Mammography 0.94 0.91 0.94 0.94 0.93 0.93 0.94 0.98 Steel 0.96 0.88 0.91 0.95 0.95 0.94 0.95 0.99 33 • CPXC achieved average AUC of 0.886 on the 8 hard datasets. • Average AUC of the best performing traditional classifier (RF) on hard datasets is 0.638. • CPXC’s AUC is never lower than RF on the hard datasets. • CPXC achieved average AUC of 0.983 on the easy datasets while the best performing traditional algorithms obtained average AUC of 0.968.
  34. 34. Ohio Center of Excellence in Knowledge-Enabled Computing CPXC: Noise Sensitivity 34 Drop of AUC vs. noise levels Method/Noise 0% 5% 10% 15% 20% Average RF 5.73 6.61 12.48 25.83 33.54 16.84 CPXC 5.87 6.79 12.92 24.7 32.7 16.6 Boosting 7.02 8.93 14.2 26.8 34.65 18.32 Log 7.04 10.56 14.63 24.7 33.94 18.17 NBC 7.06 10.58 15.26 27.89 35.1 19.18 SVM 8.6 10.34 16.28 29.59 38.02 20.57 DT 8.8 11.04 16.78 30.3 43.1 22.00
  35. 35. Ohio Center of Excellence in Knowledge-Enabled Computing CPXC: Impact of Parameters 35 4 6 8 10 12 14 0.750.800.850.90 k (Number of patterns) AUC ● ● ● ● ● ● ● Datasets Blood Congress Hillvalley Planning 0.02 0.04 0.06 0.08 0.10 0.700.750.800.850.90 minSup AUC ● ● ● ● ● Datasets Blood Congress Hillvalley Planning 0.840.850.860.870.880.890.90 Objective Function AUC ● ● ● TER AUC ACC ● Datasets ILPD Hillvalley Planning 0.3 0.4 0.5 0.6 0.7 0.780.800.820.840.860.880.90 r AUC ● ● ● ● ● ● ● ● ● ● Datasets Blood Congress Hillvalley Planning
  36. 36. Ohio Center of Excellence in Knowledge-Enabled Computing 36 Classification on Imbalanced Datasets • What is an imbalanced classification problem? • What are the real world applications? • Why traditional classification algorithms do not perform well on imbalanced datasets? • What is our proposed solution? Classifying minority instances might be more important that majority class.
  37. 37. Ohio Center of Excellence in Knowledge-Enabled Computing LE SE 37 Baseline model Classification LE SE Training Dataset Weighting • 𝑒𝑟𝑟∗ ℎ 𝑏, 𝑥 = 𝑒𝑟𝑟 ℎ 𝑏, 𝑥 × 𝛿, 𝑖𝑓𝑥 ∈ 𝑚𝑖𝑛𝑜𝑟𝑖𝑡𝑦 𝑐𝑙𝑎𝑠𝑠 𝑖𝑛𝑠𝑡𝑎𝑛𝑥𝑐𝑒𝑠 𝑒𝑟𝑟(ℎ 𝑏, 𝑥), 𝑖𝑓𝑥 ∈ 𝑚𝑎𝑗𝑜𝑟𝑖𝑡𝑦 𝑐𝑙𝑎𝑠𝑠 𝑖𝑛𝑠𝑡𝑎𝑛𝑥𝑐𝑒𝑠 New Weighting idea
  38. 38. Ohio Center of Excellence in Knowledge-Enabled Computing A Filtering Method to Remove Imbalanced Local Models 38 • 𝐼𝑅 𝑚𝑑𝑠 𝑃, 𝐷 = Number of instances in the majority class Number of instances in the minority class 𝑃1 𝑃2 𝑃3 𝑃4 … … 𝑃𝑘 (𝑓1, 𝑤1) (𝑓2, 𝑤2) (𝑓3, 𝑤3) (𝑓4, 𝑤4) … … (𝑓𝑘, 𝑤 𝑘) Patterns Local Models
  39. 39. Ohio Center of Excellence in Knowledge-Enabled Computing Experimental results 39 • The average AUC of CPXCim is 14% and 15.2% more than the AUC of SMOTE and SMOTE-TL, respectively. • The performance of CPXCim is always better than other imbalanced classifiers on these 10 datasets. CPXCim’s performance Dataset # of instances # of variables Imbalance ratio CPXCim SMOTE SMOTE-TL Yeast 1004 8 9.14 0.942 0.7728 0.772 Led7digit 443 7 10.97 0.978 0.8919 0.897 flareF 1066 11 23.79 0.883 0.7463 0.809 Wine Quality 1599 11 29.17 0.76 0.6008 0.59 Average - - - 0.92 0.798 0.807
  40. 40. Ohio Center of Excellence in Knowledge-Enabled Computing Applications of CPXR & CPXC 40 • Vahid Taslimitehrani, Guozhu Dong. A New CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury", IEEE International Conference on Bioinformatics and Bioengineering (BIBE), 2014, On page(s): 283 – 290 (Best Student Paper) • Behzad Ghanbarian, Vahid Taslimitehrani, Guozhu Dong, Yakov Pachepsky. Sample dimensions effect on prediction of soil water retention curve and saturated hydraulic conductivity. Journal of Hydrology. 528 (2015): 127-137. • Vahid Taslimitehrani, Guozhu Dong, Naveen Pereira, Maryam Panahiazar, Jyotishman Pathak. Develolping HER-driven Heart Failure Models using CPXR(Log) with the probabilistic loss function. Journal of Biomedical Informatics (2016).
  41. 41. Ohio Center of Excellence in Knowledge-Enabled Computing Application: Traumatic Brain Injury What is Traumatic Brain Injury (TBI)? It is an important public health problem and a leading cause of death and disability worldwide. Problem definition: prediction of patients outcome within 6 months after TBI event, using the admission data. • Dataset: 2159 patients collected from a trial and 15 predictor variables • Two class variables: mortality and unfavorable outcome. 41 Vahid Taslimitehrani, Guozhu Dong. A New CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury", Bioinformatics and Bioengineering (BIBE), 2014 IEEE International Conference on, On page(s): 283 – 290 (Best Student Paper Award)
  42. 42. Ohio Center of Excellence in Knowledge-Enabled Computing Application: Traumatic Brain Injury Model Basic Basic+CT Basic+CT+Lab Unfavorable Specificity 0.89(0.85) 0.87(0.85) 0.91(0.84) Sensitivity 0.54(0.52) 0.65(0.6) 0.72(0.61) Accuracy 0.75(0.72) 0.79(0.75) 0.87(0.75) F1 0.63(0.59) 0.7(0.66) 0.76(0.66) AUC 0.82(0.76) 0.87(0.8) 0.93(0.81) 42 Variable set change Mortality Unfavorable CPXR(Log) Log CPXR(Log) Log Basic Basic+CT 10% 7.7% 6% 5.2% Basic+CTBasic+CT+Lab 4.5% 2.5% 6.8% 1.25% BasicBasic+CT+Lab 15% 11.1% 13.4% 6.6% 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 False positive rate Truepositiverate CPXR(Log) SLogR SVM RF AUC_CPXR(Log) = 0.87 AUC_SLogR = 0.8 AUC_RF = 0.72 AUC_SVM = 0.7 Performance changes when we add more variables CPXR(Log)’s performance
  43. 43. Ohio Center of Excellence in Knowledge-Enabled Computing Application: Heart Failure Survival Risk Models • Collaboration with Mayo Clinic • Problem definition: Heart Failure survival prediction models. • An EHR dataset on 119,749 patients admitted to Mayo Clinic. • Predictor variables are grouped in the following categories: – Demographic, Vitals, Labs, Medications and 24 major chronic conditions as co- morbidities. • Three groups of CPXC models are developed to predict survival in 1, 2 and 5 years after heart failure event. 43 Vahid Taslimitehrani, Guozhu Dong, Naveen Pereira, Maryam Panahiazar, Jyotishman Pathak. Develolping HER-driven Heart Failure Models using CPXR(Log) with the probabilistic loss function. Journal of Biomedical Informatics (2016).
  44. 44. Ohio Center of Excellence in Knowledge-Enabled Computing Application: Heart Failure Survival Risk Models Algorithm 1 Year 2 Year 5 Year Decision Tree 0.66 0.5 0.5 Random Forest 0.8 0.72 0.72 Ada Boost 0.74 0.71 0.68 SVM 0.59 0.52 0.52 Logistic Regression 0.81 0.74 0.73 CPXC 0.937 0.83 0.786 44 Variable Log f1 f2 f3 f4 f5 f6 f7 Alzheimer 1.75 1.74 0.80 1.88 1.59 1.29 1.58 0.75 Breast Cancer 0.63 1.15 1.62 2.73 1.00 1.00 2.08 0.59 Odds ratios of PXC local models Performance of difference classifiers
  45. 45. Ohio Center of Excellence in Knowledge-Enabled Computing Application: Heart Failure Survival Risk Models Variable sets CPXC Log RF SVM DT Boosting (Demo&Vital)  (Demo&Vital) +Lab 4.8% 11.5% 19% 17.3% 0% 14.7% (Demo&Vital)  (Demo&Vital) +Lab+Med 8.9% 13.4% 21.2% 21.7% 0% 5.7% (Demo&Vital)  (Demo&Vital) +Lab+Med+Co-morbid 27.8% 9.6% 19.1% 19.5% -10.4% 7.6% (Demo&Vital) +Lab (Demo&Vital) +Lab+Med 3.2% 1.7% 1.7% 3.7% 0% -9.8% (Demo&Vital) +Lab (Demo&Vital) +Lab+Med+Co-morbid 20.9% -1.7% 0% 1.8% -10.4% -8.1% (Demo&Vital) +Lab+Med (Demo&Vital) +Lab+Med+Co-morbid 15.9% -3.3% -1.7% -1.7% -10.4% 1.8% 45 Adding co-morbidities: • decreased the AUC of other classifiers by 5.3% on average. • increased the AUC of CPXC by 21.5% on average. Performance changes when we add more variables
  46. 46. Ohio Center of Excellence in Knowledge-Enabled Computing Application: Saturated Hydraulic Conductivity • Collaboration with University of Texas at Austin and USDA-ARS • Problem definition: 1. Prediction of the soil water retention curve (SWRC) 2. Prediction of Saturated Hydraulic Conductivity (SHC) 3. Investigating the effect of sample dimensions on prediction accuracy. • Number of predictor variables: 6-13 • Number of response variables: 10 • 32 CPXR models are developed. 46 Behzad Ghanbarian, Vahid Taslimitehrani, Guozhu Dong, Yakov Pachepsky. Sample dimensions effect on prediction of soil water retention curve and saturated hydraulic conductivity. Journal of Hydrology. 528 (2015): 127-137.
  47. 47. Ohio Center of Excellence in Knowledge-Enabled Computing Application: Saturated Hydraulic Conductivity 47 -4 -2 0 2 4 6 8 10 -4 -2 0 2 4 6 8 10 Predictedlog(Ksat)[cmday-1] Measured log(Ksat) [cm day-1] SHC2 RMSLE = 0.456 -4 -2 0 2 4 6 8 10 -4 -2 0 2 4 6 8 10 Predictedlog(Ksat)[cmday-1] Measured log(Ksat) [cm day-1] SHC2 RMSLE = 1.936 Model s t 10 30 50 100 300 500 1000 1500 Linear Regression SWRC1 0.79 0.73 0.77 0.84 0.85 0.84 0.83 0.84 0.81 0.77 SWRC2 0.79 0.72 0.77 0.85 0.84 0.84 0.84 0.83 0.80 0.78 CPXR SWRC1 0.94 0.97 0.97 0.94 0.97 0.97 0.95 0.96 0.95 0.94 SWRC2 0.95 0.96 0.94 0.95 0.97 0.96 0.95 0.98 0.97 0.94
  48. 48. Ohio Center of Excellence in Knowledge-Enabled Computing Conclusion • A new type of highly accurate and interpretable regression and classification models, PXR/PXC are presented. • New techniques to build PXR and PXC models are discussed. • Each pair of pattern-model represents a diverse predictor-response interaction. • PXR and PXC models are more accurate, interpretable and less overfitting than other regression and classification algorithms. • A new method adopted from CPXC presented to handle classifying imbalanced datasets. • Several applications of CPXR and CPXC are discussed. 48
  49. 49. Ohio Center of Excellence in Knowledge-Enabled Computing Related publications • Guozhu Dong, Vahid Taslimitehrani, Pattern-Aided Regression Modeling and Prediction Model Analysis. in IEEE Transactions on Knowledge and Data Engineering, vol.27, no.9, pp.2452-2465, Sept. 1 2015. • Vahid Taslimitehrani, Guozhu Dong. A New CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury", IEEE International Conference on Bioinformatics and Bioengineering (BIBE), 2014, On page(s): 283 – 290 (Best Student Paper) • Behzad Ghanbarian, Vahid Taslimitehrani, Guozhu Dong, Yakov Pachepsky. Sample dimensions effect on prediction of soil water retention curve and saturated hydraulic conductivity. Journal of Hydrology. 528 (2015): 127-137. • Vahid Taslimitehrani, Guozhu Dong, Naveen Pereira, Maryam Panahiazar, Jyotishman Pathak. Develolping HER-driven Heart Failure Models using CPXR(Log) with the probabilistic loss function. Journal of Biomedical Informatics (2016). • Guozhu Dong, Vahid Taslimitehrani, Pattern Aided Classification, SIAM Data Mining Conference, 2016 49
  50. 50. Ohio Center of Excellence in Knowledge-Enabled Computing Acknowledgement 50

×