Presented at 15th International Conference on BioInformatics and BioEngineering (BIBE2014)
Prognostic modeling is central to medicine, as it is often used to predict patients’ outcome and response to treatments and to identify important medical risk factors. Logistic regression is one of the most used approaches for clinical pre- diction modeling. Traumatic brain injury (TBI) is an important public health issue and a leading cause of death and disability worldwide. In this study, we adapt CPXR (Contrast Pattern Aided Regression, a recently introduced regression method), to develop a new logistic regression method called CPXR(Log), for general binary outcome prediction (including prognostic modeling), and we use the method to carry out prognostic modeling for TBI using admission time data. The models produced by CPXR(Log) achieved AUC as high as 0.93 and specificity as high as 0.97, much better than those reported by previous studies. Our method produced interpretable prediction models for diverse patient groups for TBI, which show that different kinds of patients should be evaluated differently for TBI outcome prediction and the odds ratios of some predictor variables differ significantly from those given by previous studies; such results can be valuable to physicians.
Performance analysis of regularized linear regression models for oxazolines a...ijcsity
Regularized regression technique
s for lin
ear regression have been creat
ed
the last few
ten
year
s to
reduce
the
flaws
of ordinary least squ
ares regression
with regard to prediction accuracy.
In this paper, new
methods
for using regularized regression in model
choice are introduc
ed, and we
distinguish
the condition
s
in whic
h regularized regression develops
our ability to discriminate models.
W
e applied all the five
methods that use penalty
-
based (regularization) shrinkage to
handle Oxazolines and Oxazoles derivatives
descriptor
dataset
with far more predictors than observations.
The lasso,
ridge,
elasticnet,
lars and relaxed
lasso
further pos
sess the desirable property that they simultaneously sele
ct relevant predictive descriptor
s
and optimally
estimate their effects.
Here, we comparatively evaluate the performance of five regularized
linear regression methods
The assessment of the performanc
e of each model by means of benchmark
experiments
is an
established exercise.
Cross
-
validation and
resampling
method
s
are genera
lly
used to
arrive
point
evaluates the efficienci
es which are compared
to recognize
methods
with acceptable feature
s.
Predictiv
e accuracy
was evaluated
us
ing the root mean squared error
(RMSE)
and
Square of usual
correlation between predictors and observed mean inhibitory concentration of antitubercular activity
(R
square)
.
We found that all five regularized regression models were
able to produce feasible models
and
efficient capturing the linearity in the data
.
The elastic net and lars had similar accuracies
as well as lasso
and relaxed lasso
had similar accuracies
but outperformed ridge regression in terms of the RMSE and R
squ
are
metrics.
Identification of Outliersin Time Series Data via Simulation Studyiosrjce
IOSR Journal of Mathematics(IOSR-JM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of mathemetics and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in mathematics. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Performance analysis of regularized linear regression models for oxazolines a...ijcsity
Regularized regression technique
s for lin
ear regression have been creat
ed
the last few
ten
year
s to
reduce
the
flaws
of ordinary least squ
ares regression
with regard to prediction accuracy.
In this paper, new
methods
for using regularized regression in model
choice are introduc
ed, and we
distinguish
the condition
s
in whic
h regularized regression develops
our ability to discriminate models.
W
e applied all the five
methods that use penalty
-
based (regularization) shrinkage to
handle Oxazolines and Oxazoles derivatives
descriptor
dataset
with far more predictors than observations.
The lasso,
ridge,
elasticnet,
lars and relaxed
lasso
further pos
sess the desirable property that they simultaneously sele
ct relevant predictive descriptor
s
and optimally
estimate their effects.
Here, we comparatively evaluate the performance of five regularized
linear regression methods
The assessment of the performanc
e of each model by means of benchmark
experiments
is an
established exercise.
Cross
-
validation and
resampling
method
s
are genera
lly
used to
arrive
point
evaluates the efficienci
es which are compared
to recognize
methods
with acceptable feature
s.
Predictiv
e accuracy
was evaluated
us
ing the root mean squared error
(RMSE)
and
Square of usual
correlation between predictors and observed mean inhibitory concentration of antitubercular activity
(R
square)
.
We found that all five regularized regression models were
able to produce feasible models
and
efficient capturing the linearity in the data
.
The elastic net and lars had similar accuracies
as well as lasso
and relaxed lasso
had similar accuracies
but outperformed ridge regression in terms of the RMSE and R
squ
are
metrics.
Identification of Outliersin Time Series Data via Simulation Studyiosrjce
IOSR Journal of Mathematics(IOSR-JM) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of mathemetics and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in mathematics. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Articial bee Colony algorithm (ABC) is a population based
heuristic search technique used for optimization problems. ABC
is a very eective optimization technique for continuous opti-
mization problem. Crossover operators have a better exploration
property so crossover operators are added to the ABC. This pa-
per presents ABC with dierent types of real coded crossover op-
erator and its application to Travelling Salesman Problem (TSP).
Each crossover operator is applied to two randomly selected par-
ents from current swarm. Two o-springs generated from crossover
and worst parent is replaced by best ospring, other parent remains
same. ABC with real coded crossover operator applied to travelling
salesman problem. The experimental result shows that our proposed
algorithm performs better than the ABC without crossover in terms
of eciency and accuracy.
Analysis of Parameter using Fuzzy Genetic Algorithm in E-learning SystemHarshal Jain
The aim of this project is to analyze the parameter, for the inputs to find an optimization problem than the candidate solution we have. This will help us to find more accurate knowledge level of user, using Genetic Algorithm (GA). In this algorithm a population of candidate solutions (called individuals, creatures, or phenotypes) to an optimization problem is evolved toward better solutions.
Artificial bee colony (ABC) algorithm is a well known and one of the latest swarm intelligence based techniques. This method is a population based meta-heuristic algorithm used for numerical optimization. It is based on the intelligent behavior of honey bees. Artificial Bee Colony algorithm is one of the most popular techniques that are used in optimization problems. Artificial Bee Colony algorithm has some major advantages over other heuristic methods. To utilize its good feature a number of researchers combined ABC algorithm with other methods, and generate some new hybrid methods. This paper provides comparative analysis of hybrid differential Artificial Bee Colony algorithm with hybrid ABC – SPSO, Genetic algorithm and Independent rough set approach based on some parameters like technique, dimension, methodology etc. KEYWORDS
In this research, we broaden the advantages of nonnegative garrote as a feature selection method and empirically show it provides comparable results to panel models. We compare nonnegative garrote to other variable selection methods like ridge, lasso and adaptive lasso and analyze their performance on a dataset, which we have previously analyzed in another research. We conclude by showing that the results from nonnegative garrote are comparable to the robustness checks applied to the panel models to validate statistically significant variables. We conclude that nonnegative garrote is a robust variable selection method for panel orthonormal data as it accounts for the fixed and random effects, which are present in panel datasets.
A NOVEL APPROACH TO MINE FREQUENT PATTERNS FROM LARGE VOLUME OF DATASET USING...IAEME Publication
In this paper, MDL based reduction in frequent pattern is presented. The ideal outcome of any pattern mining process is to explore the data in new insights. And also, we need to eliminate the non-interesting patterns that describe noise. The major problem in frequent pattern mining is to identify the interesting patterns. Instead of performing association rule mining on all the frequent item sets, it is feasible to select a sub set of frequent item sets and perform the mining task. Selecting a small set of frequent item sets from large amount of interesting ones is a difficult task. In our approach, MDL based algorithm is used for reducing the number of frequent item sets to be used for association rule mining is presented.
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...cscpconf
Optimization problems are dominantly being solved using Computational Intelligence. One of
the issues that can be addressed in this context is problems related to attribute subset selection
evaluation. This paper presents a computational intelligence technique for solving the
optimization problem using a proposed model called Modified Genetic Search Algorithms
(MGSA) that avoids local bad search space with merit and scaled fitness variables, detecting
and deleting bad candidate chromosomes, thereby reducing the number of individual
chromosomes from search space and subsequent iterations in next generations. This paper aims
to show that Rotation forest ensembles are useful in the feature selection method. The base
classifier is multinomial logistic regression method integrated with Haar wavelets as projection
filter and reproducing the ranks of each features with 10 fold cross validation method. It also
discusses the main findings and concludes with promising result of the proposed model. It
explores the combination of MGSA for optimization with Naïve Bayes classification. The result
obtained using proposed model MGSA is validated mathematically using Principal Component
Analysis. The goal is to improve the accuracy and quality of diagnosis of Breast cancer disease
with robust machine learning algorithms. As compared to other works in literature survey,
experimental results achieved in this paper show better results with statistical inferenc
Comparing Marketplace ACA Enrollees to Employer Sponsor Programs Jasmine_Dixon
The size and scope of the Affordable Care Act (ACA) is making it difficult for researchers to get the necessary data to form any comprehensive studies from its effects. The political situation that surrounds its inception is not making matters any simpler as it adds partisan elements that shift from day to day. Fortunately, there is one metric doctors can use to approximate the health of enrollees – prescription drug use.
Articial bee Colony algorithm (ABC) is a population based
heuristic search technique used for optimization problems. ABC
is a very eective optimization technique for continuous opti-
mization problem. Crossover operators have a better exploration
property so crossover operators are added to the ABC. This pa-
per presents ABC with dierent types of real coded crossover op-
erator and its application to Travelling Salesman Problem (TSP).
Each crossover operator is applied to two randomly selected par-
ents from current swarm. Two o-springs generated from crossover
and worst parent is replaced by best ospring, other parent remains
same. ABC with real coded crossover operator applied to travelling
salesman problem. The experimental result shows that our proposed
algorithm performs better than the ABC without crossover in terms
of eciency and accuracy.
Analysis of Parameter using Fuzzy Genetic Algorithm in E-learning SystemHarshal Jain
The aim of this project is to analyze the parameter, for the inputs to find an optimization problem than the candidate solution we have. This will help us to find more accurate knowledge level of user, using Genetic Algorithm (GA). In this algorithm a population of candidate solutions (called individuals, creatures, or phenotypes) to an optimization problem is evolved toward better solutions.
Artificial bee colony (ABC) algorithm is a well known and one of the latest swarm intelligence based techniques. This method is a population based meta-heuristic algorithm used for numerical optimization. It is based on the intelligent behavior of honey bees. Artificial Bee Colony algorithm is one of the most popular techniques that are used in optimization problems. Artificial Bee Colony algorithm has some major advantages over other heuristic methods. To utilize its good feature a number of researchers combined ABC algorithm with other methods, and generate some new hybrid methods. This paper provides comparative analysis of hybrid differential Artificial Bee Colony algorithm with hybrid ABC – SPSO, Genetic algorithm and Independent rough set approach based on some parameters like technique, dimension, methodology etc. KEYWORDS
In this research, we broaden the advantages of nonnegative garrote as a feature selection method and empirically show it provides comparable results to panel models. We compare nonnegative garrote to other variable selection methods like ridge, lasso and adaptive lasso and analyze their performance on a dataset, which we have previously analyzed in another research. We conclude by showing that the results from nonnegative garrote are comparable to the robustness checks applied to the panel models to validate statistically significant variables. We conclude that nonnegative garrote is a robust variable selection method for panel orthonormal data as it accounts for the fixed and random effects, which are present in panel datasets.
A NOVEL APPROACH TO MINE FREQUENT PATTERNS FROM LARGE VOLUME OF DATASET USING...IAEME Publication
In this paper, MDL based reduction in frequent pattern is presented. The ideal outcome of any pattern mining process is to explore the data in new insights. And also, we need to eliminate the non-interesting patterns that describe noise. The major problem in frequent pattern mining is to identify the interesting patterns. Instead of performing association rule mining on all the frequent item sets, it is feasible to select a sub set of frequent item sets and perform the mining task. Selecting a small set of frequent item sets from large amount of interesting ones is a difficult task. In our approach, MDL based algorithm is used for reducing the number of frequent item sets to be used for association rule mining is presented.
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...cscpconf
Optimization problems are dominantly being solved using Computational Intelligence. One of
the issues that can be addressed in this context is problems related to attribute subset selection
evaluation. This paper presents a computational intelligence technique for solving the
optimization problem using a proposed model called Modified Genetic Search Algorithms
(MGSA) that avoids local bad search space with merit and scaled fitness variables, detecting
and deleting bad candidate chromosomes, thereby reducing the number of individual
chromosomes from search space and subsequent iterations in next generations. This paper aims
to show that Rotation forest ensembles are useful in the feature selection method. The base
classifier is multinomial logistic regression method integrated with Haar wavelets as projection
filter and reproducing the ranks of each features with 10 fold cross validation method. It also
discusses the main findings and concludes with promising result of the proposed model. It
explores the combination of MGSA for optimization with Naïve Bayes classification. The result
obtained using proposed model MGSA is validated mathematically using Principal Component
Analysis. The goal is to improve the accuracy and quality of diagnosis of Breast cancer disease
with robust machine learning algorithms. As compared to other works in literature survey,
experimental results achieved in this paper show better results with statistical inferenc
Comparing Marketplace ACA Enrollees to Employer Sponsor Programs Jasmine_Dixon
The size and scope of the Affordable Care Act (ACA) is making it difficult for researchers to get the necessary data to form any comprehensive studies from its effects. The political situation that surrounds its inception is not making matters any simpler as it adds partisan elements that shift from day to day. Fortunately, there is one metric doctors can use to approximate the health of enrollees – prescription drug use.
How to increase website conversions by applying the laws of great product designLindsay Bayuk
In this talk, Lindsay Bayuk from Pure Chat will cover the core concepts for building a great product and how to apply them to your WordPress site. Learn how to optimize your WordPress site and increase conversion with better targeting, positioning, messaging, personas and testing.
Wordcamp Salt Lake City 2015
https://www.purechat.com/
Ασφάλιση στον Ενιαίο Φορέα Κοινωνικής Ασφάλισης (ΕΦΚΑ) - κατ' εφαρμογή των διατάξεων του ν. 4387/16 - Μελών Εταιρειών ή Συνεταιρισμών και Μελών Διοικητικού Συμβουλίου Ανωνύμων Εταιρειών ή Αγροτικών Συνεταιρισμών, για τις αμοιβές που λαμβάνουν, είτε στα πλαίσια εξαρτημένης εργασίας, είτε με την ιδιότητα του μέλους ΔΣ
New Innovations in Pilates book out soon! Learn about the anatomy and kinesiology of Pilates and how to stretch effectively for life and for Pilates improvement.
Vahid Taslimitehrani's Dissertation Defense: Friday, February 19 2015.
Ph.D. Committee: Drs. Guozhu Dong, Advisor, T.K. Prasad, Amit Sheth, Keke Chen
and Jyotishman Pathak, Division of Health Informatics, Weill Cornell Medical College, Cornell University.
ABSTRACT:
Regression and classification techniques play an essential role in many data mining tasks and have broad applications. However, most of the state-of-the-art regression and classification techniques are often unable to adequately model the interactions among predictor variables in highly heterogeneous datasets. New techniques that can effectively model such complex and heterogeneous structures are needed to significantly improve prediction accuracy.
In this dissertation, we propose a novel type of accurate and interpretable regression and classification models, named as Pattern Aided Regression (PXR) and Pattern Aided Classification (PXC) respectively. Both PXR and PXC rely on identifying regions in the data space where a given baseline model has large modeling errors, characterizing such regions using patterns, and learning specialized models for those regions. Each PXR/PXC model contains several pairs of contrast patterns and local models, where a local classifier is applied only to data instances matching its associated pattern. We also propose a class of classification and regression techniques called Contrast Pattern Aided Regression (CPXR) and Contrast Pattern Aided Classification (CPXC) to build accurate and interpretable PXR and PXC models.
We have conducted a set of comprehensive performance studies to evaluate the performance of CPXR and CPXC. The results show that CPXR and CPXC outperform state-of-the-art regression and classification algorithms, often by significant margins. The results also show that CPXR and CPXC are especially effective for heterogeneous and high dimensional datasets. Besides being new types of modeling, PXR and PXC models can also provide insights into data heterogeneity and diverse predictor-response relationships.
We have also adapted CPXC to handle classifying imbalanced datasets and introduced a new algorithm called Contrast Pattern Aided Classification for Imbalanced Datasets (CPXCim). In CPXCim, we applied a weighting method to boost minority instances as well as a new filtering method to prune patterns with imbalanced matching datasets.
Finally, we applied our techniques on three real applications, two in the healthcare domain and one in the soil mechanic domain. PXR and PXC models are significantly more accurate than other learning algorithms in those three applications.
Regression and classification techniques play an essential role in many data mining tasks and have broad applications. However, most of the state-of-the-art regression and classification techniques are often unable to adequately model the interactions among predictor variables in highly heterogeneous datasets. New techniques that can effectively model such complex and heterogeneous structures are needed to significantly improve prediction accuracy.
In this dissertation, we propose a novel type of accurate and interpretable regression and classification models, named as Pattern Aided Regression (PXR) and Pattern Aided Classification (PXC) respectively. Both PXR and PXC rely on identifying regions in the data space where a given baseline model has large modeling errors, characterizing such regions using patterns, and learning specialized models for those regions. Each PXR/PXC model contains several pairs of contrast patterns and local models, where a local classifier is applied only to data instances matching its associated pattern. We also propose a class of classification and regression techniques called Contrast Pattern Aided Regression (CPXR) and Contrast Pattern Aided Classification (CPXC) to build accurate and interpretable PXR and PXC models.
We have conducted a set of comprehensive performance studies to evaluate the performance of CPXR and CPXC. The results show that CPXR and CPXC outperform state-of-the-art regression and classification algorithms, often by significant margins. The results also show that CPXR and CPXC are especially effective for heterogeneous and high dimensional datasets. Besides being new types of modeling, PXR and PXC models can also provide insights into data heterogeneity and diverse predictor-response relationships.
We have also adapted CPXC to handle classifying imbalanced datasets and introduced a new algorithm called Contrast Pattern Aided Classification for Imbalanced Datasets (CPXCim). In CPXCim, we applied a weighting method to boost minority instances as well as a new filtering method to prune patterns with imbalanced matching datasets.
Finally, we applied our techniques on three real applications, two in the healthcare domain and one in the soil mechanic domain. PXR and PXC models are significantly more accurate than other learning algorithms in those three applications.
Data reduction: breaking down large sets of data into more-manageable groups or segments that provide better insight.
- Data sampling
- Data cleaning
- Data transformation
- Data segmentation
- Dimension reduction
Poster for Society for Clinical Trials annual meeting in Boston, MA
Abstract
Randomization methods generally are designed to be both unpredictable and balanced between treatment allocations overall and within strata. However, when planning studies, little consideration is given to measuring these characteristics, nor are they examined jointly, and published comparisons between methods often use incompatible metrics and simulation assumptions. Furthermore, for purposes of real-world planning, such simulations often make unrealistic assumptions (e.g., equal sized strata), and summary statistics give limited information.
Machine learning workshop, session 3.
- Data sets
- Machine Learning Algorithms
- Algorithms by Learning Style
- Algorithms by Similarity
- People to follow
Simplified Knowledge Prediction: Application of Machine Learning in Real LifePeea Bal Chakraborty
Machine learning is the scientific study of algorithms and statistical models that is used by the machines to perform a specific task depending on patterns and inference rather than explicit instructions. This research and analysis aims to observe how precisely a machine can predict that a patient suspected of breast cancer is having malignant or benign cancer.In this paper the classification of cancer type and prediction of risk levels is done by various model of machine learning and is pictorially depicted by various tools of visual analytics.
An Improved Iterative Method for Solving General System of Equations via Gene...Zac Darcy
Various algorithms are known for solving linear system of equations. Iteration methods for solving the
large sparse linear systems are recommended. But in the case of general n× m matrices the classic
iterative algorithms are not applicable except for a few cases. The algorithm presented here is based on the
minimization of residual of solution and has some genetic characteristics which require using Genetic
Algorithms. Therefore, this algorithm is best applicable for construction of parallel algorithms. In this
paper, we describe a sequential version of proposed algorithm and present its theoretical analysis.
Moreover we show some numerical results of the sequential algorithm and supply an improved algorithm
and compare the two algorithms.
An Improved Iterative Method for Solving General System of Equations via Gene...Zac Darcy
Various algorithms are known for solving linear system of equations. Iteration methods for solving the
large sparse linear systems are recommended. But in the case of general n× m matrices the classic
iterative algorithms are not applicable except for a few cases. The algorithm presented here is based on the
minimization of residual of solution and has some genetic characteristics which require using Genetic
Algorithms. Therefore, this algorithm is best applicable for construction of parallel algorithms. In this
paper, we describe a sequential version of proposed algorithm and present its theoretical analysis.
Moreover we show some numerical results of the sequential algorithm and supply an improved algorithm
and compare the two algorithms.
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...Golden Helix Inc
Confounding from population structure, extended families and inbreeding can be a significant issue for burden and kernel association tests on rare variants from next generation DNA sequencing. An obvious solution is to combine the power of a mixed model regression analysis with the ability to assess the rare variant burden using methods such as KBAC or CMC. Recent approaches have adjusted burden and kernel tests using linear regression models; this method adjusts for the relatedness of samples and includes that directly into a logistic regression model.
This webcast will focus on the details of bringing Mixed Model Regression and KBAC together, including: deriving an optimal logistic mixed model algorithm for calculating the reduced model score, how the kinship or random effects matrix should be specified, and how it all comes together into one algorithm. Results from applying the method to variants from the 1000 Genomes project will also be presented and compared to famSKAT.
Introduction to machine learning. Basics of machine learning. Overview of machine learning. Linear regression. logistic regression. cost function. Gradient descent. sensitivity, specificity. model selection.
Correcting bias and variation in small RNA sequencing for optimal (microRNA) ...Christos Argyropoulos
Presentation given about the Generalized Additive Model Location, Scale and Shape (GAMLSS) methodology for the analysis of small RNA sequencing data and the potential of microRNAs as biomarkers for kidney and cardiometabolic diseases
Similar to A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury (20)
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury
1. Ohio Center of Excellence in Knowledge-Enabled Computing
A new CPXR Based Logistic Regression Method
and Clinical Prognostic Modeling Results Using
the Method on Traumatic Brain Injury
Vahid Taslimitehrani, Guozhu Dong
kno.e.sis center
Department of Computer Science and Engineering
Wright State University
Dayton, OH
1
2. Ohio Center of Excellence in Knowledge-Enabled Computing
Outline
• Motivation and background
• Preliminaries
– Contrast pattern mining
– Logistic regression
• CPXR(Log)
• TBI data
• Results of CXR(Log) on TBI
• Conclusion
• References
2
3. Ohio Center of Excellence in Knowledge-Enabled Computing
Motivation and Background
• CPXR (Log): Accurate and informative prognostic models
Prognostic models are central to medicine. [Steyerberg, 2009]
Facilitate physicians decision making process on patient treatment plan,
screening and etc.
Help to understand the disease behavior including identifying new
biomarkers.
Number of articles listed in PubMed with “prediction model” in title in
2012 is 7 times of that in 2000. [pubmed]
3
4. Ohio Center of Excellence in Knowledge-Enabled Computing
Motivation and Background
• CPXR (Log): A powerful new generic Logistic Regression method
Logistic regression is one of the most popular approaches for building
clinical prediction models. [Steyerberg, 2009]
Logistic regression models are desirable since
They are representable.
They are probabilistic based.
They are flexible in terms of
predictor variables. (categorical
and numerical variables)
4
5. Ohio Center of Excellence in Knowledge-Enabled Computing
Motivation and Background
• Traumatic Brain Injury
One of the leading causes of death and disability worldwide.
Annually, 1.5 million death in worldwide. [Perel, 2006]
$76.5 billion dollars including direct and indirect cost in 2010 in US.
[www.cdc.gov]
Early and accurate prognostic models based on just admission time data
to make time–critical clinical decisions by physicians.
5
6. Ohio Center of Excellence in Knowledge-Enabled Computing
Challenges in clinical modeling
• Accuracy of the clinical prediction models
• Easiness to interpret clinical prediction models
• To explain medical decision to the patient
• To identify important risk factors
• Avoid overfitting to make clinical prediction models more generalizable
• Early decision making
• ABILITY to CAPTURE
– Heterogeneous patient group behavior
6
7. Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR works well by using
several pattern local model pairs
These are different subpopulations that need different
predicted models. Using just one prediction function does
not work well!!
Not an extreme case! It happens very often …
7
8. Ohio Center of Excellence in Knowledge-Enabled Computing
How CPXR(Log) is different from other classifiers?
• CPXR introduced the idea of
– using patterns to logically characterize different
subpopulations of data and
– using local regression models to represent predictor response
relationship of the subpopulation
– choosing a pattern only if the local model is very accurate
[Dong, 2014]
• CPXR(Log)
– can capture diversified/heterogeneous behavior.
– is more generalizable.
– is less overfitting than other classifiers.
• CPXR(Log) is more accurate than other classifiers like SVM and
Random Forest.
8
9. Ohio Center of Excellence in Knowledge-Enabled Computing
Traditional classification vs CPXR
Training Data
Classification
engine
Classifier
(model)
Training
Data
Classification
engine
Baseline
model
• Large
error data
• Small
error data
(Pattern 1, Model 1)
(Pattern 2, Model 2)
(Pattern k, Model k)
.
.
.
Build and select
CPs &
local models
9
10. Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR(Log) – PXR concept
• Definition: Let 𝐷 = 𝑋𝑖, 𝑌𝑖 1 ≤ 𝑖 ≤ 𝑛 be training data for regression. Let
𝑓 be a regression model built on 𝐷, which we will call the baseline
model on 𝐷. A pattern aided regression (PXR) model is a tuple
𝑃𝑀 = ( 𝑃1, 𝑓1, 𝑤1 , … , 𝑃𝑘, 𝑓𝑘, 𝑤 𝑘 , 𝑓𝑑), where {𝑃1, … , 𝑃𝑘} is the pattern set of
𝑃𝑀, 𝑓𝑖s are local regression models of 𝑃𝑖s and 𝑓𝑑 is the default regression
model. We define the regression model of 𝑃𝑀 as
𝑓𝑃𝑀 =
𝑃 𝑖∈𝜋 𝑥
𝑤𝑖 𝑓𝑖(𝑥)
𝑃 𝑖∈𝜋 𝑥
𝑤𝑖
𝑖𝑓 𝜋 𝑥 ≠ 0
𝑓𝑑 𝑥 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
for each instance 𝑥, where 𝜋 𝑥 = 𝑃𝑖 1 ≤ 𝑖 ≤ 𝑘, 𝑥 𝑠𝑎𝑡𝑖𝑠𝑓𝑖𝑒𝑠 𝑃𝑖 .
10
11. Ohio Center of Excellence in Knowledge-Enabled Computing
Preliminaries: Contrast Patterns
• A toy example
• 𝑃1 = 𝐴2 = 𝑐 & 𝐴3 = 𝑒 𝑚𝑡 𝑃1, 𝐷 = 𝑡2, 𝑡3, 𝑡4 𝑠𝑢𝑝𝑝(𝑃1, 𝐷)=
3
5
= 𝟔𝟎%
• 𝑠𝑢𝑝𝑝𝑅𝑎𝑡𝑖𝑜 𝐶1
𝐶2
𝑃 =
2
1
= 𝟐
• Given a threshold like 2, 𝑃1 is a contrast pattern.
• Details: We only consider one minimal generator pattern for each
“equivalency class” of contrast patterns.
TID 𝑨 𝟏 𝑨 𝟐 𝑨 𝟑 𝑨 𝟒 𝑨 𝟓 Class
𝒕 𝟏 b d e g i 𝑪 𝟏
𝒕 𝟐 b c e g i 𝑪 𝟏
𝒕 𝟑 a c e g j 𝑪 𝟐
𝒕 𝟒 a c e h j 𝑪 𝟐
𝒕 𝟓 b d f g i 𝑪 𝟐
12. Ohio Center of Excellence in Knowledge-Enabled Computing
Quality measures
• CPXR(Log) needs to efficiently extract a desirable pattern set from a
huge search space of potential pattern sets.
• Definition: The average residual reduction (arr) of a pattern 𝑃 w.r.t.
a model 𝑓 and a dataset 𝐷 is
𝑎𝑟𝑟 𝑃 =
𝑋 ∈𝑚𝑑𝑠(𝑃) 𝑟𝑋(𝑓) − 𝑋∈𝑚𝑑𝑠(𝑃) 𝑟𝑋(𝑓𝑃)
𝑚𝑑𝑠(𝑃)
• Definition: The total residual reduction (trr) of a pattern set 𝑃𝑆 =
𝑃1, … , 𝑃𝑘 w.r.t a model 𝑓 and a dataset 𝐷 is
𝑡𝑟𝑟 𝑃 =
𝑋 ∈𝑚𝑑𝑠(𝑃𝑆) 𝑟 𝑋(𝑓) − 𝑋∈𝑚𝑑𝑠(𝑃𝑆) 𝑟 𝑋(𝑓 𝑃𝑀)
𝑋∈𝐷 𝑟 𝑋(𝑓)
where 𝑃𝑀 = 𝑃1, 𝑓𝑃1
, 𝑤1 , … , 𝑃𝑘, 𝑓𝑃 𝑘
, 𝑤 𝑘 , 𝑓 , 𝑤𝑖 = 𝑎𝑟𝑟(𝑃𝑖), and 𝑚𝑑𝑠 𝑃𝑆 =
𝑃∈𝑃𝑆 𝑚𝑑𝑠(𝑃).
13. Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR(Log) algorithm -- outline
• First step: split training dataset 𝐷 into two classes, 𝐿𝐸 and 𝑆𝐸.
• 𝐿𝐸: instances of 𝐷 where baseline model 𝑓 makes Large Error.
• 𝑆𝐸: instances of 𝐷 where baseline model 𝑓 makes Small Error.
• Second step: extract all contrast patterns on 𝐿𝐸 satisfying 𝑚𝑖𝑛𝑆𝑢𝑝.
• Third step: search for a small set of pattern to maximize error reduction
and uses that set to build a 𝑃𝑋𝑅 model.
• Note
Each pattern 𝑃 is associated with a local regression model 𝑓𝑃 built on 𝑃’s matching
data.
Using a pattern 𝑃 and its local associated regression model 𝑓𝑃 is a flexible way to
represent one predictor response relationship.
Different (𝑃, 𝑓𝑃) pairs represent highly different predictor response relationships.
13
14. Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR(Log) – details (1)
• Inputs:
• Training data 𝐷 = (𝑥𝑖, 𝑦𝑖) 1 ≤ 𝑖 ≤ 𝑛
• Baseline model 𝑓
• 𝜌 to partition 𝐷 into 𝐿𝐸 and 𝑆𝐸
• 𝑚𝑖𝑛𝑆𝑢𝑝 threshold on contrast patterns
• Output:
• A 𝑃𝑋𝑅 model
Let 𝑟1, … , 𝑟𝑛 denote 𝑓’s error on 𝑥1, … , 𝑥 𝑛;
Determine 𝜅 to minimize 𝜌 −
𝑟 𝑖>𝜅 𝑟 𝑖
𝑟 𝑖
;
Let 𝐿𝐸 = 𝑥𝑖 𝑟𝑖 > 𝜅 , 𝑆𝐸 = 𝐷 − 𝐿𝐸;
Discretize each numerical variable using entropy based binning;
Extract all contrast patterns for 𝑚𝑖𝑛𝑆𝑢𝑝 in the 𝐿𝐸 class (𝐶𝑃𝑆);
14
15. Ohio Center of Excellence in Knowledge-Enabled Computing
CPXR(Log) – details (2)
For each 𝑃 ∈ 𝐶𝑃𝑆, build the local regression model 𝑓𝑃 for data in 𝑚𝑑𝑠(𝑃);
Let 𝑃𝑆 = 𝑃0 , where 𝑃0 is the pattern 𝑃 in 𝐶𝑃𝑆 with highest 𝑎𝑟𝑟;
Let 𝑓𝑑 be the regression model trained from 𝐷 − 𝑃∈𝑃𝑆 𝑚𝑑𝑠(𝑃);
Return 𝑃𝑀(𝑃𝑆, 𝑓𝑑);
15
16. Ohio Center of Excellence in Knowledge-Enabled Computing
TBI data
• TBI dataset is a collection of some International and US Tirilazad trials.
• 2159 instances. [Steyerberg, 2008]
• 15 numerical and categorical predictor variables.
• Missing instances were treated using multiple imputation.
• The outcome variable is the Glascow Outcome Scale: GOS 1 (dead),…,
GOS 5 (good recovery)
• This study used two discretized versions of GOS: “Mortality” vs survival
(GOS1 vs GOS 2-5), “Unfavorable” vs favorable (GOS 1-3 vs GOS 4-5)
Category Predictor variables
Basic Cause of injury, age, GCS motor score, pupil reactivity
Computed
tomography (CT)
Hypoxia, hypotension, Marshall CT, tSAH, eDH,
compressed cistern, midline shift more than 5 mm
Lab Glucose, ph, sodium, hb
16
17. Ohio Center of Excellence in Knowledge-Enabled Computing
Results – Performance of SLogR and CPXR(Log) on Mortality models
Model SLogR CPXR(Log)
Specificity Sensitivity F1 AUC Specificity Sensitivity F1 AUC
Basic 0.95 0.18 0.27 0.77 0.96 0.18 0.28 0.8
Basic+CT 0.95 0.32 0.42 0.8 0.96 0.42 0.53 0.88
Basic+CT+Lab 0.94 0.36 0.46 0.8 0.97 0.46 0.58 0.92
Of course more accurate than standard logistic regression
17
18. Ohio Center of Excellence in Knowledge-Enabled Computing
Results – Performance of SLogR and CPXR(Log) on Unfavorable models
Model SLogR CPXR(Log)
Specificity Sensitivity F1 AUC Specificity Sensitivity F1 AUC
Basic 0.85 0.52 0.59 0.76 0.89 0.54 0.63 0.82
Basic+CT 0.85 0.6 0.66 0.8 0.87 0.65 0.7 0.87
Basic+CT+Lab 0.84 0.61 0.66 0.81 0.91 0.72 0.76 0.93
18
19. Ohio Center of Excellence in Knowledge-Enabled Computing
Results – Impact of adding more variables on AUC
Variable set change Mortality Unfavorable
CPXR(Log) SLogR CPXR(Log) SLogR
Basic Basic +CT 10% 7.7% 6% 5.2%
Basic Basic + CT + Lab 15% 11.1% 13.4% 6.6%
Mortality Unfavorable
Basic Basic+CT Basic+CT+Lab Basic Basic+CT Basic+CT+Lab
11.1% 12.8% 15% 7.9% 8.8% 14.8%
CPXR(Log) over SlogR
AUC improvement when more variables are used by CPXR(Log) and SLogR
19
20. Ohio Center of Excellence in Knowledge-Enabled Computing
Results – ROC curves of Basic models
20
21. Ohio Center of Excellence in Knowledge-Enabled Computing
Results - ROC curves of (Basic + CT) models
21
22. Ohio Center of Excellence in Knowledge-Enabled Computing
Results - ROC curves of (Basic+CT+Lab) models
22
23. Ohio Center of Excellence in Knowledge-Enabled Computing
Results – Performance comparison
CPXR(Log)
Comparing CPXR(Log)
performance with
- Logistic Regression
- SVM
- Random Forest
23
24. Ohio Center of Excellence in Knowledge-Enabled Computing
Example: patterns used by CPXR(Log) & Mortality (Basic+CT+Lab)
patterns arr Cov
(CT classification = III) 15% 20%
(CT classification = V) AND (midline shift) AND (0.56 < glucose <= 10.4) 12% 15%
(No compressed cistern) AND (No midline shift) AND (7.22 < PH <= 7.45) 10% 40%
(10.77 < glucose <= 21.98) AND (134 < sodium <= 144) 18% 18%
(No Hypotension) AND (134 < sodium < 144) AND (10.55 < HB <= 14.57)
AND (with tSAH)
19% 20%
(No tSAH) AND (134 < sodium <= 144) AND (10.77 < glucose <= 21.98)
AND (No Hypotension) AND (No midline shift) AND (One reactive pupil)
19% 20%
(No tSAH) AND (One reactive pupil) 18% 40%
24
25. Ohio Center of Excellence in Knowledge-Enabled Computing
Odds ratios
(CT classification = V) AND (midline shift) AND (0.56 < glucose <= 10.4)
25
26. Ohio Center of Excellence in Knowledge-Enabled Computing
Residual reduction and example patient
• Age = 15 years old
• Cause of injury =
motorbike accident
• GCS motor score = 5
(No eye response)
• No reactive pupil
• No hypoxia
• No hypotension
• CT scan classification = V
(mass lesion)
• No tSAH
• With ePDH
• Has midline shift more
than 5 mm
• Glucose = 9.06 mmol/l
• PH = 7.37
• Sodium = 141 mmol/l
• Hb = 14.4 g/dl
• Patient is dead.
0.78, risk of
survival based on
standard logistic
regression!!!!
0
100
200
300
400
500
600
0 500 1000 1500 2000 2500
Error distribution of TBI dataset on SLogR
Patient is matched
with “pattern II”
and CPXR(Log)
predicted 0.38 risk
of survival.
26
0
2
4
6
8
10
12
0 500 1000 1500 2000 2500
Error distribution of TBI dataset on CPXR(Log)
27. Ohio Center of Excellence in Knowledge-Enabled Computing
Results – Box plot of RMSE reduction in CPXR
• Piecewise linear regression
• Support vector regression
• Bayesian additive regression tree
• Gradient boosting method
How much CPXR can reduce RMSE (Root Mean Square Error) in 50 datasets comparing to
27
28. Ohio Center of Excellence in Knowledge-Enabled Computing
Results – Noise sensitivity and impact of the number of patterns
Number of patterns is determined by
the method automatically.
How much noisy datasets can impact
on the performance of CPXR and
other methods?
28
29. Ohio Center of Excellence in Knowledge-Enabled Computing
Conclusion
• We presented an effective new method, CPXR(Log) for logistic regression
and for clinical predictive modeling.
• We showed CPXR is more accurate than standard logistic regression and
some other classification algorithms.
• We also presented CPXR(Log) models including patterns and local
models an new odds ratios of predictor variables.
29
30. Ohio Center of Excellence in Knowledge-Enabled Computing
References
• Guozhu Dong & Vahid Taslimitehrani. Pattern-Aided Regression
Modeling and Prediction Model Analysis. Tech Report, CSE, Wright State
Univ. 2014.
• E. Steyerberg: Clinical prediction models. Springer, 2009.
• P. Perel, P. Edwards, R. Wentz, and I. Roberts: Systematic review of
prognostic models in traumatic brain injury. BMC medical informatics
and decision making, 6(1): 1-10, 2006.
• G. Dong, J. Li: Efficient mining of emerging patterns: Discovering trends
and differences. In Proc. KDD, 43-52, 1999.
• E.W. Steyerberg, et al: Predicting outcome after traumatic brain injury:
development and international validation of prognostic scores based on
admission characteristics. PLoS medicine, 5(8): e165, 2008.
30
31. Ohio Center of Excellence in Knowledge-Enabled Computing
Preliminaries: Logistic Regression
• Regression modeling: predicting response variable (output) based on
predictor variables (input).
• Logistic regression: the response variable is binary. For example,
• “having the disease” or “not”
• “mortal” or “not”
• Let X=(𝑥1, 𝑥2, … , 𝑥 𝑛) be a vector of predictor variables
• and Y be the response variable.
• The goal of logistic regression is learning a function like
𝑙𝑝 𝑋 = 𝛽0 + 𝑖=1
𝑛
𝛽𝑖 × 𝑥𝑖 satisfying
log
𝑃 𝑌 = 1
𝑃 𝑌 = 1 + 1
= 𝑙𝑝(𝑋)
Chi-square (𝜒2) is one
of the goodness of fit
measures for logistic
regression
31
33. Ohio Center of Excellence in Knowledge-Enabled Computing
Preliminaries: Contrast Patterns
• The matching data of pattern P in dataset D or 𝑚𝑡(𝑃, 𝐷) is the set of all
instances matching pattern P.
• The support of pattern P in D is 𝑠𝑢𝑝𝑝 𝑃, 𝐷 =
𝑚𝑡(𝑃,𝐷)
𝐷
• Given 2 classes 𝐶1 and 𝐶2,the support ratio of pattern P from 𝐶1 to 𝐶2
𝑠𝑢𝑝𝑝𝑅𝑎𝑡𝑖𝑜 𝐶1
𝐶2
𝑃 =
𝑠𝑢𝑝𝑝(𝑃,𝐶2)
𝑠𝑢𝑝𝑝(𝑃,𝐶1)
• Given a threshold 𝛾, a contrast pattern (emerging pattern) of class 𝐶2
is a pattern P satisfying 𝑠𝑢𝑝𝑝𝑅𝑎𝑡𝑖𝑜 𝐶1
𝐶2
𝑃 ≽ 𝛾. [Dong, 1999]
33