This document describes a project to predict hospital readmission rates within 30 days for diabetic patients using machine learning models. It provides an overview of the dataset used, data preprocessing steps including cleaning, dimension reduction, and sampling. Several classification models are tested on the preprocessed data, including logistic regression, neural networks, decision trees, boosted trees, bootstrap forest, and naive bayes. The neural networks model is selected based on having the lowest false negative rate while maintaining acceptable accuracy. Further adjustments are made to optimize the cutoff value to reduce false negatives. The developed model is expected to help healthcare providers identify high-risk patients and potentially prevent readmissions.
Understanding the Analytical method validation in a Practical PerspectiveDr. Ishaq B Mohammed
The document discusses analytical method development, validation and transfer. It begins by introducing the importance of method development, validation and transfer in pharmaceutical analysis. It then discusses some key aspects of each including the objectives of method development, definition of validation, and the purpose of method transfer. The document provides examples of parameters to consider for method development including sample type, required data, analyte levels, and expected precision and accuracy. It also gives an overview of common validation parameters like accuracy, precision, specificity, range and linearity. The document aims to provide guidance on establishing reliable analytical methods for pharmaceutical applications.
SGS Biopharm Day 2016 - Modeling & simulation in Phase 1Ruben Faelens
Modeling and simulation can optimize clinical trials in the following ways:
1. PK/PD modeling can help translate animal data to humans, define initial doses for first-in-human studies, and determine dose escalation strategies.
2. Modeling can guide decisions throughout development by exploring objectives for future phases.
3. A case study example demonstrates how PK/PD modeling was used to iteratively refine dose selection and determine stopping rules for a first-in-human clinical trial of a monoclonal antibody.
2014-10-22 EUGM | ROYCHAUDHURI | Phase I Combination TrialsCytel USA
This document summarizes a Bayesian statistical approach for determining the maximum tolerated dose (MTD) in phase I oncology clinical trials testing drug combinations. It describes the challenges of combination trials and outlines a methodology using a Bayesian model with parameters for individual drug effects and drug-drug interactions. The methodology is applied to a sample dual-drug combination trial, with results presented across multiple cohorts and doses. Based on the modeled toxicity probabilities and additional clinical data, 6 mg of drug 1 and 400 mg of drug 2 is identified as the MTD and recommended phase II dose.
2010 smg training_cardiff_day1_session1 (1 of 3)_mckenziergveroniki
This document discusses different analytical methods for meta-analyzing continuous outcome data from randomized trials: final values, change scores, and analysis of covariance (ANCOVA). It presents an example comparing the properties of these estimators using observed and simulated trial data. Key findings include:
1) The three estimators can produce different intervention effect estimates depending on the correlation between baseline and follow-up scores; ANCOVA generally has the smallest standard error.
2) ANCOVA is preferred as it is unconditionally and conditionally unbiased, whereas final values and change scores can be conditionally biased.
3) In meta-analysis, when trials have adequate allocation concealment, pooled baseline imbalance is usually not problematic; however
How to double success rate of pediatric trials?JogaGobburu
This document discusses using quantitative clinical pharmacology and modeling to improve the success rate of pediatric drug trials. It proposes a "learn-apply" approach where prior knowledge from clinical trials is used to simulate trial designs and select doses for registration studies. This can help power pediatric studies and support approvals. The goal is to leverage past learnings to design better pediatric trials through simulation and modeling.
This document provides a summary of a project to analyze factors related to readmission of diabetes patients using a dataset from 130 US hospitals. The team cleaned the data by removing attributes with high percentages of missing values, irrelevant attributes, and instances of deceased patients. They applied the SMOTE technique to address data imbalance, oversampling the minority readmission class by 200%. Three classifiers - J48 decision tree, Naive Bayes, and Bayes Net - were selected for experiments to predict patient readmission.
Are you interested in learning how to prevent hospital readmissions for your diabetic population? It is a popular belief that measuring blood glucose for your diabetic population is the most predictive variable in determining a hospital readmission for a diabetic. However, many providers of care simply do not perform the test on known diabetic patients. This study takes a look at an advanced analytic method that works within the current healthcare providers workflow to looks to identify the likelihood of a future 30-day unplanned readmission before hospital discharge.
Understanding the Analytical method validation in a Practical PerspectiveDr. Ishaq B Mohammed
The document discusses analytical method development, validation and transfer. It begins by introducing the importance of method development, validation and transfer in pharmaceutical analysis. It then discusses some key aspects of each including the objectives of method development, definition of validation, and the purpose of method transfer. The document provides examples of parameters to consider for method development including sample type, required data, analyte levels, and expected precision and accuracy. It also gives an overview of common validation parameters like accuracy, precision, specificity, range and linearity. The document aims to provide guidance on establishing reliable analytical methods for pharmaceutical applications.
SGS Biopharm Day 2016 - Modeling & simulation in Phase 1Ruben Faelens
Modeling and simulation can optimize clinical trials in the following ways:
1. PK/PD modeling can help translate animal data to humans, define initial doses for first-in-human studies, and determine dose escalation strategies.
2. Modeling can guide decisions throughout development by exploring objectives for future phases.
3. A case study example demonstrates how PK/PD modeling was used to iteratively refine dose selection and determine stopping rules for a first-in-human clinical trial of a monoclonal antibody.
2014-10-22 EUGM | ROYCHAUDHURI | Phase I Combination TrialsCytel USA
This document summarizes a Bayesian statistical approach for determining the maximum tolerated dose (MTD) in phase I oncology clinical trials testing drug combinations. It describes the challenges of combination trials and outlines a methodology using a Bayesian model with parameters for individual drug effects and drug-drug interactions. The methodology is applied to a sample dual-drug combination trial, with results presented across multiple cohorts and doses. Based on the modeled toxicity probabilities and additional clinical data, 6 mg of drug 1 and 400 mg of drug 2 is identified as the MTD and recommended phase II dose.
2010 smg training_cardiff_day1_session1 (1 of 3)_mckenziergveroniki
This document discusses different analytical methods for meta-analyzing continuous outcome data from randomized trials: final values, change scores, and analysis of covariance (ANCOVA). It presents an example comparing the properties of these estimators using observed and simulated trial data. Key findings include:
1) The three estimators can produce different intervention effect estimates depending on the correlation between baseline and follow-up scores; ANCOVA generally has the smallest standard error.
2) ANCOVA is preferred as it is unconditionally and conditionally unbiased, whereas final values and change scores can be conditionally biased.
3) In meta-analysis, when trials have adequate allocation concealment, pooled baseline imbalance is usually not problematic; however
How to double success rate of pediatric trials?JogaGobburu
This document discusses using quantitative clinical pharmacology and modeling to improve the success rate of pediatric drug trials. It proposes a "learn-apply" approach where prior knowledge from clinical trials is used to simulate trial designs and select doses for registration studies. This can help power pediatric studies and support approvals. The goal is to leverage past learnings to design better pediatric trials through simulation and modeling.
This document provides a summary of a project to analyze factors related to readmission of diabetes patients using a dataset from 130 US hospitals. The team cleaned the data by removing attributes with high percentages of missing values, irrelevant attributes, and instances of deceased patients. They applied the SMOTE technique to address data imbalance, oversampling the minority readmission class by 200%. Three classifiers - J48 decision tree, Naive Bayes, and Bayes Net - were selected for experiments to predict patient readmission.
Are you interested in learning how to prevent hospital readmissions for your diabetic population? It is a popular belief that measuring blood glucose for your diabetic population is the most predictive variable in determining a hospital readmission for a diabetic. However, many providers of care simply do not perform the test on known diabetic patients. This study takes a look at an advanced analytic method that works within the current healthcare providers workflow to looks to identify the likelihood of a future 30-day unplanned readmission before hospital discharge.
1Big Data Analytics forHealthcareChandan K. ReddyD.docxaulasnilda
1
Big Data Analytics for
Healthcare
Chandan K. Reddy
Department of Computer Science
Wayne State University
Jimeng Sun
Healthcare Analytics Department
IBM TJ Watson Research Center
2Jimeng Sun, Large-scale Healthcare Analytics
Healthcare Analytics using Electronic Health Records (EHR)
Old way: Data are expensive and small
– Input data are from clinical trials, which is small
and costly
– Modeling effort is small since the data is limited
• A single model can still take months
EHR era: Data are cheap and large
– Broader patient population
– Noisy data
– Heterogeneous data
– Diverse scale
– Complex use cases
3Jimeng Sun, Large-scale Healthcare Analytics
Heterogeneous Medical Data
DiagnosisDiagnosis
MedicationMedication
LabLab
Clinical
notes
Clinical
notes
ImagesImages
Genetic
data
Genetic
data
4Jimeng Sun, Large-scale Healthcare Analytics
Challenges of Healthcare AnalyticsScalability ChallengesChallenges in Healthcare Analytics
Collaboration across domains
Analytic platform
Intuitive results
Scalable computation
5
PARALLEL MODEL BUILDING
6Jimeng Sun, Large-scale Healthcare Analytics
Motivation – Predictive modeling using EHR is growing
Need for scalable predictive modeling platforms/systems due to increased
computational requirements from:
– Processing EHR data (due to volume, variability, and heterogeneity)
– Building accurate models
– Building clinically meaningful models
– Validating models for accuracy and generalizability
Explosion in
interest
7Jimeng Sun, Large-scale Healthcare Analytics
What does it take to develop a predictive model using EHR?
Marina: IBM
Analytics Consultant
1
2
3
4
5
Within 3 months, we need to
1. understand business case
2. obtain the data
3. prepare the data
4. develop predictive models
5. deliver the final model
David Gotz, Harry Starvropoulos, Jimeng Sun, Fei Wang.
ICDA: A Platform for Intelligent Care Delivery Analytics, AMIA 2012
8Jimeng Sun, Large-scale Healthcare Analytics
A Generalized Predictive Modeling Pipeline
Cohort Construction: Find an appropriate set of patients with the specified
target condition and a corresponding set of control patients without the
condition.
Feature Construction: Compute a feature vector representation for each
patient based on the patient’s EHR data.
Cross Validation: Partition the data into complementary subsets for use in
model training and validation testing.
Feature Selection: Rank the input features and select a subset of relevant
features for use in the model.
Classification: The training and evaluation of a model for a specific classifier.
Output: Clean up intermediate files and to put results into their final locations.
Model specification
9Jimeng Sun, Large-scale Healthcare Analytics
Cohort Construction
A
ll
pa
tie
nt
s
D1
Disease Target samples
D1 Hypertension control 5000
D2 Heart failure onset 33K
D3 Hypertension diagnosis 300K
Cases
Controls
D3
D2
10Jimeng Sun, Large- ...
The study explores major factors that contribute to hospital readmissions via various analysis algorithms, including decision tree, neutral network and Bayesian network.
The document discusses predictive modeling for diabetes using medical claims data. It describes cleaning the data, calculating baseline statistics for diabetes, using likelihood ratios to predict diabetes risk in a validation set, and evaluating the model's sensitivity, specificity, and AUC. While the model achieves good accuracy for public policy, the conclusion cautions that individual risk predictions may not be accurate enough to guide patients due to overlaps between normal and disease risk distributions.
This document discusses biases that can arise in randomized controlled trials and meta-analyses. It notes that biases can be introduced in the design, conduct, analysis, and reporting of trials. Various empirical studies are presented that demonstrate biases from lack of allocation concealment and blinding in trials. Risk of bias assessments are recommended over quality scores for evaluating biases in individual trials and meta-analyses.
The document describes the design and testing of a decision support tool for intelligently setting guardrail limits on smart infusion pumps. The tool allows users to set guardrails based on simple rules to reduce missed detections. Testing of a rule to set thresholds to eliminate missed detections for morphine doses in adults showed that it greatly increased the false alarm rate from 46% to 99.5%, while eliminating all missed detections. The conclusions discuss expanding the tool to more drugs and rules, and reducing remaining false alarms through manual threshold adjustments.
Business Analytics with R - Using Data Mining TechniquesAnvitha Ananth
Applied various data mining techniques such as - Decision Tree, Random forest, Naive Bayes and Boosting to determine which was more accurate in predicting medical appointment no shows. Used the ROC and confusion matrix to compare the results of the various techniques applied.
Sample Size: A couple more hints to handle it right using SAS and RDave Vanz
Andrii Artemchuk from Intego Group, a Ukrainian offshore staffing company, presented this power point to the audience at a phUSE conference in Frankfurt Germany in 2018 on SAS and R
ICU Patient Deterioration Prediction : A Data-Mining Approachcsandit
A huge amount of medical data is generated every da
y, which presents a challenge in analysing
these data. The obvious solution to this challenge
is to reduce the amount of data without
information loss. Dimension reduction is considered
the most popular approach for reducing
data size and also to reduce noise and redundancies
in data. In this paper, we investigate the
effect of feature selection in improving the predic
tion of patient deterioration in ICUs. We
consider lab tests as features. Thus, choosing a su
bset of features would mean choosing the
most important lab tests to perform. If the number
of tests can be reduced by identifying the
most important tests, then we could also identify t
he redundant tests. By omitting the redundant
tests, observation time could be reduced and early
treatment could be provided to avoid the risk.
Additionally, unnecessary monetary cost would be av
oided. Our approach uses state-of-the-art
feature selection for predicting ICU patient deteri
oration using the medical lab results. We
apply our technique on the publicly available MIMIC
-II database and show the effectiveness of
the feature selection. We also provide a detailed a
nalysis of the best features identified by our
approach.
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACHcscpconf
A huge amount of medical data is generated every day, which presents a challenge in analysing
these data. The obvious solution to this challenge is to reduce the amount of data without
information loss. Dimension reduction is considered the most popular approach for reducing
data size and also to reduce noise and redundancies in data. In this paper, we investigate the
effect of feature selection in improving the prediction of patient deterioration in ICUs. We
consider lab tests as features. Thus, choosing a subset of features would mean choosing the
most important lab tests to perform. If the number of tests can be reduced by identifying the
most important tests, then we could also identify the redundant tests. By omitting the redundant
tests, observation time could be reduced and early treatment could be provided to avoid the risk.
Additionally, unnecessary monetary cost would be avoided. Our approach uses state-of-the-art
feature selection for predicting ICU patient deterioration using the medical lab results. We
apply our technique on the publicly available MIMIC-II database and show the effectiveness of
the feature selection. We also provide a detailed analysis of the best features identified by our
approach.
Decision Support System to Evaluate Patient Readmission RiskAvishek Choudhury
This document presents a decision support system for predicting patient readmission risk. It discusses the importance of reducing patient readmissions due to factors like healthcare costs and patient health. The author develops a predictive model using a dataset of 68,000 patient instances and 14 attributes. Feature selection identifies the top 5 important parameters. Various machine learning methods are tested, with support vector machines achieving the best accuracy of 97% on preprocessed data. A genetic algorithm and greedy ensemble further improve the prediction accuracy to 98.5% using gradient boosting. The author concludes the model can help identify at-risk patients to minimize preventable readmissions.
EXAMINING THE EFFECT OF FEATURE SELECTION ON IMPROVING PATIENT DETERIORATION ...IJDKP
This document discusses examining the effect of feature selection on improving patient deterioration prediction in intensive care units. The authors apply feature selection techniques to laboratory test data from the MIMIC-II database to identify the most important laboratory tests for predicting patient deterioration. They find that feature selection can help reduce redundant tests, potentially saving costs and allowing earlier treatment. The selected features provide insights into critical tests without domain expertise. In future work, the authors plan to evaluate additional feature selection methods and classification algorithms on this task.
AN IMPROVED MODEL FOR CLINICAL DECISION SUPPORT SYSTEMijaia
The document describes an improved model for a clinical decision support system that was developed to address issues with misdiagnosis and inconsistent healthcare records. The system incorporates both knowledge-based and non-knowledge based decision support methods using a hybrid approach. It was trained and validated using prostate cancer and diabetes datasets, achieving classification accuracies of 98% and 94% respectively. The system aims to enhance disease detection and prediction to support better healthcare delivery.
Clinical trials involve testing new drugs or treatments on human subjects to evaluate their efficacy, safety and appropriate dosages. They generally proceed through four phases, starting with animal and laboratory tests, followed by small safety and efficacy trials on humans, then larger randomized controlled trials, and finally post-marketing surveillance. Randomized controlled trials aim to reduce bias by randomly assigning similar subjects to either the test treatment or a control treatment, and collecting and analyzing outcome data while blinding investigators. Intention-to-treat analysis includes all subjects in their original assigned groups regardless of compliance or withdrawal to avoid bias from non-compliance. Multiple regression and logistic regression analyses can be used to compare outcomes between treatments while accounting for prognostic factors.
Survey on data mining techniques in heart disease predictionSivagowry Shathesh
This document describes a study on applying data mining techniques to analyze and predict heart disease. It discusses how data mining can extract valuable knowledge from healthcare data. The study uses several data mining techniques like decision trees, naive Bayes classification, clustering, and association rule mining on heart disease datasets from UC Irvine to predict heart disease. Experimental results show that multilayer neural networks and classification techniques like naive Bayes had higher prediction accuracy compared to other methods.
A webinar hosted by CHIME. It shared thoughts on one of my areas of interest – harnessing both business intelligence and health IT, for more effective measurement of healthcare performance.
THE APPLICATION OF EXTENSIVE FEATURE EXTRACTION AS A COST STRATEGY IN CLINICA...IJDKP
The document describes a study that uses principal component analysis (PCA) for feature extraction to reduce the number of clinical markers needed for disease classification. PCA was performed on prostate cancer and diabetes datasets to extract the most relevant features. For prostate cancer, PCA extracted 3 features from 4 original markers, and for diabetes it extracted 4 features from 5 original markers. When the reduced feature sets were used in a neural network, it yielded classification accuracies of 80% for prostate cancer and 75% for diabetes. The feature extraction approach aims to lower the cost of clinical decision support systems by reducing the number of tests required while maintaining accuracy.
THE APPLICATION OF EXTENSIVE FEATURE EXTRACTION AS A COST STRATEGY IN CLINICA...IJDKP
Patients waste great deal of resources in the cause of identification of pathogens that caused their ailments; this calls for concern, hence the need to develop a veritable tool for minimizing the cost involved in classification of disease pathogens without compromising accuracy. In this paper, we developed a feature extraction model which reduces the clinical markers for prostate cancer and diabetes. The feature extraction, in the form of principal component analysis (PCA), was used to extract relevant components from prostate cancer and diabetes datasets.The simulation and experiment of the system were done with matlab.The system was able to extract 3 relevant features out of 4 prostate cancer clinical markers and 4 relevant features out of 5 diabetes clinical markers.The result showed that when trained in a multilayer neural network it yielded better classification accuracy with the extracted relevant features with 80% and 75% component analysis in prostate cancer and diabetes datasets respectively.
THE APPLICATION OF EXTENSIVE FEATURE EXTRACTION AS A COST STRATEGY IN CLINICA...IJDKP
Patients waste great deal of resources in the cause of identification of pathogens that caused their
ailments; this calls for concern, hence the need to develop a veritable tool for minimizing the cost involved
in classification of disease pathogens without compromising accuracy. In this paper, we developed a
feature extraction model which reduces the clinical markers for prostate cancer and diabetes. The feature
extraction, in the form of principal component analysis (PCA), was used to extract relevant components
from prostate cancer and diabetes datasets. The simulation and experiment of the system were done with
matlab. The system was able to extract 3 relevant features out of 4 prostate cancer clinical markers and 4
relevant features out of 5 diabetes clinical markers. The result showed that when trained in a multilayer
neural network it yielded better classification accuracy with the extracted relevant features with 80% and
75% component analysis in prostate cancer and diabetes datasets respective
David Madigan MedicReS World Congress 2014MedicReS
This document discusses issues with published observational studies and potential ways to improve them. It notes that even small choices in study design can lead to a wide range of results. Applying different analysis methods to large healthcare databases still found many false positives and negatives. However, performance improved by tailoring analyses to specific outcomes and restricting to larger sample sizes. Self-controlled designs like case-crossover consistently outperformed traditional cohort and case-control designs. Further strategies like evaluating results against known null distributions may help address biases from unmeasured confounding.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
1Big Data Analytics forHealthcareChandan K. ReddyD.docxaulasnilda
1
Big Data Analytics for
Healthcare
Chandan K. Reddy
Department of Computer Science
Wayne State University
Jimeng Sun
Healthcare Analytics Department
IBM TJ Watson Research Center
2Jimeng Sun, Large-scale Healthcare Analytics
Healthcare Analytics using Electronic Health Records (EHR)
Old way: Data are expensive and small
– Input data are from clinical trials, which is small
and costly
– Modeling effort is small since the data is limited
• A single model can still take months
EHR era: Data are cheap and large
– Broader patient population
– Noisy data
– Heterogeneous data
– Diverse scale
– Complex use cases
3Jimeng Sun, Large-scale Healthcare Analytics
Heterogeneous Medical Data
DiagnosisDiagnosis
MedicationMedication
LabLab
Clinical
notes
Clinical
notes
ImagesImages
Genetic
data
Genetic
data
4Jimeng Sun, Large-scale Healthcare Analytics
Challenges of Healthcare AnalyticsScalability ChallengesChallenges in Healthcare Analytics
Collaboration across domains
Analytic platform
Intuitive results
Scalable computation
5
PARALLEL MODEL BUILDING
6Jimeng Sun, Large-scale Healthcare Analytics
Motivation – Predictive modeling using EHR is growing
Need for scalable predictive modeling platforms/systems due to increased
computational requirements from:
– Processing EHR data (due to volume, variability, and heterogeneity)
– Building accurate models
– Building clinically meaningful models
– Validating models for accuracy and generalizability
Explosion in
interest
7Jimeng Sun, Large-scale Healthcare Analytics
What does it take to develop a predictive model using EHR?
Marina: IBM
Analytics Consultant
1
2
3
4
5
Within 3 months, we need to
1. understand business case
2. obtain the data
3. prepare the data
4. develop predictive models
5. deliver the final model
David Gotz, Harry Starvropoulos, Jimeng Sun, Fei Wang.
ICDA: A Platform for Intelligent Care Delivery Analytics, AMIA 2012
8Jimeng Sun, Large-scale Healthcare Analytics
A Generalized Predictive Modeling Pipeline
Cohort Construction: Find an appropriate set of patients with the specified
target condition and a corresponding set of control patients without the
condition.
Feature Construction: Compute a feature vector representation for each
patient based on the patient’s EHR data.
Cross Validation: Partition the data into complementary subsets for use in
model training and validation testing.
Feature Selection: Rank the input features and select a subset of relevant
features for use in the model.
Classification: The training and evaluation of a model for a specific classifier.
Output: Clean up intermediate files and to put results into their final locations.
Model specification
9Jimeng Sun, Large-scale Healthcare Analytics
Cohort Construction
A
ll
pa
tie
nt
s
D1
Disease Target samples
D1 Hypertension control 5000
D2 Heart failure onset 33K
D3 Hypertension diagnosis 300K
Cases
Controls
D3
D2
10Jimeng Sun, Large- ...
The study explores major factors that contribute to hospital readmissions via various analysis algorithms, including decision tree, neutral network and Bayesian network.
The document discusses predictive modeling for diabetes using medical claims data. It describes cleaning the data, calculating baseline statistics for diabetes, using likelihood ratios to predict diabetes risk in a validation set, and evaluating the model's sensitivity, specificity, and AUC. While the model achieves good accuracy for public policy, the conclusion cautions that individual risk predictions may not be accurate enough to guide patients due to overlaps between normal and disease risk distributions.
This document discusses biases that can arise in randomized controlled trials and meta-analyses. It notes that biases can be introduced in the design, conduct, analysis, and reporting of trials. Various empirical studies are presented that demonstrate biases from lack of allocation concealment and blinding in trials. Risk of bias assessments are recommended over quality scores for evaluating biases in individual trials and meta-analyses.
The document describes the design and testing of a decision support tool for intelligently setting guardrail limits on smart infusion pumps. The tool allows users to set guardrails based on simple rules to reduce missed detections. Testing of a rule to set thresholds to eliminate missed detections for morphine doses in adults showed that it greatly increased the false alarm rate from 46% to 99.5%, while eliminating all missed detections. The conclusions discuss expanding the tool to more drugs and rules, and reducing remaining false alarms through manual threshold adjustments.
Business Analytics with R - Using Data Mining TechniquesAnvitha Ananth
Applied various data mining techniques such as - Decision Tree, Random forest, Naive Bayes and Boosting to determine which was more accurate in predicting medical appointment no shows. Used the ROC and confusion matrix to compare the results of the various techniques applied.
Sample Size: A couple more hints to handle it right using SAS and RDave Vanz
Andrii Artemchuk from Intego Group, a Ukrainian offshore staffing company, presented this power point to the audience at a phUSE conference in Frankfurt Germany in 2018 on SAS and R
ICU Patient Deterioration Prediction : A Data-Mining Approachcsandit
A huge amount of medical data is generated every da
y, which presents a challenge in analysing
these data. The obvious solution to this challenge
is to reduce the amount of data without
information loss. Dimension reduction is considered
the most popular approach for reducing
data size and also to reduce noise and redundancies
in data. In this paper, we investigate the
effect of feature selection in improving the predic
tion of patient deterioration in ICUs. We
consider lab tests as features. Thus, choosing a su
bset of features would mean choosing the
most important lab tests to perform. If the number
of tests can be reduced by identifying the
most important tests, then we could also identify t
he redundant tests. By omitting the redundant
tests, observation time could be reduced and early
treatment could be provided to avoid the risk.
Additionally, unnecessary monetary cost would be av
oided. Our approach uses state-of-the-art
feature selection for predicting ICU patient deteri
oration using the medical lab results. We
apply our technique on the publicly available MIMIC
-II database and show the effectiveness of
the feature selection. We also provide a detailed a
nalysis of the best features identified by our
approach.
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACHcscpconf
A huge amount of medical data is generated every day, which presents a challenge in analysing
these data. The obvious solution to this challenge is to reduce the amount of data without
information loss. Dimension reduction is considered the most popular approach for reducing
data size and also to reduce noise and redundancies in data. In this paper, we investigate the
effect of feature selection in improving the prediction of patient deterioration in ICUs. We
consider lab tests as features. Thus, choosing a subset of features would mean choosing the
most important lab tests to perform. If the number of tests can be reduced by identifying the
most important tests, then we could also identify the redundant tests. By omitting the redundant
tests, observation time could be reduced and early treatment could be provided to avoid the risk.
Additionally, unnecessary monetary cost would be avoided. Our approach uses state-of-the-art
feature selection for predicting ICU patient deterioration using the medical lab results. We
apply our technique on the publicly available MIMIC-II database and show the effectiveness of
the feature selection. We also provide a detailed analysis of the best features identified by our
approach.
Decision Support System to Evaluate Patient Readmission RiskAvishek Choudhury
This document presents a decision support system for predicting patient readmission risk. It discusses the importance of reducing patient readmissions due to factors like healthcare costs and patient health. The author develops a predictive model using a dataset of 68,000 patient instances and 14 attributes. Feature selection identifies the top 5 important parameters. Various machine learning methods are tested, with support vector machines achieving the best accuracy of 97% on preprocessed data. A genetic algorithm and greedy ensemble further improve the prediction accuracy to 98.5% using gradient boosting. The author concludes the model can help identify at-risk patients to minimize preventable readmissions.
EXAMINING THE EFFECT OF FEATURE SELECTION ON IMPROVING PATIENT DETERIORATION ...IJDKP
This document discusses examining the effect of feature selection on improving patient deterioration prediction in intensive care units. The authors apply feature selection techniques to laboratory test data from the MIMIC-II database to identify the most important laboratory tests for predicting patient deterioration. They find that feature selection can help reduce redundant tests, potentially saving costs and allowing earlier treatment. The selected features provide insights into critical tests without domain expertise. In future work, the authors plan to evaluate additional feature selection methods and classification algorithms on this task.
AN IMPROVED MODEL FOR CLINICAL DECISION SUPPORT SYSTEMijaia
The document describes an improved model for a clinical decision support system that was developed to address issues with misdiagnosis and inconsistent healthcare records. The system incorporates both knowledge-based and non-knowledge based decision support methods using a hybrid approach. It was trained and validated using prostate cancer and diabetes datasets, achieving classification accuracies of 98% and 94% respectively. The system aims to enhance disease detection and prediction to support better healthcare delivery.
Clinical trials involve testing new drugs or treatments on human subjects to evaluate their efficacy, safety and appropriate dosages. They generally proceed through four phases, starting with animal and laboratory tests, followed by small safety and efficacy trials on humans, then larger randomized controlled trials, and finally post-marketing surveillance. Randomized controlled trials aim to reduce bias by randomly assigning similar subjects to either the test treatment or a control treatment, and collecting and analyzing outcome data while blinding investigators. Intention-to-treat analysis includes all subjects in their original assigned groups regardless of compliance or withdrawal to avoid bias from non-compliance. Multiple regression and logistic regression analyses can be used to compare outcomes between treatments while accounting for prognostic factors.
Survey on data mining techniques in heart disease predictionSivagowry Shathesh
This document describes a study on applying data mining techniques to analyze and predict heart disease. It discusses how data mining can extract valuable knowledge from healthcare data. The study uses several data mining techniques like decision trees, naive Bayes classification, clustering, and association rule mining on heart disease datasets from UC Irvine to predict heart disease. Experimental results show that multilayer neural networks and classification techniques like naive Bayes had higher prediction accuracy compared to other methods.
A webinar hosted by CHIME. It shared thoughts on one of my areas of interest – harnessing both business intelligence and health IT, for more effective measurement of healthcare performance.
THE APPLICATION OF EXTENSIVE FEATURE EXTRACTION AS A COST STRATEGY IN CLINICA...IJDKP
The document describes a study that uses principal component analysis (PCA) for feature extraction to reduce the number of clinical markers needed for disease classification. PCA was performed on prostate cancer and diabetes datasets to extract the most relevant features. For prostate cancer, PCA extracted 3 features from 4 original markers, and for diabetes it extracted 4 features from 5 original markers. When the reduced feature sets were used in a neural network, it yielded classification accuracies of 80% for prostate cancer and 75% for diabetes. The feature extraction approach aims to lower the cost of clinical decision support systems by reducing the number of tests required while maintaining accuracy.
THE APPLICATION OF EXTENSIVE FEATURE EXTRACTION AS A COST STRATEGY IN CLINICA...IJDKP
Patients waste great deal of resources in the cause of identification of pathogens that caused their ailments; this calls for concern, hence the need to develop a veritable tool for minimizing the cost involved in classification of disease pathogens without compromising accuracy. In this paper, we developed a feature extraction model which reduces the clinical markers for prostate cancer and diabetes. The feature extraction, in the form of principal component analysis (PCA), was used to extract relevant components from prostate cancer and diabetes datasets.The simulation and experiment of the system were done with matlab.The system was able to extract 3 relevant features out of 4 prostate cancer clinical markers and 4 relevant features out of 5 diabetes clinical markers.The result showed that when trained in a multilayer neural network it yielded better classification accuracy with the extracted relevant features with 80% and 75% component analysis in prostate cancer and diabetes datasets respectively.
THE APPLICATION OF EXTENSIVE FEATURE EXTRACTION AS A COST STRATEGY IN CLINICA...IJDKP
Patients waste great deal of resources in the cause of identification of pathogens that caused their
ailments; this calls for concern, hence the need to develop a veritable tool for minimizing the cost involved
in classification of disease pathogens without compromising accuracy. In this paper, we developed a
feature extraction model which reduces the clinical markers for prostate cancer and diabetes. The feature
extraction, in the form of principal component analysis (PCA), was used to extract relevant components
from prostate cancer and diabetes datasets. The simulation and experiment of the system were done with
matlab. The system was able to extract 3 relevant features out of 4 prostate cancer clinical markers and 4
relevant features out of 5 diabetes clinical markers. The result showed that when trained in a multilayer
neural network it yielded better classification accuracy with the extracted relevant features with 80% and
75% component analysis in prostate cancer and diabetes datasets respective
David Madigan MedicReS World Congress 2014MedicReS
This document discusses issues with published observational studies and potential ways to improve them. It notes that even small choices in study design can lead to a wide range of results. Applying different analysis methods to large healthcare databases still found many false positives and negatives. However, performance improved by tailoring analyses to specific outcomes and restricting to larger sample sizes. Self-controlled designs like case-crossover consistently outperformed traditional cohort and case-control designs. Further strategies like evaluating results against known null distributions may help address biases from unmeasured confounding.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
1. OPIM 5604 | Team 3
OPIM 5604 - PREDICTIVE MODELING
Predicting Hospital Readmission Rates within 30 days for
Diabetic Patients
TEAM 3
Yashi Sarbhai
Piyush Bishnoi
Manu Shankar
Muhammad Sanan Akbar
Mounika Paladugu
2. OPIM 5604 | Team 3
Contents
1.0 Executive Summary: 3
2.0 Problem Statement: 3
3.0 Methodology 3
3.1 DATASET OVERVIEW 3
3.1.1 Attributes and Target Variable Table 4
3.2 Data Exploration Techniques 6
3.2.1 Data Cleaning 6
3.2.2 Dimension Reduction 6
3.2.3 Missing Value Detection 7
3.2.4 Outlier Detection and Treatment. 7
4.0 Modification 7
4.1 Recoding Categorical Values 7
4.2 Rare Event Sampling 8
5.0 Modeling 8
5.1 Nominal Logistic 9
5.2 Neural Networks 9
5.3 Decision Trees: 10
5.4 Boosted Tree: 10
5.5 BootStrap Forest 11
5.6 Naïve Bayes 11
6.0 Assess 12
6.1 Model Comparison: 12
6.2 Model Improvement 12
7.0 Results and Conclusion: 13
7.1 Business Value of the Model: 13
7.2 Conclusion 13
8.0 References 14
3. OPIM 5604 | Team 3
1.0 Executive Summary:
A patientisconsideredtobe ‘re-admitted’inHospital whowere admittedinthe hospitalandagain
needstobe admittedtoa hospital withthe same problemwithin30days.Numberof Hospital
readmissionsindicate inefficiencyof healthcare systemsandadditional costsforTreatment.Therefore,
Healthcare Marketsand GovernmentHealthcare Agenciesare using 30-daysreadmission asanindex
for qualityof treatmentprovidedandtoassesstheirperformance,qualitycontrol measure andtarget
for cost reduction.Identifyingwhoare potential patientsfor readmission willenable healthcare
providerstoimprove theirservice andperformanyadditional Investigationsif neededandpreferably
preventreadmissioninfuture.
National DiabeticStatisticsreportstatedthat9.3% of the population inthe United Stateshave diabetes
out of which28% are still undiagnosed.AccordingtoCurrentUSMedical Reportthere are approximately
0.1 milliondiabeticpatientsandreadmissiontreatmentforthemcostsaroundd 250$ million.For
Diabetesreadmissionrate within30days isfoundto be 13-25% whichisquite higherthanrate of
hospitalizedpatients(8-14%).
2.0 Problem Statement:
Hospital ReadmissionReductionProgramstartedbyAffordable Care Actwasstartedto improve the
qualityof medical statementsandtreatmentandreduce the spendingonreadmission.We are tryingto
Predictthe Readmissionof diabeticPatientswithin30daysfrom the givendataset.We cannotprevent
readmissiononwhole butModel developedandpredictionscanbe usedto reduce re- admissionif
necessary, measuresare takenandimplementedonit.Real time dataof 100061 patientsiscollected,it
has 50 parameterscoveringall medical detailsrelatedtopatients,diagnosis,hospitalandlabtestsetc.
Firstmajor taskis to identifyparameterswhichare directlycontributingtoreadmissionandderivingthe
trend.Collected datahas huge amountof missingvaluesandredundantinformation.The model
developedisexpectedtopredictreadmissionrate of diabeticpatientswithin30days withsignificant
accuracy. Studyperformeddescribescollection,datapreparation,dimensionreduction,models
deployedandtheiraccuracy,interestingobservationsandpatternsidentified.
3.0 Methodology
3.1 DATASET OVERVIEW
The datasethas beenextractedfromUCImachine learningrepositoryandrepresentdatafor10 years
(1999-2008) of clinical care at 130 US hospitalsandintegrateddeliverynetworksincludingnumerous
featuresrepresentingpatientandhospital outcomes.
The total numberof instancespresentinthe datasetare 101,766 and the total columnattributesare 50.
Target variable inthisdatais“Readmitted”columnwhichisclassifiedas“peoplereadmittedindays”
(<30, >30, NO)
Link: https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008#
4. OPIM 5604 | Team 3
3.1.1 Attributes and Target Variable Table
List of features and their descriptions in the initial dataset.
Attribute Type Description and values % missing
Encounter ID Numeric Unique identifier of an encounter 0%
Patient number Numeric Unique identifier of a patient 0%
Race Nominal
Values: Caucasian, Asian, African American, Hispanic, and
other 2%
Gender Nominal Values: male, female, and unknown/invalid 0%
Age Nominal Grouped in 10-year intervals: 0, 10), 10, 20), …, 90, 100) 0%
Weight Numeric Weight in pounds. 97%
Admission type Nominal
Integer identifier corresponding to 9 distinct values, for
example, emergency, urgent, elective, newborn, and not
available 0%
Discharge
disposition Nominal
Integer identifier corresponding to 29 distinct values, for
example, discharged to home, expired, and not available 0%
Admission source Nominal
Integer identifier corresponding to 21 distinct values, for
example, physician referral,emergency room, and transfer
from a hospital 0%
Time in hospital Numeric Integer number of days between admission and discharge 0%
Payer code Nominal
Integer identifier corresponding to 23 distinct values, for
example, Blue Cross/Blue Shield, Medicare, and self-pay 52%
Medical specialty Nominal
Integer identifier of a specialty of the admitting physician,
corresponding to 84 distinct values, for example,
cardiology, internal medicine, family/general practice, and
surgeon 53%
Number of lab
procedures Numeric Number of lab tests performed during the encounter 0%
Number of
procedures Numeric
Number of procedures (other than lab tests) performed
during the encounter 0%
Number of
medications Numeric
Number of distinct generic names administered during the
encounter 0%
Number of
outpatient visits Numeric
Number of outpatient visits of the patient in the year
preceding the encounter 0%
Number of
emergency visits Numeric
Number of emergency visits of the patient in the year
preceding the encounter 0%
Number of
inpatient visits Numeric
Number of inpatient visits of the patient in the year
preceding the encounter 0%
Diagnosis 1 Nominal
The primary diagnosis (coded as first three digits of ICD9);
848 distinct values 0%
Diagnosis 2 Nominal
Secondary diagnosis (coded as first three digits of ICD9);
923 distinct values 0%
5. OPIM 5604 | Team 3
Diagnosis 3 Nominal
Additional secondary diagnosis (coded as first three digits
of ICD9); 954 distinct values 1%
Number of
diagnoses Numeric Number of diagnoses entered to the system 0%
Glucose serum test
result Nominal
Indicates the range of the result or if the test was not taken.
Values: “>200,” “>300,” “normal,” and “none” if not
measured 0%
A1c test result Nominal
Indicates the range of the result or if the test was not taken.
Values: “>8” if the result was greater than 8%, “>7” if the
result was greater than 7% but less than 8%, “normal” if the
result was less than 7%, and “none” if not measured. 0%
Change of
medications Nominal
Indicates if there was a change in diabetic medications
(either dosage or generic name). Values: “change” and “no
change” 0%
Diabetes
medications Nominal
Indicates if there was any diabetic medication prescribed.
Values: “yes” and “no” 0%
23 features for
medications Nominal
For the generic names: metformin, repaglinide, nateglinide,
chlorpropamide, glimepiride, acetohexamide, glipizide,
glyburide, tolbutamide, pioglitazone, rosiglitazone,
acarbose,miglitol, troglitazone, tolazamide, examide,
insulin, glyburide-metformin, glipizide-metformin,
glimepiride-pioglitazone, metformin-rosiglitazone, and
metformin-pioglitazone, the feature indicates whether the
drug was prescribed or there was a change in the dosage.
Values: “up” if the dosage was increased during the
encounter, “down” if the dosage was decreased,“steady” if
the dosage did not change, and “no” if the drug was not
prescribed 0%
Readmitted Nominal
Days to inpatient readmission. Values: “<30” if the patient
was readmitted in less than 30 days, “>30” if the patient
was readmitted in more than 30 days, and “No” for no
record of readmission. 0%
3.2 Data Exploration Techniques
The core objective of the DataExplorationtechnique isremovingthe redundantdatafromrowsand
columnsthatare lesssignificantinpredictingthe targetvariable.The belowstepswere followedin
exploringandprocessingdata:
3.2.1 Data Cleaning
A newvariable “NA”wascreatedandimputedforthe insignificantvaluesinthe instances
1. Admission_type_id ->Values5,6,8whichrepresentsNotavailable,Null andnotmappedare
convertedtoNA and 7 whichrepresentsTraumacenterisrecodedto5.
2. Discharge_disposition_id ->18,25,26 (Null,Notmapped,invalid) convertedtoNA.
6. OPIM 5604 | Team 3
3. Admission_source_id -> 9,15,17,20,21(Not Available, Null, not mapped, invalid)
converted to NA
3.2.2 Dimension Reduction
Dimensionreductiontechniquewasdeployedinthe explorationprocesstoreduce numberof variables.
The target was to identifyminimumnumberof relevantattributeswithnon- overlappinginformation.
The belowmeasureswere taken:
3.2.2.1 Column Removal.
The followingcolumnswere removedfromthe dataset.
S. No. Attribute Reason for Removal
1. Weight 98569 valueswere missingfromtotal of 101,766
rowswhichaccounts for96.85%
2. PayerCode Payercode signifiesthe mode of paymentfor
differentpatientsandisnotmuchsignificanttothe
problemstatement
3. Medical specialty Medical specialtyhasaround50% missingvalues.
Thus,needto be removed.
4. Diag1, Diag2,Diag3 Theyare nominal variable witharound1000 distinct
possible values.Theyare codesusedformedical
purposes.These columnsare removedsoasto
reduce the complexityof the model.(Tradeoff
betweencomplexityandaccuracy)
3.2.2.2 Derived Column
medical_procedures:Thiscolumnisa summationof Num_labprocedures,num_procedures,num
medication.Itrepresentsthe individual’sdependabilityorinteractionwiththe hospitals.
Previous_number_of_visits :Number_outpatient,number_emergencyandnumber_inpatient
convertedtoone a single attribute asnumber_of_visitswhichisthe sum of the mentioned three
columns.
Diabetes_medications :Metformintometformin-pioglitazone isconvertedtoa scale of 0 - no,1- yes.
Thenthese valuesare summeduptoreduce the numberof columnsfrom23 to 1.
7. OPIM 5604 | Team 3
3.2.2.3 Principal Component Analysis.
Principal Analysiswascomputedfor5attributes(Time inhospital,medical_procedure,previousvisits,
numberof diagnoses,diabetes_medication).4componentsoutof five were chosen.Thisdecisionwas
made keepinginmindthe complexityversusthe accuracyof the model.
As a resultof all the techniques,wereable toreduce attributesto18 from50 whichmeansthat
dimensionalitywasreducedbymore than50%
3.2.3 Missing Value Detection
Missingvalue detectionandtreatmentplayedanessentialrole indataprocessing.The below two
approacheswere followedtotreatthe missingvalues.
● All the missingvalueswerefirstidentifiedandimputedbyavariable called‘NA’
● Attributesthathada majorityof ‘NA’valueswere droppedfromthe dataset
3.2.4 Outlier Detection and Treatment.
As a part of outlierdetection,we identifiedthe attributescontainingoutliers.Asperthe business,they
were notthe outliersbutactual values.Hence,novalueswere removed.
For example,numberof visitswererangingfrom0 to 80 buttheirpossibilitycannotbe rejected.
4.0 Modification
4.1 Recoding Categorical Values
Usingthe “Recode”optioninJMP,categorical valuesof differentvalueswere reassignedtosuite the
businessneedsandresearchconducted.
S. No. Attributes RecodedValuesApplied
1 Age Accordingto variousstudies,age groupswere codedasbelow
● 0-40 Years � 1
● 40-70 years� 2
● Above 70 �3
2 Max_glu_serum Max Glu Serumsignifiesthe sugarlevels
● None� 0
● Normal� 1
● Abnormal (>200 & >300) �2
3 A1Cresult A1C ResultsBloodtestthatreflectsavgbloodglucose levelsover
past 3 months.The resultswere codedas:
8. OPIM 5604 | Team 3
● None� 0
● Normal� 1
● Abnormal (>7 & >8) �2
4 Medications All the diabetesMedication(withvaluesNo,steady,Up,down)
were dividedas0or 1
● Steady,Up,Down �1
● No � 0
4.2 Rare Event Sampling
● simple randomsamplingmayproduce toofew of the rare classto yielduseful informationabout
what distinguishesthemfromthe dominantclass.In such cases stratifiedsamplingisoftenused
to oversample cases from the rare class and improve the performance of classifiers.
● In our case, the proportion of target Variable as ’Yes’ were too rare to produce any accurate
results.Hence,stratifiedsamplingwasappliedtogainabalancedratioof ‘yes’ inthe datasample.
● As a result, the total numberof instanceswere reduced to 38525 with11357 rowshavingtarget
Variable “Readmitted” as “yes”
5.0 Modeling
Thisis supervisedlearningforclassification,afterdata wasprocessed,variousclassificationmodels
were testedtoidentifythe model withmaximumaccuracyandAUC, andminimummisclassification
rate.The belowmodelswere executedtostudythe performance.
5.1 Nominal Logistic
11. OPIM 5604 | Team 3
5.6 Naïve Bayes
6.0 Assess
6.1 Model Comparison:
Model
False Negatives (People
who are readmitted but
predicted as No)
Accuracy of
the Model
AUC Misclassification
rate
Logistic
Regression
932 63.7% 0.6612 0.36
Neural
Networks
844 60.6% 0.6570 0.39
Decision
Trees
877 58.1% 0.6299 0.41
Boosted Tree 952 61.4% 0.6365 0.38
Bootstrap
Forest
895 60.9% 0.6515 0.39
Naïve Bayes 1098 65.5% 0.6412 0.35
12. OPIM 5604 | Team 3
Out of all Modelswe are prioritizingFalse negatives.We have low false negativesinNeural Networks
model,althoughtotal accuracyof this model islow comparedtoNaïve Bayesand Logisticandothers.
So,we choose thismodel foraccurate prediction.False negativesimplywrongpredictionof readmitted
as ‘NO’whichwill impactthe accuracy of predictionandprobablywe cannotprovide propertreatment
to themif our predictioniswrong.
6.2 Model Improvement
We are altering the Cutoff by trying different possibilities by reducing it to reduce false
negatives trading off between total accuracy and reduction of false negatives. Finally, we could
come down to 591 from 844 false negatives maintaining our total accuracy
13. OPIM 5604 | Team 3
7.0 Results and Conclusion:
Hospitalsshoulduse the Neural Networkmodel topredictwhetherthe patientneedstobe
readmittedwithin30days. The cutoff rate forusingthe Neural Networkmodel shouldbe keptat
0.45 as we are able to achieve the targetof minimizingFalse Negatives.Stratifiedsamplingwasused
because of the rarity of the Target Variable meaningthe accuracymightbe compromisedforthe
model
7.1 Business Value of the Model:
Our Projectdevelops a predictivetool to health serviceproviders to predictpatients with risk of readmission
within 30 days with an accuracy of 60.6%. From model profiler we can see that number of visits (inpatient) is
contributingto the risk of readmission.So,Hospitals should providemore careto those.
Model cost with cut off_0.45
Falsenegatives-591,Falsepositives-9471=(591*1000$+9471*200$)=$248,5200