SlideShare a Scribd company logo
1 of 13
Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia
Page 1 of 13
DE2 Big Data – Assignment 2 – Group 16
Predicting the Impact of Covid-19 in US Counties based on US Census Data.
Introductionand Context
The novel coronavirus crisis has exposed the lack of testing equipment and the unreliability of the number of
cases as a metric to predict the true impact of the virus in an area. This report outlines the creation of a
machine learning model to predict the impact of Covid-19 on counties in the United States withoutusing
testing data as a metric of prediction. Instead, the 2017 US Census dataset was combined with a New York f
being updated as the project progressed. This combined dataset allowed us to find correlations between
factorsin the census data and coronavirusdeath tolls, meaning wecould make accuratepredictions of the
virus’s future impact in counties whichare yet to be affected,or the impact of future viruses over the entire
country.As outlined in the following report, the data was interpreted to predict whether a county would be
lethally affectedor not. These results could be used by private companies or the government to help allocate
future health resources.
Preparing Data
Creating Data-frame
The first step was obtaining a workabledataset. This involvedcombining the Covid-19 US data, whichhad
information for every county,with the demographic info of those counties found in the 2017 US census.
Selecting one date:
As the model predicts the cumulative number of deaths, not on a time scale, the latest date forwhich data
available from all counties was used. The date used was 17 June 2020.
Choosing the threshold for boolean death attribute:
In order to use logistic regression curve fitting, we wouldhave to convertour “number of deaths” attribute to
a boolean 1 or 0 value. Todo this, we had to divide the values as either above, or below a certain value. The
values we tried were the mean number of deaths per county,the first or 3rd quartile range, or the median
number of deaths per county.
The median value was 1, and this started the idea of predicting whether a county was lethally affected at all
(as in, if it had 1 or more deaths, we would count that as a 1 and if it had no deaths, we wouldcount that as a
0). However,just using the median wouldmean some of the 1 death counties would countas below the
median while some wouldcount as above. We therefore decided to simply split our boolean deaths attribute
as 1 if there were 1 or more deaths in that county and 0 if there were none.
Balancing the dataset:
The boolean split discussed above produced an unbalanced attribute in which62% of values were 1 while
38% were 0. Consequently, all models were made in two versions: with a manually balanced training set and
with the unbalanced set. The values for these methods are compared below.
Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia
Page 2 of 13
The histograms are focused around the less deadly counties as the differencebetween unbalanced and
balanced datasets was barely visible when over 60 deaths. Note that, in the balanced set, the proportion of
counties with 0 deaths is decreased while those with 1 or more have increased.
Separating training, validation, test set:
To prevent the overfitting of the Covid-19 dataset, the data was split into 80% training, 10% validation and
10% test. The model was developed using the training data and assessed usig the validation data. The test
dataset will be used as the final measure of how accuratethe created model is.
Dropping attributes:
After selecting the data forthat day, attributes that couldn’t be included in the model were removed from the
dataset, these were, ‘date’, ‘State’, ‘County’, ‘County ID’ and ‘Corona Virus Cases’ as the model’s purpose is to
not rely on this data.
Standardising data:
As weused logistic regression methods, we standardised our data.
Creating the Models
What to optimise for:
It was important to determine how wecould design it to optimally save lives as it could potentially save lives.
This meant choosing between optimising the algorithm, using forwardselection, for accuracy,precision, or
recall. For this case scenario, false negatives were the most consequential errors to make, as they meant
incorrectpredictions means that a county would be unaffected. This could potentially lead to insufficient
healthcare causing avoidable deaths. Therefore,we had to optimise fora high recall.
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
This formula shows that recall will optimise fortrue positives compared to false negatives, meaning we will
minimise the number of false negatives. However,our algorithm could theoretically predict every single
county being affected,which wouldgive it a 100% recall. Also, our model should be used to manage health
Figure 1 and 2: histogram of unbalanced and balanced data overlaid.
Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia
Page 3 of 13
resources, whichare finite. Therefore, wedecided to balance this by optimising foran average between
accuracy and recall, later referred to as weighted accuracy.
𝑊𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦+ 𝑅𝑒𝑐𝑎𝑙𝑙
2
Threshold value for adding variables:
In the forwardselection, backward selection, and random selection algorithms, we had to define the threshold
for added accuracy wherea new variable wouldbe added or removed forbackward selection). We chose
forwardselection as it is the basis forthe other algorithms, and we tested the impact of different threshold
values on our weighted validation accuracy and on the difference between the validation and test accuracy
values as larger values here would expose overfitting.
Figure 2: Performance of Model vs Threshold Value, all values can be found in appendices.
Looking at these values, weconcluded that a threshold value of 0.2% should be used in the creation of our
algorithms. At this value, the differencebetween validation and test accuracies increased slightly (+1.83%),
showing a tendency towards overfitting. However,the validation weighted accuracy increased by more than
that (+2.89%), meaning the algorithm still performed better overall. The differencebetween validation and
test at this threshold value was only 5.54%, whichwe considered not to be a serious case of overfitting. We
kept the 0.2% threshold value for all three of the followingmodels.
How the algorithms work:
Due to the Covid-19 Dataset not having any attributes that were grouped, for example gender, logistic
regression was used. The three logistic regression algorithms were; Forward Select, BackwardSelect and
Random Select. All three algorithms are greedy.
Forward Select
The algorithm goes through all the attributes and tests them each individually forthe highest weighted
accuracy.The attribute that gives the greatest weighted accuracy is added to the selected variables list. This
process then repeats, with the remaining variables.
Random Select
Designed to mitigate forward select greediness, the random select takes each of the attributes and runs them
as the first item of the selected variables list. From this point the Forward Select algorithm runs. Standard
Forward Select algorithm may not give the highest weight accuracy as it only chooses what attributes are
currently available that will provide the current highest accuracy. The random select algorithm removes this
issue.
Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia
Page 4 of 13
Backward select
The algorithm goes through all the attributes in a list and tests removing them each individually to achieve the
highest weighted accuracy.This process is then repeated.
Results
Outcomes:
In all cases of running the logistic regression algorithm, the validation weighted accuracy was higher than the
test accuracy by a few percent. This is because the models were optimised forthe validation datasets, not the
test data sets.
Validation weighted accuracy
of models
Unbalanced training set Balanced training set
Forward Select 85.17% 77.81%
BackwardSelect 84.07% 74.56%
Random Select 85.17% 77.81%
Best Model
Out of the 6 different algorithm variations, using unbalanced data consistently outperformed balanced data.
As training the dataset on a balanced training set meant that it was less capable of predicting unbalanced
validation and test sets. Plus, the unbalanced dataset is a closer reflectionof how the data will be distributed
in a real case scenario, so unbalanced was used.
From the final 3 algorithms, Random and forwardselection gave the highest weighted accuracy.They were
identical as they returned the exact same features. However, the Random Select was chosen as it is more
robust and less greedy than the forwardselection. This is because the random select applies the greedy
method on every possible starting feature. This increases the chances of finding the global optimum
combination making it the chosen method.
Model Summary:
The highest validation weighted accuracy was foundusing the Random Selection algorithm and Forward
selection algorithm on an unbalanced dataset. The validation weighted accuracy was85.17%. The attributes
of this model were ‘FamilyWork’,‘Asian’, ‘Unemployment’, ‘Income’ and ‘MeanCommute’ in that order. These
attributes are explained below and the correlation between each one and deaths individually is displayed on
scatter plots in the appendix.
Attribute Meaning
FamilyWork Percentage of population in unpaid family work
Asian Percentage of population that is Asian
Unemployment Percentage unemployment rate
Income Median household income (USD)
MeanCommute Mean commute time to work (Mins)
Table 2: Meanings of 5 attributes chosen for the model
Table 1: Weighted Accuracy results of 3 Algorithms using Balanced and Unbalanced training sets
Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia
Page 5 of 13
The p-value shows the relative significance of each attribute in the final model. If the p-value is below 0.05, the
significance is higher. The most important attributes were ‘Asian’, ‘MeanCommute’ and ‘Unemployment’. It is
important to note that in a model consisting of only ‘FamilyWork’,the p value was below the threshold.
TP:166 FN:20
FP: 77 TN:44
The confusion matrix illustrates that there were only 20 false negatives. This means that out of the counties
that would have more than one death, the model correctly informs 89%. This is an acceptable value because
warning that there willbe death, is better forthe county,than informing that county that there won’tbe a
death and there is one. Out of the counties that won’thave a single case, the model correctly predicts 37%.
Sanity checks
- Checking importance with GetSummary for algorithms with only one attribute
- Printing statements as features are added to indicate the code is working properly
- Features being returned by different algorithms (forwardsselect, random select)
Conclusion
To conclude,using a Random Select algorithm, the ‘FamilyWork’, ‘Asian’, ‘Unemployment’, ‘Income’ and
‘MeanCommute’ were chosen to predict which counties in the US will experience deaths to an accuracy of
79.62%. This method could be improved by using a larger dataset or repeating the analysis at differenttimes
in the pandemic. It may also be useful to add more medical information about each county such as hospital
beds, or smoking rates.
As a final exploration, Lasso Logistic Regression and Support VectorMachine (SVM) were run to assess the
performance of the Random Select model. The features returned by the LASSO were identical other than
income, whichwas replaced by a very similar metric, income per capita. This change in features increased the
validation and test accuracy by around 0.2%. Finally, the SVM was run and after iterating it to find the best
kernel and gamma, the max weighted accuracy was found. The SVM performed 4% worse in terms of
validation accuracy but 3% better for the test dataset. Making the differencebetween the validation and test
accuracies only 2%. The twotests indicate that the Random Select provides a high degree of accuracy,
howeversome iterations can be made to choose better features and reduce overfitting.
Table 3: Model Summary of final algorithm on unbalanced
dataset
Table 4: Confusion Matrix of Random Select Unbalanced Test
Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia
Page 6 of 13
References
- New York Times, US Counties COVID 19 Dataset, dataset, available from: <
https://www.kaggle.com/fireballbyedimyrnmom/us-counties-covid-19-dataset>
- US Census Bureau, 2017 US Census Demographic Data, dataset, available from: <
https://www.kaggle.com/muonneutrino/us-census-demographic-data>
- Venkatarama, C., Analysis of US Demographic Data, article, available from: < https://rstudio-pubs-
static.s3.amazonaws.com/352906_b6f719f938134f76bccb099ae1b89ed6.html>
- Dr de Montjoye, Y.-A., Dr Cardin, M.-A., Dr Picinali, L., Big Data Module Handbook, handbook, available
from: < https://bb.imperial.ac.uk/bbcswebdav/pid-1745733-dt-content-rid-
6101422_1/courses/11036.201910/DE2-BD-2020_Handbook.pdf>
- Dr de Montjoye, Y.-A., Dr Cardin, M.-A., Dr Picinali, L., Big Data Module Lectures, lectures, available
from : <
https://bb.imperial.ac.uk/webapps/blackboard/content/listContent.jsp?course_id=_16571_1&content_id=_1
745735_1&mode=reset>
Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia
Page 7 of 13
Appendix
Appendix 1: Scatter Plots of selected attributes and Coronavirus deaths
Attribute RSS Value
FamilyWork 0.00319149
MeanCommute 0.0247023
Asian 0.09227881
Unemployment 0.00097703
Income 0.05355706
Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia
Page 8 of 13
Appendix 2: Data for Threshold Value selection graph
Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia
Page 9 of 13
Appendix 1: Code for all algorithms
Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia
Page 10 of 13
Appendix 2: Forward Selection Algorithm
Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia
Page 11 of 13
Appendix 3: Backward Selection Algorithm
Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia
Page 12 of 13
Appendix 4: Random Selection Algorithm
Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia
Page 13 of 13
Appendix 5: Lasso and SVM Algorithms

More Related Content

Similar to Predicting deaths from COVID-19 using Machine Learning

Classification modelling review
Classification modelling reviewClassification modelling review
Classification modelling reviewJaideep Adusumelli
 
Regression and Classification Analysis
Regression and Classification AnalysisRegression and Classification Analysis
Regression and Classification AnalysisYashIyengar
 
Classification via Logistic Regression
Classification via Logistic RegressionClassification via Logistic Regression
Classification via Logistic RegressionTaweh Beysolow II
 
WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForest
WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForestWisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForest
WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForestSheing Jing Ng
 
IRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
IRJET- Breast Cancer Prediction using Supervised Machine Learning AlgorithmsIRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
IRJET- Breast Cancer Prediction using Supervised Machine Learning AlgorithmsIRJET Journal
 
Hepatic injury classification
Hepatic injury classificationHepatic injury classification
Hepatic injury classificationZheliang Jiang
 
Data analysis_PredictingActivity_SamsungSensorData
Data analysis_PredictingActivity_SamsungSensorDataData analysis_PredictingActivity_SamsungSensorData
Data analysis_PredictingActivity_SamsungSensorDataKaren Yang
 
Team 5 imputing_medical_missing_data_ga approach_preseatation
Team 5 imputing_medical_missing_data_ga approach_preseatationTeam 5 imputing_medical_missing_data_ga approach_preseatation
Team 5 imputing_medical_missing_data_ga approach_preseatationNafiz Ishtiaque Ahmed
 
coad_machine_learning
coad_machine_learningcoad_machine_learning
coad_machine_learningFord Sleeman
 
Predicting breast cancer: Adrian Valles
Predicting breast cancer: Adrian VallesPredicting breast cancer: Adrian Valles
Predicting breast cancer: Adrian VallesAdrián Vallés
 
Cancer detection using data mining
Cancer detection using data miningCancer detection using data mining
Cancer detection using data miningRishabhKumar283
 
Best crime predictor: Linear Regression
Best crime predictor: Linear RegressionBest crime predictor: Linear Regression
Best crime predictor: Linear RegressionJonathan Chauwa
 
INFLUENCE OF THE EVENT RATE ON DISCRIMINATION ABILITIES OF BANKRUPTCY PREDICT...
INFLUENCE OF THE EVENT RATE ON DISCRIMINATION ABILITIES OF BANKRUPTCY PREDICT...INFLUENCE OF THE EVENT RATE ON DISCRIMINATION ABILITIES OF BANKRUPTCY PREDICT...
INFLUENCE OF THE EVENT RATE ON DISCRIMINATION ABILITIES OF BANKRUPTCY PREDICT...ijdms
 
Statistical analysis of Multiple and Logistic Regression
Statistical analysis of Multiple and Logistic RegressionStatistical analysis of Multiple and Logistic Regression
Statistical analysis of Multiple and Logistic RegressionSindhujanDhayalan
 
McCarthy_TermPaperSpring
McCarthy_TermPaperSpringMcCarthy_TermPaperSpring
McCarthy_TermPaperSpringMarc McCarthy
 
Accurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsAccurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsJieming Wei
 

Similar to Predicting deaths from COVID-19 using Machine Learning (20)

Classification modelling review
Classification modelling reviewClassification modelling review
Classification modelling review
 
X18136931 statistics ca2_updated
X18136931 statistics ca2_updatedX18136931 statistics ca2_updated
X18136931 statistics ca2_updated
 
Regression and Classification Analysis
Regression and Classification AnalysisRegression and Classification Analysis
Regression and Classification Analysis
 
Classification via Logistic Regression
Classification via Logistic RegressionClassification via Logistic Regression
Classification via Logistic Regression
 
WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForest
WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForestWisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForest
WisconsinBreastCancerDiagnosticClassificationusingKNNandRandomForest
 
IRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
IRJET- Breast Cancer Prediction using Supervised Machine Learning AlgorithmsIRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
IRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
 
Hepatic injury classification
Hepatic injury classificationHepatic injury classification
Hepatic injury classification
 
Data analysis_PredictingActivity_SamsungSensorData
Data analysis_PredictingActivity_SamsungSensorDataData analysis_PredictingActivity_SamsungSensorData
Data analysis_PredictingActivity_SamsungSensorData
 
Team 5 imputing_medical_missing_data_ga approach_preseatation
Team 5 imputing_medical_missing_data_ga approach_preseatationTeam 5 imputing_medical_missing_data_ga approach_preseatation
Team 5 imputing_medical_missing_data_ga approach_preseatation
 
coad_machine_learning
coad_machine_learningcoad_machine_learning
coad_machine_learning
 
Predicting breast cancer: Adrian Valles
Predicting breast cancer: Adrian VallesPredicting breast cancer: Adrian Valles
Predicting breast cancer: Adrian Valles
 
Cancer detection using data mining
Cancer detection using data miningCancer detection using data mining
Cancer detection using data mining
 
JEDM_RR_JF_Final
JEDM_RR_JF_FinalJEDM_RR_JF_Final
JEDM_RR_JF_Final
 
The Lachman Test
The Lachman TestThe Lachman Test
The Lachman Test
 
Best crime predictor: Linear Regression
Best crime predictor: Linear RegressionBest crime predictor: Linear Regression
Best crime predictor: Linear Regression
 
INFLUENCE OF THE EVENT RATE ON DISCRIMINATION ABILITIES OF BANKRUPTCY PREDICT...
INFLUENCE OF THE EVENT RATE ON DISCRIMINATION ABILITIES OF BANKRUPTCY PREDICT...INFLUENCE OF THE EVENT RATE ON DISCRIMINATION ABILITIES OF BANKRUPTCY PREDICT...
INFLUENCE OF THE EVENT RATE ON DISCRIMINATION ABILITIES OF BANKRUPTCY PREDICT...
 
Statistical analysis of Multiple and Logistic Regression
Statistical analysis of Multiple and Logistic RegressionStatistical analysis of Multiple and Logistic Regression
Statistical analysis of Multiple and Logistic Regression
 
McCarthy_TermPaperSpring
McCarthy_TermPaperSpringMcCarthy_TermPaperSpring
McCarthy_TermPaperSpring
 
Accurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsAccurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification Algorithms
 
Quality of data
Quality of dataQuality of data
Quality of data
 

Recently uploaded

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 

Recently uploaded (20)

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 

Predicting deaths from COVID-19 using Machine Learning

  • 1. Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia Page 1 of 13 DE2 Big Data – Assignment 2 – Group 16 Predicting the Impact of Covid-19 in US Counties based on US Census Data. Introductionand Context The novel coronavirus crisis has exposed the lack of testing equipment and the unreliability of the number of cases as a metric to predict the true impact of the virus in an area. This report outlines the creation of a machine learning model to predict the impact of Covid-19 on counties in the United States withoutusing testing data as a metric of prediction. Instead, the 2017 US Census dataset was combined with a New York f being updated as the project progressed. This combined dataset allowed us to find correlations between factorsin the census data and coronavirusdeath tolls, meaning wecould make accuratepredictions of the virus’s future impact in counties whichare yet to be affected,or the impact of future viruses over the entire country.As outlined in the following report, the data was interpreted to predict whether a county would be lethally affectedor not. These results could be used by private companies or the government to help allocate future health resources. Preparing Data Creating Data-frame The first step was obtaining a workabledataset. This involvedcombining the Covid-19 US data, whichhad information for every county,with the demographic info of those counties found in the 2017 US census. Selecting one date: As the model predicts the cumulative number of deaths, not on a time scale, the latest date forwhich data available from all counties was used. The date used was 17 June 2020. Choosing the threshold for boolean death attribute: In order to use logistic regression curve fitting, we wouldhave to convertour “number of deaths” attribute to a boolean 1 or 0 value. Todo this, we had to divide the values as either above, or below a certain value. The values we tried were the mean number of deaths per county,the first or 3rd quartile range, or the median number of deaths per county. The median value was 1, and this started the idea of predicting whether a county was lethally affected at all (as in, if it had 1 or more deaths, we would count that as a 1 and if it had no deaths, we wouldcount that as a 0). However,just using the median wouldmean some of the 1 death counties would countas below the median while some wouldcount as above. We therefore decided to simply split our boolean deaths attribute as 1 if there were 1 or more deaths in that county and 0 if there were none. Balancing the dataset: The boolean split discussed above produced an unbalanced attribute in which62% of values were 1 while 38% were 0. Consequently, all models were made in two versions: with a manually balanced training set and with the unbalanced set. The values for these methods are compared below.
  • 2. Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia Page 2 of 13 The histograms are focused around the less deadly counties as the differencebetween unbalanced and balanced datasets was barely visible when over 60 deaths. Note that, in the balanced set, the proportion of counties with 0 deaths is decreased while those with 1 or more have increased. Separating training, validation, test set: To prevent the overfitting of the Covid-19 dataset, the data was split into 80% training, 10% validation and 10% test. The model was developed using the training data and assessed usig the validation data. The test dataset will be used as the final measure of how accuratethe created model is. Dropping attributes: After selecting the data forthat day, attributes that couldn’t be included in the model were removed from the dataset, these were, ‘date’, ‘State’, ‘County’, ‘County ID’ and ‘Corona Virus Cases’ as the model’s purpose is to not rely on this data. Standardising data: As weused logistic regression methods, we standardised our data. Creating the Models What to optimise for: It was important to determine how wecould design it to optimally save lives as it could potentially save lives. This meant choosing between optimising the algorithm, using forwardselection, for accuracy,precision, or recall. For this case scenario, false negatives were the most consequential errors to make, as they meant incorrectpredictions means that a county would be unaffected. This could potentially lead to insufficient healthcare causing avoidable deaths. Therefore,we had to optimise fora high recall. 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 This formula shows that recall will optimise fortrue positives compared to false negatives, meaning we will minimise the number of false negatives. However,our algorithm could theoretically predict every single county being affected,which wouldgive it a 100% recall. Also, our model should be used to manage health Figure 1 and 2: histogram of unbalanced and balanced data overlaid.
  • 3. Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia Page 3 of 13 resources, whichare finite. Therefore, wedecided to balance this by optimising foran average between accuracy and recall, later referred to as weighted accuracy. 𝑊𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦+ 𝑅𝑒𝑐𝑎𝑙𝑙 2 Threshold value for adding variables: In the forwardselection, backward selection, and random selection algorithms, we had to define the threshold for added accuracy wherea new variable wouldbe added or removed forbackward selection). We chose forwardselection as it is the basis forthe other algorithms, and we tested the impact of different threshold values on our weighted validation accuracy and on the difference between the validation and test accuracy values as larger values here would expose overfitting. Figure 2: Performance of Model vs Threshold Value, all values can be found in appendices. Looking at these values, weconcluded that a threshold value of 0.2% should be used in the creation of our algorithms. At this value, the differencebetween validation and test accuracies increased slightly (+1.83%), showing a tendency towards overfitting. However,the validation weighted accuracy increased by more than that (+2.89%), meaning the algorithm still performed better overall. The differencebetween validation and test at this threshold value was only 5.54%, whichwe considered not to be a serious case of overfitting. We kept the 0.2% threshold value for all three of the followingmodels. How the algorithms work: Due to the Covid-19 Dataset not having any attributes that were grouped, for example gender, logistic regression was used. The three logistic regression algorithms were; Forward Select, BackwardSelect and Random Select. All three algorithms are greedy. Forward Select The algorithm goes through all the attributes and tests them each individually forthe highest weighted accuracy.The attribute that gives the greatest weighted accuracy is added to the selected variables list. This process then repeats, with the remaining variables. Random Select Designed to mitigate forward select greediness, the random select takes each of the attributes and runs them as the first item of the selected variables list. From this point the Forward Select algorithm runs. Standard Forward Select algorithm may not give the highest weight accuracy as it only chooses what attributes are currently available that will provide the current highest accuracy. The random select algorithm removes this issue.
  • 4. Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia Page 4 of 13 Backward select The algorithm goes through all the attributes in a list and tests removing them each individually to achieve the highest weighted accuracy.This process is then repeated. Results Outcomes: In all cases of running the logistic regression algorithm, the validation weighted accuracy was higher than the test accuracy by a few percent. This is because the models were optimised forthe validation datasets, not the test data sets. Validation weighted accuracy of models Unbalanced training set Balanced training set Forward Select 85.17% 77.81% BackwardSelect 84.07% 74.56% Random Select 85.17% 77.81% Best Model Out of the 6 different algorithm variations, using unbalanced data consistently outperformed balanced data. As training the dataset on a balanced training set meant that it was less capable of predicting unbalanced validation and test sets. Plus, the unbalanced dataset is a closer reflectionof how the data will be distributed in a real case scenario, so unbalanced was used. From the final 3 algorithms, Random and forwardselection gave the highest weighted accuracy.They were identical as they returned the exact same features. However, the Random Select was chosen as it is more robust and less greedy than the forwardselection. This is because the random select applies the greedy method on every possible starting feature. This increases the chances of finding the global optimum combination making it the chosen method. Model Summary: The highest validation weighted accuracy was foundusing the Random Selection algorithm and Forward selection algorithm on an unbalanced dataset. The validation weighted accuracy was85.17%. The attributes of this model were ‘FamilyWork’,‘Asian’, ‘Unemployment’, ‘Income’ and ‘MeanCommute’ in that order. These attributes are explained below and the correlation between each one and deaths individually is displayed on scatter plots in the appendix. Attribute Meaning FamilyWork Percentage of population in unpaid family work Asian Percentage of population that is Asian Unemployment Percentage unemployment rate Income Median household income (USD) MeanCommute Mean commute time to work (Mins) Table 2: Meanings of 5 attributes chosen for the model Table 1: Weighted Accuracy results of 3 Algorithms using Balanced and Unbalanced training sets
  • 5. Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia Page 5 of 13 The p-value shows the relative significance of each attribute in the final model. If the p-value is below 0.05, the significance is higher. The most important attributes were ‘Asian’, ‘MeanCommute’ and ‘Unemployment’. It is important to note that in a model consisting of only ‘FamilyWork’,the p value was below the threshold. TP:166 FN:20 FP: 77 TN:44 The confusion matrix illustrates that there were only 20 false negatives. This means that out of the counties that would have more than one death, the model correctly informs 89%. This is an acceptable value because warning that there willbe death, is better forthe county,than informing that county that there won’tbe a death and there is one. Out of the counties that won’thave a single case, the model correctly predicts 37%. Sanity checks - Checking importance with GetSummary for algorithms with only one attribute - Printing statements as features are added to indicate the code is working properly - Features being returned by different algorithms (forwardsselect, random select) Conclusion To conclude,using a Random Select algorithm, the ‘FamilyWork’, ‘Asian’, ‘Unemployment’, ‘Income’ and ‘MeanCommute’ were chosen to predict which counties in the US will experience deaths to an accuracy of 79.62%. This method could be improved by using a larger dataset or repeating the analysis at differenttimes in the pandemic. It may also be useful to add more medical information about each county such as hospital beds, or smoking rates. As a final exploration, Lasso Logistic Regression and Support VectorMachine (SVM) were run to assess the performance of the Random Select model. The features returned by the LASSO were identical other than income, whichwas replaced by a very similar metric, income per capita. This change in features increased the validation and test accuracy by around 0.2%. Finally, the SVM was run and after iterating it to find the best kernel and gamma, the max weighted accuracy was found. The SVM performed 4% worse in terms of validation accuracy but 3% better for the test dataset. Making the differencebetween the validation and test accuracies only 2%. The twotests indicate that the Random Select provides a high degree of accuracy, howeversome iterations can be made to choose better features and reduce overfitting. Table 3: Model Summary of final algorithm on unbalanced dataset Table 4: Confusion Matrix of Random Select Unbalanced Test
  • 6. Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia Page 6 of 13 References - New York Times, US Counties COVID 19 Dataset, dataset, available from: < https://www.kaggle.com/fireballbyedimyrnmom/us-counties-covid-19-dataset> - US Census Bureau, 2017 US Census Demographic Data, dataset, available from: < https://www.kaggle.com/muonneutrino/us-census-demographic-data> - Venkatarama, C., Analysis of US Demographic Data, article, available from: < https://rstudio-pubs- static.s3.amazonaws.com/352906_b6f719f938134f76bccb099ae1b89ed6.html> - Dr de Montjoye, Y.-A., Dr Cardin, M.-A., Dr Picinali, L., Big Data Module Handbook, handbook, available from: < https://bb.imperial.ac.uk/bbcswebdav/pid-1745733-dt-content-rid- 6101422_1/courses/11036.201910/DE2-BD-2020_Handbook.pdf> - Dr de Montjoye, Y.-A., Dr Cardin, M.-A., Dr Picinali, L., Big Data Module Lectures, lectures, available from : < https://bb.imperial.ac.uk/webapps/blackboard/content/listContent.jsp?course_id=_16571_1&content_id=_1 745735_1&mode=reset>
  • 7. Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia Page 7 of 13 Appendix Appendix 1: Scatter Plots of selected attributes and Coronavirus deaths Attribute RSS Value FamilyWork 0.00319149 MeanCommute 0.0247023 Asian 0.09227881 Unemployment 0.00097703 Income 0.05355706
  • 8. Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia Page 8 of 13 Appendix 2: Data for Threshold Value selection graph
  • 9. Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia Page 9 of 13 Appendix 1: Code for all algorithms
  • 10. Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia Page 10 of 13 Appendix 2: Forward Selection Algorithm
  • 11. Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia Page 11 of 13 Appendix 3: Backward Selection Algorithm
  • 12. Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia Page 12 of 13 Appendix 4: Random Selection Algorithm
  • 13. Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia Page 13 of 13 Appendix 5: Lasso and SVM Algorithms