Predicting deaths from COVID-19 using Machine Learning

Oscar Leclercq- Idan Gal-Shohet - Ethan Djanogly - Premal Gadhia
Page 1 of 13
DE2 Big Data – Assignment 2 – Group 16
Predicting the Impact of Covid-19 in US Counties based on US Census Data.
Introductionand Context
The novel coronavirus crisis has exposed the lack of testing equipment and the unreliability of the number of
cases as a metric to predict the true impact of the virus in an area. This report outlines the creation of a
machine learning model to predict the impact of Covid-19 on counties in the United States withoutusing
testing data as a metric of prediction. Instead, the 2017 US Census dataset was combined with a New York f
being updated as the project progressed. This combined dataset allowed us to find correlations between
factorsin the census data and coronavirusdeath tolls, meaning wecould make accuratepredictions of the
virus’s future impact in counties whichare yet to be affected,or the impact of future viruses over the entire
country.As outlined in the following report, the data was interpreted to predict whether a county would be
lethally affectedor not. These results could be used by private companies or the government to help allocate
future health resources.
Preparing Data
Creating Data-frame
The first step was obtaining a workabledataset. This involvedcombining the Covid-19 US data, whichhad
information for every county,with the demographic info of those counties found in the 2017 US census.
Selecting one date:
As the model predicts the cumulative number of deaths, not on a time scale, the latest date forwhich data
available from all counties was used. The date used was 17 June 2020.
Choosing the threshold for boolean death attribute:
In order to use logistic regression curve fitting, we wouldhave to convertour “number of deaths” attribute to
a boolean 1 or 0 value. Todo this, we had to divide the values as either above, or below a certain value. The
values we tried were the mean number of deaths per county,the first or 3rd quartile range, or the median
number of deaths per county.
The median value was 1, and this started the idea of predicting whether a county was lethally affected at all
(as in, if it had 1 or more deaths, we would count that as a 1 and if it had no deaths, we wouldcount that as a
0). However,just using the median wouldmean some of the 1 death counties would countas below the
median while some wouldcount as above. We therefore decided to simply split our boolean deaths attribute
as 1 if there were 1 or more deaths in that county and 0 if there were none.
Balancing the dataset:
The boolean split discussed above produced an unbalanced attribute in which62% of values were 1 while
38% were 0. Consequently, all models were made in two versions: with a manually balanced training set and
with the unbalanced set. The values for these methods are compared below.

Page 2 of 13
The histograms are focused around the less deadly counties as the differencebetween unbalanced and
balanced datasets was barely visible when over 60 deaths. Note that, in the balanced set, the proportion of
counties with 0 deaths is decreased while those with 1 or more have increased.
Separating training, validation, test set:
To prevent the overfitting of the Covid-19 dataset, the data was split into 80% training, 10% validation and
10% test. The model was developed using the training data and assessed usig the validation data. The test
dataset will be used as the final measure of how accuratethe created model is.
Dropping attributes:
After selecting the data forthat day, attributes that couldn’t be included in the model were removed from the
dataset, these were, ‘date’, ‘State’, ‘County’, ‘County ID’ and ‘Corona Virus Cases’ as the model’s purpose is to
not rely on this data.
Standardising data:
As weused logistic regression methods, we standardised our data.
Creating the Models
What to optimise for:
It was important to determine how wecould design it to optimally save lives as it could potentially save lives.
This meant choosing between optimising the algorithm, using forwardselection, for accuracy,precision, or
recall. For this case scenario, false negatives were the most consequential errors to make, as they meant
incorrectpredictions means that a county would be unaffected. This could potentially lead to insufficient
healthcare causing avoidable deaths. Therefore,we had to optimise fora high recall.
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
This formula shows that recall will optimise fortrue positives compared to false negatives, meaning we will
minimise the number of false negatives. However,our algorithm could theoretically predict every single
county being affected,which wouldgive it a 100% recall. Also, our model should be used to manage health
Figure 1 and 2: histogram of unbalanced and balanced data overlaid.

Page 3 of 13
resources, whichare finite. Therefore, wedecided to balance this by optimising foran average between
accuracy and recall, later referred to as weighted accuracy.
𝑊𝑒𝑖𝑔ℎ𝑡𝑒𝑑 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦+ 𝑅𝑒𝑐𝑎𝑙𝑙
2
Threshold value for adding variables:
In the forwardselection, backward selection, and random selection algorithms, we had to define the threshold
for added accuracy wherea new variable wouldbe added or removed forbackward selection). We chose
forwardselection as it is the basis forthe other algorithms, and we tested the impact of different threshold
values on our weighted validation accuracy and on the difference between the validation and test accuracy
values as larger values here would expose overfitting.
Figure 2: Performance of Model vs Threshold Value, all values can be found in appendices.
Looking at these values, weconcluded that a threshold value of 0.2% should be used in the creation of our
algorithms. At this value, the differencebetween validation and test accuracies increased slightly (+1.83%),
showing a tendency towards overfitting. However,the validation weighted accuracy increased by more than
that (+2.89%), meaning the algorithm still performed better overall. The differencebetween validation and
test at this threshold value was only 5.54%, whichwe considered not to be a serious case of overfitting. We
kept the 0.2% threshold value for all three of the followingmodels.
How the algorithms work:
Due to the Covid-19 Dataset not having any attributes that were grouped, for example gender, logistic
regression was used. The three logistic regression algorithms were; Forward Select, BackwardSelect and
Random Select. All three algorithms are greedy.
Forward Select
The algorithm goes through all the attributes and tests them each individually forthe highest weighted
accuracy.The attribute that gives the greatest weighted accuracy is added to the selected variables list. This
process then repeats, with the remaining variables.
Random Select
Designed to mitigate forward select greediness, the random select takes each of the attributes and runs them
as the first item of the selected variables list. From this point the Forward Select algorithm runs. Standard
Forward Select algorithm may not give the highest weight accuracy as it only chooses what attributes are
currently available that will provide the current highest accuracy. The random select algorithm removes this
issue.

Page 4 of 13
Backward select
The algorithm goes through all the attributes in a list and tests removing them each individually to achieve the
highest weighted accuracy.This process is then repeated.
Results
Outcomes:
In all cases of running the logistic regression algorithm, the validation weighted accuracy was higher than the
test accuracy by a few percent. This is because the models were optimised forthe validation datasets, not the
test data sets.
Validation weighted accuracy
of models
Unbalanced training set Balanced training set
Forward Select 85.17% 77.81%
BackwardSelect 84.07% 74.56%
Random Select 85.17% 77.81%
Best Model
Out of the 6 different algorithm variations, using unbalanced data consistently outperformed balanced data.
As training the dataset on a balanced training set meant that it was less capable of predicting unbalanced
validation and test sets. Plus, the unbalanced dataset is a closer reflectionof how the data will be distributed
in a real case scenario, so unbalanced was used.
From the final 3 algorithms, Random and forwardselection gave the highest weighted accuracy.They were
identical as they returned the exact same features. However, the Random Select was chosen as it is more
robust and less greedy than the forwardselection. This is because the random select applies the greedy
method on every possible starting feature. This increases the chances of finding the global optimum
combination making it the chosen method.
Model Summary:
The highest validation weighted accuracy was foundusing the Random Selection algorithm and Forward
selection algorithm on an unbalanced dataset. The validation weighted accuracy was85.17%. The attributes
of this model were ‘FamilyWork’,‘Asian’, ‘Unemployment’, ‘Income’ and ‘MeanCommute’ in that order. These
attributes are explained below and the correlation between each one and deaths individually is displayed on
scatter plots in the appendix.
Attribute Meaning
FamilyWork Percentage of population in unpaid family work
Asian Percentage of population that is Asian
Unemployment Percentage unemployment rate
Income Median household income (USD)
MeanCommute Mean commute time to work (Mins)
Table 2: Meanings of 5 attributes chosen for the model
Table 1: Weighted Accuracy results of 3 Algorithms using Balanced and Unbalanced training sets

Page 5 of 13
The p-value shows the relative significance of each attribute in the final model. If the p-value is below 0.05, the
significance is higher. The most important attributes were ‘Asian’, ‘MeanCommute’ and ‘Unemployment’. It is
important to note that in a model consisting of only ‘FamilyWork’,the p value was below the threshold.
TP:166 FN:20
FP: 77 TN:44
The confusion matrix illustrates that there were only 20 false negatives. This means that out of the counties
that would have more than one death, the model correctly informs 89%. This is an acceptable value because
warning that there willbe death, is better forthe county,than informing that county that there won’tbe a
death and there is one. Out of the counties that won’thave a single case, the model correctly predicts 37%.
Sanity checks
- Checking importance with GetSummary for algorithms with only one attribute
- Printing statements as features are added to indicate the code is working properly
- Features being returned by different algorithms (forwardsselect, random select)
Conclusion
To conclude,using a Random Select algorithm, the ‘FamilyWork’, ‘Asian’, ‘Unemployment’, ‘Income’ and
‘MeanCommute’ were chosen to predict which counties in the US will experience deaths to an accuracy of
79.62%. This method could be improved by using a larger dataset or repeating the analysis at differenttimes
in the pandemic. It may also be useful to add more medical information about each county such as hospital
beds, or smoking rates.
As a final exploration, Lasso Logistic Regression and Support VectorMachine (SVM) were run to assess the
performance of the Random Select model. The features returned by the LASSO were identical other than
income, whichwas replaced by a very similar metric, income per capita. This change in features increased the
validation and test accuracy by around 0.2%. Finally, the SVM was run and after iterating it to find the best
kernel and gamma, the max weighted accuracy was found. The SVM performed 4% worse in terms of
validation accuracy but 3% better for the test dataset. Making the differencebetween the validation and test
accuracies only 2%. The twotests indicate that the Random Select provides a high degree of accuracy,
howeversome iterations can be made to choose better features and reduce overfitting.
Table 3: Model Summary of final algorithm on unbalanced
dataset
Table 4: Confusion Matrix of Random Select Unbalanced Test

Page 6 of 13
References
- New York Times, US Counties COVID 19 Dataset, dataset, available from: <
https://www.kaggle.com/fireballbyedimyrnmom/us-counties-covid-19-dataset>
- US Census Bureau, 2017 US Census Demographic Data, dataset, available from: <
https://www.kaggle.com/muonneutrino/us-census-demographic-data>
- Venkatarama, C., Analysis of US Demographic Data, article, available from: < https://rstudio-pubs-
static.s3.amazonaws.com/352906_b6f719f938134f76bccb099ae1b89ed6.html>
- Dr de Montjoye, Y.-A., Dr Cardin, M.-A., Dr Picinali, L., Big Data Module Handbook, handbook, available
from: < https://bb.imperial.ac.uk/bbcswebdav/pid-1745733-dt-content-rid-
6101422_1/courses/11036.201910/DE2-BD-2020_Handbook.pdf>
- Dr de Montjoye, Y.-A., Dr Cardin, M.-A., Dr Picinali, L., Big Data Module Lectures, lectures, available
from : <
https://bb.imperial.ac.uk/webapps/blackboard/content/listContent.jsp?course_id=_16571_1&content_id=_1
745735_1&mode=reset>

Page 7 of 13
Appendix
Appendix 1: Scatter Plots of selected attributes and Coronavirus deaths
Attribute RSS Value
FamilyWork 0.00319149
MeanCommute 0.0247023
Asian 0.09227881
Unemployment 0.00097703
Income 0.05355706

Page 8 of 13
Appendix 2: Data for Threshold Value selection graph

Page 9 of 13
Appendix 1: Code for all algorithms

Page 10 of 13
Appendix 2: Forward Selection Algorithm

Page 11 of 13
Appendix 3: Backward Selection Algorithm

Page 12 of 13
Appendix 4: Random Selection Algorithm

Page 13 of 13
Appendix 5: Lasso and SVM Algorithms

Predicting deaths from COVID-19 using Machine Learning

Recommended

Recommended

More Related Content

Similar to Predicting deaths from COVID-19 using Machine Learning

Similar to Predicting deaths from COVID-19 using Machine Learning (20)

Recently uploaded

Recently uploaded (20)

Predicting deaths from COVID-19 using Machine Learning