1. coronavirus-case-tracking
May 31, 2020
1 The Data Science Pipeline: COVID-19 Case Tracking
Philip Tian, Jonathan Lin, David Ahmed
1.1 Introduction
We will introduce the data science pipeline by analyzing data from the (currently ongoing) COVID-
19 pandemic in the United States. COVID-19, also known as the coronavirus, or more specifically
the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a disease that first appeared
in the continental United States in early February and March. A comprehensive list of known
and documented symptoms can be found here. As of May 15, 2020, there are around 4.5 million
individuals infected with the virus, which is why many experts in the media and academia are
calling this crisis a pandemic.
The COVID-19 crisis has caused many temporary economic and social changes. For example, most
states in the United States have issued stay-at-home orders, where guidelines are issued on when
and why one can leave home for certain reasons. These issues mean that normal work activities have
stopped and countries have been struggling with issues like sudden drops in economic productivity
(documented in news stories like this one or this one). These economic issues directly caused by
the virus have big impacts on the well-being of people all across the world, as unemployment grows
in several countries worldwide, which is a direct consequence of the effects that the pandemic and
governmental policies like stay-at-home and the closure of “non-essential businesses” have on the
economy and people’s income.
Given this balance between maintaining local or national economic health and preventing the
spread of a dangerous disease, policymakers are faced with certain questions about the current
state of affairs. - How long should we maintain a stay-at-home order nationally/locally? - Are
other states/countries handling the situation “better”? What are they doing differently? - What
does “better” even mean? Can we quantify these things when we are making policy decisions? -
How will cases grow from today? Can our current infrastructure (hospitals) handle this growth?
Faced with a variety of policy decisions that might literally cost lives, policy makers are walking on
tightropes, and in this day and age of information it is especially important that the decision which
is finally made is close to optimal. But how does one know whether or not a decision is optimal?
Such a question can be answered using the techniques of data science.
1.1.1 The Importance of Data Science During the COVID-19 Crisis
As the total number of COVID-19 cases increases in the USA and worldwide, it seems almost too
easy for the mainstream media to capitalize on all the news cases to churn out new news stories. An
1
2. increase in cases (perhaps coupled with some official or expert statement) is a news story waiting
to happen. For example, here, here, and here are all news stories which were directly obtained by
internet searching the phrase “increase in cases”. With this influx of information, there are certain
questions we should ask. - Can we trust that the data that is being reported is accurate and
correct? - What exactly does 1000 (or 2000, or 500) cases daily mean in the context of a certain
state/country/county? - To what extent can we predict how the crisis (especially in terms of the
number of cases) evolves over time? - Is it possible to make policy or economic decisions based on
such analysis, and should we make these decisions?
It is precisely the field of data science and its techniques and methods which allow us to conduct an
accurate analysis of the data in order to produce more accurate answers to the questions above. By
extracting, manipulating, and analyzing sets of data, we ensure that policymakers are much more
informed about the status of the crisis locally and nationally much better than just day by day
information and playing it by ear, so to speak. Gathering such information is critical, especially in
our time when reliable information seems hard to find.
In order to even start considering questions like those posed above, we must consider many of
the aspects of the data science pipeline, which we have listed below. 1. Data Collection: data
which is relevant to the project is found and collected. 2. Data Manipulation and Cleaning:
Data is manipulated into a format which is suitable for analysis. 3. Initial Data Visualization
(or exploratory data analysis): numerical data can be graphed to spot trends. 4. Modeling and
Prediction: One can use various methods to do predictions on future data or general population.
In our project, we will attempt to illustrate the data science pipeline with respect to the COVID-19
crisis. We will illustrate (with working code) the basic aspects of this pipeline as described above,
starting from the beginning. First we will discuss the data and where we got it from, as well as
the background and assessment of accuracy. Then, we will manipulate and modify the data for our
purposes. Finally, we will use the data we obtained to make future predictions.
1.2 Data Collection
For this project, we would like to collect data about total cases in the United States more closely
restricted to the local level, such as states and counties. We’ve obtained data from the following
sources, described below.
1.2.1 The COVID Tracking Project
The COVID Tracking Project is a collective volunteer effort to document data about the crisis day
by day. Their data collection claims to be comprehensive, with data from every state and most US
Territories. Their data, as one of only comprehensive sources of data on cases in the US, is being
used by many news outlets and academic experts.
We will be using their data on the day by day number of cases in states and in the US. In
the code below, this data is encompassed in the csv files Data/CTPStates-historical.csv,
Data/CTPTestingMaryland and Data/CTPUS-historical, the first containing the data of the
states, the second the number of tests given out in Maryland, and the last containing overall
US data. These spreadsheets give us the data we need for cases over time. The data is updated
daily and can be found here in this spreadsheet.
2
3. 1.2.2 COVID-19 Data from JHU
Johns Hopkins University has a nice GUI which displays the current case data from around the
world. It can be found here. It is a very visually striking application and readers are encouraged to
go and play around with it. But where does this application pull data from? It pulls the data from a
github repository which is maintained by JHU for the purpose of keeping the application up to date.
The data itself is more specific than that of the COVID Tracking project, in that it tracks cases and
deaths by county, which is much more specific than state-wide. For example, in Maryland, one might
notice that the majority of cases arise in Montgomery County and Prince George’s County, with
less overall cases in neighboring counties (though Baltimore County does not trail too far behind).
The data we’ve obtained from JHU is packaged into the files Data/CasesDeathsCounty.csv and
Data/CountyConfirmedCases.csv.
1.2.3 Date of Stay-at-Home Information
We can find the date that each state established a stay-at-home order through this CNN Article,
which gives detailed information on when all the US States established this order.
Limitations of the data gathered Though as clearly highlighted above, there is lots of data
regarding overall cases both nationally and locally regarding the total cases over time of COVID-19,
one must at least be aware of the limitations that such convenient data offers us.
The primary concern when working with this data is accuracy. If a policymaker were to consider
using such data and its trends to make important decisions relevant to the crisis, then the first
question they ask should be - Is this data reliable? Can I trust that the data values are accurately
measured? This concern about accuracy is very important when considering stay-at-home orders,
shutdowns, and other impactful decisions.
The main limitation with large datasets which measure data from lots of locations, such as the data
from the COVID Tracking project and the JHU dataset is that you need lots and lots of sources in
order to measure accuracy of the data. With COVID, it is at least one source for each state and
territory. For JHU, it is at least one source for each county. This is seemingly even worse, as a lot
more can go wrong.
Concerns about the COVID Tracker Project Data The COVID-19 Tracker Project Dataset
pulls data from the relevant state and territory government health services. This means that the
reliability of this data starts and ends with the accuracy of state government reporting. It is a
difficult job to figure out how each state is measuring its data and whether or not it is accurate.
Concerns about the JHU Dataset As of May 16, the github page for the JHU dataset has
over 1300 reported issues. Some of these issues are complete non-issues. But some of these issues
report supposedly serious problems with the data, for example this one which is claiming that the
applet is not reporting the case number for the country of Nepal correctly. A lot of the issues with
the data are encoding related, having to do with state or county codes. Hopefully these will not be
too much of an issue with respect to this project.
For the purposes of our project, we need not really concern ourselves with the reliability of the data
collection. It is not our job, and besides, we don’t have the power to confirm to ourselves. However,
3
4. if we were in a position of more power, with the influence to affect policy decisions, then weighing
these issues is something that we absolutely have to do. Hopefully we have made the point that
in many cases the hardest part of the data science pipeline is verifying that the data
you get to be analyzed is completely accurate to the fullest extent possible. But for the
purposes of continuing our project we will assume that it is accurate.
2 Setup
Here we set up all the libraries we will be using. We list them below: - pandas will be used for data
manipulation. We will use it to format our raw data into tables. - plotnine is a Python library
similar to ggplot2 in R. We will use it to graph our data. - numpy is a scientific computing library.
- statsmodels is a statistical library that we will use for our data analysis.
[29]: import pandas as pd
from plotnine import *
import numpy as np
import statsmodels.formula.api as sm
import warnings
warnings.filterwarnings("ignore")
2.1 Data Manipulation
Here, this code manipulates the data into a format amenable to analysis. First we extract the infor-
mation from the .csv files into pandas dataframes using the read_csv command. Using some filters
we filter the relevant information into two data tables, maryland_deaths and maryland_confirmed.
[30]: counties = set(["Allegany", "Anne Arundel", "Baltimore", "Calvert",
"Caroline", "Carroll", "Cecil", "Charles", "Dorchester",
"Frederick", "Garrett", "Harford", "Howard", "Kent",
"Montgomery", "Prince George's", "Queen Anne's",
"St. Mary's", "Somerset", "Talbot", "Washington",
"Wicomico", "Worcester", "Baltimore City"])
county_deaths = pd.read_csv("Data/CasesDeathsCounty.csv");
county_confirmed = pd.read_csv("Data/CountyConfirmedCases.csv")
testing = pd.read_csv("Data/CTPTestingMaryland.csv")
testing['date'] = pd.to_datetime(testing['date'],format = '%Y-%m-%d')
maryland_deaths = county_deaths.loc[county_deaths["state"] == "Maryland"].
→loc[county_deaths["location_name"].isin(counties)].loc[county_deaths["date"]
.between(("04/1/2020"), ("05/12/2018"))].
→reset_index(drop=True)
maryland_confirmed = county_confirmed.loc[county_confirmed["county_name"].
→isin(counties)].query('state == "Maryland"').reset_index(drop=True)
4
6. 8 9.73 0.00
9 0.00 0.00
[31]: maryland_confirmed.head(10)
[31]: last_update state county_name county_name_long
0 2020-05-12 21:32:28 Maryland Allegany Allegany, Maryland, US
1 2020-05-12 21:32:28 Maryland Anne Arundel Anne Arundel, Maryland, US
2 2020-05-12 21:32:28 Maryland Baltimore Baltimore, Maryland, US
3 2020-05-12 21:32:28 Maryland Calvert Calvert, Maryland, US
4 2020-05-12 21:32:28 Maryland Caroline Caroline, Maryland, US
5 2020-05-12 21:32:28 Maryland Carroll Carroll, Maryland, US
6 2020-05-12 21:32:28 Maryland Cecil Cecil, Maryland, US
7 2020-05-12 21:32:28 Maryland Charles Charles, Maryland, US
8 2020-05-12 21:32:28 Maryland Dorchester Dorchester, Maryland, US
9 2020-05-12 21:32:28 Maryland Frederick Frederick, Maryland, US
fips_code lat lon NCHS_urbanization total_population
0 24001.0 39.623576 -78.692805 Small metro 71977.0
1 24003.0 39.006702 -76.603293 Large fringe metro 567696.0
2 24005.0 39.457847 -76.629120 Large fringe metro 827625.0
3 24009.0 38.539616 -76.568206 Large fringe metro 91082.0
4 24011.0 38.871723 -75.829042 Non-core 32875.0
5 24013.0 39.564536 -77.023737 Large fringe metro 167522.0
6 24015.0 39.566477 -75.946274 Large fringe metro 102517.0
7 24017.0 38.510923 -76.985807 Large fringe metro 157671.0
8 24019.0 38.454135 -76.027524 Micropolitan 32261.0
9 24021.0 39.472966 -77.399994 Large fringe metro 248472.0
confirmed confirmed_per_100000 deaths deaths_per_100000
0 148 205.62 13 18.06
1 2520 443.90 127 22.37
2 4051 489.47 204 24.65
3 211 231.66 13 14.27
4 174 529.28 0 0.00
5 589 351.60 60 35.82
6 270 263.37 15 14.63
7 761 482.65 55 34.88
8 102 316.17 2 6.20
9 1282 515.95 77 30.99
[32]: maryland_overall = pd.read_csv("Data/CTPStates-historical.csv").query('state ==␣
→"MD"').reset_index()
maryland_overall['date'] = pd.to_datetime(maryland_overall['date'],format =␣
→'%Y%m%d')
maryland_overall = maryland_overall.sort_values('date')
6
7. maryland_overall = maryland_overall.merge(testing, left_on='date',␣
→right_on='date', how='inner')
maryland_overall['positiveRatio'] = maryland_overall['positive'] /␣
→maryland_overall['cumulative_total_people_tested']
maryland_overall =␣
→maryland_overall[['date','positive','positiveIncrease','positiveRatio','cumulative_total_peo
maryland_overall.head(10)
[32]: date positive positiveIncrease positiveRatio
0 2020-03-05 0.0 NaN 0.000000
1 2020-03-06 3.0 3.0 0.103448
2 2020-03-07 3.0 0.0 0.068182
3 2020-03-08 3.0 0.0 0.054545
4 2020-03-09 5.0 2.0 0.064103
5 2020-03-10 6.0 1.0 0.063158
6 2020-03-11 9.0 3.0 0.087379
7 2020-03-12 12.0 3.0 0.113208
8 2020-03-13 17.0 5.0 0.153153
9 2020-03-14 26.0 9.0 0.216667
cumulative_total_people_tested
0 17
1 29
2 44
3 55
4 78
5 95
6 103
7 106
8 111
9 120
2.2 Exploratory Data Analysis
The simplest way we can view the data is by looking at the number of positive tested cases over
time. This will give us a basic understanding about the spread of COVID-19.
[33]: (ggplot(maryland_overall,aes(y='positive',x='date'))
+ geom_point() + ggtitle("Number of Confirmed Cases in MD")
+ xlab("Date") + ylab("Cases"))
7
8. [33]: <ggplot: (146476666316)>
We can see that the number of cases is rapidly increasing, and it seems like the rate is stabilizing
into a more linear shape as time goes on.
Another interesting thing we can look at is by graphing the positive increase per day, essentially
a graph of derivatives at each day. This will give us an idea of how the rate of transmission is
changing.
[34]: (ggplot(maryland_overall,aes(y='positiveIncrease',x='date'))
+ geom_point() + geom_smooth(method='lm')
+ ggtitle("Increase in confirmed cases in MD"))
8
9. [34]: <ggplot: (146483016147)>
We see that the rate of transmission seems to be still on the rise, despite quarantine and stay-at-
home orders.
We can also view the spread of COVID-19 through the ratio of tests that return positive for the virus
against total number of tests given. It should also be noted that this data is not truly indicative
of the true rate of transmission of the virus, as the we imagine only people who are exhibiting
symptoms will go out to be tested. Nevertheless, this graph will show give us some interesting
results.
[35]: (ggplot(maryland_overall,aes(y='positiveRatio',x='date'))
+ geom_point() + geom_smooth(method='lm')
+ ggtitle("Ratio of positive cases in MD"))
9
10. [35]: <ggplot: (146482781353)>
Here, the ratio of cases has a sudden discontinuity between the dates of 3/26 and 3/27. The reason
for this is that within the data, prior to 3/27, only positive tests were tracked. This means that
the ratio is extremely skewed towards positive number of cases, and creates an improbable curve
within the graph. In this case, to see the change in overall ratio, we will limit the graph to after
3/27. The COVID Tracking project corroborates this discontinuity here where under the MD row
in “States” they record that Maryland did not report negative cases between 3/12 and 3/28 (which
is a one day discrepancy between this plot and that recording). Below you can find a plot where
the values before 3/28 are thrown out and a linear regression model is fit.
[36]: temp = maryland_overall.query('date > "2020-03-27"')
(ggplot(temp,aes(y='positiveRatio',x='date'))
+ geom_point() + geom_smooth(method='lm')
+ ggtitle("Ratio of positive cases in MD"))
10
11. [36]: <ggplot: (146473328399)>
Here is a facet plot of the cumulative cases for all counties in Maryland. From the chart it is clear
that some counties have far larger case growth rates than other counties (which is to be expected,
as only a few counties are very populous).
[37]: (ggplot(maryland_deaths, aes(x="date", y="cumulative_cases"))
+ geom_point()
+ facet_wrap('~location_name', ncol=4)
+ theme(axis_text_x = element_text(angle=90), figure_size=(30,30)))
11
12. [37]: <ggplot: (146482755697)>
This is a similar plot, but restricted to the 8 most populous counties (Montgomery, Prince George’s,
Baltimore, Baltimore City, Howard, Anne Arundel, Frederick, Harford) as listed here. As popula-
tion likely affects transmission rate, it may be useful to isolate these.
[38]: temp = maryland_deaths.query('location_name in ["Montgomery", "Prince␣
→George's", "Baltimore", "Baltimore City",
"Frederick", "Howard", "Anne Arundel",␣
→"Harford"]')
(ggplot(temp, aes(x="date", y="cumulative_cases"))
+ geom_point()
+ facet_wrap('~location_name', ncol=4)
12
13. + theme(axis_text_x = element_text(angle=90), figure_size=(70,35)))
[38]: <ggplot: (-9223371890378682827)>
2.3 Hypothesis Testing and Machine Learning
Now we will perform some hypothesis testing to see if quarantine has actually changed the transmis-
sion rate of COVID-19. To do so, we will create linear regressions for the rate of increasing positive
cases before and during the quarantine. We will set our null hypothesis as there is no difference
between rate of increasing cases, and our alternative hypothesis that there is some difference.
[39]: JDate = list()
before_df = maryland_overall.query('date < "2020-04-14"')
for i, row in before_df.iterrows():
JDate.append(row['date'].to_julian_date()-2458912.5)
before_df['jDate'] = JDate
before_df['status'] = "0"
JDate = list()
after_df = maryland_overall.query('date >= "2020-04-14"')
for i, row in after_df.iterrows():
JDate.append(row['date'].to_julian_date()-2458952.5)
after_df['jDate'] = JDate
after_df['status'] = "1"
13
15. Df Residuals: 64 BIC: 1164.
Df Model: 3
Covariance Type: nonrobust
================================================================================
=====
coef std err t P>|t| [0.025
0.975]
--------------------------------------------------------------------------------
-----
Intercept -2068.0115 368.746 -5.608 0.000 -2804.667
-1331.356
status[T.1] 9435.4242 577.425 16.341 0.000 8281.886
1.06e+04
jDate 191.6871 15.674 12.230 0.000 160.376
222.999
jDate:status[T.1] 711.5184 31.022 22.936 0.000 649.546
773.491
==============================================================================
Omnibus: 6.598 Durbin-Watson: 0.111
Prob(Omnibus): 0.037 Jarque-Bera (JB): 6.084
Skew: 0.722 Prob(JB): 0.0478
Kurtosis: 3.243 Cond. No. 99.8
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
specified.
"""
The above plot and table show the information for a linear regression. The table shows us that
the rate of transmission prior to the quarentine is ~191 new confirmed cases per day, whereas after
quarentine began and the incubation period passed, the rate increased to ~711 new cases per day.
We can also see that the p-value for each of these two regressions are lower than our designated
α of 0.05, thus we can say that these slopes are reasonable. We also see that the p-value for
comparing the two slopes is 0, meaning that we reject our null hypothesis that the rate stayed the
same throughout quarantine.
Something we should note is that our rate actually increased drastically after quarantine(and incu-
bation period), which is not at all what we expect, nor what actually happened. It should be noted
that it makes sense for viruses, especially ones as contagious as COVID, to increase proportionally
to the number of positive cases, as more cases means higher rate of transmission. Additionally,
there are many other factors, such as differences between states, urbanization, availability of testing,
as well as other reasons that are likely outside the scope of our understanding.
2.4 Other Resources
As the COVID-19 case is an ongoing pandemic, you should strive to stay informed about the
virus in general. Every major news source, and most minor sources as well, will have ongoing
15
16. updates concerning the virus. As our analysis was solely centered around the case in Maryland,
you should check your local and state government webpages concerning their stance and laws
concerning the situation. For more in-depth news and articles concerning COVID, you should
follow the World Health Organization website for scholarly articles and global news. To view data
about the pandemic, we would recommend the data provided by the World Health Organization,
as well as taking a look at the COVID Tracking Project and data gathered from Johns Hopkins
University, which were used in the analysis of our data. COVID-19 is a serious threat accross the
globe, so the more informed you are and the more involved you are with the data, the better your
decision making will be on how to stay safe and healthy.
[ ]:
16