SlideShare a Scribd company logo
1 of 14
Download to read offline
Approximating a More Accurate Count of Deaths Caused by the
Russian Flu in the United States
Elijah Fiore, Robert Legge, and Jonathan Walton
December 2, 2016
The Russian Flu was a pandemic influenza outbreak that originated around Russia and Central Asia in 1889
and moved West to Europe and the Americas, killing roughly one million people worldwide; however due
to poor reporting in certain areas and possible false identification of similar causes of death, the death toll
is being reexamined by Professor Tom Ewing at Virginia Tech to find a more accurate approximation of
the number of people who died as a direct result of the Russian Flu. Using data from the Report on Vital
and Social Statistics from the 1880, 1890, and 1900 US Censuses, “Excess Analysis” was used as a means
of statistically finding the potential number of misidentified causes of death. With group-wise methodology,
we determined that there were 22,409 victims of the Russian Flu while doing the analysis for each state and
adding a tolerance to adjust for natural fluctuations in death rates, we obtained a slightly higher answer at
25,477 deaths. While this did not result in a definitive estimation due to the lack of available data and time
constraints, it can be used as a stepping stone for future research and analysis.
1
1 Problem Statement
For several years, Tom Ewing, a professor at Virginia Tech in the College of Liberal Arts and Human
Sciences, has been researching the Russian Flu and its significance in a historical context. Throughout his
time researching, he has sought the help of multiple groups of students, including ourselves, to aid in his
efforts. While Professor Ewing’s research covers multiple aspects of the disease outbreak, our main objective
is to accurately estimate the number of deaths that occurred from the Russian Flu between December 1889
and January 1890 within the United States. Professor Ewing believes that the count of those infected and
killed by the Russian Flu is extremely low especially compared to the number of deaths reported during the
Spanish Flu outbreak 30 years later. While many researchers have studied the Spanish Flu because of its
notoriety, little is currently known about the Russian Flu. Professor Ewing noted that much of the data
available on the causes of death though 1889 and 1890 was inconsistent due to faults in reporting and other
confounding variables. Our work on this problem aims to marry historical research with data science to
correct for the statistical shortcomings in Professor Ewing’s data. This problem is important in regards to
observing other historical examples of disease and perhaps future occurrences along with media coverage
and reporting of these events. Our results could prove beneficial to other historians and scholars alike whose
studies focus on disease outbreak as well. We hope to extrapolate upon Professor Ewing’s findings in order
to provide accurate numerical context to this historical issue.
2 Ethical Considerations
Although the intentions of our project are purely academic, by the very nature of our project, we are claiming
that many causes of death for Americans of the 1890’s were inaccurately reported and that the true cause
of death is the Russian Flu. It could be argued that trying to change one’s cause of death more than a
century later when the matter has long been put to rest is unethical. Furthermore, our model is based on an
assumption by Professor Ewing. While his assumption is very educated and his research so far supports his
claim that the current death count is inaccurate, our model is almost entirely based on Professor Ewing’s
inference being correct. Our model works by comparing several years of US Census data and assumes that
every death for a handful of reasons that falls outside of the norm was due to the Russian Flu. Reporting
any “new Russian Flu death count” as fact rather than a very broad estimate would be a very inaccurate
representation of the model. The model answers a historical question and should be left in a historical
context.
3 Literature Review
Professor Ewing’s theory is well cited in his exploratory article on the Vital Statistics data. He illustrates a
lack of reported “La Grippe” cases paired with an abundance of other reported diseases such as pneumonia.
Outside of Professor Ewing’s research, only a handful of articles exist on the Russian Flu. We have found
in our research a French study examining age distribution of the epidemic based on census data from 15
countries. The authors found that the clinical attack rate in these European countries were high and relatively
constant within the age range of 1-60 years, but less outside this range in the extremes. This paper had
similar aspirations to our project but with different data not useful for our purposes [3].
Additionally, a paper by J.F. Brundage investigates case rates and death rates from different strains of
influenza in the American Journal of Preventative Medicine. While his goal is similar to ours in that he
wanted to create a clear picture of the dynamics of an influenza epidemic, he focused on the Spanish and
Asiatic strains and looked at case rates and attack rates based on age. The disease is significantly lesser
known in light of the significantly deadlier Spanish Flu which plagued the world less than 30 years later [1].
Lastly, a paper published by the Proceedings of the National Academy of Sciences explored the transmissi-
2
bility and geographic spread of the Russian Flu using military archives in Europe. The authors utilized a
technique we hope to emulate in our project that used disease dynamics determined by this geographic data
in order to estimate a SEIR model [2].
4 Design Criteria
Before beginning any work on the final product, we created a set of design criteria to ensure that our final
product met the needs of our client; however, as we progressed, we began to realize what was possible with
this project and what was not. As a result, changes were made to our design methodology and considerations
several times through the course of the semester in an attempt to take these obstacles into account while
still providing the best solution to the problem at hand. Primarily, through some thought about what our
final product would be, we decided to split our efforts into sub-problems that would be completed on a
step-by-step basis and each had their own design criteria.
1. Statistical Analysis - This is less of a sub-problem, as it is more of a preliminary step in order to
continue with the following sub-problems. As such, this section will not have design criteria/methods
to weigh. This part of our solution will include comparing data from the 1890 Census to the 1880 and
1890 Censuses.
2. Modeling - This is the first sub-problem with design considerations. The modeling section refers to the
numerical modeling of the disease that is the backbone to our success. Initially, we hoped to formulate
a Differential or an Integrated Nested Laplace Approximation (INLA/Spatio-temporal) SEIR model.
The criteria for the modeling portion of our solution are as follows:
• Historical Validity. This refers not only to how accurate our model is in detailing the fatalities,
but also to how it stands against much of the research Professor Ewing has already done on the
subject.
• Usability. Professor Ewing and his team are mostly historians who do not posses the technical
skills to heavily adjust our code or math in the future. It is our goal to create a model that does
not require a technical background to use and would be rated a 1 or 2 on an ease-of-use scale of
1 (easy) to 5 (hard) by Professor Ewing and his team.
• Model Adaptability. The scope of our project is only within the United States, but it is essential
that our model be mindful of census parameters used in other countries. It would be ideal to
create a model that can easily be manipulated to be used at various geographic levels, such as
country, region, state, and city level. Since reporting between US Censuses are not consistent,
it is unlikely that consistency across most geographic levels exists, so we can accommodate by
making our process fairly simple and easily adaptable.
• Timeliness. We did not want to take too much time creating our model due to the fact that
we were scheduled to give a presentation in early October for Professor Ewing’s colleagues and
anyone else who was interested in attending. This presentation was pushed back to the end of
November due to issues from both our side and the client’s. This was added in order to separate
the time-consuming modeling techniques from ones that are reasonable.
3. Visualizations - This sub-problem was originally a criterion of our solution rather than a section by
itself, but we found that it was important to separate the modeling form the data visualizations as
they involve very different methods that should be graded apart from one another. Our technique
options for visualizations include: Heat maps, time series, clustering (or factor based visualizations),
and map charts. The design criteria for the visualizations are new, but fairly self-explanatory and will
be touched on in the evaluation of the visual methods.
3
• Usefulness. It is important that the information portrayed can be used for future analysis and
shows some level of substance.
• Ease of Understanding. We want to minimize unnecessary content in order to avoid any confusion
as to what is being shown by our visualizations. For our project, simple and effective is better
than complex and bulky.
• Run-Time. This is a simple consideration as we do not want our model to take too much time
to run, especially if we are handing it off to Professor Ewing and his team. We need it to run
numerous operations with multiple states concurrently in order for it to be convenient to our
client and audience.
• Aesthetic Appeal. In addition to functionality and correctness, it is important that our visual-
izations be usable for presentations and papers. We want to ensure that, while these plots and
charts are useful, they also look nice and are not overly simplistic or cluttered.
4.1 Quantitative Ranking of Design Criteria
With these criteria in mind we can weigh each according to their value to the project. Using a pairwise
comparison chart, we were able to prioritize our design criteria to narrow down our options and allow us to
select the best approach.
For our mathematical models, this is:
Criterion Weight
Historical Validity .40
Timeliness .30
Usability .20
Adaptability .10
“Historical Validity” has the highest weight to our project, as the other criteria have no merit if our model
does not have an acceptable level of accuracy. “Timeliness” has the next highest weight. Professor Ewing,
had set a date in early October for a presentation, which was eventually moved back to the end of November,
so we knew we couldn’t spend too much time developing a model, otherwise our work would be of limited use
to him. “Usability” falls below timeliness, as it is more important to our sponsor to have accurate, timely
results than to be able to revisit and revise our results. Our lowest weighted criterion is “Adaptability”.
Our model cannot be so specific, as it will only work in the United States. Instead we need to keep in mind
data that would be equally available in other countries although those analyses are outside the scope of our
project. Adaptability is something we can focus on after the main goals of the project have been satisfied.
In comparison, our scoring for our visualizations is as follows:
Criterion Weight
Usefulness .40
Ease of Understanding .30
Run Time .20
Aesthetic Appeal .10
We see that “Usefulness” is weighted the highest because if the model does not help to solve the main
problem at hand and no further insight can be gained from it, then there is little point to the model and
the model ultimately fails. Following this is “Ease of Understanding” as it is imperative that our solution
be something that can be one that can easily be utilized and understood by Professor Ewing and the rest
of his research time once the semester is over. “Run Time” is less of an issue due to the fact that we do not
foresee any issues with our visualizations taking inordinate amount of time to render, but we are aware that
4
any solution created should be completed within a reasonable amount of time. Lastly, we have “Aesthetic
Appeal”. We are aware that any visualization we create may be used in future presentations or simply for
research purposes, and so they should look decent, but this factor is also not nearly as important as the
previously stated factors in terms of gaining further insight from the data through our visualizations.
5 Summary of Techniques Used to Validate Model
Due to the nature of our data and analysis technique, there were no intuitive ways to validate our results.
When SEIR modeling was an option, the validation technique postulated was to compare the disease numbers
simulated to a determined “well-reported” state’s numbers; however, with the assumption that flu numbers
may be entirely inaccurate and the departure from a SEIR modeling solution due to inadequate data, this
method was unusable. Our solution to this was to use an unrelated cause of death (or one with historically
less probability of being misdiagnosed) and to ensure our calculations did not detect differences in flu death
rate due to the illness. This validation of results will be further discussed in the results portion of the report.
6 Selected Design Solutions
A natural part of data analysis problems is figuring out the data one wants or need and rectifying this with
what is actually available; this was indeed an important portion of our endeavor and this issue inspired
countless changes to the solution methods proposed earlier in the report. Though epidemic modeling was a
main aim of the project through much of the semester, there were some challenges with the data that made
this sub-problem (or modeling solution) ineffective, and perhaps not possible, within our scope.
For traditional SEIR models, changes in population compartments are estimated and simulated over time us-
ing differential equations with parameters that directly affect the disease’s dynamics. Our original methodol-
ogy was to determine these parameters (like contact rate, incubation period, etc.) through statistical analysis
of Vital Statistics from the 1880, 1890, and 1900 censuses. For SEIR to be practical, there must be accurate
parameter estimates to abstract the real world in an accurate manner. However, through exploration of
available data, there was simply not enough reliable data on the number of influenza cases or the attributes
of the disease itself that was usable for this type of analysis. The most comprehensive set available was
the collection of death counts by cause of death found in the very census we assume is inconsistent for our
problem. In addition, SEIR modeling is generally used to track populations and disease outbreak, not total
deaths, which is our goal in this project.
As we thought more in depth about the problem and data at hand, it became clearer that SEIR was not an
appropriate method for estimating deaths in the US from the Russian Flu at this stage in the investigation.
Due to this realization, our methodology was updated to a reduced two-sectioned approach: statistical
analysis (calculating falsely attributed upper respiratory deaths through examining trends in death rates)
and visualizations (map charts, bar graphs, and plotting software for Professor Ewing).
6.1 Statistical Analysis: Excess Analysis of Marker Diseases
Through our initial talks with Professor Ewing and general research on the flu outbreak in 1890, we found
discrepancies in La Grippe death counts that were the basis to the investigation. This was primarily seen
through the inherent disconnect between the most populous states around the Northeast US and the lack of
reported La Grippe deaths in these areas, highlighted by figure 1. Many of the Northeast (and eastern states
in general) that had high population and greater density had fewer counts of La Grippe deaths, which goes
against normal diseases dynamics as large cities would translate to a greater spread of influenza, as logic
would dictate. The discrepancy is highlighted the most with the state of New York which had the highest
5
25
30
35
40
45
50
-120 -100 -80
x
y
1e+06
2e+06
3e+06
4e+06
5e+06
6e+06
Pop
1890 Population
25
30
35
40
45
50
-120 -100 -80
x
y
250
500
750
Grippe
1890 La Grippe Deaths
Figure 1: Population in 1890 and La Grippe Death Count in 1890
population in 1890 (around 6.1 million) with a relatively low death count (reported as 298). Comparing this
to a smaller state in Virginia at this time, one finds that with a population of about 1.7 million, Virginia
reported 575 deaths in the 11th census.
With guidance from Professor Ewing, we established an assumption for the cause of this “under-reporting”
in states like New York using the historical context of the Russian Flu. In the early days of the epidemic,
doctors in the early states were not aware or not looking for influenza in their diagnoses and classified
deaths as a disease with similar symptoms; mostly upper respiratory diseases. The notion of under-reporting
inspired our statistical methods for estimating flu deaths through specifically looking at the trends in these
marker diseases through what we called “excess analysis”. Excess analysis refers to examining the natural
movement of death rates in similar diseases and comparing the actual rate in 1890 with the predicted 1890
rate from the trends to isolate the excess, or additional deaths outside the norm. this is similar to the
method the CDC uses to track influenza that includes combining Pneumonia and influenza deaths together
as one category. We assume that these excess deaths can be attributed to the misreporting of the Russian
Flu. Figure 2 shows a simple visualization of this identification. In our primary analysis that looked at
geographical groups of states, we used formulas similar to the following sample formulas to calculate the
additional deaths from historically relevant diseases Bronchitis, Consumption, and Pneumonia:
Trend rate =
(Bronchitis death rate1900 − Bronchitis death rate 1880)
20
(1)
Predicted death rate 1890 = (trend rate ∗ 10) + Bronchitis death rate 1880 (2)
Excess = Actual death rate 1890 − Predicted death rate 1890 (3)
Estimated death rate = La Grippe rate + Marker excess (4)
Estimated deaths =
(Estimated death rate ∗ Total 1890 population)
100, 000
(5)
Later in our analysis when we performed excess calculation for every state, we also added a tolerance to the
excess death rates to make a more conservative estimate of what death rates in 1890 were actually out of
the norm. We chose the tolerance to be one standard deviation away from the predicted death rate of each
disease. The tolerance acted as a filter for adding only significant departures from death trends to the total.
All calculations in this analysis were performed in R.
6
Excess
Figure 2: Identifying Excess in Bronchitis Death Rates
6.2 Visualization Approaches
In addition to finding a viable solution through the application of statistical analysis, visualizations help
communicate these findings. Several types we considered include heat maps that show the disease’s impact
at the city, state, and national level, time series plots to show the spread of the disease, and also clustering
in an attempt to find correlations between specific fields such as age, gender, race, and geographic location.
Map charts, bar charts, and time series plots were the techniques selected due to their viability for the data
and results we obtained.
For our project, map charts were selected to visualize death rates, counts, and populations of the respective
states in our set. This is the simplest way to show the geographical component of influenza and death
reporting. These map charts were created in RStudio with the ggplot and maps packages. In addition, bar
charts were utilized as an exploratory tool to identify marker diseases that had the spikes in death rates in
1890; this is seen in figure 2. The bar graphs were also created using ggplot and a graphical R function called
multiplot. Finally, as a deliverable to Professor Ewing, we created an interactive tool coded in Python in
Enthought Canopy for on the spot visualization of disease deaths from the Census data. The main package
used to create this tool was matplotlib, specifically pyplot and widgets modules.
As seen in figure 3, the sliders on the bottom of the window allow the user to customize what data is
plotted. The user can select choose which state and which disease to plot, or set both the State and Cause of
Death slider to the left at zero to show all states and all diseases, respectfully. The plotted points show the
normalized number of deaths for each year and by clicking on a point, the actual death count is returned.
This is a useful tool because it can be easily adapted to a similarly outlined data frame. This will aid
Professor Ewing if he chooses to drill down into the city level, shift his focus to another country, or observe
other causes of death around the time period.
7
1880 1885 1890 1895 1900
Year
0.0
0.2
0.4
0.6
0.8
1.0NormalizedPopulation
Total Population
Scarlet Fever
Enteric Fever
Malarial Fever
Diphtheria
Bronchitis (Croup)
Consumption
Pneumonia
La Grippe
Meningitis
Disease of Respiratory System
Scrofula
Unknown
Cause of Death Causes of Death
State filter: New York
Reset
Normalized Death Counts by Disease
1880 1885 1890 1895 1900
Year
0.0
0.2
0.4
0.6
0.8
1.0
NormalizedPopulation
Bronchitis (Croup)
Cause of Death Bronchitis (Croup)
State filter: New York
Reset
Normalized Death Counts by Disease
Figure 3: Demonstration of Plotting Interactive Tool
7 Obstacles
Our greatest obstacle we had to overcome was finding data in a usable format, or finding a way to convert the
data from scanned PDF Census documents to digitized tables. We attempted to look into other sources, but
found nothing usable for our specific time period. The next step was to try to use OCR (Optical Character
Recognition) software in an attempt to convert the images to either a CSV or text file. With the census data
in PDF format and the OCR software consistently returning unusable interpretations of our data due to the
poor quality of the scans, the only remaining option to progress with the project was manual data entry.
Although the manual input was not an academically challenging task, it certainly was tedious. Professor
Ewing named eleven causes listed in the 1890 Census that he believed could have been named the cause
of death instead of the Russian Flu. Those reasons are mainly respiratory and are as follows: Bronchitis,
Consumption, Diphtheria, Enteric Fever, Malarial Fever, Scarlet Fever, Meningitis, Pneumonia, Respiratory
System Disease, Scrofula, and “Unknown”. We had the option of performing our statistical analysis on a
city scale rather than state level although there were numerous issues with this method. First and foremost,
this would require manually entering roughly 18,000 entries across the three different Censuses which is
impractical considering our time constraints and that only a handful per minute can be entered. Another
issue with using city level data is that the cities vary greatly across the censuses and are frequently thrown
into a “group” category which also varies between the Censuses. For example, Group 1 in 1880 for any given
state could be vastly different than Group 1 for the same state in 1890.
8 Results
The results of our project are split into two sub-sections: group-wise and state-wise analysis. The separation
was a product of discussions with Professor Ewing in reference to the spread of the epidemic geographically,
as well as the assumption that certain regions (specifically the Northeast) were inherently under-reported.
These sections highlight the development of our problem solution and how our solution was built-off of
previous work as we went forward.
8.1 Group-wise Excess Analysis
Our investigation began with splitting the states into (roughly) equally sized groups that mimicked the spread
of an epidemic across the US, using the La Grippe death count map as a reference for determining group 1
8
Figure 4: Group Separation Over Plot of La Grippe Death Count
(the under-reported group of populous states including New York, Massachusetts, etc.); the groupings are
visualized in figure 4. To validate there is a distinct disparity between population totals in these groups and
reported La Grippe deaths, these totals were plotted in bar graphs (seen in figure 5.
It is clear that, while group 1 has comparable population, there is a significant lack of reported La Grippe
deaths. With this established, we move forward in the excess analysis specifically on group 1 with the
assumption that groups 2 through 4 had well-reported deaths because of the geographical separation that
allowed health professionals to identify the Russian Flu later in the epidemic when it appeared in these
locations. Through the methods discussed earlier in the report, excess death rates from Pneumonia and
Bronchitis were calculated and added to the La Grippe death rates reported by the 1890 census for group 1.
In group 1, the death rate from La Grippe per 100,000 persons was determined to be 12.5, while Bronchitis
and Pneumonia had excess death rates of 25.77 and 23.5 respectively. These rates were added together and
used to determine the group 1 flu deaths in table 1. The death count estimate for group 1 was added to the
total from the rest of the groups to get a total estimate. From table 1, we get our preliminary answer to
our guiding question: according to this basic investigation, there were 22,409 victims of the Russian Flu, 72
percent higher than the reported value.
The question that had to be answered at this point was if this 22,409 was the final answer to this problem.
Looking at the primary methodology, there are many limitations to our initial estimations. First, this type of
analysis assumes that these groups are actually significant and that the disease spread in this way. Secondly,
this method glosses over the possibility of variation within each group. for example, Pennsylvania is included
in group 1 because of its high population, geographic location, and low reported La Grippe deaths in its
largest cities; however, from figure 5, one can see it still has (comparatively) more reported deaths than New
York or New Jersey. The inclusion of Pennsylvania in this group could affect estimates poorly. Additionally,
because of the “well-reported” assumption for the other groups, excess deaths from these groups are not
estimated or counted towards the new total. Lastly, this is not a terribly conservative method for estimating
misreported deaths as there could be natural fluctuation in the observed trend and the method would not
account for that and assign more deaths than accurate to the excess. To address these concerns, excess
analysis was performed for each state where a tolerance of one standard deviation of the predicted 1890
9
0.0e+00
5.0e+06
1.0e+07
1.5e+07
2.0e+07
1 2 3 4
Group
Population
Group
1
2
3
4
1890 Population per Group
0
2000
4000
6000
1 2 3 4
Group
Deaths
Group
1
2
3
4
1890 La Grippe Deaths Per Group
Figure 5: Population and La Grippe Death Count in 1890 Group-Wise
death rates was used for each disease of interest.
Reported La Grippe Deaths Estimated Excess Deaths Total La Grippe Deaths Plus Excess Deaths
Group 1 2,177 9,360 11,537
Total US 13,049 9,360 22,409
Table 1: Table of Reported La Grippe Deaths and Estimated Excess Deaths in 1890
8.2 State-wise Excess Analysis
After selecting the marker diseases (Bronchitis, Pneumonia, and Consumption) which were historically iden-
tified to increase in death counts (on average) during the epidemic, an R script was ran that filtered through
all states to identify all excess death rates that were both positive (as a negative difference between pre-
dicted and actual death rates in 1890 would signify a decrease in disease death rates and, consequentially,
a well-reported state) and above the discussed tolerance to account for fluctuations in death rate trends. A
sample of the code used is directly below.
#removing undocumented states
dak <- c("ND", "SD", "DA", "OK")
cen2 = subset(cen, !(Abbr %in% dak))
#Adding death rate for La Grippe
cen2$Grippe.Per = 100000*cen2$La.Grippe.1890/cen2$Total.Population.1890
cen2$Grippe.Per
#Vectors for storing death values from the for loops
pred = vector(mode="numeric", length=46)
death4sd <- vector(mode="numeric", length=46)
deaths <- vector(mode="numeric", length=46)
#For loop to get bronchitis death rates for each state
for (i in 1:46){
bron80 = 100000*cen2$Bronchitis..Croup..1880[i]/cen2$Total.Population.1880[i]
bron90 = 100000*cen2$Bronchitis.1890[i]/cen2$Total.Population.1890[i]
10
25
30
35
40
45
50
-120 -100 -80
x
y
10
20
30
40
GrippePer
1890 La Grippe Deaths Per 100,000
25
30
35
40
45
50
-120 -100 -80
x
y
20
40
60
80
Death_Rate
1890 La Grippe Death Rate Plus Excess Bronchitis and Pneumonia Death Rates
Figure 6: Comparison of Death Rates Between reported La Grippe and With Added Excess from Bronchitis
and Pneumonia
bron00 = 100000*cen2$Bronchitis.1900[i]/cen2$Total.Population.1900[i]
diff = (bron00-bron80)/20 #slope of death trend line
death = bron90-(bron80+(diff*(10))) #excess death rates
pred[i] = (bron80+(diff*(10))) #predicted death rates 1890
death4sd[i] = pred[i]*cen2$Total.Population.1890[i]/100000 #predicted deaths
deaths[i] = death*cen2$Total.Population.1890[i]/100000 #excess deaths
}
tol = sd(pred) #setting tolerance at 1 sd
#For loop that looks at if the death rate is over tolerance and positive
for (i in 1:46){
if (!(deaths[i] > (pred[i]+tol))){
deaths[i] = 0
}
}
#vector with excess death counts (with 0 for states that were insignificant
deaths
#adding vector of death rates to the data frame
cen2$death1 = 100000*(deaths+cen2$La.Grippe.1890)/cen2$Total.Population.1890
Both Dakotas were removed from these calculations because of lack of data for each time point for comparison;
nevertheless, their respective La Grippe death totals were added in the final estimation untouched. The
process used, like before, added excess death rates to La Grippe death rates and used each state’s 1890
population to calculate an estimated flu death rate for each state. The map chart visualizations of the
estimated death rates after adding excess disease deaths are shown in figure 6. The map charts show that
by adding these excess death rates from Bronchitis and Pneumonia, we get a more accurate estimation of
total influenza death rates per 100,000. It is important to note that New York now has the highest death
rate in the country, which is realistic for the state that includes the most people and largest cities (both vital
factors for disease spread).
Here, we obtained, in our opinion a more accurate estimation of 1890 US Russian Flu deaths that took into
account variation state-to-state and possible fluctuations in death trends. An answer to our problem, so to
speak, is that there were about 25,477 victims of the Russian Flu compared to the 22,409 estimated victims
11
Reported Deaths Estimated Excess Deaths
Bronchitis 21,420 4,702
Pneumonia 76,578 3,127
Consumption 102,727 4,599
Totals 200,725 12,428
Table 2: Table of Reported Disease Deaths and Estimated Excess Deaths in 1890
1890 Influenza Deaths
Total Reported 13,049
Total Estimated with Excess 25,477
Table 3: Table of 1890 La Grippe Deaths Reported and with Estimated Excess Deaths
through the Group-wise analysis. In order to validate our methodology of including the excess deaths in
the flu death category, we examined how a seemingly unrelated and well-diagnosed disease performed in our
analysis. If our methods are accurate, a disease that was not inflated by misreported influenza deaths, such
as Scarlet Fever, should not report many excess deaths if any at all.
Looking at Group-wise plots for Scarlet Fever in figure 7 shows a steep average drop off in each group with
no obvious excess in any case. Just as a preliminary visualization, one can see that Scarlet Fever most
likely will not contribute much excess to any flu death totals as we should expect from a disease with very
specific symptoms (red rash) that is less probable to be misdiagnosed. Using the same code seen previously
to calculate excess Scarlet Fever deaths over the prediction tolerance, we get the excess deaths from Scarlet
Fever for each state. From the output, there were no excess death attributed to the disease save for 136 in
Indiana (possibly an outlier case; this output can be seen below. The resulting output is a good sign that
for diseases that did not spike during 1890 (whether caused by influenza or from another outbreak outside
of our scope), our model does not falsely attribute deaths to the Russian Flu. In this way, our estimation
methods pass this initial validation technique.
> deaths5
[1] 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
[11] 0.0000 0.0000 135.5855 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
[21] 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
[31] 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
[41] 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
8.3 Evaluating Our Results
Examining what we accomplished in our results versus what we aspired to do through our design criteria is
extremely important when looking back at the success of our solutions. An issue to note is that our design
criteria were specifically made with past solutions in mind, including the specific quantification system;
however, it is still possible to look at our results using the criteria as a rough guide for success despite
differences in solution methods. We aimed to evaluate our modeling solution by its historical validity,
timeliness, adaptability, and usability. Even with changing our work from SEIR modeling to excess analysis,
our results fit these criteria just as well. While our model may not be as adaptable as we would have liked
(methodology only useful to United States because of our assumptions), it has a certain degree of validity
as discussed previously, is simple to understand in its basic estimation, and is not a technique that takes
considerable time to run for each state. From these criteria alone, our results are successful. Looking at
12
0
20
40
60
1880 1890 1900
Year
Deaths
Scarlet.Fever Deaths per 100,000 in Group 1
0
20
40
60
1880 1890 1900
Year
Deaths
Scarlet.Fever Deaths per 100,000 in Group 2
0
20
40
60
1880 1890 1900
Year
Deaths
Scarlet.Fever Deaths per 100,000 in Group 3
0
20
40
60
1880 1890 1900
Year
Deaths
Scarlet.Fever Deaths per 100,000 in Group 4
Figure 7: 1890 Scarlet Fever Death Rate by Group
the same criteria for our visualizations, we come to the same conclusion. Not only are our graphics easy to
understand and appealing, they are, at their core, useful to our client in his continued work with the topic
(especially the Python graphing script).
Despite our perceived success, it is important to establish the limitations of our final solutions and methods
in this project. Primarily, with a data set assumed to be inherently flawed, accuracy is a peak concern.
while we have shown it is possible to estimate Russian flu deaths in the US, the real question is that of
the true accuracy. One cannot simply go back in time and change how the public recorded deaths, so this
may be a limitation beyond repair. Secondly, the data our team transcribed was state-level. this created
the issue of low spatial resolution that restricts the analysis possible. There could be city-to-city variation
(much like that of states) that accounts for different death rates. There could be discrepancies between
populous and dense cities versus small towns that would be very useful to account for in our model. Lastly,
there were many assumptions made that must hold true for any of our results to be meaningful. Firstly,
the trends in death rates must be naturally occurring and not determined by outside events like separate
outbreaks in our diseases of interest, or in other words, these excess deaths must be caused by La Grippe and
must be significant. All these limitations show there may be issues in the estimation methods themselves;
nevertheless, in such a sparse topic such as the Russian Flu, our results prove to be an important basis for
future work.
9 Future Work
We have by no means come up with a definitive answer for the question, “How many people died from the
Russian Flu in the United States from December 1889 to January 1890?” We have statistically supported
the possibility that more people died from the Russian Flu than what was previously recorded, but it is not
13
enough to be able to give an accurate estimate without further analysis. If we had more time we would want
to look into more statistical analysis to identify outliers that may be skewing the data in an attempt to get
a slightly more accurate approximation.
Moving forward, Professor Ewing noted that he wanted to try to look deeper into the city level, but also
expand from the state level to geographic regions, then to other countries and bring all of that information
together to see how they compare in a historical and geographical context. He also wanted to try to find trends
based on categories such as gender, age, ethnicity, height, weight, and other recorded quantitative categories
to see if anything stood out, like one group being more susceptible than another. This would have been
an interesting topic for analysis that could have observed through k-means clustering and multidimensional
scaling (MDS) as a means of finding commonalities and trends among observed cases.
Unfortunately, practically any future work on our topic and our past ideas for our mathematical models,
including Differential and Spatio-temporal SEIR modeling, as well as the previously stated clustering and
MDS visualizations, relies largely on the existence of usable data, and that is the greatest issue. One thought
that we had considered was scraping death certificates from ancestry websites and other databases, which
could be worth looking into if all other resources have been exhausted.
References
[1] J. F. Brundage, Cases and deaths during influenza pandemics in the united states, American Journal
of Preventive Medicine, 31 (2006), p. 252–256.
[2] A.-J. Valleron, A. Cori, S. Valtat, S. Meurisse, F. Carrat, and P.-Y. Boelle, Transmissibility
and geographic spread of the 1889 influenza pandemic, Proceedings of the National Academy of Sciences,
107 (2010), p. 8778–8781.
[3] S. Valtat, A. Cori, F. Carrat, and A.-J. Valleron, Age distribution of cases and deaths during
the 1889 influenza pandemic, Vaccine, 29 (2011).
14

More Related Content

Viewers also liked

Tutorial completo instalando windows vista
Tutorial completo instalando windows vistaTutorial completo instalando windows vista
Tutorial completo instalando windows vistajulioblogger
 
Direito de família. reconhecimento de uniões estáveis
Direito de família. reconhecimento de uniões estáveisDireito de família. reconhecimento de uniões estáveis
Direito de família. reconhecimento de uniões estáveisallaymer
 
Compensaçâo de danos morais
Compensaçâo de danos moraisCompensaçâo de danos morais
Compensaçâo de danos moraisallaymer
 
Acordao tj-rs-reconhecendo-verba
Acordao tj-rs-reconhecendo-verbaAcordao tj-rs-reconhecendo-verba
Acordao tj-rs-reconhecendo-verbaallaymer
 
Apelação cível – responsabilidade acidente de transito
Apelação cível – responsabilidade   acidente de transitoApelação cível – responsabilidade   acidente de transito
Apelação cível – responsabilidade acidente de transitoallaymer
 
é Nula a alienação de bem imóvel, na constância da sociedade
é Nula a alienação de bem imóvel, na constância da sociedadeé Nula a alienação de bem imóvel, na constância da sociedade
é Nula a alienação de bem imóvel, na constância da sociedadeallaymer
 
3Com 3CBLSF26PWRH
3Com 3CBLSF26PWRH3Com 3CBLSF26PWRH
3Com 3CBLSF26PWRHsavomir
 

Viewers also liked (8)

Tutorial completo instalando windows vista
Tutorial completo instalando windows vistaTutorial completo instalando windows vista
Tutorial completo instalando windows vista
 
Direito de família. reconhecimento de uniões estáveis
Direito de família. reconhecimento de uniões estáveisDireito de família. reconhecimento de uniões estáveis
Direito de família. reconhecimento de uniões estáveis
 
Compensaçâo de danos morais
Compensaçâo de danos moraisCompensaçâo de danos morais
Compensaçâo de danos morais
 
Acordao tj-rs-reconhecendo-verba
Acordao tj-rs-reconhecendo-verbaAcordao tj-rs-reconhecendo-verba
Acordao tj-rs-reconhecendo-verba
 
Slide 4
Slide 4Slide 4
Slide 4
 
Apelação cível – responsabilidade acidente de transito
Apelação cível – responsabilidade   acidente de transitoApelação cível – responsabilidade   acidente de transito
Apelação cível – responsabilidade acidente de transito
 
é Nula a alienação de bem imóvel, na constância da sociedade
é Nula a alienação de bem imóvel, na constância da sociedadeé Nula a alienação de bem imóvel, na constância da sociedade
é Nula a alienação de bem imóvel, na constância da sociedade
 
3Com 3CBLSF26PWRH
3Com 3CBLSF26PWRH3Com 3CBLSF26PWRH
3Com 3CBLSF26PWRH
 

Similar to approximating-accurate-count

2 hours agoLuke Powell Main Post - Luke PowellCOLLAPSETo.docx
2 hours agoLuke Powell Main Post - Luke PowellCOLLAPSETo.docx2 hours agoLuke Powell Main Post - Luke PowellCOLLAPSETo.docx
2 hours agoLuke Powell Main Post - Luke PowellCOLLAPSETo.docxvickeryr87
 
Informative Speech Sample Essay. Writing an Informative Speech
Informative Speech Sample Essay. Writing an Informative SpeechInformative Speech Sample Essay. Writing an Informative Speech
Informative Speech Sample Essay. Writing an Informative SpeechFelicia Gonzales
 
Dengue Outrage Forecasting via SAS
Dengue Outrage Forecasting via SASDengue Outrage Forecasting via SAS
Dengue Outrage Forecasting via SASSaurav Gupta
 
COVID-19 data configuration and statistical analysis
COVID-19 data configuration and statistical analysisCOVID-19 data configuration and statistical analysis
COVID-19 data configuration and statistical analysisAnshJAIN50
 
Investigation: How and Why Has Our Reaction To Disease Changed?
Investigation: How and Why Has Our Reaction To Disease Changed?Investigation: How and Why Has Our Reaction To Disease Changed?
Investigation: How and Why Has Our Reaction To Disease Changed?Big History Project
 
Elementos del abstract
Elementos del abstractElementos del abstract
Elementos del abstractUNEFM
 
APPLIED EPIDEMIOLOGY UNIT 1A.pptx
APPLIED EPIDEMIOLOGY UNIT 1A.pptxAPPLIED EPIDEMIOLOGY UNIT 1A.pptx
APPLIED EPIDEMIOLOGY UNIT 1A.pptxWILLIAMSADU1
 
3. descriptive studies
3. descriptive studies3. descriptive studies
3. descriptive studiesAshok Kulkarni
 
How to Write a Scientific Paper in Midwifery.ppt
How to Write a Scientific Paper in Midwifery.pptHow to Write a Scientific Paper in Midwifery.ppt
How to Write a Scientific Paper in Midwifery.pptInjunieOnnie
 
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicineHuman resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicinePrabir Chatterjee
 
Cool Argumentative Essay Topics.pdf
Cool Argumentative Essay Topics.pdfCool Argumentative Essay Topics.pdf
Cool Argumentative Essay Topics.pdfNikki Wheeler
 
DUE 1162017 10 P.M ESTTHIS IS A 4 PART HIV SPSS PROJECT. ATTAC.docx
DUE 1162017 10 P.M ESTTHIS IS A 4 PART HIV SPSS PROJECT. ATTAC.docxDUE 1162017 10 P.M ESTTHIS IS A 4 PART HIV SPSS PROJECT. ATTAC.docx
DUE 1162017 10 P.M ESTTHIS IS A 4 PART HIV SPSS PROJECT. ATTAC.docxsagarlesley
 
Study-Designs-in-Epidemiology-2.pdf
Study-Designs-in-Epidemiology-2.pdfStudy-Designs-in-Epidemiology-2.pdf
Study-Designs-in-Epidemiology-2.pdfKelvinSoko
 
Assessing undergraduate research methods: Presentation by Marian West, Univer...
Assessing undergraduate research methods: Presentation by Marian West, Univer...Assessing undergraduate research methods: Presentation by Marian West, Univer...
Assessing undergraduate research methods: Presentation by Marian West, Univer...The Higher Education Academy
 
Grit and Growth”The theme this week (Friends and Enemies”) int.docx
Grit and Growth”The theme this week (Friends and Enemies”) int.docxGrit and Growth”The theme this week (Friends and Enemies”) int.docx
Grit and Growth”The theme this week (Friends and Enemies”) int.docxwhittemorelucilla
 
Research & Methods, and it's types
Research & Methods, and it's typesResearch & Methods, and it's types
Research & Methods, and it's typesMarooq
 
Visualization Tools for the Refinery Platform - Supporting reproducible resea...
Visualization Tools for the Refinery Platform - Supporting reproducible resea...Visualization Tools for the Refinery Platform - Supporting reproducible resea...
Visualization Tools for the Refinery Platform - Supporting reproducible resea...Nils Gehlenborg
 

Similar to approximating-accurate-count (20)

2 hours agoLuke Powell Main Post - Luke PowellCOLLAPSETo.docx
2 hours agoLuke Powell Main Post - Luke PowellCOLLAPSETo.docx2 hours agoLuke Powell Main Post - Luke PowellCOLLAPSETo.docx
2 hours agoLuke Powell Main Post - Luke PowellCOLLAPSETo.docx
 
Informative Speech Sample Essay. Writing an Informative Speech
Informative Speech Sample Essay. Writing an Informative SpeechInformative Speech Sample Essay. Writing an Informative Speech
Informative Speech Sample Essay. Writing an Informative Speech
 
Dengue Outrage Forecasting via SAS
Dengue Outrage Forecasting via SASDengue Outrage Forecasting via SAS
Dengue Outrage Forecasting via SAS
 
COVID-19 data configuration and statistical analysis
COVID-19 data configuration and statistical analysisCOVID-19 data configuration and statistical analysis
COVID-19 data configuration and statistical analysis
 
Biostatics ppt
Biostatics pptBiostatics ppt
Biostatics ppt
 
Investigation: How and Why Has Our Reaction To Disease Changed?
Investigation: How and Why Has Our Reaction To Disease Changed?Investigation: How and Why Has Our Reaction To Disease Changed?
Investigation: How and Why Has Our Reaction To Disease Changed?
 
Elementos del abstract
Elementos del abstractElementos del abstract
Elementos del abstract
 
APPLIED EPIDEMIOLOGY UNIT 1A.pptx
APPLIED EPIDEMIOLOGY UNIT 1A.pptxAPPLIED EPIDEMIOLOGY UNIT 1A.pptx
APPLIED EPIDEMIOLOGY UNIT 1A.pptx
 
3. descriptive studies
3. descriptive studies3. descriptive studies
3. descriptive studies
 
How to Write a Scientific Paper in Midwifery.ppt
How to Write a Scientific Paper in Midwifery.pptHow to Write a Scientific Paper in Midwifery.ppt
How to Write a Scientific Paper in Midwifery.ppt
 
Europe on corona for pdf
Europe on corona for pdfEurope on corona for pdf
Europe on corona for pdf
 
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicineHuman resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
 
Cool Argumentative Essay Topics.pdf
Cool Argumentative Essay Topics.pdfCool Argumentative Essay Topics.pdf
Cool Argumentative Essay Topics.pdf
 
DUE 1162017 10 P.M ESTTHIS IS A 4 PART HIV SPSS PROJECT. ATTAC.docx
DUE 1162017 10 P.M ESTTHIS IS A 4 PART HIV SPSS PROJECT. ATTAC.docxDUE 1162017 10 P.M ESTTHIS IS A 4 PART HIV SPSS PROJECT. ATTAC.docx
DUE 1162017 10 P.M ESTTHIS IS A 4 PART HIV SPSS PROJECT. ATTAC.docx
 
Research Methods
Research MethodsResearch Methods
Research Methods
 
Study-Designs-in-Epidemiology-2.pdf
Study-Designs-in-Epidemiology-2.pdfStudy-Designs-in-Epidemiology-2.pdf
Study-Designs-in-Epidemiology-2.pdf
 
Assessing undergraduate research methods: Presentation by Marian West, Univer...
Assessing undergraduate research methods: Presentation by Marian West, Univer...Assessing undergraduate research methods: Presentation by Marian West, Univer...
Assessing undergraduate research methods: Presentation by Marian West, Univer...
 
Grit and Growth”The theme this week (Friends and Enemies”) int.docx
Grit and Growth”The theme this week (Friends and Enemies”) int.docxGrit and Growth”The theme this week (Friends and Enemies”) int.docx
Grit and Growth”The theme this week (Friends and Enemies”) int.docx
 
Research & Methods, and it's types
Research & Methods, and it's typesResearch & Methods, and it's types
Research & Methods, and it's types
 
Visualization Tools for the Refinery Platform - Supporting reproducible resea...
Visualization Tools for the Refinery Platform - Supporting reproducible resea...Visualization Tools for the Refinery Platform - Supporting reproducible resea...
Visualization Tools for the Refinery Platform - Supporting reproducible resea...
 

approximating-accurate-count

  • 1. Approximating a More Accurate Count of Deaths Caused by the Russian Flu in the United States Elijah Fiore, Robert Legge, and Jonathan Walton December 2, 2016 The Russian Flu was a pandemic influenza outbreak that originated around Russia and Central Asia in 1889 and moved West to Europe and the Americas, killing roughly one million people worldwide; however due to poor reporting in certain areas and possible false identification of similar causes of death, the death toll is being reexamined by Professor Tom Ewing at Virginia Tech to find a more accurate approximation of the number of people who died as a direct result of the Russian Flu. Using data from the Report on Vital and Social Statistics from the 1880, 1890, and 1900 US Censuses, “Excess Analysis” was used as a means of statistically finding the potential number of misidentified causes of death. With group-wise methodology, we determined that there were 22,409 victims of the Russian Flu while doing the analysis for each state and adding a tolerance to adjust for natural fluctuations in death rates, we obtained a slightly higher answer at 25,477 deaths. While this did not result in a definitive estimation due to the lack of available data and time constraints, it can be used as a stepping stone for future research and analysis. 1
  • 2. 1 Problem Statement For several years, Tom Ewing, a professor at Virginia Tech in the College of Liberal Arts and Human Sciences, has been researching the Russian Flu and its significance in a historical context. Throughout his time researching, he has sought the help of multiple groups of students, including ourselves, to aid in his efforts. While Professor Ewing’s research covers multiple aspects of the disease outbreak, our main objective is to accurately estimate the number of deaths that occurred from the Russian Flu between December 1889 and January 1890 within the United States. Professor Ewing believes that the count of those infected and killed by the Russian Flu is extremely low especially compared to the number of deaths reported during the Spanish Flu outbreak 30 years later. While many researchers have studied the Spanish Flu because of its notoriety, little is currently known about the Russian Flu. Professor Ewing noted that much of the data available on the causes of death though 1889 and 1890 was inconsistent due to faults in reporting and other confounding variables. Our work on this problem aims to marry historical research with data science to correct for the statistical shortcomings in Professor Ewing’s data. This problem is important in regards to observing other historical examples of disease and perhaps future occurrences along with media coverage and reporting of these events. Our results could prove beneficial to other historians and scholars alike whose studies focus on disease outbreak as well. We hope to extrapolate upon Professor Ewing’s findings in order to provide accurate numerical context to this historical issue. 2 Ethical Considerations Although the intentions of our project are purely academic, by the very nature of our project, we are claiming that many causes of death for Americans of the 1890’s were inaccurately reported and that the true cause of death is the Russian Flu. It could be argued that trying to change one’s cause of death more than a century later when the matter has long been put to rest is unethical. Furthermore, our model is based on an assumption by Professor Ewing. While his assumption is very educated and his research so far supports his claim that the current death count is inaccurate, our model is almost entirely based on Professor Ewing’s inference being correct. Our model works by comparing several years of US Census data and assumes that every death for a handful of reasons that falls outside of the norm was due to the Russian Flu. Reporting any “new Russian Flu death count” as fact rather than a very broad estimate would be a very inaccurate representation of the model. The model answers a historical question and should be left in a historical context. 3 Literature Review Professor Ewing’s theory is well cited in his exploratory article on the Vital Statistics data. He illustrates a lack of reported “La Grippe” cases paired with an abundance of other reported diseases such as pneumonia. Outside of Professor Ewing’s research, only a handful of articles exist on the Russian Flu. We have found in our research a French study examining age distribution of the epidemic based on census data from 15 countries. The authors found that the clinical attack rate in these European countries were high and relatively constant within the age range of 1-60 years, but less outside this range in the extremes. This paper had similar aspirations to our project but with different data not useful for our purposes [3]. Additionally, a paper by J.F. Brundage investigates case rates and death rates from different strains of influenza in the American Journal of Preventative Medicine. While his goal is similar to ours in that he wanted to create a clear picture of the dynamics of an influenza epidemic, he focused on the Spanish and Asiatic strains and looked at case rates and attack rates based on age. The disease is significantly lesser known in light of the significantly deadlier Spanish Flu which plagued the world less than 30 years later [1]. Lastly, a paper published by the Proceedings of the National Academy of Sciences explored the transmissi- 2
  • 3. bility and geographic spread of the Russian Flu using military archives in Europe. The authors utilized a technique we hope to emulate in our project that used disease dynamics determined by this geographic data in order to estimate a SEIR model [2]. 4 Design Criteria Before beginning any work on the final product, we created a set of design criteria to ensure that our final product met the needs of our client; however, as we progressed, we began to realize what was possible with this project and what was not. As a result, changes were made to our design methodology and considerations several times through the course of the semester in an attempt to take these obstacles into account while still providing the best solution to the problem at hand. Primarily, through some thought about what our final product would be, we decided to split our efforts into sub-problems that would be completed on a step-by-step basis and each had their own design criteria. 1. Statistical Analysis - This is less of a sub-problem, as it is more of a preliminary step in order to continue with the following sub-problems. As such, this section will not have design criteria/methods to weigh. This part of our solution will include comparing data from the 1890 Census to the 1880 and 1890 Censuses. 2. Modeling - This is the first sub-problem with design considerations. The modeling section refers to the numerical modeling of the disease that is the backbone to our success. Initially, we hoped to formulate a Differential or an Integrated Nested Laplace Approximation (INLA/Spatio-temporal) SEIR model. The criteria for the modeling portion of our solution are as follows: • Historical Validity. This refers not only to how accurate our model is in detailing the fatalities, but also to how it stands against much of the research Professor Ewing has already done on the subject. • Usability. Professor Ewing and his team are mostly historians who do not posses the technical skills to heavily adjust our code or math in the future. It is our goal to create a model that does not require a technical background to use and would be rated a 1 or 2 on an ease-of-use scale of 1 (easy) to 5 (hard) by Professor Ewing and his team. • Model Adaptability. The scope of our project is only within the United States, but it is essential that our model be mindful of census parameters used in other countries. It would be ideal to create a model that can easily be manipulated to be used at various geographic levels, such as country, region, state, and city level. Since reporting between US Censuses are not consistent, it is unlikely that consistency across most geographic levels exists, so we can accommodate by making our process fairly simple and easily adaptable. • Timeliness. We did not want to take too much time creating our model due to the fact that we were scheduled to give a presentation in early October for Professor Ewing’s colleagues and anyone else who was interested in attending. This presentation was pushed back to the end of November due to issues from both our side and the client’s. This was added in order to separate the time-consuming modeling techniques from ones that are reasonable. 3. Visualizations - This sub-problem was originally a criterion of our solution rather than a section by itself, but we found that it was important to separate the modeling form the data visualizations as they involve very different methods that should be graded apart from one another. Our technique options for visualizations include: Heat maps, time series, clustering (or factor based visualizations), and map charts. The design criteria for the visualizations are new, but fairly self-explanatory and will be touched on in the evaluation of the visual methods. 3
  • 4. • Usefulness. It is important that the information portrayed can be used for future analysis and shows some level of substance. • Ease of Understanding. We want to minimize unnecessary content in order to avoid any confusion as to what is being shown by our visualizations. For our project, simple and effective is better than complex and bulky. • Run-Time. This is a simple consideration as we do not want our model to take too much time to run, especially if we are handing it off to Professor Ewing and his team. We need it to run numerous operations with multiple states concurrently in order for it to be convenient to our client and audience. • Aesthetic Appeal. In addition to functionality and correctness, it is important that our visual- izations be usable for presentations and papers. We want to ensure that, while these plots and charts are useful, they also look nice and are not overly simplistic or cluttered. 4.1 Quantitative Ranking of Design Criteria With these criteria in mind we can weigh each according to their value to the project. Using a pairwise comparison chart, we were able to prioritize our design criteria to narrow down our options and allow us to select the best approach. For our mathematical models, this is: Criterion Weight Historical Validity .40 Timeliness .30 Usability .20 Adaptability .10 “Historical Validity” has the highest weight to our project, as the other criteria have no merit if our model does not have an acceptable level of accuracy. “Timeliness” has the next highest weight. Professor Ewing, had set a date in early October for a presentation, which was eventually moved back to the end of November, so we knew we couldn’t spend too much time developing a model, otherwise our work would be of limited use to him. “Usability” falls below timeliness, as it is more important to our sponsor to have accurate, timely results than to be able to revisit and revise our results. Our lowest weighted criterion is “Adaptability”. Our model cannot be so specific, as it will only work in the United States. Instead we need to keep in mind data that would be equally available in other countries although those analyses are outside the scope of our project. Adaptability is something we can focus on after the main goals of the project have been satisfied. In comparison, our scoring for our visualizations is as follows: Criterion Weight Usefulness .40 Ease of Understanding .30 Run Time .20 Aesthetic Appeal .10 We see that “Usefulness” is weighted the highest because if the model does not help to solve the main problem at hand and no further insight can be gained from it, then there is little point to the model and the model ultimately fails. Following this is “Ease of Understanding” as it is imperative that our solution be something that can be one that can easily be utilized and understood by Professor Ewing and the rest of his research time once the semester is over. “Run Time” is less of an issue due to the fact that we do not foresee any issues with our visualizations taking inordinate amount of time to render, but we are aware that 4
  • 5. any solution created should be completed within a reasonable amount of time. Lastly, we have “Aesthetic Appeal”. We are aware that any visualization we create may be used in future presentations or simply for research purposes, and so they should look decent, but this factor is also not nearly as important as the previously stated factors in terms of gaining further insight from the data through our visualizations. 5 Summary of Techniques Used to Validate Model Due to the nature of our data and analysis technique, there were no intuitive ways to validate our results. When SEIR modeling was an option, the validation technique postulated was to compare the disease numbers simulated to a determined “well-reported” state’s numbers; however, with the assumption that flu numbers may be entirely inaccurate and the departure from a SEIR modeling solution due to inadequate data, this method was unusable. Our solution to this was to use an unrelated cause of death (or one with historically less probability of being misdiagnosed) and to ensure our calculations did not detect differences in flu death rate due to the illness. This validation of results will be further discussed in the results portion of the report. 6 Selected Design Solutions A natural part of data analysis problems is figuring out the data one wants or need and rectifying this with what is actually available; this was indeed an important portion of our endeavor and this issue inspired countless changes to the solution methods proposed earlier in the report. Though epidemic modeling was a main aim of the project through much of the semester, there were some challenges with the data that made this sub-problem (or modeling solution) ineffective, and perhaps not possible, within our scope. For traditional SEIR models, changes in population compartments are estimated and simulated over time us- ing differential equations with parameters that directly affect the disease’s dynamics. Our original methodol- ogy was to determine these parameters (like contact rate, incubation period, etc.) through statistical analysis of Vital Statistics from the 1880, 1890, and 1900 censuses. For SEIR to be practical, there must be accurate parameter estimates to abstract the real world in an accurate manner. However, through exploration of available data, there was simply not enough reliable data on the number of influenza cases or the attributes of the disease itself that was usable for this type of analysis. The most comprehensive set available was the collection of death counts by cause of death found in the very census we assume is inconsistent for our problem. In addition, SEIR modeling is generally used to track populations and disease outbreak, not total deaths, which is our goal in this project. As we thought more in depth about the problem and data at hand, it became clearer that SEIR was not an appropriate method for estimating deaths in the US from the Russian Flu at this stage in the investigation. Due to this realization, our methodology was updated to a reduced two-sectioned approach: statistical analysis (calculating falsely attributed upper respiratory deaths through examining trends in death rates) and visualizations (map charts, bar graphs, and plotting software for Professor Ewing). 6.1 Statistical Analysis: Excess Analysis of Marker Diseases Through our initial talks with Professor Ewing and general research on the flu outbreak in 1890, we found discrepancies in La Grippe death counts that were the basis to the investigation. This was primarily seen through the inherent disconnect between the most populous states around the Northeast US and the lack of reported La Grippe deaths in these areas, highlighted by figure 1. Many of the Northeast (and eastern states in general) that had high population and greater density had fewer counts of La Grippe deaths, which goes against normal diseases dynamics as large cities would translate to a greater spread of influenza, as logic would dictate. The discrepancy is highlighted the most with the state of New York which had the highest 5
  • 6. 25 30 35 40 45 50 -120 -100 -80 x y 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 Pop 1890 Population 25 30 35 40 45 50 -120 -100 -80 x y 250 500 750 Grippe 1890 La Grippe Deaths Figure 1: Population in 1890 and La Grippe Death Count in 1890 population in 1890 (around 6.1 million) with a relatively low death count (reported as 298). Comparing this to a smaller state in Virginia at this time, one finds that with a population of about 1.7 million, Virginia reported 575 deaths in the 11th census. With guidance from Professor Ewing, we established an assumption for the cause of this “under-reporting” in states like New York using the historical context of the Russian Flu. In the early days of the epidemic, doctors in the early states were not aware or not looking for influenza in their diagnoses and classified deaths as a disease with similar symptoms; mostly upper respiratory diseases. The notion of under-reporting inspired our statistical methods for estimating flu deaths through specifically looking at the trends in these marker diseases through what we called “excess analysis”. Excess analysis refers to examining the natural movement of death rates in similar diseases and comparing the actual rate in 1890 with the predicted 1890 rate from the trends to isolate the excess, or additional deaths outside the norm. this is similar to the method the CDC uses to track influenza that includes combining Pneumonia and influenza deaths together as one category. We assume that these excess deaths can be attributed to the misreporting of the Russian Flu. Figure 2 shows a simple visualization of this identification. In our primary analysis that looked at geographical groups of states, we used formulas similar to the following sample formulas to calculate the additional deaths from historically relevant diseases Bronchitis, Consumption, and Pneumonia: Trend rate = (Bronchitis death rate1900 − Bronchitis death rate 1880) 20 (1) Predicted death rate 1890 = (trend rate ∗ 10) + Bronchitis death rate 1880 (2) Excess = Actual death rate 1890 − Predicted death rate 1890 (3) Estimated death rate = La Grippe rate + Marker excess (4) Estimated deaths = (Estimated death rate ∗ Total 1890 population) 100, 000 (5) Later in our analysis when we performed excess calculation for every state, we also added a tolerance to the excess death rates to make a more conservative estimate of what death rates in 1890 were actually out of the norm. We chose the tolerance to be one standard deviation away from the predicted death rate of each disease. The tolerance acted as a filter for adding only significant departures from death trends to the total. All calculations in this analysis were performed in R. 6
  • 7. Excess Figure 2: Identifying Excess in Bronchitis Death Rates 6.2 Visualization Approaches In addition to finding a viable solution through the application of statistical analysis, visualizations help communicate these findings. Several types we considered include heat maps that show the disease’s impact at the city, state, and national level, time series plots to show the spread of the disease, and also clustering in an attempt to find correlations between specific fields such as age, gender, race, and geographic location. Map charts, bar charts, and time series plots were the techniques selected due to their viability for the data and results we obtained. For our project, map charts were selected to visualize death rates, counts, and populations of the respective states in our set. This is the simplest way to show the geographical component of influenza and death reporting. These map charts were created in RStudio with the ggplot and maps packages. In addition, bar charts were utilized as an exploratory tool to identify marker diseases that had the spikes in death rates in 1890; this is seen in figure 2. The bar graphs were also created using ggplot and a graphical R function called multiplot. Finally, as a deliverable to Professor Ewing, we created an interactive tool coded in Python in Enthought Canopy for on the spot visualization of disease deaths from the Census data. The main package used to create this tool was matplotlib, specifically pyplot and widgets modules. As seen in figure 3, the sliders on the bottom of the window allow the user to customize what data is plotted. The user can select choose which state and which disease to plot, or set both the State and Cause of Death slider to the left at zero to show all states and all diseases, respectfully. The plotted points show the normalized number of deaths for each year and by clicking on a point, the actual death count is returned. This is a useful tool because it can be easily adapted to a similarly outlined data frame. This will aid Professor Ewing if he chooses to drill down into the city level, shift his focus to another country, or observe other causes of death around the time period. 7
  • 8. 1880 1885 1890 1895 1900 Year 0.0 0.2 0.4 0.6 0.8 1.0NormalizedPopulation Total Population Scarlet Fever Enteric Fever Malarial Fever Diphtheria Bronchitis (Croup) Consumption Pneumonia La Grippe Meningitis Disease of Respiratory System Scrofula Unknown Cause of Death Causes of Death State filter: New York Reset Normalized Death Counts by Disease 1880 1885 1890 1895 1900 Year 0.0 0.2 0.4 0.6 0.8 1.0 NormalizedPopulation Bronchitis (Croup) Cause of Death Bronchitis (Croup) State filter: New York Reset Normalized Death Counts by Disease Figure 3: Demonstration of Plotting Interactive Tool 7 Obstacles Our greatest obstacle we had to overcome was finding data in a usable format, or finding a way to convert the data from scanned PDF Census documents to digitized tables. We attempted to look into other sources, but found nothing usable for our specific time period. The next step was to try to use OCR (Optical Character Recognition) software in an attempt to convert the images to either a CSV or text file. With the census data in PDF format and the OCR software consistently returning unusable interpretations of our data due to the poor quality of the scans, the only remaining option to progress with the project was manual data entry. Although the manual input was not an academically challenging task, it certainly was tedious. Professor Ewing named eleven causes listed in the 1890 Census that he believed could have been named the cause of death instead of the Russian Flu. Those reasons are mainly respiratory and are as follows: Bronchitis, Consumption, Diphtheria, Enteric Fever, Malarial Fever, Scarlet Fever, Meningitis, Pneumonia, Respiratory System Disease, Scrofula, and “Unknown”. We had the option of performing our statistical analysis on a city scale rather than state level although there were numerous issues with this method. First and foremost, this would require manually entering roughly 18,000 entries across the three different Censuses which is impractical considering our time constraints and that only a handful per minute can be entered. Another issue with using city level data is that the cities vary greatly across the censuses and are frequently thrown into a “group” category which also varies between the Censuses. For example, Group 1 in 1880 for any given state could be vastly different than Group 1 for the same state in 1890. 8 Results The results of our project are split into two sub-sections: group-wise and state-wise analysis. The separation was a product of discussions with Professor Ewing in reference to the spread of the epidemic geographically, as well as the assumption that certain regions (specifically the Northeast) were inherently under-reported. These sections highlight the development of our problem solution and how our solution was built-off of previous work as we went forward. 8.1 Group-wise Excess Analysis Our investigation began with splitting the states into (roughly) equally sized groups that mimicked the spread of an epidemic across the US, using the La Grippe death count map as a reference for determining group 1 8
  • 9. Figure 4: Group Separation Over Plot of La Grippe Death Count (the under-reported group of populous states including New York, Massachusetts, etc.); the groupings are visualized in figure 4. To validate there is a distinct disparity between population totals in these groups and reported La Grippe deaths, these totals were plotted in bar graphs (seen in figure 5. It is clear that, while group 1 has comparable population, there is a significant lack of reported La Grippe deaths. With this established, we move forward in the excess analysis specifically on group 1 with the assumption that groups 2 through 4 had well-reported deaths because of the geographical separation that allowed health professionals to identify the Russian Flu later in the epidemic when it appeared in these locations. Through the methods discussed earlier in the report, excess death rates from Pneumonia and Bronchitis were calculated and added to the La Grippe death rates reported by the 1890 census for group 1. In group 1, the death rate from La Grippe per 100,000 persons was determined to be 12.5, while Bronchitis and Pneumonia had excess death rates of 25.77 and 23.5 respectively. These rates were added together and used to determine the group 1 flu deaths in table 1. The death count estimate for group 1 was added to the total from the rest of the groups to get a total estimate. From table 1, we get our preliminary answer to our guiding question: according to this basic investigation, there were 22,409 victims of the Russian Flu, 72 percent higher than the reported value. The question that had to be answered at this point was if this 22,409 was the final answer to this problem. Looking at the primary methodology, there are many limitations to our initial estimations. First, this type of analysis assumes that these groups are actually significant and that the disease spread in this way. Secondly, this method glosses over the possibility of variation within each group. for example, Pennsylvania is included in group 1 because of its high population, geographic location, and low reported La Grippe deaths in its largest cities; however, from figure 5, one can see it still has (comparatively) more reported deaths than New York or New Jersey. The inclusion of Pennsylvania in this group could affect estimates poorly. Additionally, because of the “well-reported” assumption for the other groups, excess deaths from these groups are not estimated or counted towards the new total. Lastly, this is not a terribly conservative method for estimating misreported deaths as there could be natural fluctuation in the observed trend and the method would not account for that and assign more deaths than accurate to the excess. To address these concerns, excess analysis was performed for each state where a tolerance of one standard deviation of the predicted 1890 9
  • 10. 0.0e+00 5.0e+06 1.0e+07 1.5e+07 2.0e+07 1 2 3 4 Group Population Group 1 2 3 4 1890 Population per Group 0 2000 4000 6000 1 2 3 4 Group Deaths Group 1 2 3 4 1890 La Grippe Deaths Per Group Figure 5: Population and La Grippe Death Count in 1890 Group-Wise death rates was used for each disease of interest. Reported La Grippe Deaths Estimated Excess Deaths Total La Grippe Deaths Plus Excess Deaths Group 1 2,177 9,360 11,537 Total US 13,049 9,360 22,409 Table 1: Table of Reported La Grippe Deaths and Estimated Excess Deaths in 1890 8.2 State-wise Excess Analysis After selecting the marker diseases (Bronchitis, Pneumonia, and Consumption) which were historically iden- tified to increase in death counts (on average) during the epidemic, an R script was ran that filtered through all states to identify all excess death rates that were both positive (as a negative difference between pre- dicted and actual death rates in 1890 would signify a decrease in disease death rates and, consequentially, a well-reported state) and above the discussed tolerance to account for fluctuations in death rate trends. A sample of the code used is directly below. #removing undocumented states dak <- c("ND", "SD", "DA", "OK") cen2 = subset(cen, !(Abbr %in% dak)) #Adding death rate for La Grippe cen2$Grippe.Per = 100000*cen2$La.Grippe.1890/cen2$Total.Population.1890 cen2$Grippe.Per #Vectors for storing death values from the for loops pred = vector(mode="numeric", length=46) death4sd <- vector(mode="numeric", length=46) deaths <- vector(mode="numeric", length=46) #For loop to get bronchitis death rates for each state for (i in 1:46){ bron80 = 100000*cen2$Bronchitis..Croup..1880[i]/cen2$Total.Population.1880[i] bron90 = 100000*cen2$Bronchitis.1890[i]/cen2$Total.Population.1890[i] 10
  • 11. 25 30 35 40 45 50 -120 -100 -80 x y 10 20 30 40 GrippePer 1890 La Grippe Deaths Per 100,000 25 30 35 40 45 50 -120 -100 -80 x y 20 40 60 80 Death_Rate 1890 La Grippe Death Rate Plus Excess Bronchitis and Pneumonia Death Rates Figure 6: Comparison of Death Rates Between reported La Grippe and With Added Excess from Bronchitis and Pneumonia bron00 = 100000*cen2$Bronchitis.1900[i]/cen2$Total.Population.1900[i] diff = (bron00-bron80)/20 #slope of death trend line death = bron90-(bron80+(diff*(10))) #excess death rates pred[i] = (bron80+(diff*(10))) #predicted death rates 1890 death4sd[i] = pred[i]*cen2$Total.Population.1890[i]/100000 #predicted deaths deaths[i] = death*cen2$Total.Population.1890[i]/100000 #excess deaths } tol = sd(pred) #setting tolerance at 1 sd #For loop that looks at if the death rate is over tolerance and positive for (i in 1:46){ if (!(deaths[i] > (pred[i]+tol))){ deaths[i] = 0 } } #vector with excess death counts (with 0 for states that were insignificant deaths #adding vector of death rates to the data frame cen2$death1 = 100000*(deaths+cen2$La.Grippe.1890)/cen2$Total.Population.1890 Both Dakotas were removed from these calculations because of lack of data for each time point for comparison; nevertheless, their respective La Grippe death totals were added in the final estimation untouched. The process used, like before, added excess death rates to La Grippe death rates and used each state’s 1890 population to calculate an estimated flu death rate for each state. The map chart visualizations of the estimated death rates after adding excess disease deaths are shown in figure 6. The map charts show that by adding these excess death rates from Bronchitis and Pneumonia, we get a more accurate estimation of total influenza death rates per 100,000. It is important to note that New York now has the highest death rate in the country, which is realistic for the state that includes the most people and largest cities (both vital factors for disease spread). Here, we obtained, in our opinion a more accurate estimation of 1890 US Russian Flu deaths that took into account variation state-to-state and possible fluctuations in death trends. An answer to our problem, so to speak, is that there were about 25,477 victims of the Russian Flu compared to the 22,409 estimated victims 11
  • 12. Reported Deaths Estimated Excess Deaths Bronchitis 21,420 4,702 Pneumonia 76,578 3,127 Consumption 102,727 4,599 Totals 200,725 12,428 Table 2: Table of Reported Disease Deaths and Estimated Excess Deaths in 1890 1890 Influenza Deaths Total Reported 13,049 Total Estimated with Excess 25,477 Table 3: Table of 1890 La Grippe Deaths Reported and with Estimated Excess Deaths through the Group-wise analysis. In order to validate our methodology of including the excess deaths in the flu death category, we examined how a seemingly unrelated and well-diagnosed disease performed in our analysis. If our methods are accurate, a disease that was not inflated by misreported influenza deaths, such as Scarlet Fever, should not report many excess deaths if any at all. Looking at Group-wise plots for Scarlet Fever in figure 7 shows a steep average drop off in each group with no obvious excess in any case. Just as a preliminary visualization, one can see that Scarlet Fever most likely will not contribute much excess to any flu death totals as we should expect from a disease with very specific symptoms (red rash) that is less probable to be misdiagnosed. Using the same code seen previously to calculate excess Scarlet Fever deaths over the prediction tolerance, we get the excess deaths from Scarlet Fever for each state. From the output, there were no excess death attributed to the disease save for 136 in Indiana (possibly an outlier case; this output can be seen below. The resulting output is a good sign that for diseases that did not spike during 1890 (whether caused by influenza or from another outbreak outside of our scope), our model does not falsely attribute deaths to the Russian Flu. In this way, our estimation methods pass this initial validation technique. > deaths5 [1] 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 [11] 0.0000 0.0000 135.5855 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 [21] 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 [31] 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 [41] 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 8.3 Evaluating Our Results Examining what we accomplished in our results versus what we aspired to do through our design criteria is extremely important when looking back at the success of our solutions. An issue to note is that our design criteria were specifically made with past solutions in mind, including the specific quantification system; however, it is still possible to look at our results using the criteria as a rough guide for success despite differences in solution methods. We aimed to evaluate our modeling solution by its historical validity, timeliness, adaptability, and usability. Even with changing our work from SEIR modeling to excess analysis, our results fit these criteria just as well. While our model may not be as adaptable as we would have liked (methodology only useful to United States because of our assumptions), it has a certain degree of validity as discussed previously, is simple to understand in its basic estimation, and is not a technique that takes considerable time to run for each state. From these criteria alone, our results are successful. Looking at 12
  • 13. 0 20 40 60 1880 1890 1900 Year Deaths Scarlet.Fever Deaths per 100,000 in Group 1 0 20 40 60 1880 1890 1900 Year Deaths Scarlet.Fever Deaths per 100,000 in Group 2 0 20 40 60 1880 1890 1900 Year Deaths Scarlet.Fever Deaths per 100,000 in Group 3 0 20 40 60 1880 1890 1900 Year Deaths Scarlet.Fever Deaths per 100,000 in Group 4 Figure 7: 1890 Scarlet Fever Death Rate by Group the same criteria for our visualizations, we come to the same conclusion. Not only are our graphics easy to understand and appealing, they are, at their core, useful to our client in his continued work with the topic (especially the Python graphing script). Despite our perceived success, it is important to establish the limitations of our final solutions and methods in this project. Primarily, with a data set assumed to be inherently flawed, accuracy is a peak concern. while we have shown it is possible to estimate Russian flu deaths in the US, the real question is that of the true accuracy. One cannot simply go back in time and change how the public recorded deaths, so this may be a limitation beyond repair. Secondly, the data our team transcribed was state-level. this created the issue of low spatial resolution that restricts the analysis possible. There could be city-to-city variation (much like that of states) that accounts for different death rates. There could be discrepancies between populous and dense cities versus small towns that would be very useful to account for in our model. Lastly, there were many assumptions made that must hold true for any of our results to be meaningful. Firstly, the trends in death rates must be naturally occurring and not determined by outside events like separate outbreaks in our diseases of interest, or in other words, these excess deaths must be caused by La Grippe and must be significant. All these limitations show there may be issues in the estimation methods themselves; nevertheless, in such a sparse topic such as the Russian Flu, our results prove to be an important basis for future work. 9 Future Work We have by no means come up with a definitive answer for the question, “How many people died from the Russian Flu in the United States from December 1889 to January 1890?” We have statistically supported the possibility that more people died from the Russian Flu than what was previously recorded, but it is not 13
  • 14. enough to be able to give an accurate estimate without further analysis. If we had more time we would want to look into more statistical analysis to identify outliers that may be skewing the data in an attempt to get a slightly more accurate approximation. Moving forward, Professor Ewing noted that he wanted to try to look deeper into the city level, but also expand from the state level to geographic regions, then to other countries and bring all of that information together to see how they compare in a historical and geographical context. He also wanted to try to find trends based on categories such as gender, age, ethnicity, height, weight, and other recorded quantitative categories to see if anything stood out, like one group being more susceptible than another. This would have been an interesting topic for analysis that could have observed through k-means clustering and multidimensional scaling (MDS) as a means of finding commonalities and trends among observed cases. Unfortunately, practically any future work on our topic and our past ideas for our mathematical models, including Differential and Spatio-temporal SEIR modeling, as well as the previously stated clustering and MDS visualizations, relies largely on the existence of usable data, and that is the greatest issue. One thought that we had considered was scraping death certificates from ancestry websites and other databases, which could be worth looking into if all other resources have been exhausted. References [1] J. F. Brundage, Cases and deaths during influenza pandemics in the united states, American Journal of Preventive Medicine, 31 (2006), p. 252–256. [2] A.-J. Valleron, A. Cori, S. Valtat, S. Meurisse, F. Carrat, and P.-Y. Boelle, Transmissibility and geographic spread of the 1889 influenza pandemic, Proceedings of the National Academy of Sciences, 107 (2010), p. 8778–8781. [3] S. Valtat, A. Cori, F. Carrat, and A.-J. Valleron, Age distribution of cases and deaths during the 1889 influenza pandemic, Vaccine, 29 (2011). 14