STAT 656 – Spring 2011ShootoutGroup #1 (Team SATURN) Storm Impacts on Infectious Disease PropagationPrepared By:Charles Gordon, Abigail Green, Uday Hejmadi, PrabuKrishnamurthy, Bhargava Lakkaraju, Deepthi Uppalapati,Li Yun ZhangMay 9, 2011
Executive Summary:Even in the 21st century, weather plays a large role in people’s individual lives as well as in thecommunity in which they live. Weather events can affect the propagation of infectiousdiseases by causing people to spend more time indoors in close contact with each other. Givendata on the number of health facility admissions by Diagnostic Related Group (DRG) infectiousdiseases each week in various area codes for several years along with storm and weather datafor the preceding weeks, while factoring in the minimum and maximum incubation periods forall specified DRGs, a model was built to find relationships between the predictor variables andthe number of admits. The analysis shows that some storms (Flood, Cold and Wind) can causea statistically significant increase in the number of admissions, while other storms (Winter andThunderstorms) do not play a decisive role. However, the storm and weather data havemeaningful interactions with Age Groups, that correspond to varying phases of biologicaldevelopment, as well as the area’s population and the different DRGs themselves. Using theselected model, healthcare management can forecast variation in healthcare usage and planaccordingly.Appendix A details the necessary steps to import the attached SAS Enterprise Miner .xmldiagram and repeat the analysis.Introduction / Problem Statement:Many factors impact the spread of disease. In this project, we analyze the impact of storms onthe incidence of infectious disease. Weather is believed to have direct impacts such as injuries,drowning, freezing, exhaustion and dehydration as well as indirect impacts when people’sbehaviors are changed. For example, weather patterns affect how people congregate and as aresult the storms affect the rate of propagation of an infection through a community. Thepresumption is that the more time people spend indoors near each other the more likely adisease is likely to spread.First, we will determine the types of storms (Wind, Thunderstorm, Flood, Winter storm, Cold)that have a statistically significant impact on certain infectious diseases. Second, we willpropose a model that predicts the incidence of certain infectious diseases based on DiagnosticRelated Group (DRG), age-group, area code and week. The model could then be used to helphealthcare providers prepare for fluctuations in patient needs throughout the different weeksof the year based on predicted patient numbers and diseases.Data Preparation:Several datasets were provided. An admits dataset provided the number of people admittedfor a specific infectious disease by area code, age-group, and week. The dataset was assumedto be complete. The admits table (after changing the column title from “DRG_code” to“DRG24” for compatibility) was joined to the DRG table, which provided the minimum andmaximum weeks required for the incubation of a given disease. In the description of several
DRGs there was a specified age-range for the DRG, so only populations in the described age-range could be diagnosed with the DRG. The tables were joined based on DRG code. Then twonew columns were created, minimum and maximum storm allowable storm week, so thatstorm data could be joined to the admits data. The minimum allowable storm week wascalculated by subtracting the maximum incubation period from the week that people wereadmitted, and the maximum allowable storm week was calculated by subtracting the minimumincubation period.These datasets were then joined to the storm data, which provided the number of storms of agiven type (Wind, Thunderstorm, Flood, Winter, and Cold) that occurred in a given area codeand week. Storms were joined to admits by counting each type of storm that occurred duringthe DRG-specific incubation period prior to the observed week.The weather table was joined to the storm table to try and determine if any storms maypossibly be missing from the storm dataset. First, several of the area codes had multipleobservations in one week. These observations were very similar, so the mean of all theobservations was used to condense the multiple observations into one per week. The stormand weather datasets were joined on area code and week. There were very few stormobservations with a count of zero, so great care would need to be taken when modelingadditional storms. The weather data was filtered to remove values of week > 418 as noresponse data was available for the number of admits on weeks beginning with 419, andmissing values of the predictor variables were imputed with the Tree method for intervalvariables. No indicator variables were created. The number of missing values ranged from 17to 42 observations.The two datasets “sas2011_population” and “sas2011_population_2005” were appended tocreate the entire population data set, and the “age” column was changed to “age_group” withmore clearly defined values that matched the descriptions in the admits table. The populationdata was merged with the admits data (that already contains DRG and storm data) by area-code, age-group and week.The calendar dataset was filtered to remove values of week > 418 and then merged to theadmits data set that was filtered to remove missing responses. It was merged by week, area-code and age group. The calendar data provides the number of workdays and schooldays in agiven calendar week. The number of workdays and schooldays may impact the ability of astorm to impact the spread of disease because it is assumed that there may be more humancontact at school and work to encourage the spread of disease. The calendar data at the timeof the storm impacts the potential for the spread of an infectious disease, and its impact mayvary by age group.Data ExplorationThis problem includes data on 23 separate diagnostic related group (DRG) codes, althoughsome of them are related. For example, there are three separate DRG codes for pneumonia
with the differences being with or without “CC” and also 0-17 versus 17+. The minimumincubation period for all DRGs is 1 week with a max incubation period of 2-4 weeks.The fictitious admits data has a separate record for each week number depending on the areacode, DRG code and age_group along with the response value of number of admits. Over 73%of the records had 0 admits during that week with an average of 0.289. Only 0.4% of therecords had more than 2 admits during the week, with a maximum value of 14 that occurred ontwo separate occasions. With a quick glance at the data sorted by number of admits, the samepredictors of Area_Code = URQ80YY, DRG24 = 89, and age_group = 65+ are heavilyconcentrated at the high range of number of admits.The population data is heavily right-skewed 90% of the age_group-specific populations in agiven area code are less than 5,000 although some groups do contain values of greater than20,000. The median is just over 500 while the mean is over 1,500.The storm dataset not only breaks down the types of storms in a given week for each areacode, but it also indicates how many storms occurred that week. Having multiple severestorms, or even a larger storm system that contained more than one type of storm (e.g. Coldand Flood in the same week), could play a large role in determining the number of admits withan associated DRG. It is clear that some storms are major and other are minor relative to oneanother after viewing the weather data and seeing the range of Snow in one week.The weather data set gives more data including high and low average temperatures for theweek in addition to the lowest and highest temperature of the entire week. Precipitation andSnow are contained in separate columns, and Snow must not be a subset of Precipitation as itsvalues can be greater than Precipitation. The hours of daylight is also given. An initialhypothesis would be that greater hours of sunlight would lead to people venturing outdoorsmore and contributing less to the spread of an infection disease. Given the large range oftemperatures, the data not only spans entire years but also encompasses a large variety oflocations. The weather data is highly correlated due to each area_code having a separaterecord for each week. Adjacent weeks are expected to be highly correlated as well as having aseasonal factor throughout the year. Some data was missing, but it was very minimal.The calendar data set captures the data of the Sunday of each week, and the number ofworkdays and schooldays indicate how much interaction people in the community may beexperiencing that week depending on their age. The summer weeks are easily spotted byhaving 0 schooldays.The area code will not be used as a predictor variable. We may want to generalize over all areacodes and not just the ones given in this data set. Additionally, the information contained bythe area code’s location and climate zone should be captured by the storm and weather datasets.
The score data set contains information on area_code, DRG24, age_group and week number.The number of admits is blank and must be modeled. The weather and storm historical datashould be carried through to help make inferences on the score data set, and it will be mergedby the area_code and week.Data MiningOnce all of the data sets were properly added to the SAS Enterprise Miner diagram with thenecessary coding to append and merge where necessary along with some filtering andimputation, a 20% sample of the data was taken to build the model against. This still includedhundreds of thousands of observations but allowed for quicker analytical processing time.Additionally, the predictors did not have any categorical variables that occurred veryinfrequently, so not too much information was lost. The sample was stratified usingproportional criteria to ensure that a representative sample was chosen. This was importantdue to the fact that some of the response variable levels with a high number of admits didoccur infrequently. A minimum strata size of five was applied. The model was build, validatedand tested against this sample size, but the rules of the ultimately chosen model are applied inwhole to the scoring data set. Figure 1 shows the Enterprise Miner Process Flow Diagram. Thedata mining begins after the final data merge.Figure 1: Enterprise Miner Process Flow Diagram
Using the sample of the merged data sets, the data was partitioned using a Data Partition nodeinto 60% Training, 20% Validation and 20% Test. Using the default of the node, the partitionswere stratified on the Response variable of number of admits. While each data set will havedifferent records, this stratification will keep them as close as randomly possible. The defaultrandom seed number 12345 was used in this node as well as all preceding and following nodes.The subsequent models will be built on the Training data set per the rules specified in thosemodels. The Validation data set will be used to find the optimal number of steps or iterations inthe model based on the chosen criteria. This keeps the model from overfitting the data. If onlythe Training data set was used, each additional rule of a model could increase its apparent fit tothe data, but this model and all of its rules may not be applicable to future data sets. Applyingthis model is a large part of the expected outcome of this exercise. The Test data set is usedseparately as it is not involved in either the building of the model or the selection of the bestmodel. The Test data set gives additional independent records for an unbiased measurementof the results of the model.The output from the Data Partition node was fed into model nodes including a Decision Treeanalysis using Misclassification criteria, a Decision Tree analysis using Average Square Error anda Gradient Boosting model.The first Decision Tree selected its final model based on the Misclassification rate of theValidation data set. The decision tree in general determines its model not based on amathematical equation but instead on a set of splitting rules that determines the most likelyoutcome of a record given all of the available predictors. First all of the data is groupedtogether and a rule is developed that splits the group into two sub-groups such that thedifference between the groups is maximized. The maximization occurs by ultimately choosingcertain values of one variable and putting them onto one side of the tree while placing theremaining observations on the other side of the tree after the calculations are performed on allvariables and split locations. This process is then repeated for each sub-tree on the Trainingdata set until one of the stopping rules is reached. Our response variable was number ofadmits and it was treated as a continuous variable. Therefore, the exact number of admits hadto be modeled for the record to be properly identified, although only integers greater than orequal to 0 were valid choices. Because the response variable was interval, ProbF was used asthe interval splitting rule criterion. The decision tree is able to handle missing values withouteliminating the record by placing them all in the left side of the branch and modifying itssplitting rules appropriately. The maximum branch size was 2 and the maximum branch depthwas 6; these are both defaults of the Decision Tree node.The second Decision Tree selected its final model based on the Average Square Error of theValidation data set. This assessment criterion seems more appropriate as the larger problem isinterested in determining which storms have the greatest impact on the number of admits withthe ability of the model to make predictions. Choosing the exact number of admits for a givenweek is less important than making the most accurate predictions available. The final modelwill be selected using Average Square Error as the assessment criterion, but a Misclassification
tree was also included to see how it compares. Because the response variable was interval,ProbF was used as the interval splitting rule criterion. The decision tree is able to handlemissing values without eliminating the record by placing them all in the left side of the branchand modifying its splitting rules appropriately. The maximum branch size was 2 and themaximum branch depth was 6; these are both defaults of the Decision Tree node.The Gradient Boosting node uses tree boosting to create a series of decision trees that togetherform a single predictive model. A tree in the series is fit to the residual of the prediction fromthe earlier trees in the series where the residual is defined in terms of the derivative of a lossfunction. Boosting is a classification technique whereby the estimated probabilities areadjusted by weight estimates, and the weight estimates are increased when the previous modelmisclassified the response. The Gradient Boosting model in this diagram uses 50 iterations witha Shrinkage value of 0.10 to reduce the prediction of each tree and a Training proportion of60% where a different training sample is taken for each iteration. The other defaults of theGradient Boosting node were kept with the assessment measure being Average Square Error onthe Validation data set.Results:The Misclassification decision tree had nearly 500,000 degrees of freedom. The inputs included18 interval variables and 2 nominal variables. The final selected model had only four terminalleaves meaning there were only three splits. The first split was on Flood Storm count and thesecond split was on Cold Storm count. The third and final split was on Highest Temperature ofthe Week, with temperatures above 52.5 associated with higher admit rates (and the group sizeof the leaf equal to the minimum value of 5). Further iterations improved the misclassificationrate of the training data set, but the misclassification rate of the validation data set reached itslowest value at four iterations and remained at the same rate with further iterations.Therefore, the simplest model (i.e. the one with the fewest iterations) with the bestmisclassification rate was chosen as the final model. The cumulative lift chart for this model isshown in Figure 2 below. Considering the top 10% of ordered cases, a lift of 10 indicates ourmodel provides a 10 times more likely outcome of our response variable of number of admits.Figure 2: Cumulative Lift Chart for Misclassification Decision Tree
The Average Square Error decision tree had an identical set-up to the Misclassification decisiontree, except it had a different model-selecting criterion. The average square error was slightlylower and preferential to the Misclassification decision tree, although the two trees had thesame misclassification rate on the validation data set. The final selected model had nineterminal leaves. Figure 3 shows that the best model was chosen after 9 iterations, and thedecision would have been different if using misclassification rate. Figure 4 shows the Lift Chart;the entire lift benefit is achieved by the 10th percentile on the Training data set. The cumulativelift chart is nearly identical to that shown for the Misclassification decision tree, with acumulative lift value of 10 at the 10th percentile.The first two splits in the Average Square Error decision tree matched those for theMisclassification decision tree: Flood Storm count and then Cold Storm count. For the areaswith a higher flood storm count, larger populations above 3048.5 were much more likely tohave a larger number of admits. Highest temperature of the week was again used as a splittingvariable before Week Number was split on twice. Including the week number in the model maynot contribute much to future predictions, but it could indicate trends as to whether a certainarea is becoming more or less prone to the transfer of infectious diseases. Further down thetree model, additional rules were again made on splitting on the Flood Storm and Cold Stormdata. These particular storms seem especially correlated with the number of admits insubsequent weeks.
Figure 3: Model Iteration Plots for Average Square Error Decision TreeFigure 4: Lift Chart for Average Square Error Decision TreeThe final model selected from the Gradient Boosting node was chosen based on average squareerror of the Validation data set as opposed to the default profit criterion. This modelperformed very similarly to the two decision tree models with a nearly identical cumulative liftchart, but the predictor variables were considerably different. The first and third mostimportant variables were the DRG code and the Age Group, respectively. This indicates thatthere is much more involved than just the type of storm. There are strong interaction effects
that determine the number of admits. Additionally, the only storm count that incurred anysplitting was Wind Storm.Figure 5: Variable Importance for Gradient Boosting model Ration of Validation Splitting Validation to Training InteractionVariable Name Variable Label Rules Importance Importance Importance ImportanceDRG24 DRG24 317 1 1 1 0.05705945WindStormCount 620 0.90436482 0.68292926 0.755147979 0.05235151age_group 150 0.54379129 0.31783064 0.584471741 0.03751741_NODEID_ 39 0.23469854 0.24760786 1.055003845 0.00276443week week 374 0.43223949 0.20742442 0.479883087 0.00319923Schooldays Schooldays 72 0.30115671 0.18227595 0.605252845 0.01177374 Imputed: LowestIMP_MinLowT temperature of the 47 0.34987899 0.16313048 0.466248272 0.00187398 week Imputed: Average lowIMP_AvgLowT temperature of the 13 0.10240737 0.05075145 0.495583996 NaN week Imputed: HighestIMP_MaxHighT temperature of the 72 0.15980206 0.04996011 0.312637462 NaN weekIMP_week Imputed: week 79 0.18864154 0.01674544 0.088768586 NaN Imputed: Ddaylight ofIMP_Daylight 0 0 0 NaN NaN the weekWorkdays Workdays 0 0 0 NaN NaNcoldStormCount 0 0 0 NaN NaNThunderStormCount 0 0 0 NaN NaN Imputed: AverageIMP_AvgHighT high temperature of 0 0 0 NaN NaN the weekWinterStormCount 0 0 0 NaN NaN Imputed: Inches ofIMP_snow 0 0 0 NaN NaN snowFloodStormCount 0 0 0 NaN NaNpopulation 0 0 0 NaN NaN Imputed:IMP_prcp 0 0 0 NaN NaN PrecipitationThe model comparison node selected the Gradient Boosting model as it had the lowestvalidation average square error value of 0.0246. The Average Square Error Decision Tree wassecond with a validation average square error value of 0.0249, and the MisclassificationDecision Tree was not far behind at 0.0250. Timeout errors were received when trying to fitregression and neural node models; the root cause is unknown.
Conclusions:After meticulous merging of the given data set with all available weather, storm, population,calendar and DRG parameters, the developed models achieved a low average square errorvalue of less than 0.0250 using the number of admits as the response variable. Themisclassification rate of all of the models was primarily based on the prevalence of the weekswith admits equal to 0; however, the low average square error of the selected model lendssome confidence to making predictions.The Gradient Boosting model was chosen over the two Decision Tree models, although they allhad similar response characteristics. The Wind, Flood and Cold storms were involved in themodels, but Thunderstorm and Winter storms were not. Storms that affect people’s behaviorand keep them indoors in close proximity to others are more likely to contribute to the spreadof infectious diseases. In that light, wind storms can knock out electricity and cause severedamage, impairing travel and normal school and business functions. Floods and cold stormscan likewise be major disruptions. On the other hand, people are accustomed tothunderstorms and winter storms and have methods of dealing with them while not greatlyaltering their normal lifestyle.The DRG code as well as the Age Group played significant roles in the chosen model. Theinteractions between these variables and the weather data are evident; certain events willaffect some groups and not others. Additionally, some of the DRGs only apply to certain AgeGroups, so the relationship is not surprising. For cold storms, the respiratory illnesses includingbronchitis, pneumonia and respiratory infections were most positively correlated with anincrease in cold storms while viral illnesses and fevers actually had a negative correlation. Forfloods, viral illness and fever as well as otitis media and URI had a positive correlation with anincrease storm incidence. These relationships are built into the model, so the model hadpredictive capabilities as well as general modeling capabilities if planners want to see what theiradmission needs might be given certain scenarios.
Appendix A: SAS Enterprise Miner 6.2 .XML Diagram InstructionsIn order to run the attached “Shootout_Team_SATURN.xml” file, open Enterprise Miner andcreate a new project. In the Project Start Code, add a libname titled “shootout” with a path toa directory containing the SAS2011_STORM dataset. Add the following 7 data sources to theproject: CALENDAR_DATA (created from Calendar.xls), DRG_LIST_DATA (created fromDRG_list.xls), SAS2011_ADMITS, SAS2011_POPULATION, SAS2011_POPULATION_2005,SAS2011_WEATHER, SCORE_DATA. All of the datasets were provided by the SAS Shootout 2011project. Import the Shootout_Team_SATURN.xml diagram and run all nodes.