A short presentation on the use of tweets to measure and predict Norovirus outbreaks. This talk was given at the 2015 mini-conference on data science as one of four competition finalists, and the project eventually won.
Staff members were asked on the FSA’s Yammer network (a closed social media network) to share their ideas of words that might be used when discussing norovirus. The results were used in a keyword search to build the dataset of tweets.
As well as keywords that we want to include, there are keywords that often go hand-in-hand with them and indicate that the subject of the tweet was unrelated to Norovirus. We also crowd-sourced these exclusions, as well as looking out for common terms that indicated a “red herring”
Assuming that lab reports are themselves a good indicator of Norovirus cases in the community, this chart demonstrates that tweets including “sickness bug” and related terms are a similarly good indicator, reproducing not just the seasonality but features such as the double peak in the winter of 2012/13 and the relative heights of the peaks. For this graph, both lab reports and volumes are smoothed by averaging over seven week periods.
In order to be a predictive tool, we need to identify the characteristics of the tweets curve at a time prior to the peak of the cases. We therefore look at a lagged set of data where we are comparing tweets to “future” cases.
In order to be a predictive tool, we need to identify the characteristics of the tweets curve at a time prior to the peak of the cases. We therefore look at a lagged set of data where we are comparing tweets to “future” cases.
The predictive model, whether it be logistic regression or naïve Bayes, will give us a value that we need to convert into a one or a zero. We need to choose a cutoff value that suits our needs for interventions. This could be done by minimising false positives, maximising true positives, or some other method. The outcome we want it a method that gives an early warning. We are willing to accept false positives in spring and summer on the basis that these can be eliminated by inspection.
The predictive model, whether it be logistic regression or naïve Bayes, will give us a value that we need to convert into a one or a zero. We need to choose a cutoff value that suits our needs for interventions. This could be done by minimising false positives, maximising true positives, or some other method. The outcome we want it a method that gives an early warning. We are willing to accept false positives in spring and summer on the basis that these can be eliminated by inspection.
The predictive model, whether it be logistic regression or naïve Bayes, will give us a value that we need to convert into a one or a zero. We need to choose a cutoff value that suits our needs for interventions. This could be done by minimising false positives, maximising true positives, or some other method. The outcome we want it a method that gives an early warning. We are willing to accept false positives in spring and summer on the basis that these can be eliminated by inspection.
The predictive model, whether it be logistic regression or naïve Bayes, will give us a value that we need to convert into a one or a zero. We need to choose a cutoff value that suits our needs for interventions. This could be done by minimising false positives, maximising true positives, or some other method. The outcome we want it a method that gives an early warning. We are willing to accept false positives in spring and summer on the basis that these can be eliminated by inspection.
The predictive model, whether it be logistic regression or naïve Bayes, will give us a value that we need to convert into a one or a zero. We need to choose a cutoff value that suits our needs for interventions. This could be done by minimising false positives, maximising true positives, or some other method. The outcome we want it a method that gives an early warning. We are willing to accept false positives in spring and summer on the basis that these can be eliminated by inspection.
The predictive model, whether it be logistic regression or naïve Bayes, will give us a value that we need to convert into a one or a zero. We need to choose a cutoff value that suits our needs for interventions. This could be done by minimising false positives, maximising true positives, or some other method. The outcome we want it a method that gives an early warning. We are willing to accept false positives in spring and summer on the basis that these can be eliminated by inspection.
NHS Choices have created infographics to influence people’s behaviour when they or their children catch Norovirus. These will be released when the model predicts an outbreak (these images are not yet finalised, and may still be subject to change).
Epidemiology and cost of nosocomial gastroenteritis, Avon, England, 2002-2003.
Lopman BA1, Reacher MH, Vipond IB, Hill D, Perry C, Halladay T, Brown DW, Edmunds WJ, Sarangi J. http://www.ncbi.nlm.nih.gov/pubmed/15504271
Longitudinal study of infectious intestinal disease in the UK (IID2 study): incidence in the community and presenting to general practice
Open Access
Clarence C Tam1, Laura C Rodrigues1, Laura Viviani1, Julie P Dodds2, Meirion R Evans3, Paul R Hunter4, Jim J Gray5, Louise H Letley2, Greta Rait2, David S Tompkins6, Sarah J O'Brien7 On behalf of the IID2 Study Executive Committee*
http://gut.bmj.com/content/early/2011/06/26/gut.2011.238386.short?q=w_gut_ahead_tab
Some tweets are geotagged, and most carry some location information such as the address of the user. This could potentially be used to map Norovirus outbreaks as they occur, and target interventions even more effectively.