Your SlideShare is downloading. ×
Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

3,459
views

Published on

The majority of the designs, analyses and evaluations of early detection (or biosurveillance) systems have been geared towards specific data sources and detection algorithms. Much less effort has been …

The majority of the designs, analyses and evaluations of early detection (or biosurveillance) systems have been geared towards specific data sources and detection algorithms. Much less effort has been focused on how these systems will "interact" with humans. For example, consider multiple domain experts working at different levels across different organizations in an environment where numerous biosurveillance algorithms may provide contradictory interpretations of ongoing events. We present a framework that consists of a collection of autonomous, machine learning-enabled analytic processes, services and tools that; for the first time, will seamlessly integrate surveillance and response systems with human experts.


0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,459
On Slideshare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
109
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Transcript

    • 1. Taha Kass-Hout, MD, MS Nicolás di Tada October 2008 MACHINE LEARNING AND DISEASE SURVEILLANCE
    • 2. Image source: http://www.birds.cornell.edu/crows/images/deadcrow.jpg Image source: http://farm3.static.flickr.com/2029/2239605500_6ef2fd2295.jpg?v=0
    • 3. LATE DETECTION – RESPONSE DAY CASES Opportunity for control
    • 4. EARLY DETECTION AND RESPONSE DAY CASES Opportunity for control
    • 5. INFORMATION SOURCES
      • Event-based – ad-hoc unstructured reports issued by formal or informal sources
      • Indicator-based – (number of cases, rates, proportion of strains…)
    • 6. PUBLIC HEALTH MEASURES
      • Representativeness
      • Completeness
      • Predictive Value
      • Timeliness
    • 7. PUBLIC HEALTH MEASURES 1000 Malaria infections (100%) 50 Malaria notifications (5%) Specificity / Reliability Sensitivity / Timeliness
        • Main attributes
          • Representativeness
          • Completeness
          • Predictive value positive
      Get as close to the bottom of the pyramid as possible Urge frequent reporting: Weekly  daily  immediately
    • 8. PUBLIC HEALTH MEASURES Analyze and interpret Automated analysis/ thresholds Time
        • Main attributes
          • Timeliness
      Health care hotline Signal as early as possible
    • 9. THE PROBLEM SPACE
      • Current systems design, analysis and evaluation has been geared towards specific data sources and detection algorithms – not humans
      • We have systems in place for those threats we have been faced with before
    • 10. PUBLIC HEALTH – TWO PERSPECTIVES
      • Case management
        • Individual cases of notifiable diseases
        • Relationship networks (contact tracing)
      • Population surveillance
        • Larger risk patterns
    • 11. CASE MANAGEMENT
      • Questions/problems:
        • Is a case due to recent transmission?
        • If so, does the case share any feature with other, recent cases?
      • Ways it's being done:
        • Investigations/interviews
        • Meeting with other investigators
    • 12. POPULATION SURVEILLANCE
      • Questions/problems:
        • Are more cases happening than expected?
        • Does an excess suggest ongoing transmission in a specific region?
      • Way it's being done:
        • Semi-automated routine temporal and space-time statistical analysis
    • 13. WHY LOCATION MATTERS – CASE MANAGEMENT
      • If you are studying a case of a certain disease that was just declared
      • It is harder to picture the situation by looking at something as this..
    • 14. WHY LOCATION MATTERS – CASE MANAGEMENT
    • 15. WHY LOCATION MATTERS – CASE MANAGEMENT
      • Than by looking at this..
    • 16. WHY LOCATION MATTERS – CASE MANAGEMENT
    • 17. WHY LOCATION MATTERS – POP SURVEILLANCE
      • If you are studying the spatial distribution of a set of disease clusters
      • This would seem more difficult..
    • 18. WHY LOCATION MATTERS – POP SURVEILLANCE
    • 19. WHY LOCATION MATTERS – POP SURVEILLANCE
      • Than this..
    • 20. WHY LOCATION MATTERS – POP SURVEILLANCE
    • 21. MODERN DISEASE SURVEILLANCE
      • In the past two decades, much disease surveillance research has focused on developing analytical methods for automatically detecting anomalous patterns in data
      • Modern methods can achieve timely detection of anomalies by incorporating temporal , spatial , and multivariate information
    • 22. MODERN DISEASE SURVEILLANCE 9/20, 15213, cough/cold, … 9/21, 15207, antifever, … 9/22, 15213, CC = cough, ... 1,000,000 more records… Huge mass of data Detection algorithm “ What are we supposed to do with this?” Too many alerts
    • 23. MODERN DISEASE SURVEILLANCE 9/20, 15213, cough/cold, … 9/21, 15207, antifever, … 9/22, 15213, CC = cough, ... 1,000,000 more records… Huge mass of data Feedback loop
    • 24. ADVANTAGES OF MACHINE LEARNING P(malaria) = 22% P(influenza) = 13% P(other ILI) = 33%
    • 25. MACHINE LEARNING TECHNIQUES
      • Classifiers
      • Clustering
      • Bayesian Statistics
      • Neural Networks
      • Genetic Algorithms
    • 26. HOW TO REPRESENT A DOCUMENT? “ This morning I woke up with fever, I might have a flu.” “ I had a flu last month. […] I had a flu early this week.” flu fever
    • 27. CLASSIFIERS – PROBLEM DEFINITION
      • Map items to vectors (Feature extraction)
      • Normalize those vectors
      • Train the classifier
      • Measure the results with new information
      • Feedback the classifier
      • Separate classes in feature space
    • 28. CLASSIFIERS - SVM
    • 29. SVM – MARGIN MAXIMIZATION
      • Support vectors define the separator
    • 30. SVM – NON LINEAR? Φ : x -> φ ( x ) Map to higher-dimension space
    • 31. SVM – FILTERING OR CLASSIFYING Document 1 Document 2 Document 3 Positives Negatives Training Document Training Document Classifier
    • 32. CLUSTERING – PROBLEM DEFINITION
      • Map items to vectors (Feature extraction)
      • Normalization
      • Agglomerative and Partitional
    • 33. CLUSTERING - AGGLOMERATIVE
    • 34. CLUSTERING - PARTITIONAL
    • 35. BAYESIAN STATISTICS Probability of disease A (flu) once symptoms B (fever) are observed Probability of fever once flu is confirmed Probability of flu (prior or marginal) Probability of fever (prior or marginal)
    • 36. NEURAL NETWORKS
      • Given a set of stimulus, train a system to produce a given output
    • 37. NEURAL NETWORKS - STRUCTURE Hidden Layer Output Layer Input Layer […] […] {I 0 ,I 1 ,……I n } {O 0 ,O 1 ,……O n } Weight
    • 38. NEURAL NETWORK - APPLICATION Event?
    • 39. GENETIC ALGORITHM - BASICS
      • Define the model that you want to optimize
      • Create the fitness function
      • Evolve the gene pool testing against the fitness function.
      • Select the best individual
    • 40. GENETIC ALGORITHM – MODEL
      • Model the transmission process using a set of parameters:
        • Onset time between an infection and illness
        • Latency period
        • Incubation period
        • Symptomatic period
        • Infectious period
      (Onset, Latency, Incubation, Symptomatic , Infectious) ( 2 days, 3 days, 1 day, 4 days, 3 days)
    • 41. GENETIC ALGORITHM – MODEL FITNESS Fitness = 1/Area
    • 42. GENETIC ALGORITHM – PROCESS
      • Create an initial population of candidates
      • Use operators to generate new candidates (mating and mutation)
      • Discard worst individuals or select best individuals in generation
      • Repeat from 2 until you find a candidate that satisfies the solution searched
    • 43. GENETIC ALGORITHM - PROCESS (4, 5 ,6, 3 ,5) (4,3,6,2,5) (5,3,4,6,2) (2,4,6,3,5) (4,3,6,5,2) (2,3,4,6,5) (3,4,5,2,6) (3,5,4,6,2) (4,5,3,6,2) (5,4,2,3,6) (4,6,3,2,5) (3,4,2,6,5) (3,6,5,1,4) ( 5,3 , 2,6,5 ) ( 3,4 , 4,6,2 ) ( 5,3 , 2,6,5 ) ( 3,4 , 4,6,2 )
    • 44. RESULTS – IMPROVED SURVEILLANCE
    • 45. Q&A
    • 46. THANK YOU!
      • Taha Kass-Hout, MD, MS
      • http://www.instedd.org
      • [email_address]
      • http://taha.instedd.org
      • Nicolás di Tada
      • http://www.manas.com.ar
      • [email_address]
      • http://weblogs.manas.com.ar/ndt/
    • 47. BACKUP SLIDES
    • 48. REFERENCES
      • Izadi, M. and Buckeridge, D., Decision Theoretic Analysis of Improving Epidemic Detection, AMIA 2007, Symposium Proceedings 2007
      • EpiNorth-Based material ( http://www.epinorth.org ):
        • Mereckiene, J., Outbreak Investigation Operational Aspects. Jurmala, Latvia, 2006
        • Bagdonaite, J., and Mereckiene, J., Outbreak Investigation Methodological aspects. Jurmala, Latvia, 2006
        • Epidemic Intelligence: Signals from surveillance systems, Anne Mazick, Statens Serum Institut, Denmark, EpiTrain III, Jurmala, August 2006
      • Daniel Neil, Incorporating Learning into Disease Surveillance Systems
    • 49. REFERENCES
      • Algorithms
        • Complex Event Processing Over Uncertain Data in Wasserkrug (2008)
        • Outbreak detection through automated surveillance A review of the determinants of detection in Buckeridge (2007)
        • Approaches to the evaluation of outbreak detection methods in Watkins (2006)
        • Algorithms for rapid outbreak detection a research synthesis Buckeridge (2004)
        • Data mining in bioinformatics using Weka in Frank (2004)
    • 50. REFERENCES
      • Automating Laboratory Reporting
        • Automatic Electronic Laboratory-Based Reporting in Panackal (2002)
        • Benefits and Barriers to Electronic Laboratory Results Reporting for Notifiable Diseases in Nguyen (2007)
      • Using EMR Data for Disease Surveillance
        • Using Electronic Medical Records to Enhance Detection and Reporting of Vaccine Adverse Events in Hinrichsen (2007)
        • Electronic Medical Record Support for PH in Klompas (2007)
        • A knowledgebase to support notifiable disease surveillance in Doyle (2005)
        • Automated Detection of Tuberculosis Using Electronic Medical Record Data in Calderwood (2007)
      • Misc Readings
        • Breakthrough in modeling emerging disease hotspots in Jones (2008)
        • Use of data mining techniques to investigate disease risk classification as a proxy for compromised biosecurity of cattle herds in Wales in Ortiz-Pelaez (2008)
    • 51. RELATED PROJECTS
      • InSTEDD RNA (or Event Evolution): Collaborative Analytics and Environment for Linking Early Health-Related Event Detection to an Effective Response ( http://taha.instedd.org/2008/09/collaborative-analytics-and-environment.html )
      • ALPACA "ALPACA Light Parsing And Classifying Application (ALPACA) is a classifying tool designed for use in community-oriented software as well as in Academia. The application consists of two parts: a parsing tool for transforming raw documents into readable data, and a classifying tool for categorizing documents into user-provided classes. The application provides a user-friendly interface and a Plug-in functionality to provide a simple way to add more parsers/classifiers to the application." http://2008.hfoss.org/ALPACA
      • Surveillance Project An Open Source R-package disease surveillance framework for "...the development and the evaluation of outbreak detection algorithms in univariate and multivariate routine collected public health surveillance data." http://surveillance.r-forge.r-project.org/
      • Weka An open source "...collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes." http://www.cs.waikato.ac.nz/~ml/weka/
    • 52.