Taha Kass-Hout, MD, MS Nicolás di Tada October 2008 MACHINE LEARNING AND  DISEASE SURVEILLANCE
Image source:  http://www.birds.cornell.edu/crows/images/deadcrow.jpg   Image source:  http://farm3.static.flickr.com/2029...
LATE DETECTION – RESPONSE DAY CASES Opportunity  for control
EARLY DETECTION AND RESPONSE DAY CASES Opportunity  for control
INFORMATION SOURCES <ul><li>Event-based – ad-hoc unstructured reports issued by formal or informal sources </li></ul><ul><...
PUBLIC HEALTH MEASURES <ul><li>Representativeness </li></ul><ul><li>Completeness </li></ul><ul><li>Predictive Value </li><...
PUBLIC HEALTH MEASURES 1000  Malaria  infections (100%) 50  Malaria  notifications (5%) Specificity / Reliability Sensitiv...
PUBLIC HEALTH MEASURES Analyze and  interpret   Automated analysis/ thresholds Time <ul><ul><li>Main attributes </li></ul>...
THE PROBLEM SPACE <ul><li>Current systems design, analysis and evaluation has been geared towards specific data sources an...
PUBLIC HEALTH – TWO PERSPECTIVES <ul><li>Case management  </li></ul><ul><ul><li>Individual cases of notifiable diseases </...
CASE MANAGEMENT <ul><li>Questions/problems: </li></ul><ul><ul><li>Is a case due to recent transmission? </li></ul></ul><ul...
POPULATION SURVEILLANCE <ul><li>Questions/problems: </li></ul><ul><ul><li>Are more cases happening than expected? </li></u...
WHY LOCATION MATTERS – CASE MANAGEMENT <ul><li>If you are studying a case of a certain disease that was just declared </li...
WHY LOCATION MATTERS – CASE MANAGEMENT
WHY LOCATION MATTERS – CASE MANAGEMENT <ul><li>Than by looking at this.. </li></ul>
WHY LOCATION MATTERS – CASE MANAGEMENT
WHY LOCATION MATTERS – POP SURVEILLANCE <ul><li>If you are studying the spatial distribution of a set of disease clusters ...
WHY LOCATION MATTERS – POP SURVEILLANCE
WHY LOCATION MATTERS – POP SURVEILLANCE <ul><li>Than this.. </li></ul>
WHY LOCATION MATTERS – POP SURVEILLANCE
MODERN DISEASE SURVEILLANCE <ul><li>In the past two decades, much disease surveillance research has focused on developing ...
MODERN DISEASE SURVEILLANCE 9/20, 15213, cough/cold, … 9/21, 15207, antifever, … 9/22, 15213, CC = cough, ... 1,000,000  m...
MODERN DISEASE SURVEILLANCE 9/20, 15213, cough/cold, … 9/21, 15207, antifever, … 9/22, 15213, CC = cough, ... 1,000,000  m...
ADVANTAGES OF MACHINE LEARNING P(malaria) = 22%  P(influenza) = 13%   P(other ILI) = 33%
MACHINE LEARNING TECHNIQUES <ul><li>Classifiers </li></ul><ul><li>Clustering </li></ul><ul><li>Bayesian Statistics </li></...
HOW TO REPRESENT A DOCUMENT? “ This morning I woke up with fever, I might have a flu.” “ I had a flu last month. […] I had...
CLASSIFIERS – PROBLEM DEFINITION <ul><li>Map items to vectors (Feature extraction) </li></ul><ul><li>Normalize those vecto...
CLASSIFIERS - SVM
SVM – MARGIN MAXIMIZATION <ul><li>Support vectors define the separator </li></ul>
SVM – NON LINEAR? Φ :  x   ->   φ ( x ) Map to higher-dimension space
SVM – FILTERING OR CLASSIFYING Document 1 Document 2 Document 3 Positives Negatives Training Document Training Document Cl...
CLUSTERING – PROBLEM DEFINITION <ul><li>Map items to vectors (Feature extraction) </li></ul><ul><li>Normalization </li></u...
CLUSTERING - AGGLOMERATIVE
CLUSTERING - PARTITIONAL
BAYESIAN STATISTICS Probability of disease A (flu) once symptoms B (fever) are observed Probability of fever once flu is c...
NEURAL NETWORKS <ul><li>Given a set of stimulus, train a system to produce a given output </li></ul>
NEURAL NETWORKS - STRUCTURE Hidden Layer Output Layer Input Layer […] […] {I 0 ,I 1 ,……I n } {O 0 ,O 1 ,……O n } Weight
NEURAL NETWORK - APPLICATION Event?
GENETIC ALGORITHM - BASICS <ul><li>Define the model that you want to optimize </li></ul><ul><li>Create the fitness functio...
GENETIC ALGORITHM – MODEL <ul><li>Model the transmission process using a set of parameters: </li></ul><ul><ul><li>Onset ti...
GENETIC ALGORITHM – MODEL FITNESS Fitness = 1/Area
GENETIC ALGORITHM – PROCESS <ul><li>Create an initial population of candidates </li></ul><ul><li>Use operators to generate...
GENETIC ALGORITHM - PROCESS  (4, 5 ,6, 3 ,5)  (4,3,6,2,5)  (5,3,4,6,2) (2,4,6,3,5) (4,3,6,5,2) (2,3,4,6,5) (3,4,5,2,6) (3,...
RESULTS – IMPROVED SURVEILLANCE
Q&A
THANK YOU! <ul><li>Taha Kass-Hout, MD, MS </li></ul><ul><li>http://www.instedd.org   </li></ul><ul><li>[email_address] </l...
BACKUP SLIDES
REFERENCES <ul><li>Izadi, M.  and Buckeridge, D., Decision Theoretic Analysis of Improving Epidemic Detection, AMIA 2007, ...
REFERENCES <ul><li>Algorithms </li></ul><ul><ul><li>Complex Event Processing Over Uncertain Data in Wasserkrug (2008) </li...
REFERENCES <ul><li>Automating Laboratory Reporting </li></ul><ul><ul><li>Automatic Electronic Laboratory-Based Reporting i...
RELATED PROJECTS <ul><li>InSTEDD RNA  (or Event Evolution): Collaborative Analytics and Environment for Linking Early Heal...
 
Upcoming SlideShare
Loading in...5
×

Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

3,555

Published on

The majority of the designs, analyses and evaluations of early detection (or biosurveillance) systems have been geared towards specific data sources and detection algorithms. Much less effort has been focused on how these systems will "interact" with humans. For example, consider multiple domain experts working at different levels across different organizations in an environment where numerous biosurveillance algorithms may provide contradictory interpretations of ongoing events. We present a framework that consists of a collection of autonomous, machine learning-enabled analytic processes, services and tools that; for the first time, will seamlessly integrate surveillance and response systems with human experts.

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,555
On Slideshare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
112
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Biosurveillance: Machine Learning And Disease Surveillance by Kass-Hout Di Tada

    1. 1. Taha Kass-Hout, MD, MS Nicolás di Tada October 2008 MACHINE LEARNING AND DISEASE SURVEILLANCE
    2. 2. Image source: http://www.birds.cornell.edu/crows/images/deadcrow.jpg Image source: http://farm3.static.flickr.com/2029/2239605500_6ef2fd2295.jpg?v=0
    3. 3. LATE DETECTION – RESPONSE DAY CASES Opportunity for control
    4. 4. EARLY DETECTION AND RESPONSE DAY CASES Opportunity for control
    5. 5. INFORMATION SOURCES <ul><li>Event-based – ad-hoc unstructured reports issued by formal or informal sources </li></ul><ul><li>Indicator-based – (number of cases, rates, proportion of strains…) </li></ul>
    6. 6. PUBLIC HEALTH MEASURES <ul><li>Representativeness </li></ul><ul><li>Completeness </li></ul><ul><li>Predictive Value </li></ul><ul><li>Timeliness </li></ul>
    7. 7. PUBLIC HEALTH MEASURES 1000 Malaria infections (100%) 50 Malaria notifications (5%) Specificity / Reliability Sensitivity / Timeliness <ul><ul><li>Main attributes </li></ul></ul><ul><ul><ul><li>Representativeness </li></ul></ul></ul><ul><ul><ul><li>Completeness </li></ul></ul></ul><ul><ul><ul><li>Predictive value positive </li></ul></ul></ul>Get as close to the bottom of the pyramid as possible Urge frequent reporting: Weekly  daily  immediately
    8. 8. PUBLIC HEALTH MEASURES Analyze and interpret Automated analysis/ thresholds Time <ul><ul><li>Main attributes </li></ul></ul><ul><ul><ul><li>Timeliness </li></ul></ul></ul>Health care hotline Signal as early as possible
    9. 9. THE PROBLEM SPACE <ul><li>Current systems design, analysis and evaluation has been geared towards specific data sources and detection algorithms – not humans </li></ul><ul><li>We have systems in place for those threats we have been faced with before </li></ul>
    10. 10. PUBLIC HEALTH – TWO PERSPECTIVES <ul><li>Case management </li></ul><ul><ul><li>Individual cases of notifiable diseases </li></ul></ul><ul><ul><li>Relationship networks (contact tracing) </li></ul></ul><ul><li>Population surveillance </li></ul><ul><ul><li>Larger risk patterns </li></ul></ul>
    11. 11. CASE MANAGEMENT <ul><li>Questions/problems: </li></ul><ul><ul><li>Is a case due to recent transmission? </li></ul></ul><ul><ul><li>If so, does the case share any feature with other, recent cases? </li></ul></ul><ul><li>Ways it's being done: </li></ul><ul><ul><li>Investigations/interviews </li></ul></ul><ul><ul><li>Meeting with other investigators </li></ul></ul>
    12. 12. POPULATION SURVEILLANCE <ul><li>Questions/problems: </li></ul><ul><ul><li>Are more cases happening than expected? </li></ul></ul><ul><ul><li>Does an excess suggest ongoing transmission in a specific region? </li></ul></ul><ul><li>Way it's being done: </li></ul><ul><ul><li>Semi-automated routine temporal and space-time statistical analysis </li></ul></ul>
    13. 13. WHY LOCATION MATTERS – CASE MANAGEMENT <ul><li>If you are studying a case of a certain disease that was just declared </li></ul><ul><li>It is harder to picture the situation by looking at something as this.. </li></ul>
    14. 14. WHY LOCATION MATTERS – CASE MANAGEMENT
    15. 15. WHY LOCATION MATTERS – CASE MANAGEMENT <ul><li>Than by looking at this.. </li></ul>
    16. 16. WHY LOCATION MATTERS – CASE MANAGEMENT
    17. 17. WHY LOCATION MATTERS – POP SURVEILLANCE <ul><li>If you are studying the spatial distribution of a set of disease clusters </li></ul><ul><li>This would seem more difficult.. </li></ul>
    18. 18. WHY LOCATION MATTERS – POP SURVEILLANCE
    19. 19. WHY LOCATION MATTERS – POP SURVEILLANCE <ul><li>Than this.. </li></ul>
    20. 20. WHY LOCATION MATTERS – POP SURVEILLANCE
    21. 21. MODERN DISEASE SURVEILLANCE <ul><li>In the past two decades, much disease surveillance research has focused on developing analytical methods for automatically detecting anomalous patterns in data </li></ul><ul><li>Modern methods can achieve timely detection of anomalies by incorporating temporal , spatial , and multivariate information </li></ul>
    22. 22. MODERN DISEASE SURVEILLANCE 9/20, 15213, cough/cold, … 9/21, 15207, antifever, … 9/22, 15213, CC = cough, ... 1,000,000 more records… Huge mass of data Detection algorithm “ What are we supposed to do with this?” Too many alerts
    23. 23. MODERN DISEASE SURVEILLANCE 9/20, 15213, cough/cold, … 9/21, 15207, antifever, … 9/22, 15213, CC = cough, ... 1,000,000 more records… Huge mass of data Feedback loop
    24. 24. ADVANTAGES OF MACHINE LEARNING P(malaria) = 22% P(influenza) = 13% P(other ILI) = 33%
    25. 25. MACHINE LEARNING TECHNIQUES <ul><li>Classifiers </li></ul><ul><li>Clustering </li></ul><ul><li>Bayesian Statistics </li></ul><ul><li>Neural Networks </li></ul><ul><li>Genetic Algorithms </li></ul>
    26. 26. HOW TO REPRESENT A DOCUMENT? “ This morning I woke up with fever, I might have a flu.” “ I had a flu last month. […] I had a flu early this week.” flu fever
    27. 27. CLASSIFIERS – PROBLEM DEFINITION <ul><li>Map items to vectors (Feature extraction) </li></ul><ul><li>Normalize those vectors </li></ul><ul><li>Train the classifier </li></ul><ul><li>Measure the results with new information </li></ul><ul><li>Feedback the classifier </li></ul><ul><li>Separate classes in feature space </li></ul>
    28. 28. CLASSIFIERS - SVM
    29. 29. SVM – MARGIN MAXIMIZATION <ul><li>Support vectors define the separator </li></ul>
    30. 30. SVM – NON LINEAR? Φ : x -> φ ( x ) Map to higher-dimension space
    31. 31. SVM – FILTERING OR CLASSIFYING Document 1 Document 2 Document 3 Positives Negatives Training Document Training Document Classifier
    32. 32. CLUSTERING – PROBLEM DEFINITION <ul><li>Map items to vectors (Feature extraction) </li></ul><ul><li>Normalization </li></ul><ul><li>Agglomerative and Partitional </li></ul>
    33. 33. CLUSTERING - AGGLOMERATIVE
    34. 34. CLUSTERING - PARTITIONAL
    35. 35. BAYESIAN STATISTICS Probability of disease A (flu) once symptoms B (fever) are observed Probability of fever once flu is confirmed Probability of flu (prior or marginal) Probability of fever (prior or marginal)
    36. 36. NEURAL NETWORKS <ul><li>Given a set of stimulus, train a system to produce a given output </li></ul>
    37. 37. NEURAL NETWORKS - STRUCTURE Hidden Layer Output Layer Input Layer […] […] {I 0 ,I 1 ,……I n } {O 0 ,O 1 ,……O n } Weight
    38. 38. NEURAL NETWORK - APPLICATION Event?
    39. 39. GENETIC ALGORITHM - BASICS <ul><li>Define the model that you want to optimize </li></ul><ul><li>Create the fitness function </li></ul><ul><li>Evolve the gene pool testing against the fitness function. </li></ul><ul><li>Select the best individual </li></ul>
    40. 40. GENETIC ALGORITHM – MODEL <ul><li>Model the transmission process using a set of parameters: </li></ul><ul><ul><li>Onset time between an infection and illness </li></ul></ul><ul><ul><li>Latency period </li></ul></ul><ul><ul><li>Incubation period </li></ul></ul><ul><ul><li>Symptomatic period </li></ul></ul><ul><ul><li>Infectious period </li></ul></ul>(Onset, Latency, Incubation, Symptomatic , Infectious) ( 2 days, 3 days, 1 day, 4 days, 3 days)
    41. 41. GENETIC ALGORITHM – MODEL FITNESS Fitness = 1/Area
    42. 42. GENETIC ALGORITHM – PROCESS <ul><li>Create an initial population of candidates </li></ul><ul><li>Use operators to generate new candidates (mating and mutation) </li></ul><ul><li>Discard worst individuals or select best individuals in generation </li></ul><ul><li>Repeat from 2 until you find a candidate that satisfies the solution searched </li></ul>
    43. 43. GENETIC ALGORITHM - PROCESS (4, 5 ,6, 3 ,5) (4,3,6,2,5) (5,3,4,6,2) (2,4,6,3,5) (4,3,6,5,2) (2,3,4,6,5) (3,4,5,2,6) (3,5,4,6,2) (4,5,3,6,2) (5,4,2,3,6) (4,6,3,2,5) (3,4,2,6,5) (3,6,5,1,4) ( 5,3 , 2,6,5 ) ( 3,4 , 4,6,2 ) ( 5,3 , 2,6,5 ) ( 3,4 , 4,6,2 )
    44. 44. RESULTS – IMPROVED SURVEILLANCE
    45. 45. Q&A
    46. 46. THANK YOU! <ul><li>Taha Kass-Hout, MD, MS </li></ul><ul><li>http://www.instedd.org </li></ul><ul><li>[email_address] </li></ul><ul><li>http://taha.instedd.org </li></ul><ul><li>Nicolás di Tada </li></ul><ul><li>http://www.manas.com.ar </li></ul><ul><li>[email_address] </li></ul><ul><li>http://weblogs.manas.com.ar/ndt/ </li></ul>
    47. 47. BACKUP SLIDES
    48. 48. REFERENCES <ul><li>Izadi, M. and Buckeridge, D., Decision Theoretic Analysis of Improving Epidemic Detection, AMIA 2007, Symposium Proceedings 2007 </li></ul><ul><li>EpiNorth-Based material ( http://www.epinorth.org ): </li></ul><ul><ul><li>Mereckiene, J., Outbreak Investigation Operational Aspects. Jurmala, Latvia, 2006 </li></ul></ul><ul><ul><li>Bagdonaite, J., and Mereckiene, J., Outbreak Investigation Methodological aspects. Jurmala, Latvia, 2006 </li></ul></ul><ul><ul><li>Epidemic Intelligence: Signals from surveillance systems, Anne Mazick, Statens Serum Institut, Denmark, EpiTrain III, Jurmala, August 2006 </li></ul></ul><ul><li>Daniel Neil, Incorporating Learning into Disease Surveillance Systems </li></ul>
    49. 49. REFERENCES <ul><li>Algorithms </li></ul><ul><ul><li>Complex Event Processing Over Uncertain Data in Wasserkrug (2008) </li></ul></ul><ul><ul><li>Outbreak detection through automated surveillance A review of the determinants of detection in Buckeridge (2007) </li></ul></ul><ul><ul><li>Approaches to the evaluation of outbreak detection methods in Watkins (2006) </li></ul></ul><ul><ul><li>Algorithms for rapid outbreak detection a research synthesis Buckeridge (2004) </li></ul></ul><ul><ul><li>Data mining in bioinformatics using Weka in Frank (2004) </li></ul></ul>
    50. 50. REFERENCES <ul><li>Automating Laboratory Reporting </li></ul><ul><ul><li>Automatic Electronic Laboratory-Based Reporting in Panackal (2002) </li></ul></ul><ul><ul><li>Benefits and Barriers to Electronic Laboratory Results Reporting for Notifiable Diseases in Nguyen (2007) </li></ul></ul><ul><li>Using EMR Data for Disease Surveillance </li></ul><ul><ul><li>Using Electronic Medical Records to Enhance Detection and Reporting of Vaccine Adverse Events in Hinrichsen (2007) </li></ul></ul><ul><ul><li>Electronic Medical Record Support for PH in Klompas (2007) </li></ul></ul><ul><ul><li>A knowledgebase to support notifiable disease surveillance in Doyle (2005) </li></ul></ul><ul><ul><li>Automated Detection of Tuberculosis Using Electronic Medical Record Data in Calderwood (2007) </li></ul></ul><ul><li>Misc Readings </li></ul><ul><ul><li>Breakthrough in modeling emerging disease hotspots in Jones (2008) </li></ul></ul><ul><ul><li>Use of data mining techniques to investigate disease risk classification as a proxy for compromised biosecurity of cattle herds in Wales in Ortiz-Pelaez (2008) </li></ul></ul>
    51. 51. RELATED PROJECTS <ul><li>InSTEDD RNA (or Event Evolution): Collaborative Analytics and Environment for Linking Early Health-Related Event Detection to an Effective Response ( http://taha.instedd.org/2008/09/collaborative-analytics-and-environment.html ) </li></ul><ul><li>ALPACA &quot;ALPACA Light Parsing And Classifying Application (ALPACA) is a classifying tool designed for use in community-oriented software as well as in Academia. The application consists of two parts: a parsing tool for transforming raw documents into readable data, and a classifying tool for categorizing documents into user-provided classes. The application provides a user-friendly interface and a Plug-in functionality to provide a simple way to add more parsers/classifiers to the application.&quot; http://2008.hfoss.org/ALPACA </li></ul><ul><li>Surveillance Project An Open Source R-package disease surveillance framework for &quot;...the development and the evaluation of outbreak detection algorithms in univariate and multivariate routine collected public health surveillance data.&quot; http://surveillance.r-forge.r-project.org/ </li></ul><ul><li>Weka An open source &quot;...collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.&quot; http://www.cs.waikato.ac.nz/~ml/weka/ </li></ul>
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×