Text mining, machine learning, NLP and all that (in 10 minutes)

  • 436 views
Uploaded on

Byron C Wallace, from #CochraneTech Symposium, Québec 2013

Byron C Wallace, from #CochraneTech Symposium, Québec 2013

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
436
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
12
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. text mining, machine learning, NLP and all that (in 10 minutes) Byron C Wallace Brown Center for Evidence Based Medicine #CochraneTech
  • 2. why do we need this stuff? [Bastian et al, PLoS Medicine 2010]
  • 3. why do we need this stuff? [Bastian et al, PLoS Medicine 2010]
  • 4. PubMed growth [http://altmetrics.org/wp-content/uploads/2010/10/medline-articles-by-year-lg.png]
  • 5. PubMed ? 2 search database 1 formulate question, protocol & query 4 extract data treatment outcome ba c d 3 screen retrieved citations Studies AIMS1988 ASSET1988 Aber1976 Amery1969 Anderson1983 Bassand1986 Bett1973 Bossaert1987 Brunelli1988 Buchalter1987 Croydon1987 Dewar1963 Durand1987 ECSG−11979 ECSG−21988 EWP1971 Fletcher1959 GISSI1986 Gormsen1973 Guerci1987 Heikinheim1971 ISAM1986 ISISPilot1987 ISIS−21988 Ikram1986 Julian1987 Khaja1983 Leiboff1984 Maublant1988 Meinertz1988 NHFAustra1988 Olson1986 Raizner1985 Rentrop1984 Sainsous1986 Schreiber1986 Simoons1985 TICO1988 Topol1987 WWICSK1983 WWIVSK1988 White1987 Overall (I^2=19% , P=0.147) 0 0.01 0.02 0.04 0.08 0.190.270.38 0.76 1.91 3.82 7.65 18.26 OddsRatio(logscale) 5 synthesize extracted data what can we automate
  • 6. PubMed ? 2 search database 1 formulate question, protocol & query 4 extract data treatment outcome ba c d 3 screen retrieved citations Studies AIMS1988 ASSET1988 Aber1976 Amery1969 Anderson1983 Bassand1986 Bett1973 Bossaert1987 Brunelli1988 Buchalter1987 Croydon1987 Dewar1963 Durand1987 ECSG−11979 ECSG−21988 EWP1971 Fletcher1959 GISSI1986 Gormsen1973 Guerci1987 Heikinheim1971 ISAM1986 ISISPilot1987 ISIS−21988 Ikram1986 Julian1987 Khaja1983 Leiboff1984 Maublant1988 Meinertz1988 NHFAustra1988 Olson1986 Raizner1985 Rentrop1984 Sainsous1986 Schreiber1986 Simoons1985 TICO1988 Topol1987 WWICSK1983 WWIVSK1988 White1987 Overall (I^2=19% , P=0.147) 0 0.01 0.02 0.04 0.08 0.190.270.38 0.76 1.91 3.82 7.65 18.26 OddsRatio(logscale) 5 synthesize extracted data what can we automate
  • 7. what can we automate?
  • 8. learner unlabeled data U expert labeled data L predictive model abstracts from PubMed search doctor conducting review manually screened abstracts SVM how does this work?
  • 9. SVMs o x o o o o o o o o x x x x x x xx x xx x support vectors margino
  • 10. bag of words1.2 Supervised M achine Learn I am a Nigerian prince writing to you about an inheritance... ... dinner about prince call ... work nigerian yesterday office inheritance ... ... 0 1 1 0 ... 0 1 0 0 1 ... Figure 1.4: The (binary) Bag-of-Words (BoW) representation.
  • 11. special considerations for the case of systematic reviews • class imbalance – far fewer relevant than irrelevant abstracts – asymmetric costs sensitivity more important than specificity • reviewer time is scarce and expensive – better models, fewer labels: active learning and dual supervision
  • 12. how do we do? “Towards Modernizing the Systematic Review Pipeline: Efficient Updating via Data Mining” Genetics in Medicine 2012
  • 13. PubMed ? 2 search database 1 formulate question, protocol & query 4 extract data treatment outcome ba c d 3 screen retrieved citations Studies AIMS1988 ASSET1988 Aber1976 Amery1969 Anderson1983 Bassand1986 Bett1973 Bossaert1987 Brunelli1988 Buchalter1987 Croydon1987 Dewar1963 Durand1987 ECSG−11979 ECSG−21988 EWP1971 Fletcher1959 GISSI1986 Gormsen1973 Guerci1987 Heikinheim1971 ISAM1986 ISISPilot1987 ISIS−21988 Ikram1986 Julian1987 Khaja1983 Leiboff1984 Maublant1988 Meinertz1988 NHFAustra1988 Olson1986 Raizner1985 Rentrop1984 Sainsous1986 Schreiber1986 Simoons1985 TICO1988 Topol1987 WWICSK1983 WWIVSK1988 White1987 Overall (I^2=19% , P=0.147) 0 0.01 0.02 0.04 0.08 0.190.270.38 0.76 1.91 3.82 7.65 18.26 OddsRatio(logscale) 5 synthesize extracted data beyond citation screening
  • 14. PubMed ? 2 search database 1 formulate question, protocol & query 4 extract data treatment outcome ba c d 3 screen retrieved citations Studies AIMS1988 ASSET1988 Aber1976 Amery1969 Anderson1983 Bassand1986 Bett1973 Bossaert1987 Brunelli1988 Buchalter1987 Croydon1987 Dewar1963 Durand1987 ECSG−11979 ECSG−21988 EWP1971 Fletcher1959 GISSI1986 Gormsen1973 Guerci1987 Heikinheim1971 ISAM1986 ISISPilot1987 ISIS−21988 Ikram1986 Julian1987 Khaja1983 Leiboff1984 Maublant1988 Meinertz1988 NHFAustra1988 Olson1986 Raizner1985 Rentrop1984 Sainsous1986 Schreiber1986 Simoons1985 TICO1988 Topol1987 WWICSK1983 WWIVSK1988 White1987 Overall (I^2=19% , P=0.147) 0 0.01 0.02 0.04 0.08 0.190.270.38 0.76 1.91 3.82 7.65 18.26 OddsRatio(logscale) 5 synthesize extracted data beyond citation screening
  • 15. Questions? byron_wallace@brown.edu http://www.cebm.brown.edu/software www.cebm.brown.edu/byron