Crowdsourcing to structure biological
           knowledge
                    Andrew Su, Ph.D.
      Department of Molecular and Experimental Medicine
               The Scripps Research Institute

                         ISI, USC

                    August 16, 2012
2
Human genetics underlies human health
                                     Molecular understanding of:
                                     • Biological function
                                     • Genetic variation
                                     • Mutation         “Gene
                                     • Deletion      annotation”
                                     • Amplification
                                     • …




              ~3 billion   ~23,000
               bases        genes




                                              Molecular
                                            diagnostics &
                                            therapeutics
3
Structured gene annotations enable computation



         Structured annotations
4
Few genes are well annotated


               TP53
               TNF
               APOE
               MTHFR
               IL6
               HLA-DRB1
   Counts




               VEGFA
               EGFR
               TGFB1                              59%
               ACE

                       PubMed
                                                        38%            23,278 protein-
                                                                        coding genes

                Gene
            ontology (GO)




                            Genes, sorted by decreasing counts


                                                         Data: NCBI gene2pubmed, August 2010
5
Biocuration is a key annotation bottleneck


                   Number of PubMed-indexed articles

    1,000,000


     800,000


     600,000


     400,000


     200,000


           0
                1979 1984 1989 1994 1999 2004 2009
6




311,696 articles (1.5% of PubMed)
have been cited by GO annotations
7




    Sooner or later, the
 research community will
need to be involved in the
             0
annotation effort to scale
   up to the rate of data
        generation.
8
The Long Tail is a prolific source of content


                       Short
                       Head
             Content
            produced


                                       Long Tail



                               Contributors (sorted)




             News :      Newspapers                 Blogs
              Video:    TV/Hollywood               YouTube
   Product reviews:    Consumer reports         Amazon reviews
     Food reviews:       Food critics                Yelp
     Talent judging:      Olympics               American Idol
9




  We can harness the
Long Tail of scientists
to directly participate in
  the gene annotation
        process.
10
From crowdsourcing to structured data



                   The Gene Wiki




                Biological Games
11
 10,000 gene “stubs” within Wikipedia          Utility




                                                         Users

                                        Contributors



                                         Protein structure
    Gene
  summary
                                          Symbols and
                                           identifiers


                                         Gene Ontology
                                          annotations
   Protein
interactions

                                        Tissue expression
  Linked                                     pattern
references

                                         Links to structured
                                             databases



Huss, PLoS Biol, 2008
12
 Gene Wiki has a critical mass of readers
                                         Total: 4.0 million views / month




Huss, PLoS Biol, 2008; Good, NAR, 2011
13
 Gene Wiki has a critical mass of editors



                         Editor count   Editors




                                                          Edit count
                                                  Edits




                  Increase of ~10,000 words / month from >1,000 edits
                               Currently 1.42 million words
                      Approximately equal to 230 full-length articles
Good, NAR, 2011
14
A review article for every gene is powerful




      Reelin: 68 editors, 543 edits since July 2002
      Heparin: 175 editors, 320 edits since June 2003
      AMPK: 44 editors, 84 edits since March 2004
      RNAi: 232 editors, 708 edits since October 2002
                                          References to the literature
         Hyperlinks to related concepts
Filtering, extracting, and summarizing PubMed



Documents




 Concepts
16
Document- and concept-centric text mining
                          Predicate

                Subject               Object
17
Simple text mining for gene annotations

                                          NCBI Entrez Gene: 334



                           Gene Wiki
                           mapping


          Wikilink                           Candidate
                                             assertion

                                          GO:0006897



                           GO exact
                            match
            6319 novel Gene Ontology annotations
            2147 novel Disease Ontology annotations
18
Gene Wiki+ for integrative queries


                      mwsync




                http://genewikiplus.org
19
Dynamic queries across genes, diseases, SNPs
20
21




TOP 100
GENES
22
Gene Wiki+ for integrative queries


                     mwsync

                                OMIM
                              PharmGKB



                   {{#ask:
                   [[Category:Human_proteins]]
                         [[is_associated_with::

                   <q>[[Category:Breast_cancer]
                   ]</q>]]
                         [[HasSNP::
      …




                     <q>[[is_associated_with::
                http://genewikiplus.org
23
Gene Wiki+ for integrative queries


                      mwsync

                                   OMIM
                                 PharmGKB




                http://genewikiplus.org
24
From crowdsourcing to structured data



                   The Gene Wiki




                Biological Games
25
Not just the biomedical literature…
26
BioGPS aggregates gene-centric information




                  http://biogps.org
27
The plugin interface is simple and universal


Pubmed
   http://www.ncbi.nlm.nih.gov/sites/entrez?...&Term={{Symbol}}


STRING
   http://string-db.org/newstring_cgi?...&identifier={{EnsemblGene}}


 KEGG
   http://www.genome.jp/dbget-bin/www_bget?hsa:{{EntrezGene}}




           URL template
                                        Rendered URL
              Gene entity
28
The plugin interface is simple and universal
29
The plugin interface is simple and universal
30
The plugin interface is simple and universal
31
The plugin interface is simple and universal
32
The plugin interface is simple and universal




              Total of 389 gene-centric online
          databases registered as BioGPS plugins
33
BioGPS has a critical mass of users
           Daily pageviews




  • > 4100 registered users              Top 10 organizations
  • 4000 unique visitors per week   1.     Harvard     6. Cambridge
                                    2.     NIH         7. U Penn
  • 40,000 page views per week
                                    3.     UCSD        8. Stanford
                                    4.     Scripps     9. Wash U
                                    5.     MIT         10. UNC
34
All resources should provide RDF…
35
Mining structured content from HTML
36
Defining a data extraction template
        TP53   TNF   APOE   IL6   VEGF EGFR TGFB1   …
  …
37
The BioGPS Semantic Annotator




              http://50.112.124.237
38
All resources should provide flat files…
39
From crowdsourcing to structured data



                   The Gene Wiki




                Biological Games
40



Seven million human hours




                            http://www.flickr.com/photos/archana3k1/4124330493/
41



Twenty million human hours




                             http://www.flickr.com/photos/ableman/2171326385/
42
-
    150 billion human hours
              per year




                              http://www.flickr.com/photos/rvp-cw/6243289302/
43
Using games to fold proteins



      Fold.it players have successfully:
      • Outperformed state of the art protein
        folding algorithms (Cooper, Nature, 2010)
      • Solved a previously-intractable crystal
        structure (Khatib, Nat Struct Mol Biol, 2011)
      • Designed an improved protein folding
        algorithm (Khatib, PNAS, 2011)
      • Improved enzyme activity of de novo
        designed enzyme (Eiben, Nat Biotechnol, 2011)
44
Using games to fold RNAs




              http://eterna.cmu.edu/
45
Using games to align sequences




              http://phylo.cs.mcgill.ca
46
Using games to annotate gene-disease links

                    hurry!

                                        then on to the next question

       If its ‘right’, you get points




                      Click the related disease




                             http://genegames.org
47
Dizeez players seem pretty smart…

  In total:
  • 207 unique gamers
  • 1045 games played
  • 8525 guesses

# Occurrences   Gene Disease              Pubmed   OMIM PharmGKB   Gene Wiki

      7         GAST gastrinoma
      7         RBP3 retinoblastoma
      7         SSX1 synovial sarcoma
      6          TG    Graves' disease
      6         CRYGC Cataract
      6         SOX8 mental retardation
      6          WRN Werner syndrome
      6          ABL1 leukemia
      6         MLL3 leukemia
      6         SNAI2 breast carcinoma
48
Dizeez players seem pretty smart…

  In total:
  • 207 unique gamers
  • 1045 games played
  • 8525 guesses

# Occurrences    Gene Disease              Pubmed   OMIM PharmGKB   Gene Wiki

      5         MECOM sarcoma
      4         ATF7   cancer
      3         ABCB5 acute myeloid leukemia
      3         SART1 glioblastoma
      3         NCK1   leukemia
      3         NEK1   cancer
49
GenESP: Two-player annotation games
50
COMBO: Genomic predictors for disease


                          make predictions on
  cancer   normal           new samples


                     find patterns
                                         cancer

                                         normal
51
COMBO: Genomic predictors for disease
52
COMBO: Genomic predictors for disease
53
COMBO: Genomic predictors for disease
54
COMBO: Genomic predictors for disease
55
COMBO: Genomic predictors for disease
56
COMBO: Genomic predictors for disease
57




  We can harness the
Long Tail of scientists
to directly participate in
  the gene annotation
        process.
58
       Collaborators                                                  Group members
Doug Howe, ZFIN                                             Erik Clarke       Ian Macleod
John Hogenesch, U Penn
Jon Huss, GNF
                                                            Ben Good          Chunlei Wu
Luca de Alfaro, UCSC                                        Salvatore Loguercio
Angel Pizzaro, U Penn
Faramarz Valafar, SDSU
Pierre Lindenbaum,
      Fondation Jean Dausset                             Summer internships for students!
Michael Martone, Rush
Konrad Koehler, Karo Bio
Warren Kibbe, Simon Lim, Northwestern
Many Wikipedia editors
    WP:MCB Project



                                                                                     Contact
                                                                                 http://sulab.org
 Recruiting graduate students
                                                                                asu@scripps.edu
  in quantitative biology! See                                                    @andrewsu
 http://education.scripps.edu/                                                    +Andrew Su



                                        Funding and Support



                                   (BioGPS: GM83924, Gene Wiki: GM089820)

Crowdsourcing to structure biological knowledge (USC/ISI)

  • 1.
    Crowdsourcing to structurebiological knowledge Andrew Su, Ph.D. Department of Molecular and Experimental Medicine The Scripps Research Institute ISI, USC August 16, 2012
  • 2.
    2 Human genetics underlieshuman health Molecular understanding of: • Biological function • Genetic variation • Mutation “Gene • Deletion annotation” • Amplification • … ~3 billion ~23,000 bases genes Molecular diagnostics & therapeutics
  • 3.
    3 Structured gene annotationsenable computation Structured annotations
  • 4.
    4 Few genes arewell annotated TP53 TNF APOE MTHFR IL6 HLA-DRB1 Counts VEGFA EGFR TGFB1 59% ACE PubMed 38% 23,278 protein- coding genes Gene ontology (GO) Genes, sorted by decreasing counts Data: NCBI gene2pubmed, August 2010
  • 5.
    5 Biocuration is akey annotation bottleneck Number of PubMed-indexed articles 1,000,000 800,000 600,000 400,000 200,000 0 1979 1984 1989 1994 1999 2004 2009
  • 6.
    6 311,696 articles (1.5%of PubMed) have been cited by GO annotations
  • 7.
    7 Sooner or later, the research community will need to be involved in the 0 annotation effort to scale up to the rate of data generation.
  • 8.
    8 The Long Tailis a prolific source of content Short Head Content produced Long Tail Contributors (sorted) News : Newspapers Blogs Video: TV/Hollywood YouTube Product reviews: Consumer reports Amazon reviews Food reviews: Food critics Yelp Talent judging: Olympics American Idol
  • 9.
    9 Wecan harness the Long Tail of scientists to directly participate in the gene annotation process.
  • 10.
    10 From crowdsourcing tostructured data The Gene Wiki Biological Games
  • 11.
    11 10,000 gene“stubs” within Wikipedia Utility Users Contributors Protein structure Gene summary Symbols and identifiers Gene Ontology annotations Protein interactions Tissue expression Linked pattern references Links to structured databases Huss, PLoS Biol, 2008
  • 12.
    12 Gene Wikihas a critical mass of readers Total: 4.0 million views / month Huss, PLoS Biol, 2008; Good, NAR, 2011
  • 13.
    13 Gene Wikihas a critical mass of editors Editor count Editors Edit count Edits Increase of ~10,000 words / month from >1,000 edits Currently 1.42 million words Approximately equal to 230 full-length articles Good, NAR, 2011
  • 14.
    14 A review articlefor every gene is powerful Reelin: 68 editors, 543 edits since July 2002 Heparin: 175 editors, 320 edits since June 2003 AMPK: 44 editors, 84 edits since March 2004 RNAi: 232 editors, 708 edits since October 2002 References to the literature Hyperlinks to related concepts
  • 15.
    Filtering, extracting, andsummarizing PubMed Documents Concepts
  • 16.
    16 Document- and concept-centrictext mining Predicate Subject Object
  • 17.
    17 Simple text miningfor gene annotations NCBI Entrez Gene: 334 Gene Wiki mapping Wikilink Candidate assertion GO:0006897 GO exact match 6319 novel Gene Ontology annotations 2147 novel Disease Ontology annotations
  • 18.
    18 Gene Wiki+ forintegrative queries mwsync http://genewikiplus.org
  • 19.
    19 Dynamic queries acrossgenes, diseases, SNPs
  • 20.
  • 21.
  • 22.
    22 Gene Wiki+ forintegrative queries mwsync OMIM PharmGKB {{#ask: [[Category:Human_proteins]] [[is_associated_with:: <q>[[Category:Breast_cancer] ]</q>]] [[HasSNP:: … <q>[[is_associated_with:: http://genewikiplus.org
  • 23.
    23 Gene Wiki+ forintegrative queries mwsync OMIM PharmGKB http://genewikiplus.org
  • 24.
    24 From crowdsourcing tostructured data The Gene Wiki Biological Games
  • 25.
    25 Not just thebiomedical literature…
  • 26.
    26 BioGPS aggregates gene-centricinformation http://biogps.org
  • 27.
    27 The plugin interfaceis simple and universal Pubmed http://www.ncbi.nlm.nih.gov/sites/entrez?...&Term={{Symbol}} STRING http://string-db.org/newstring_cgi?...&identifier={{EnsemblGene}} KEGG http://www.genome.jp/dbget-bin/www_bget?hsa:{{EntrezGene}} URL template Rendered URL Gene entity
  • 28.
    28 The plugin interfaceis simple and universal
  • 29.
    29 The plugin interfaceis simple and universal
  • 30.
    30 The plugin interfaceis simple and universal
  • 31.
    31 The plugin interfaceis simple and universal
  • 32.
    32 The plugin interfaceis simple and universal Total of 389 gene-centric online databases registered as BioGPS plugins
  • 33.
    33 BioGPS has acritical mass of users Daily pageviews • > 4100 registered users Top 10 organizations • 4000 unique visitors per week 1. Harvard 6. Cambridge 2. NIH 7. U Penn • 40,000 page views per week 3. UCSD 8. Stanford 4. Scripps 9. Wash U 5. MIT 10. UNC
  • 34.
  • 35.
  • 36.
    36 Defining a dataextraction template TP53 TNF APOE IL6 VEGF EGFR TGFB1 … …
  • 37.
    37 The BioGPS SemanticAnnotator http://50.112.124.237
  • 38.
    38 All resources shouldprovide flat files…
  • 39.
    39 From crowdsourcing tostructured data The Gene Wiki Biological Games
  • 40.
    40 Seven million humanhours http://www.flickr.com/photos/archana3k1/4124330493/
  • 41.
    41 Twenty million humanhours http://www.flickr.com/photos/ableman/2171326385/
  • 42.
    42 - 150 billion human hours per year http://www.flickr.com/photos/rvp-cw/6243289302/
  • 43.
    43 Using games tofold proteins Fold.it players have successfully: • Outperformed state of the art protein folding algorithms (Cooper, Nature, 2010) • Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011) • Designed an improved protein folding algorithm (Khatib, PNAS, 2011) • Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)
  • 44.
    44 Using games tofold RNAs http://eterna.cmu.edu/
  • 45.
    45 Using games toalign sequences http://phylo.cs.mcgill.ca
  • 46.
    46 Using games toannotate gene-disease links hurry! then on to the next question If its ‘right’, you get points Click the related disease http://genegames.org
  • 47.
    47 Dizeez players seempretty smart… In total: • 207 unique gamers • 1045 games played • 8525 guesses # Occurrences Gene Disease Pubmed OMIM PharmGKB Gene Wiki 7 GAST gastrinoma 7 RBP3 retinoblastoma 7 SSX1 synovial sarcoma 6 TG Graves' disease 6 CRYGC Cataract 6 SOX8 mental retardation 6 WRN Werner syndrome 6 ABL1 leukemia 6 MLL3 leukemia 6 SNAI2 breast carcinoma
  • 48.
    48 Dizeez players seempretty smart… In total: • 207 unique gamers • 1045 games played • 8525 guesses # Occurrences Gene Disease Pubmed OMIM PharmGKB Gene Wiki 5 MECOM sarcoma 4 ATF7 cancer 3 ABCB5 acute myeloid leukemia 3 SART1 glioblastoma 3 NCK1 leukemia 3 NEK1 cancer
  • 49.
  • 50.
    50 COMBO: Genomic predictorsfor disease make predictions on cancer normal new samples find patterns cancer normal
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
    57 Wecan harness the Long Tail of scientists to directly participate in the gene annotation process.
  • 58.
    58 Collaborators Group members Doug Howe, ZFIN Erik Clarke Ian Macleod John Hogenesch, U Penn Jon Huss, GNF Ben Good Chunlei Wu Luca de Alfaro, UCSC Salvatore Loguercio Angel Pizzaro, U Penn Faramarz Valafar, SDSU Pierre Lindenbaum, Fondation Jean Dausset Summer internships for students! Michael Martone, Rush Konrad Koehler, Karo Bio Warren Kibbe, Simon Lim, Northwestern Many Wikipedia editors WP:MCB Project Contact http://sulab.org Recruiting graduate students asu@scripps.edu in quantitative biology! See @andrewsu http://education.scripps.edu/ +Andrew Su Funding and Support (BioGPS: GM83924, Gene Wiki: GM089820)

Editor's Notes

  • #5 We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
  • #7 Much knowledge is locked up in biomedical literature – our goal is to make it computableIf you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
  • #11 For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • #16 Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  • #25 For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • #29 MODs and portals
  • #30 Genetics resources
  • #31 Literature resources
  • #32 Protein resources
  • #33 Pathway and expression databases
  • #40 For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  • #41 Empire state building