SlideShare a Scribd company logo
Translating unstructured, crowdsourced content
              into structured data


                  Andrew Su, Ph.D.
              The Scripps Research Institute

                   NCBO Webinar

                 February 20, 2013
2
Human genetics underlies human health
                                     Molecular understanding of:
                                     • Biological function
                                     • Genetic variation
                                     • Mutation
                                     • Deletion
                                     • Amplification
                                     • …
                                                       Structured
                                                          gene
                                                          Gene
                                                      annotations


              ~3 billion   ~20,000
               bases        genes




                                           Molecular
                                         diagnostics &
                                         therapeutics
3
Structured gene annotations enable computation



        Structured gene annotations
4
Few genes are well annotated

                     CTNNB1
                     VEGFA
                     SIRT1
                     FGFR2
     GO Annotation




                     TGFB1
                     TP53
        Counts




                     MEF2C
                     BMP4                      65%
                     LEF1
                     WNT5A
                     TNF
                                                     41%


                                                                  20,473 protein-
                                                                   coding genes




                         Genes, sorted by decreasing counts




                                                              Data: NCBI, February 2013
5
Few genes are well annotated
     GO Annotation
        Counts




                                           + Electronic annotation (IEA)




                     Genes, sorted by decreasing counts




                                                             Data: NCBI, February 2013
6
Few genes are well annotated
     GO Annotation
        Counts




                                                      + Electronic annotation (IEA)


                      Biological
                     Process only


                                Genes, sorted by decreasing counts




                                                                        Data: NCBI, February 2013
7




311,696 articles (1.5% of PubMed)
have been cited by GO annotations
8




    Sooner or later, the
 research community will
need to be involved in the
             0
annotation effort to scale
   up to the rate of data
        generation.
9




    Crowdsourcing
  empowers the entire
 scientific community to
directly participate in the
gene annotation process.
10
From crowdsourcing to structured data



                   The Gene Wiki




                GeneGames.org
11
 10,000 gene “stubs” within Wikipedia



                                        Protein structure
    Gene
  summary
                                         Symbols and
                                          identifiers


                                        Gene Ontology
                                         annotations
   Protein
interactions

                                        Tissue expression
  Linked                                     pattern
references

                                        Links to structured
                                            databases



Huss, PLoS Biol, 2008
12
 Gene Wiki has a critical mass of readers
                                         Total: 4.0 million views / month




Huss, PLoS Biol, 2008; Good, NAR, 2011
13
 Gene Wiki has a critical mass of editors



                         Editor count   Editors




                                                          Edit count
                                                  Edits




                  Increase of ~10,000 words / month from >1,000 edits
                               Currently 1.42 million words
                      Approximately equal to 230 full-length articles
Good, NAR, 2011
14
A review article for every gene is powerful




      Reelin: 68 editors, 543 edits since July 2002
      Heparin: 175 editors, 320 edits since June 2003
      AMPK: 44 editors, 84 edits since March 2004
      RNAi: 232 editors, 708 edits since October 2002
                                          References to the literature
         Hyperlinks to related concepts
Filtering, extracting, and summarizing PubMed



Documents




 Concepts
16
Document- and concept-centric text mining
                          Predicate

                Subject               Object
17
 Simple text mining for gene annotations

                                                            NCBI Entrez Gene: 334



                                             Gene Wiki
                                             mapping


                            Wikilink                           Candidate
                                                               assertion

                                                            GO:0006897



                                             GO exact
                                              match
                              6319 novel Gene Ontology annotations
                              2147 novel Disease Ontology annotations

Good, BMC Genomics, 2011.
18
 Gene Wiki content improves enrichment analysis



                                More
      p-value                significant
  (PubMed + GW)             PubMed only

                                                             Muscle
                                                           contraction



                                                More
                                             significant
                                            PubMed + GW




                              p-value (PubMed only)
Good, BMC Genomics, 2011.
19
 Gene Wiki+ for integrative queries


                                        mwsync




Good, J Biomed Semantics, 2012.
                                  http://genewikiplus.org
20
 Dynamic queries across genes, diseases, SNPs




Good, J Biomed Semantics, 2012.
21
 Gene Wiki+ for integrative queries


                                       mwsync

                                                  OMIM
                                                PharmGKB



                                     {{#ask:
                                     [[Category:Human_proteins]]
                                           [[is_associated_with::

                                     <q>[[Category:Breast_cancer]
                                     ]</q>]]
                                           [[HasSNP::
                     …




                                       <q>[[is_associated_with::
                                  http://genewikiplus.org
Good, J Biomed Semantics, 2012.
22
 Gene Wiki+ for integrative queries


                                        mwsync

                                                     OMIM
                                                   PharmGKB




Good, J Biomed Semantics, 2012.
                                  http://genewikiplus.org
23
Wikidata


           Provide a database of the
            world‟s knowledge that
               anyone can edit
                       - Denny Vrandečić
24
Wikidata
                         Q414043


                    Reelin




                                            Protein           Q8054
Property:P31          is a
                                            Glycoprotein    Q187126


                                            Neural
Property:P128       regulates                               Q1345738
                                            development

                                            VLDL receptor   Q1979313
Property:P129       Interacts               Amyloid
                       with                 precursor       Q423510
                                            protein
                http://www.wikidata.org/wiki/Q414043
25
Wikidata
                                   Q414043




                                                                                        Q8054
Property:P31
                                                                                      Q187126


Property:P128                                                                         Q1345738

                                                                                      Q1979313
Property:P129
                                                                                      Q423510

        http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en
26
Wikidata




  http://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force
27
Wikidata




  http://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force
28
From crowdsourcing to structured data



                   The Gene Wiki




                GeneGames.org
29
Not just the biomedical literature…
30
 BioGPS aggregates gene-centric information




                                           http://biogps.org
Wu, NAR, 2013; Wu, Genome Biology, 2009.
31
 The plugin interface is simple and universal


 Pubmed
           http://www.ncbi.nlm.nih.gov/sites/entrez?...&Term={{Symbol}}


 STRING
           http://string-db.org/newstring_cgi?...&identifier={{EnsemblGene}}


      KEGG
           http://www.genome.jp/dbget-bin/www_bget?hsa:{{EntrezGene}}




                              URL template
                                                 Rendered URL
                                   Gene entity
Wu, NAR, 2013; Wu, Genome Biology, 2009.
32
The plugin interface is simple and universal
33
The plugin interface is simple and universal
34
The plugin interface is simple and universal
35
The plugin interface is simple and universal
36
The plugin interface is simple and universal




              Total of 389 gene-centric online
          databases registered as BioGPS plugins
37
 BioGPS has a critical mass of users
                             Daily pageviews




          • > 4100 registered users                 Top 10 organizations
          • 4000 unique visitors per week      1.     Harvard     6. Cambridge
                                               2.     NIH         7. U Penn
          • 40,000 page views per week
                                               3.     UCSD        8. Stanford
                                               4.     Scripps     9. Wash U
                                               5.     MIT         10. UNC
Wu, NAR, 2013; Wu, Genome Biology, 2009.
38
All resources should provide RDF…
39
Mining structured content from HTML
40
Defining a data extraction template
        TP53   TNF   APOE   IL6   VEGF EGFR TGFB1   …
  …
41
The BioGPS Semantic Annotator




           http://54.244.135.254:8000/
42
From crowdsourcing to structured data



                   The Gene Wiki




                GeneGames.org
43



Seven million human hours




                            http://www.flickr.com/photos/archana3k1/4124330493/
44



Twenty million human hours




                             http://www.flickr.com/photos/ableman/2171326385/
45
-
    150 billion human hours
              per year




                              http://www.flickr.com/photos/rvp-cw/6243289302/
46
Using games to fold proteins



      Fold.it players have successfully:
      • Outperformed state of the art protein
        folding algorithms (Cooper, Nature, 2010)
      • Solved a previously-intractable crystal
        structure (Khatib, Nat Struct Mol Biol, 2011)
      • Designed an improved protein folding
        algorithm (Khatib, PNAS, 2011)
      • Improved enzyme activity of de novo
        designed enzyme (Eiben, Nat Biotechnol, 2011)
47
Using games to fold RNAs




              http://eterna.cmu.edu/
48
 Using games to align sequences




                            http://phylo.cs.mcgill.ca
Kawrykow, PLOS ONE, 2012.
49
Using games to annotate genes?




              http://genegames.org
50
No good gene-disease annotation database
             Query: Apolipoprotein E




            Alzheimer's disease (AD)
            Lipoprotein glomerulopathy
            Sea-blue histiocyte disease
51
No good gene-disease annotation database
             Query: Apolipoprotein E




            Alzheimer's disease (AD)
            Lipoprotein glomerulopathy
            Sea-blue histiocyte disease
            Hyperlipoproteinemia, type III
            Macular degeneration, age-related
            Myocardial infarction susceptibility
52
No good gene-disease annotation database
              Query: Apolipoprotein E




           ? Alzheimer's disease (AD)
           ? Lipoprotein glomerulopathy
           ? Sea-blue histiocyte disease
             Hyperlipoproteinemia, type III
           ? Macular degeneration, age-related
           ? Myocardial infarction susceptibility
             HIV
             Psoriasis
             Vascular Diseases
53
No good gene-disease annotation database
             Query: Apolipoprotein E




            Alzheimer's disease (AD)    Memory
                                        Coronary Artery Disease
            Neuropsychological Tests    Hypertension
            Cognition Disorders         Mental Status Schedule
                                        Psychiatric Status Rating
            Dementia                        Scales
            Cognition                   Hyperlipidemias
                                        Atrophy
            Disease Progression         Dementia, Vascular
            Cardiovascular Diseases     Parkinson Disease
                                        Brain Injuries
            Coronary Disease            Myocardial Infarction
            Diabetes Mellitus, Type 2   …

            Memory Disorders            477 diseases!
54
Play Dizeez to annotate gene-disease links
                                                6. Play to win!
               5. Hurry!
                                 4. Then on to the
                                 next question…

           3. If it‟s „right‟, you get points

            1. Read the clue (gene)




                             2. Click the related disease
                                (only one is “right”)
55
Dizeez players seem pretty smart…

 In total (since Dec 2011):
 • 230 unique gamers
 • 1045 games played
 • 8525 guesses

# Occurrences Gene   Disease              Gene Wiki   OMIM PharmGKB   PubMed

     11      NBPF3 neuroblastoma
     11      SOX8    mental retardation
      9      ABL1    leukemia
      9      SSX1    synovial sarcoma
      8      APC     colorectal cancer
      8      FES     sarcoma
      8      RBP3    retinoblastoma
      8      GAST    gastrinoma
      8      DCC     colorectal cancer
      8      MAP3K5 cancer
56
Using games to predict phenotype from genotype?




               http://genegames.org
57
Classification problems in genome biology

                                                   Classify new
   cancer                    normal                  samples


                                      find patterns
                                                                  cancer
   100,000s features




                                                                  normal
                                          SVM
                                         Neural
                                        networks
                                          Naïve
                                          Bayes
                                          KNN
                                           …
                       100s samples
58
Random forests
                                      Sample subset
                                       of cases and   Train decision
  cancer                     normal       features         tree
   100,000s features




                       100s samples
59
Random forests


  cancer                     normal
   100,000s features




                       100s samples
60
Random forests

                                                         Classify new
  cancer                     normal                        samples



                                                                        cancer
   100,000s features




                                                                        normal




                                      How to interject
                                        biological
                       100s samples    knowledge?
61
Network-guided forests




                         Dutkowski & Ideker (2011). PLoS Computational Biology
62
Network-guided forests
                                          Sample
                                      features by PPI   Train decision
  cancer                     normal       network            tree
   100,000s features




                       100s samples
63
Human-guided forests
                                        Sample
                                      features by    Train decision
  cancer                     normal      human            tree
                                      intelligence
   100,000s features




                       100s samples
64
65
The Cure: Genomic predictors for disease
66
The Cure: Genomic predictors for disease
67
The Cure: Genomic predictors for disease
68
The Cure: Genomic predictors for disease
69
The Cure: Genomic predictors for disease
70
The Cure: Genomic predictors for disease
71
Human-guided forests

                       Classify new
                         samples



                                      cancer
                                      normal
72
“Critical Assessment”-style challenge
73
Results


• 214 registered players
   – 50% declared knowledge of cancer
     biology
   – 40% self-identified as having Ph.D.
• Prediction results
   – 70% correct on survival concordance
     index
   – Best scoring model was 76%
   – Player registrations still increasing!
74




    Crowdsourcing
  empowers the entire
 scientific community to
directly participate in the
gene annotation process.
75
       Collaborators                                                        Group members
Doug Howe, ZFIN                                             Katie Fisch                    Max Nanis
John Hogenesch, U Penn
Luca de Alfaro, UCSC
                                                            Ben Good                       Chunlei Wu
Angel Pizzaro, U Penn                                       Salvatore Loguercio
Faramarz Valafar, SDSU
Pierre Lindenbaum,
      Fondation Jean Dausset                                                Key group alumni
Michael Martone, Rush
Konrad Koehler, Karo Bio                                                    Erik Clarke
Warren Kibbe, Simon Lim, Northwestern                                       Jon Huss
Many Wikipedia editors                                                      Marc Leglise
    WP:MCB Project                                                          Maximilian Ludvigsson
                                                                            Ian MacLeod
                                                                            Camilo Orozco




                                               Contact
                                           http://sulab.org
                                          asu@scripps.edu
                                            @andrewsu
                                            +Andrew Su

                                        Funding and Support



                                   (BioGPS: GM83924, Gene Wiki: GM089820)

More Related Content

What's hot

Stephen Friend Complex Traits: Genomics and Computational Approaches 2012-02-23
Stephen Friend Complex Traits: Genomics and Computational Approaches 2012-02-23Stephen Friend Complex Traits: Genomics and Computational Approaches 2012-02-23
Stephen Friend Complex Traits: Genomics and Computational Approaches 2012-02-23
Sage Base
 
Friend NIEHS 2013-03-01
Friend NIEHS 2013-03-01Friend NIEHS 2013-03-01
Friend NIEHS 2013-03-01
Sage Base
 
Infographic CV MB Long
Infographic CV MB LongInfographic CV MB Long
Infographic CV MB Long
Metin Bilgin
 
Stephen Friend Cytoscape Retreat 2011-05-20
Stephen Friend Cytoscape Retreat 2011-05-20Stephen Friend Cytoscape Retreat 2011-05-20
Stephen Friend Cytoscape Retreat 2011-05-20
Sage Base
 
Church nhgri 2012
Church nhgri 2012Church nhgri 2012
Church nhgri 2012
Deanna Church
 
Stephen Friend Haas School of Business 2012-03-05
Stephen Friend Haas School of Business 2012-03-05Stephen Friend Haas School of Business 2012-03-05
Stephen Friend Haas School of Business 2012-03-05
Sage Base
 
SBS 2011: Sensitive Cell-based and Biochemical Assays Using Epic(R) Label-fre...
SBS 2011: Sensitive Cell-based and Biochemical Assays Using Epic(R) Label-fre...SBS 2011: Sensitive Cell-based and Biochemical Assays Using Epic(R) Label-fre...
SBS 2011: Sensitive Cell-based and Biochemical Assays Using Epic(R) Label-fre...
PerkinElmer, Inc.
 
Clinical grade ex vivo expanded human natural killer (NK) cells
Clinical grade ex vivo expanded human natural killer (NK) cellsClinical grade ex vivo expanded human natural killer (NK) cells
Clinical grade ex vivo expanded human natural killer (NK) cells
lifextechnologies
 
Games for improving human phenotype prediction
Games for improving human phenotype predictionGames for improving human phenotype prediction
Games for improving human phenotype prediction
Benjamin Good
 
useR2011 - Huber
useR2011 - HuberuseR2011 - Huber
useR2011 - Huber
rusersla
 

What's hot (10)

Stephen Friend Complex Traits: Genomics and Computational Approaches 2012-02-23
Stephen Friend Complex Traits: Genomics and Computational Approaches 2012-02-23Stephen Friend Complex Traits: Genomics and Computational Approaches 2012-02-23
Stephen Friend Complex Traits: Genomics and Computational Approaches 2012-02-23
 
Friend NIEHS 2013-03-01
Friend NIEHS 2013-03-01Friend NIEHS 2013-03-01
Friend NIEHS 2013-03-01
 
Infographic CV MB Long
Infographic CV MB LongInfographic CV MB Long
Infographic CV MB Long
 
Stephen Friend Cytoscape Retreat 2011-05-20
Stephen Friend Cytoscape Retreat 2011-05-20Stephen Friend Cytoscape Retreat 2011-05-20
Stephen Friend Cytoscape Retreat 2011-05-20
 
Church nhgri 2012
Church nhgri 2012Church nhgri 2012
Church nhgri 2012
 
Stephen Friend Haas School of Business 2012-03-05
Stephen Friend Haas School of Business 2012-03-05Stephen Friend Haas School of Business 2012-03-05
Stephen Friend Haas School of Business 2012-03-05
 
SBS 2011: Sensitive Cell-based and Biochemical Assays Using Epic(R) Label-fre...
SBS 2011: Sensitive Cell-based and Biochemical Assays Using Epic(R) Label-fre...SBS 2011: Sensitive Cell-based and Biochemical Assays Using Epic(R) Label-fre...
SBS 2011: Sensitive Cell-based and Biochemical Assays Using Epic(R) Label-fre...
 
Clinical grade ex vivo expanded human natural killer (NK) cells
Clinical grade ex vivo expanded human natural killer (NK) cellsClinical grade ex vivo expanded human natural killer (NK) cells
Clinical grade ex vivo expanded human natural killer (NK) cells
 
Games for improving human phenotype prediction
Games for improving human phenotype predictionGames for improving human phenotype prediction
Games for improving human phenotype prediction
 
useR2011 - Huber
useR2011 - HuberuseR2011 - Huber
useR2011 - Huber
 

Viewers also liked

Wikipedia as an engine for scientific communication and collaboration at mass...
Wikipedia as an engine for scientific communication and collaboration at mass...Wikipedia as an engine for scientific communication and collaboration at mass...
Wikipedia as an engine for scientific communication and collaboration at mass...
Andrew Su
 
Using Citizen Science to organize biomedical knowledge
Using Citizen Science to organize biomedical knowledgeUsing Citizen Science to organize biomedical knowledge
Using Citizen Science to organize biomedical knowledge
Andrew Su
 
Centralized Model Organism Database (Biocuration 2014 poster)
Centralized Model Organism Database (Biocuration 2014 poster)Centralized Model Organism Database (Biocuration 2014 poster)
Centralized Model Organism Database (Biocuration 2014 poster)
Andrew Su
 
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
Andrew Su
 
Heart BD2K, Biocuration, and Citizen Science
Heart BD2K, Biocuration, and Citizen ScienceHeart BD2K, Biocuration, and Citizen Science
Heart BD2K, Biocuration, and Citizen Science
Andrew Su
 
Open biomedical knowledge using crowdsourcing and citizen science
Open biomedical knowledge using crowdsourcing and citizen scienceOpen biomedical knowledge using crowdsourcing and citizen science
Open biomedical knowledge using crowdsourcing and citizen science
Andrew Su
 
UCSD / DBMI seminar 2015-02-6
UCSD / DBMI seminar 2015-02-6UCSD / DBMI seminar 2015-02-6
UCSD / DBMI seminar 2015-02-6
Andrew Su
 
Citizen Science and Rare Disease Research
Citizen Science and Rare Disease ResearchCitizen Science and Rare Disease Research
Citizen Science and Rare Disease Research
Andrew Su
 

Viewers also liked (8)

Wikipedia as an engine for scientific communication and collaboration at mass...
Wikipedia as an engine for scientific communication and collaboration at mass...Wikipedia as an engine for scientific communication and collaboration at mass...
Wikipedia as an engine for scientific communication and collaboration at mass...
 
Using Citizen Science to organize biomedical knowledge
Using Citizen Science to organize biomedical knowledgeUsing Citizen Science to organize biomedical knowledge
Using Citizen Science to organize biomedical knowledge
 
Centralized Model Organism Database (Biocuration 2014 poster)
Centralized Model Organism Database (Biocuration 2014 poster)Centralized Model Organism Database (Biocuration 2014 poster)
Centralized Model Organism Database (Biocuration 2014 poster)
 
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...
 
Heart BD2K, Biocuration, and Citizen Science
Heart BD2K, Biocuration, and Citizen ScienceHeart BD2K, Biocuration, and Citizen Science
Heart BD2K, Biocuration, and Citizen Science
 
Open biomedical knowledge using crowdsourcing and citizen science
Open biomedical knowledge using crowdsourcing and citizen scienceOpen biomedical knowledge using crowdsourcing and citizen science
Open biomedical knowledge using crowdsourcing and citizen science
 
UCSD / DBMI seminar 2015-02-6
UCSD / DBMI seminar 2015-02-6UCSD / DBMI seminar 2015-02-6
UCSD / DBMI seminar 2015-02-6
 
Citizen Science and Rare Disease Research
Citizen Science and Rare Disease ResearchCitizen Science and Rare Disease Research
Citizen Science and Rare Disease Research
 

Similar to NCBO Webinar: Translating unstructured, crowdsourced content into structured data

Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
Andrew Su
 
ISMB2012: The Gene Wiki: Crowdsourcing human gene annotation
ISMB2012: The Gene Wiki: Crowdsourcing human gene annotationISMB2012: The Gene Wiki: Crowdsourcing human gene annotation
ISMB2012: The Gene Wiki: Crowdsourcing human gene annotation
Andrew Su
 
ISB2012: The Gene Wiki: Crowdsourcing human gene annotation
ISB2012: The Gene Wiki: Crowdsourcing human gene annotationISB2012: The Gene Wiki: Crowdsourcing human gene annotation
ISB2012: The Gene Wiki: Crowdsourcing human gene annotation
Andrew Su
 
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.orgCrowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Andrew Su
 
bioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics databioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics data
International Institute of Tropical Agriculture
 
Bio-ontologies in bioinformatics: Growing up challenges
Bio-ontologies in bioinformatics: Growing up challengesBio-ontologies in bioinformatics: Growing up challenges
Bio-ontologies in bioinformatics: Growing up challenges
Janna Hastings
 
Friend EORTC 2012-11-08
Friend EORTC 2012-11-08Friend EORTC 2012-11-08
Friend EORTC 2012-11-08
Sage Base
 
Text mining tools for semantically enriching scientific literature
Text mining tools for semantically enriching scientific literatureText mining tools for semantically enriching scientific literature
Text mining tools for semantically enriching scientific literature
Duncan Hull
 
Friend WIN Symposium 2012-06-28
Friend WIN Symposium 2012-06-28Friend WIN Symposium 2012-06-28
Friend WIN Symposium 2012-06-28
Sage Base
 
Stephen Friend ICR UK 2012-06-18
Stephen Friend ICR UK 2012-06-18Stephen Friend ICR UK 2012-06-18
Stephen Friend ICR UK 2012-06-18
Sage Base
 
RML NCBI Resources
RML NCBI ResourcesRML NCBI Resources
RML NCBI Resources
Jackie Wirz, PhD
 
Human genetic variation and its contribution to complex traits
Human genetic variation and its contribution to complex traitsHuman genetic variation and its contribution to complex traits
Human genetic variation and its contribution to complex traits
groovescience
 
Biotechnology as Career Option 2012
Biotechnology as Career Option 2012Biotechnology as Career Option 2012
Biotechnology as Career Option 2012
Reportbioinformatics
 
DNA Barcoding and its application in species identification
DNA Barcoding and its application in species identificationDNA Barcoding and its application in species identification
DNA Barcoding and its application in species identification
supriya k
 
Ilene Mizrachi - Opening Plenary
Ilene Mizrachi - Opening PlenaryIlene Mizrachi - Opening Plenary
Ilene Mizrachi - Opening Plenary
Consortium for the Barcode of Life (CBOL)
 
Research presentation-wd
Research presentation-wdResearch presentation-wd
Research presentation-wd
Wagied Davids
 
Church isca2012
Church isca2012Church isca2012
Church isca2012
Deanna Church
 
BITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequences
BITS
 
Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...
Chris Evelo
 
lecture 1.pptx
lecture 1.pptxlecture 1.pptx
lecture 1.pptx
MohamedHasan816582
 

Similar to NCBO Webinar: Translating unstructured, crowdsourced content into structured data (20)

Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org (Sanger)
 
ISMB2012: The Gene Wiki: Crowdsourcing human gene annotation
ISMB2012: The Gene Wiki: Crowdsourcing human gene annotationISMB2012: The Gene Wiki: Crowdsourcing human gene annotation
ISMB2012: The Gene Wiki: Crowdsourcing human gene annotation
 
ISB2012: The Gene Wiki: Crowdsourcing human gene annotation
ISB2012: The Gene Wiki: Crowdsourcing human gene annotationISB2012: The Gene Wiki: Crowdsourcing human gene annotation
ISB2012: The Gene Wiki: Crowdsourcing human gene annotation
 
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.orgCrowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
 
bioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics databioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics data
 
Bio-ontologies in bioinformatics: Growing up challenges
Bio-ontologies in bioinformatics: Growing up challengesBio-ontologies in bioinformatics: Growing up challenges
Bio-ontologies in bioinformatics: Growing up challenges
 
Friend EORTC 2012-11-08
Friend EORTC 2012-11-08Friend EORTC 2012-11-08
Friend EORTC 2012-11-08
 
Text mining tools for semantically enriching scientific literature
Text mining tools for semantically enriching scientific literatureText mining tools for semantically enriching scientific literature
Text mining tools for semantically enriching scientific literature
 
Friend WIN Symposium 2012-06-28
Friend WIN Symposium 2012-06-28Friend WIN Symposium 2012-06-28
Friend WIN Symposium 2012-06-28
 
Stephen Friend ICR UK 2012-06-18
Stephen Friend ICR UK 2012-06-18Stephen Friend ICR UK 2012-06-18
Stephen Friend ICR UK 2012-06-18
 
RML NCBI Resources
RML NCBI ResourcesRML NCBI Resources
RML NCBI Resources
 
Human genetic variation and its contribution to complex traits
Human genetic variation and its contribution to complex traitsHuman genetic variation and its contribution to complex traits
Human genetic variation and its contribution to complex traits
 
Biotechnology as Career Option 2012
Biotechnology as Career Option 2012Biotechnology as Career Option 2012
Biotechnology as Career Option 2012
 
DNA Barcoding and its application in species identification
DNA Barcoding and its application in species identificationDNA Barcoding and its application in species identification
DNA Barcoding and its application in species identification
 
Ilene Mizrachi - Opening Plenary
Ilene Mizrachi - Opening PlenaryIlene Mizrachi - Opening Plenary
Ilene Mizrachi - Opening Plenary
 
Research presentation-wd
Research presentation-wdResearch presentation-wd
Research presentation-wd
 
Church isca2012
Church isca2012Church isca2012
Church isca2012
 
BITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequences
 
Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...
 
lecture 1.pptx
lecture 1.pptxlecture 1.pptx
lecture 1.pptx
 

More from Andrew Su

Building and mining a heterogeneous biomedical knowledge graph
Building and mining a heterogeneous biomedical knowledge graphBuilding and mining a heterogeneous biomedical knowledge graph
Building and mining a heterogeneous biomedical knowledge graph
Andrew Su
 
Wikidata as a FAIR knowledge graph for the life sciences
Wikidata as a FAIR knowledge graph for the life sciencesWikidata as a FAIR knowledge graph for the life sciences
Wikidata as a FAIR knowledge graph for the life sciences
Andrew Su
 
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledgeThe Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
Andrew Su
 
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
Andrew Su
 
WikiGenomes Poster (ISMB)
WikiGenomes Poster (ISMB)WikiGenomes Poster (ISMB)
WikiGenomes Poster (ISMB)
Andrew Su
 
The case for an open biomedical knowledgebase
The case for an open biomedical knowledgebaseThe case for an open biomedical knowledgebase
The case for an open biomedical knowledgebase
Andrew Su
 
Open data, compound repurposing, and rare diseases (ISCB)
Open data, compound repurposing, and rare diseases (ISCB)Open data, compound repurposing, and rare diseases (ISCB)
Open data, compound repurposing, and rare diseases (ISCB)
Andrew Su
 
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Andrew Su
 
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
Andrew Su
 
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Andrew Su
 
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Andrew Su
 
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen ScienceCrowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Andrew Su
 
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.orgCrowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Andrew Su
 
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
Andrew Su
 
20120220 Tri-Con Cloud Computing Symposium
20120220 Tri-Con Cloud Computing Symposium20120220 Tri-Con Cloud Computing Symposium
20120220 Tri-Con Cloud Computing Symposium
Andrew Su
 

More from Andrew Su (15)

Building and mining a heterogeneous biomedical knowledge graph
Building and mining a heterogeneous biomedical knowledge graphBuilding and mining a heterogeneous biomedical knowledge graph
Building and mining a heterogeneous biomedical knowledge graph
 
Wikidata as a FAIR knowledge graph for the life sciences
Wikidata as a FAIR knowledge graph for the life sciencesWikidata as a FAIR knowledge graph for the life sciences
Wikidata as a FAIR knowledge graph for the life sciences
 
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledgeThe Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
The Gene Wiki: Using Wikipedia and Wikidata to organize biomedical knowledge
 
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
BOSC2017: Using Wikidata as an open, community-maintained database of biomedi...
 
WikiGenomes Poster (ISMB)
WikiGenomes Poster (ISMB)WikiGenomes Poster (ISMB)
WikiGenomes Poster (ISMB)
 
The case for an open biomedical knowledgebase
The case for an open biomedical knowledgebaseThe case for an open biomedical knowledgebase
The case for an open biomedical knowledgebase
 
Open data, compound repurposing, and rare diseases (ISCB)
Open data, compound repurposing, and rare diseases (ISCB)Open data, compound repurposing, and rare diseases (ISCB)
Open data, compound repurposing, and rare diseases (ISCB)
 
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
Open data, compound repurposing, and rare diseases -- Point Loma Nazarene Uni...
 
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
Panel on Citizen Science and Crowdsourcing Games - March 27, 2015
 
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
Crowdsourcing and Learning from Crowd Data (Tutorial @ PSB2015)
 
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)
 
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen ScienceCrowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
 
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.orgCrowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
 
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
GeneGames.org: Crowdsourcing human gene annotation (Genome Informatics 2012)
 
20120220 Tri-Con Cloud Computing Symposium
20120220 Tri-Con Cloud Computing Symposium20120220 Tri-Con Cloud Computing Symposium
20120220 Tri-Con Cloud Computing Symposium
 

Recently uploaded

[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!
Ortus Solutions, Corp
 
What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
DianaGray10
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
FilipTomaszewski5
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
christinelarrosa
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)
HarpalGohil4
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
Sease
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
UiPathCommunity
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
Enterprise Knowledge
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 

Recently uploaded (20)

[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!
 
What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)AWS Certified Solutions Architect Associate (SAA-C03)
AWS Certified Solutions Architect Associate (SAA-C03)
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
 

NCBO Webinar: Translating unstructured, crowdsourced content into structured data

  • 1. Translating unstructured, crowdsourced content into structured data Andrew Su, Ph.D. The Scripps Research Institute NCBO Webinar February 20, 2013
  • 2. 2 Human genetics underlies human health Molecular understanding of: • Biological function • Genetic variation • Mutation • Deletion • Amplification • … Structured gene Gene annotations ~3 billion ~20,000 bases genes Molecular diagnostics & therapeutics
  • 3. 3 Structured gene annotations enable computation Structured gene annotations
  • 4. 4 Few genes are well annotated CTNNB1 VEGFA SIRT1 FGFR2 GO Annotation TGFB1 TP53 Counts MEF2C BMP4 65% LEF1 WNT5A TNF 41% 20,473 protein- coding genes Genes, sorted by decreasing counts Data: NCBI, February 2013
  • 5. 5 Few genes are well annotated GO Annotation Counts + Electronic annotation (IEA) Genes, sorted by decreasing counts Data: NCBI, February 2013
  • 6. 6 Few genes are well annotated GO Annotation Counts + Electronic annotation (IEA) Biological Process only Genes, sorted by decreasing counts Data: NCBI, February 2013
  • 7. 7 311,696 articles (1.5% of PubMed) have been cited by GO annotations
  • 8. 8 Sooner or later, the research community will need to be involved in the 0 annotation effort to scale up to the rate of data generation.
  • 9. 9 Crowdsourcing empowers the entire scientific community to directly participate in the gene annotation process.
  • 10. 10 From crowdsourcing to structured data The Gene Wiki GeneGames.org
  • 11. 11 10,000 gene “stubs” within Wikipedia Protein structure Gene summary Symbols and identifiers Gene Ontology annotations Protein interactions Tissue expression Linked pattern references Links to structured databases Huss, PLoS Biol, 2008
  • 12. 12 Gene Wiki has a critical mass of readers Total: 4.0 million views / month Huss, PLoS Biol, 2008; Good, NAR, 2011
  • 13. 13 Gene Wiki has a critical mass of editors Editor count Editors Edit count Edits Increase of ~10,000 words / month from >1,000 edits Currently 1.42 million words Approximately equal to 230 full-length articles Good, NAR, 2011
  • 14. 14 A review article for every gene is powerful Reelin: 68 editors, 543 edits since July 2002 Heparin: 175 editors, 320 edits since June 2003 AMPK: 44 editors, 84 edits since March 2004 RNAi: 232 editors, 708 edits since October 2002 References to the literature Hyperlinks to related concepts
  • 15. Filtering, extracting, and summarizing PubMed Documents Concepts
  • 16. 16 Document- and concept-centric text mining Predicate Subject Object
  • 17. 17 Simple text mining for gene annotations NCBI Entrez Gene: 334 Gene Wiki mapping Wikilink Candidate assertion GO:0006897 GO exact match 6319 novel Gene Ontology annotations 2147 novel Disease Ontology annotations Good, BMC Genomics, 2011.
  • 18. 18 Gene Wiki content improves enrichment analysis More p-value significant (PubMed + GW) PubMed only Muscle contraction More significant PubMed + GW p-value (PubMed only) Good, BMC Genomics, 2011.
  • 19. 19 Gene Wiki+ for integrative queries mwsync Good, J Biomed Semantics, 2012. http://genewikiplus.org
  • 20. 20 Dynamic queries across genes, diseases, SNPs Good, J Biomed Semantics, 2012.
  • 21. 21 Gene Wiki+ for integrative queries mwsync OMIM PharmGKB {{#ask: [[Category:Human_proteins]] [[is_associated_with:: <q>[[Category:Breast_cancer] ]</q>]] [[HasSNP:: … <q>[[is_associated_with:: http://genewikiplus.org Good, J Biomed Semantics, 2012.
  • 22. 22 Gene Wiki+ for integrative queries mwsync OMIM PharmGKB Good, J Biomed Semantics, 2012. http://genewikiplus.org
  • 23. 23 Wikidata Provide a database of the world‟s knowledge that anyone can edit - Denny Vrandečić
  • 24. 24 Wikidata Q414043 Reelin Protein Q8054 Property:P31 is a Glycoprotein Q187126 Neural Property:P128 regulates Q1345738 development VLDL receptor Q1979313 Property:P129 Interacts Amyloid with precursor Q423510 protein http://www.wikidata.org/wiki/Q414043
  • 25. 25 Wikidata Q414043 Q8054 Property:P31 Q187126 Property:P128 Q1345738 Q1979313 Property:P129 Q423510 http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en
  • 28. 28 From crowdsourcing to structured data The Gene Wiki GeneGames.org
  • 29. 29 Not just the biomedical literature…
  • 30. 30 BioGPS aggregates gene-centric information http://biogps.org Wu, NAR, 2013; Wu, Genome Biology, 2009.
  • 31. 31 The plugin interface is simple and universal Pubmed http://www.ncbi.nlm.nih.gov/sites/entrez?...&Term={{Symbol}} STRING http://string-db.org/newstring_cgi?...&identifier={{EnsemblGene}} KEGG http://www.genome.jp/dbget-bin/www_bget?hsa:{{EntrezGene}} URL template Rendered URL Gene entity Wu, NAR, 2013; Wu, Genome Biology, 2009.
  • 32. 32 The plugin interface is simple and universal
  • 33. 33 The plugin interface is simple and universal
  • 34. 34 The plugin interface is simple and universal
  • 35. 35 The plugin interface is simple and universal
  • 36. 36 The plugin interface is simple and universal Total of 389 gene-centric online databases registered as BioGPS plugins
  • 37. 37 BioGPS has a critical mass of users Daily pageviews • > 4100 registered users Top 10 organizations • 4000 unique visitors per week 1. Harvard 6. Cambridge 2. NIH 7. U Penn • 40,000 page views per week 3. UCSD 8. Stanford 4. Scripps 9. Wash U 5. MIT 10. UNC Wu, NAR, 2013; Wu, Genome Biology, 2009.
  • 38. 38 All resources should provide RDF…
  • 40. 40 Defining a data extraction template TP53 TNF APOE IL6 VEGF EGFR TGFB1 … …
  • 41. 41 The BioGPS Semantic Annotator http://54.244.135.254:8000/
  • 42. 42 From crowdsourcing to structured data The Gene Wiki GeneGames.org
  • 43. 43 Seven million human hours http://www.flickr.com/photos/archana3k1/4124330493/
  • 44. 44 Twenty million human hours http://www.flickr.com/photos/ableman/2171326385/
  • 45. 45 - 150 billion human hours per year http://www.flickr.com/photos/rvp-cw/6243289302/
  • 46. 46 Using games to fold proteins Fold.it players have successfully: • Outperformed state of the art protein folding algorithms (Cooper, Nature, 2010) • Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011) • Designed an improved protein folding algorithm (Khatib, PNAS, 2011) • Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)
  • 47. 47 Using games to fold RNAs http://eterna.cmu.edu/
  • 48. 48 Using games to align sequences http://phylo.cs.mcgill.ca Kawrykow, PLOS ONE, 2012.
  • 49. 49 Using games to annotate genes? http://genegames.org
  • 50. 50 No good gene-disease annotation database Query: Apolipoprotein E Alzheimer's disease (AD) Lipoprotein glomerulopathy Sea-blue histiocyte disease
  • 51. 51 No good gene-disease annotation database Query: Apolipoprotein E Alzheimer's disease (AD) Lipoprotein glomerulopathy Sea-blue histiocyte disease Hyperlipoproteinemia, type III Macular degeneration, age-related Myocardial infarction susceptibility
  • 52. 52 No good gene-disease annotation database Query: Apolipoprotein E ? Alzheimer's disease (AD) ? Lipoprotein glomerulopathy ? Sea-blue histiocyte disease Hyperlipoproteinemia, type III ? Macular degeneration, age-related ? Myocardial infarction susceptibility HIV Psoriasis Vascular Diseases
  • 53. 53 No good gene-disease annotation database Query: Apolipoprotein E Alzheimer's disease (AD) Memory Coronary Artery Disease Neuropsychological Tests Hypertension Cognition Disorders Mental Status Schedule Psychiatric Status Rating Dementia Scales Cognition Hyperlipidemias Atrophy Disease Progression Dementia, Vascular Cardiovascular Diseases Parkinson Disease Brain Injuries Coronary Disease Myocardial Infarction Diabetes Mellitus, Type 2 … Memory Disorders 477 diseases!
  • 54. 54 Play Dizeez to annotate gene-disease links 6. Play to win! 5. Hurry! 4. Then on to the next question… 3. If it‟s „right‟, you get points 1. Read the clue (gene) 2. Click the related disease (only one is “right”)
  • 55. 55 Dizeez players seem pretty smart… In total (since Dec 2011): • 230 unique gamers • 1045 games played • 8525 guesses # Occurrences Gene Disease Gene Wiki OMIM PharmGKB PubMed 11 NBPF3 neuroblastoma 11 SOX8 mental retardation 9 ABL1 leukemia 9 SSX1 synovial sarcoma 8 APC colorectal cancer 8 FES sarcoma 8 RBP3 retinoblastoma 8 GAST gastrinoma 8 DCC colorectal cancer 8 MAP3K5 cancer
  • 56. 56 Using games to predict phenotype from genotype? http://genegames.org
  • 57. 57 Classification problems in genome biology Classify new cancer normal samples find patterns cancer 100,000s features normal SVM Neural networks Naïve Bayes KNN … 100s samples
  • 58. 58 Random forests Sample subset of cases and Train decision cancer normal features tree 100,000s features 100s samples
  • 59. 59 Random forests cancer normal 100,000s features 100s samples
  • 60. 60 Random forests Classify new cancer normal samples cancer 100,000s features normal How to interject biological 100s samples knowledge?
  • 61. 61 Network-guided forests Dutkowski & Ideker (2011). PLoS Computational Biology
  • 62. 62 Network-guided forests Sample features by PPI Train decision cancer normal network tree 100,000s features 100s samples
  • 63. 63 Human-guided forests Sample features by Train decision cancer normal human tree intelligence 100,000s features 100s samples
  • 64. 64
  • 65. 65 The Cure: Genomic predictors for disease
  • 66. 66 The Cure: Genomic predictors for disease
  • 67. 67 The Cure: Genomic predictors for disease
  • 68. 68 The Cure: Genomic predictors for disease
  • 69. 69 The Cure: Genomic predictors for disease
  • 70. 70 The Cure: Genomic predictors for disease
  • 71. 71 Human-guided forests Classify new samples cancer normal
  • 73. 73 Results • 214 registered players – 50% declared knowledge of cancer biology – 40% self-identified as having Ph.D. • Prediction results – 70% correct on survival concordance index – Best scoring model was 76% – Player registrations still increasing!
  • 74. 74 Crowdsourcing empowers the entire scientific community to directly participate in the gene annotation process.
  • 75. 75 Collaborators Group members Doug Howe, ZFIN Katie Fisch Max Nanis John Hogenesch, U Penn Luca de Alfaro, UCSC Ben Good Chunlei Wu Angel Pizzaro, U Penn Salvatore Loguercio Faramarz Valafar, SDSU Pierre Lindenbaum, Fondation Jean Dausset Key group alumni Michael Martone, Rush Konrad Koehler, Karo Bio Erik Clarke Warren Kibbe, Simon Lim, Northwestern Jon Huss Many Wikipedia editors Marc Leglise WP:MCB Project Maximilian Ludvigsson Ian MacLeod Camilo Orozco Contact http://sulab.org asu@scripps.edu @andrewsu +Andrew Su Funding and Support (BioGPS: GM83924, Gene Wiki: GM089820)

Editor's Notes

  1. We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
  2. We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
  3. We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
  4. Much knowledge is locked up in biomedical literature – our goal is to make it computableIf you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
  5. For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  6. Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  7. We extended this analysis to all 773 GO terms used in human gene annotations and found a consistent improvement in the enrichment scores
  8. Combines open editing of a wiki, with the robust community of editors at Wikipedia, with the structured data model of a database
  9. For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  10. MODs and portals
  11. Genetics resources
  12. Literature resources
  13. Protein resources
  14. Pathway and expression databases
  15. For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
  16. Empire state building
  17. FES = “Feline sarcoma oncogene”RBP3 = “Retinol binding protein 3, interstitial”
  18. Question: how to interject biological knowledge in the feature selection process?