Your SlideShare is downloading. ×
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes

2,570
views

Published on

Keynote talk given at GMOD 2014 …

Keynote talk given at GMOD 2014

Video of talk at: https://www.youtube.com/watch?v=RVijs5ry05E
Video of QA at: https://www.youtube.com/watch?v=dGHXo-iNsyU
Blog post: http://sulab.org/2013/06/creating-a-centralized-model-organism-database-cmod/

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,570
On Slideshare
0
From Embeds
0
Number of Embeds
18
Actions
Shares
0
Downloads
3
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • At least three functions – gene and genome annotation, software development, system administration
  • GMOD reduces redundancy in software development
  • 96 species shown
  • ~3000 species shown
  • Currently over 3000 sequenced genomes, will hit 10,000 in 2015, 100k in 2022, 1M in 2028# sequenced genomes doubles ~2 years
  • Every group still individually hosts database and web servers
  • CMOD reduces redundancy in system administration, leaving MOD communities to focus on what they do best – gene and genome annotations.
  • We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discoveryNo IEA
  • Numbers updated 7/15/2011
  • Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  • Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
  • Structured annotations enable pathway analysis, statistical analyses, cross-species comparisons
  • Tried on 773 GO categories, significant in 356 cases (46%)
  • We extended this analysis to all 773 GO terms used in human gene annotations and found a consistent improvement in the enrichment scores
  • Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
  • Hamburger to cow algorithm or ‘wishful thinking” requires Jurassic Park technology
  • Combines open editing of a wiki, with the robust community of editors at Wikipedia, with the structured data model of a database
  • Wikipedia: > 4M articles, averages over 2500 views per second
  • CMOD reduces redundancy in system administration, leaving MOD communities to focus on what they do best – gene and genome annotations.
  • Transcript

    • 1. A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced Genomes Andrew Su, Ph.D. @andrewsu asu@scripps.edu http://sulab.org OK January 16, 2014 GMOD 2014 OK
    • 2. 2 Why am I giving this keynote?
    • 3. 3 Harnessing the crowd… http://www.flickr.com/photos/portland_mike/6140660504/
    • 4. … to organize information 4 http://www.flickr.com/photos/45697441@N00/6629580443
    • 5. My simplified history of MODs 5
    • 6. My simplified history of MODs 6
    • 7. GMOD is widely used 199 (!) organizations listed as GMOD users 7
    • 8. Does the current model scale? 8
    • 9. Does the current model scale? 9
    • 10. Does the current model scale? 11
    • 11. The Long Tail of genomic data is being lost Identified 517 operons and 103 small regulatory RNAs... 12
    • 12. The Long Tail of genomic data is being lost Identified 517 operons and 103 small regulatory RNAs... 13
    • 13. At least you can download structured data… 14
    • 14. Centralized Model Organism Database concept CMOD 15
    • 15. 16 GMOD as a Service (GaaS) http://www.flickr.com/photos/aigle_dore/5626312363/
    • 16. 17 http://www.flickr.com/photos/shannonmary/187131727/
    • 17. 18 GO Annotation Counts Few genes are well annotated… CTNNB1 VEGFA SIRT1 FGFR2 TGFB1 TP53 MEF2C BMP4 LEF1 WNT5A TNF 65% 41% 20,473 proteincoding genes Genes, sorted by decreasing counts Data: NCBI, February 2013
    • 18. … because the literature is sparsely curated? Number of PubMed-indexed articles 1,000,000 800,000 600,000 400,000 200,000 0 1979 1984 1989 1994 1999 2004 2009 19
    • 19. … because the literature is sparsely curated? Number capacity read by scientist Average of articlesof humantypical scientist 20 10 0 1979 1984 1989 1994 1999 2004 2009 20
    • 20. 21 311,696 articles (1.5% of PubMed) have been cited by GO annotations
    • 21. 22 Sooner or later, the research community will need to be involved in the 0 annotation effort to scale up to the rate of data generation.
    • 22. The Long Tail is a prolific source of content Short Head Content produced Long Tail Contributors (sorted) News : Video: Product reviews: Food reviews: Talent judging: Newspapers TV/Hollywood Consumer reports Food critics Olympics Blogs YouTube Amazon reviews Yelp American Idol 23
    • 23. Wikipedia is reasonably accurate 24
    • 24. Wikipedia has breadth and depth 25 Articles Words (millions) Wikipedia Britannica Online http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
    • 25. 26 We can harness the Long Tail of scientists to directly participate in the gene annotation process.
    • 26. Filtering, extracting, and summarizing PubMed Documents Concepts Review article
    • 27. Filtering, extracting, and summarizing PubMed Documents Concepts
    • 28. Wiki success depends on a positive feedback Gene wiki page utility 1 2 Number of contributors 100 200 Number of users 29
    • 29. 10,000 gene “stubs” within Wikipedia 30 Utility Users Contributors Protein structure Gene summary Symbols and identifiers Gene Ontology annotations Protein interactions Linked references Tissue expression pattern Links to structured databases Huss, PLoS Biol, 2008
    • 30. Gene Wiki has a critical mass of readers 31 Utility Total: 4.0 million views / month Users Contributors Huss, PLoS Biol, 2008; Good, NAR, 2011
    • 31. 32 Gene Wiki has a critical mass of editors Editors Edits Edit count Editor count Utility Users Contributors Increase of ~10,000 words / month from >1,000 edits Currently 1.42 million words Approximately equal to 230 full-length articles Good, NAR, 2011
    • 32. A review article for every gene is powerful Reelin: 98 editors, 703 edits since July 2002 Hyperlinks to related concepts Heparin: 358 editors, 654 edits since June 2003 AMPK: 109 editors, 203 edits since March 2004 RNAi: 394 editors, 994 edits since October 2002 References to the literature 33
    • 33. Making the Gene Wiki more computable Free text Structured annotations 34
    • 34. 35 Filling the gaps in gene annotation NCBI Entrez Gene: 334 Gene Wiki mapping Wikilink Candidate assertion GO:0006897 GO exact match 6319 novel GO annotations 2147 novel DO annotations
    • 35. 36 Gene Wiki content improves enrichment analysis axon guidance (GO:0007411) Enrichment analysis GO term 811 articles 264 genes PubMed abstracts Gene list GO:0007411 Yes Linked genes through PubMed No Yes 13 2 No 251 12033 P = 1.55 E-20 Concept recognition
    • 36. 37 Gene Wiki content improves enrichment analysis muscle contraction (GO:0006936) Enrichment analysis GO term 251 articles 87 genes Gene list PubMed abstracts Concept recognition + Gene Wiki 87 articles GO:0006936 Linked genes through PubMed GO:0006936 Linked genes through PubMed + Gene Wiki P = 1.0 P = 1.22 E-09
    • 37. 38 Gene Wiki content improves enrichment analysis p-value (PubMed + GW) More significant PubMed only Muscle contraction More significant PubMed + GW p-value (PubMed only)
    • 38. 39 The Long Tail of scientists is a valuable source of information on gene function
    • 39. Can we skip text mining? http://fiehnlab.ucdavis.edu/projects/rice_metabolome/
    • 40. 41 Wikidata Provide a database of the world’s knowledge that anyone can edit - Denny Vrandečić
    • 41. Wikidata understands scale 42
    • 42. Wikidata understands scale 43 14 million Wikidata items… …13 million total genes in Entrez Gene
    • 43. Wikidata understands scale 44 27 million Wikidata statements… …150k total GO annotations
    • 44. Wikidata for biology 45 Q414043 Reelin Protein Property:P31 Q8054 is a Property:P129 regulates Interacts with Q187126 Neural development Q1345738 VLDL receptor Property:P128 Glycoprotein Q1979313 Amyloid precursor protein http://www.wikidata.org/wiki/Q414043 Q423510
    • 45. Wikidata for biology 46 Q414043 Q8054 Property:P31 Property:P128 Q187126 Q1345738 Q1979313 Property:P129 Q423510 http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en
    • 46. Increasing biological data in Wikidata http://www.wikidata.org/wiki/Wikidata:Molecular_Biology_task_force 47
    • 47. Loading genomic data into Wikidata Entrez Gene Ensembl UniProt UCSC PDB RefSeq 48
    • 48. Wikidata gene model 49 Added ~1000 human genes so far….
    • 49. 50 Wikidata as CMOD? CMOD
    • 50. 51 Wikidata as CMOD? CMOD Powered by: CMOD
    • 51. 52 The Long Tail of bioinformaticians can collaboratively build a Centralized Model Organism Database (CMOD).
    • 52. Gene Wiki Collaborators Doug Howe, ZFIN John Hogenesch, U Penn Jon Huss, GNF Luca de Alfaro, UCSC Angel Pizzaro, U Penn Faramarz Valafar, SDSU Pierre Lindenbaum, Fondation Jean Dausset Michael Martone, Rush Konrad Koehler, Karo Bio Warren Kibbe, Simon Lim Many Wikipedia editors WP:MCB Project Group members Katie Fisch Ben Good Salvatore Loguercio 53 Tobias Meissner Max Nanis Chunlei Wu Key group alumni Adriel Carolino Erik Clarke Jon Huss Marc Leglise Maximilian Ludvigsson Ian MacLeod Camilo Orozco Contact http://sulab.org asu@scripps.edu @andrewsu +Andrew Su Funding and Support (BioGPS: GM83924, Gene Wiki: GM089820)