This document discusses using crowdsourcing to improve biological knowledge by structuring gene annotations. It notes that few genes are well annotated currently, with most attention given to a small number of prominent genes. Biocuration is a bottleneck that limits greater annotation. The document proposes harnessing the "Long Tail" of scientists through a Gene Wiki and biological games to directly involve more researchers in annotation. This could scale up annotation to the rate of new data generation and literature. Structured data from crowdsourcing could enable new integrative queries and predictive models across genes, diseases, and genotypes.
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Crowdsourcing to structure biological knowledge (USC/ISI)
1. Crowdsourcing to structure biological
knowledge
Andrew Su, Ph.D.
Department of Molecular and Experimental Medicine
The Scripps Research Institute
ISI, USC
August 16, 2012
2. 2
Human genetics underlies human health
Molecular understanding of:
• Biological function
• Genetic variation
• Mutation “Gene
• Deletion annotation”
• Amplification
• …
~3 billion ~23,000
bases genes
Molecular
diagnostics &
therapeutics
7. 7
Sooner or later, the
research community will
need to be involved in the
0
annotation effort to scale
up to the rate of data
generation.
8. 8
The Long Tail is a prolific source of content
Short
Head
Content
produced
Long Tail
Contributors (sorted)
News : Newspapers Blogs
Video: TV/Hollywood YouTube
Product reviews: Consumer reports Amazon reviews
Food reviews: Food critics Yelp
Talent judging: Olympics American Idol
9. 9
We can harness the
Long Tail of scientists
to directly participate in
the gene annotation
process.
11. 11
10,000 gene “stubs” within Wikipedia Utility
Users
Contributors
Protein structure
Gene
summary
Symbols and
identifiers
Gene Ontology
annotations
Protein
interactions
Tissue expression
Linked pattern
references
Links to structured
databases
Huss, PLoS Biol, 2008
12. 12
Gene Wiki has a critical mass of readers
Total: 4.0 million views / month
Huss, PLoS Biol, 2008; Good, NAR, 2011
13. 13
Gene Wiki has a critical mass of editors
Editor count Editors
Edit count
Edits
Increase of ~10,000 words / month from >1,000 edits
Currently 1.42 million words
Approximately equal to 230 full-length articles
Good, NAR, 2011
14. 14
A review article for every gene is powerful
Reelin: 68 editors, 543 edits since July 2002
Heparin: 175 editors, 320 edits since June 2003
AMPK: 44 editors, 84 edits since March 2004
RNAi: 232 editors, 708 edits since October 2002
References to the literature
Hyperlinks to related concepts
32. 32
The plugin interface is simple and universal
Total of 389 gene-centric online
databases registered as BioGPS plugins
33. 33
BioGPS has a critical mass of users
Daily pageviews
• > 4100 registered users Top 10 organizations
• 4000 unique visitors per week 1. Harvard 6. Cambridge
2. NIH 7. U Penn
• 40,000 page views per week
3. UCSD 8. Stanford
4. Scripps 9. Wash U
5. MIT 10. UNC
42. 42
-
150 billion human hours
per year
http://www.flickr.com/photos/rvp-cw/6243289302/
43. 43
Using games to fold proteins
Fold.it players have successfully:
• Outperformed state of the art protein
folding algorithms (Cooper, Nature, 2010)
• Solved a previously-intractable crystal
structure (Khatib, Nat Struct Mol Biol, 2011)
• Designed an improved protein folding
algorithm (Khatib, PNAS, 2011)
• Improved enzyme activity of de novo
designed enzyme (Eiben, Nat Biotechnol, 2011)
46. 46
Using games to annotate gene-disease links
hurry!
then on to the next question
If its ‘right’, you get points
Click the related disease
http://genegames.org
57. 57
We can harness the
Long Tail of scientists
to directly participate in
the gene annotation
process.
58. 58
Collaborators Group members
Doug Howe, ZFIN Erik Clarke Ian Macleod
John Hogenesch, U Penn
Jon Huss, GNF
Ben Good Chunlei Wu
Luca de Alfaro, UCSC Salvatore Loguercio
Angel Pizzaro, U Penn
Faramarz Valafar, SDSU
Pierre Lindenbaum,
Fondation Jean Dausset Summer internships for students!
Michael Martone, Rush
Konrad Koehler, Karo Bio
Warren Kibbe, Simon Lim, Northwestern
Many Wikipedia editors
WP:MCB Project
Contact
http://sulab.org
Recruiting graduate students
asu@scripps.edu
in quantitative biology! See @andrewsu
http://education.scripps.edu/ +Andrew Su
Funding and Support
(BioGPS: GM83924, Gene Wiki: GM089820)
Editor's Notes
We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
Much knowledge is locked up in biomedical literature – our goal is to make it computableIf you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
MODs and portals
Genetics resources
Literature resources
Protein resources
Pathway and expression databases
For each resourceBriefly describe the unstructured resourceDescribe the structuring approach