BIOBASE, the leader in data annotation and curation for genomics, took part in the Genome Informatics Alliance 2012: Logistics meeting in Oregon, and had an opportunity to present on trends in annotation of genomic data.
Callback to last year in Verona, where shakespeares romeo and julia played „ An infinite number of monkes with typewriters (or one monkey with infinite time) in principle would be able to write all the works of shakespeare“ the idea of getting things right by throwing impractically large resources at them (the monkeys would take much longer than the length of the universe) How can we deal with millions of genomes, how can we annotate them, facing the same limitation in resources Goes back to Aristoteles in regard to permutations, and more recently 1913 — Émile Borel’s essay — “Mécanique Statistique et Irréversibilité”
Datas not bad, ist only bad if we do not know what to do with it
„ Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?“ – T.S. Eliot
Analytic validity How accurately and reliably the test measures the genotype of interest. Clinical validity How accurately the test detects or predicts the outcomes of interest. Clinical utility How likely the test is to significantly improve patient outcomes.
Facebook for genomes Ebay for genomes?
Bad fít for whole genome/exome Quality consistency issues
Wikipedia = Crowdsourcing‘s Posterchild for distributed curation: Everybody can contribute quality killed Britannica Linus‘ Law: &quot;given enough eyeballs, all bugs are shallow“ -- Eric S. Raymond, named in honor of L. Torvalds India: Incentive of getting published
Has not worked out that well in practice, for biology/science because of researchers spending time rather to do research and get published, and because of difficulites with maintaining standards Ideas: Force journals mandate submission of data into databases journals require gene symbols, accessions, etc Tie career advancement to annotation with Microattribution Crowdfunding Suggested solutions: Force ‘em? journals mandate submission of data into databases authors provide a machine-readable XML summary journals require gene symbols, accessions for genes and isoforms, description of species, cell types, genotypes Tie career advancement to annotation? Author IDs, Microattribution Fund community annotation (crowdfunding)? Valuable suggestions to enable many tools. For lowest level of data curation. Changing career eval would mean changing the entire credit system for research which sits on peer reviewed author papers
Use expert who are paid to collect information manually into databases
We train curators for up to half a year before they do live curation
Trends in Annotation of Genomic Data
One million monkeys with typewritersAnnotations of the Genomic Data DelugeGenome Informatics AlliancePortland, 28/29 March 2012Dr. Frank Schacherer, CTO, BIOBASE GmbHfrank.firstname.lastname@example.org
Disclaimer: no actual monkeysinvolved In 2003 the Arts Council for England paid £2,000 for a real-life test of the theorem involving six Sulawesi crested macaques, but the trial was abandoned after a month. AT C G G AT TT The monkeys produced five pages of TT A text, mainly composed of the letter S, C GTA CG but failed to type anything close to a CGC word of English, broke the computer G G TA C and used the keyboard as a lavatory. A ATA C TTG A A http://www.telegraph.co.uk/technology/news/8789 C TG G 894/Monkeys-at-typewriters-close-to-reproducing- C CGT AT Shakespeare.html T
Agenda• What annotation do we need?• How can we get it?
A deluge of data• deluge (plural deluges) – A great flood or rain. The deluge continued for hours, drenching the land and slowing traffic to a halt. – An overwhelming amount of something. The rock concert was a deluge of sound.
Media perception Science 2011 The Power Of Digitizing Health Affairs 2009 Human Beings 17 Feb 2012 Soon, $1,000 Will Cost of Gene Sequencing Map Your Genes Falls, Raising Hopes for 10 Jan 2012 Medical AdvancesPersonalized Medicine 7 March 2012Hits a Bump / March 2012
Life cycle of data annotationUnderstan Derive dMap AnalyzeAnnotate Publish Rank Curate
How to predict mutation effects • Overlap with other data – dbSNP, 1000 genomes – Relatives and Controls • Algorithmically – Frameshift, Nonsense, Stop gain/loss, Non-synonymous changes (SIFT, PolyPhen, ...) • Based on annotation – known functional regions (active sites, binding sites, ...) • Directly known effects – HGMDBioinformatics, Vol. 26 no. 16 2010, pages 2069; 10.1093/bioinformatics/btq330
Associating Genotype with Phenotypehttp://www.gen2phen.org/
What data do we need for clinical applicationACCE takes its name from the four main criteria for evaluating a genetic test —analytic validity, clinical validity, clinical utility and associatedethical, legal and social implicationsCenters for Disease Control and PreventionOffice of Public Health Genomics (OPHG)
Ideal Annotation for clinical use?• Variants N=12 – Pathogenic, Uncertain, Benign 4 Testing (Clinical Validity,Who/When, Methods, – Severities, if known Interpretation, Cost) – Ethnicities/Frequencies 4 Management, – Number of cases Clinical Significance, Implications – Symptoms In conjunction with 3 Actionability, Clinical Utility other mutations 3 Clinical manifestations• ( Pathophysiology, Phenotype, Prognosis, Evidences Severity, Penetrance, – Not weighted equally Pleiotropy) – Risks of incorrect classification 2 Frequency not equal between genes (especially indicate most common variants) 2 Inheritance and Data from: Howard P. Levy, MD, PhD Johns Hopkins University de novo mutation rate 2 Evidence-based Data from: Elaine Lyon, Ph.D., FACMG University of Utah & 1 Clinical Decision Support in EHR ARUP Laboratories
Who provides annotation? Payor Test Lab Curator Researcher Patient MD/Geneticist Anybody Computer
Surveys & Patient Self-annotation nature biotechnology VOLUME 29 NUMBER 5 MAY 2011 Knaus, William A. BUILDING A GENOME Patients with serious diseases may experiment with drugs that have ENABLED ELECTRONIC not received regulatory approval. Online patient communities MEDICAL RECORD structured around quantitative outcome data have the potential to provide an observational environment to monitor such drug usage and its consequences. Here we describe an analysis of data reported on the website PatientsLikeMe by patients with amyotrophic lateral sclerosis (ALS) who experimented with lithium carbonate
DNA Variant DatabasesData, except for HGMD and DMuDB courtesy of P. Willems, Mutabase
Testing Lab dataA safe and secure route for sharing variant dataThe Diagnostic Mutation Database (DMuDB) is a unique repository of highquality variant data collected from accredited clinical genetic testinglaboratories in the UK National Health Service (NHS).It provides a safe and secure way for variant data to be shared within andbetween laboratories in order to support safer, more consistentdiagnoses. The database was established in order to address the lack ofdata-sharing or publication in the genetic testing community.DMuDB is used regularly by genetic scientists: • to check a new variant against existing reported variants from other laboratories • to check for co-reported variants • as a part of regular re-assessment of unclassified variants • via the Universal Browser as part of complex searches covering multiple databases www.ngrl.org.uk/Manchester
LSDBs (Locus Specific Databases) http://www.hgvs.org/dblist/glsdb.html
Crowdsourcing reality …biological databases can be “The future of curated by a diffuse network of biocuration To thrive, the field that volunteers? This is certainly not the links biologists and their case and at the core of every data urgently needs successful wiki database are a group structure, recognition and support. “ of dedicated experts who do the bulk NATURE|Vol 455|2008 of the data curation.
Data Annotation Professionals• Clear incentives• Background in life sciences (MSc/PhD)• Curation is sole focus of work• Knowledge of standards, databases, formats, specialized tools Huge volumes of primary data are currently archived in numerous open-access databases, and with new generation technologies becoming more common in laboratories, large datasets will become even more prevalent than today. The lasting archiving, accurate curation, efficient analysis and precise interpretation of all of these data are a challenge. Collectively, database development and biocuration are at the forefront of the endeavor to make sense of this mounting deluge of data.
Conclusions on annotation• Clinical-grade annotation may be the most important task ahead• NGS itself contributes to generate evidence• Many different sources and ways of annotation exist• Human, specialist annotation remains essential (monkeys nonwithstanding)
• BIOBASE Employees all around the world • David Cooper, University of CardiffThank you! • Andrew Deveraux, NGRL • Patrick Willems, MutaBase • Johan den Dunnen, HVP & Leiden University Medical Center • Anthony J. Brooks, GEN2PHEN & University of Leicester • Samir K. Brahmachari , OSDD Gene Regulation Analysis Human Mutation & Functional Analysis Variant Analysis email@example.com www.biobase-international.com