CurationEwan Birney (tweetable)
Who am I?• Associate Director at  European Bioinformatics  Institute (EBI)• Involved in genomics since I  was 19 (> 20 yea...
Molecular Biology• The study of how life works – at a molecular level• Key molecules:  • DNA – Information store (Disk)  •...
Theory  Observation                    Can accurately predict from models Must directly observe    Molecular Geology,  Cl...
This ratio is not well correlated with data size   ~60PB                        High Energy PhysicsData Size             M...
“Knowing stuff” is critical to biology…• The bases of the human genome  • … and the Mouse, Rat, Wheat, Ecoli, Plasmodium, ...
Two fundamental types of information• Experimental data           • Consensus Knowledge• The result of a specific    • Int...
Five types of curation
Experimental Data Entry• Intact – Protein:Protein  interactions• GWAS Catalog –  extraction of summary  statistics
Experimental Meta data capture• Sample, CDS lines in  ENA• Sample in Metabolights,  PRIDE etc• Machine and analysis  speci...
Consensus integration of information• GenCode gene models in  human• Summaries and GO  assignment in UniProt• Pathway info...
Knowledge frameworks•   The EC classification•   Cell type ontologies•   Cell lineages – Worms!•   SnowMed, HPO etc•   GO ...
Knowledge management• Creation of rules  representing ENA  standards compliance• Cross-ontology  coordination (eg, EFO) or...
Data Entry vs Programming Direct                                    Programmatic Data Entry                               ...
Thank You!
Curation Dilema• If you do your job well…   • If you do your job badly…• Everyone assumes it’s      • Everyone assumes it’...
Why we need an infrastructure…
Infrastructures are critical…
But we only notice them when they go wrong
Biology already needs an informationinfrastructure• For the human genome  • (…and the mouse, and the rat, and… x 150 now, ...
..But this keeps on going…• We have to scale across all of (interesting) life  • There are a lot of species out there!• We...
ELIXIR’s missionTo build a sustainableEuropean infrastructure forbiologicalinformation, supporting lifescience research an...
How?Fully Centralised                                 Fully DistributedPros: Stability, reuse,             Pros: Responsiv...
Research        Healthcare    International    National    EBI / Elixir     Healthcare    English          National Langua...
Other infrastructures needed for biology• EuroBioImaging  • Cellular and whole organism Imaging• BioBanks (BBMRI)  • We ne...
(you can follow me on twitter @ewanbirney)I blog and update this on Google Plus publically
Upcoming SlideShare
Loading in …5
×

Ewan Birney Biocuration 2013

355 views
293 views

Published on

Ewan Birney's slides from Biocuration 2013

Published in: Health & Medicine, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
355
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
10
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Ewan Birney Biocuration 2013

  1. 1. CurationEwan Birney (tweetable)
  2. 2. Who am I?• Associate Director at European Bioinformatics Institute (EBI)• Involved in genomics since I was 19 (> 20 years!)• Trained as a biochemist – most people think I am CS EBI is in Hinxton, South• Analysed – sometimes lead Cambridgeshire – human/mouse/rat/platypus EBI is part of EMBL, ~like etc genomes, ENCODE, CERN for molecular biology Others.
  3. 3. Molecular Biology• The study of how life works – at a molecular level• Key molecules: • DNA – Information store (Disk) • RNA – Key information transformer, also does stuff (RAM) • Proteins – The business end of life (Chip, robotic arms) • Metabolites – Fuel and signalling molecules (electricity)• Theories of how these interact – no theories of to predict what they are• Instead we determine attributes of molecules and store them in globally accessible, open, databases
  4. 4. Theory  Observation Can accurately predict from models Must directly observe Molecular Geology, Climate High Energy Biology Astronomy modelling Physics
  5. 5. This ratio is not well correlated with data size ~60PB High Energy PhysicsData Size Molecular Astronomy Biology ~5PB Climate Models Ratio of model predictability
  6. 6. “Knowing stuff” is critical to biology…• The bases of the human genome • … and the Mouse, Rat, Wheat, Ecoli, Plasmodium, Cow….• The functions of proteins • Enzymes, Transcription Factors, Signalling….• The types of cells, their lineages and organ composition • …and all the molecular components in each cell• Small molecules • … and their conversions, binding partners• Structures of molecules, complexes and cells • … at atomic and higher resolution
  7. 7. Two fundamental types of information• Experimental data • Consensus Knowledge• The result of a specific • Integration of different experiment strands of information on a• Often an experiment topic specific, data heavy part • Realised as a plus a “meta-data” part computationally accessible• Might be contradictory scheme• “Primary paper” • “Review article”
  8. 8. Five types of curation
  9. 9. Experimental Data Entry• Intact – Protein:Protein interactions• GWAS Catalog – extraction of summary statistics
  10. 10. Experimental Meta data capture• Sample, CDS lines in ENA• Sample in Metabolights, PRIDE etc• Machine and analysis specification in PDB, PRIDE, ENA
  11. 11. Consensus integration of information• GenCode gene models in human• Summaries and GO assignment in UniProt• Pathway information in Reactome• GO assignment and summaries in MODs (eg, PomBase, WormBase, PhytoPathDB etc)
  12. 12. Knowledge frameworks• The EC classification• Cell type ontologies• Cell lineages – Worms!• SnowMed, HPO etc• GO ontologies
  13. 13. Knowledge management• Creation of rules representing ENA standards compliance• Cross-ontology coordination (eg, EFO) or tieing (GO  ChEBI)• RuleBase / UniRule curation processes
  14. 14. Data Entry vs Programming Direct Programmatic Data Entry Data Entry “Messy” Scripting Improved Data entry tools RuleBase, Computational Accessible Standards
  15. 15. Thank You!
  16. 16. Curation Dilema• If you do your job well… • If you do your job badly…• Everyone assumes it’s • Everyone assumes it’s easy easy• People forget about the • People forget about the complexity complexity• You are ignored  • People complain 
  17. 17. Why we need an infrastructure…
  18. 18. Infrastructures are critical…
  19. 19. But we only notice them when they go wrong
  20. 20. Biology already needs an informationinfrastructure• For the human genome • (…and the mouse, and the rat, and… x 150 now, 1000 in the future!) - Ensembl• For the function of genes and proteins • For all genes, in text and computational – UniProt and GO• For all 3D structures • To understand how proteins work – PDBe• For where things are expressed • The differences and functionality of cells - Atlas
  21. 21. ..But this keeps on going…• We have to scale across all of (interesting) life • There are a lot of species out there!• We have to handle new areas, in particular medicine • A set of European haplotypes for good imputation • A set of actionable variants in germline and cancers• We have to improve our chemical understanding • Of biological chemicals • Of chemicals which interfere with Biology
  22. 22. ELIXIR’s missionTo build a sustainableEuropean infrastructure forbiologicalinformation, supporting lifescience research and its medicinetranslation to: environment bioindustries society 22
  23. 23. How?Fully Centralised Fully DistributedPros: Stability, reuse, Pros: Responsive, GeographicLearning ease Language responsiveCons: Hard to concentrate Cons: Internal communication overheadExpertise across of life science Harder for end users to learnGeographic, language placement Harder to provide multi-decade stabilityBottlenecks and lack of diversity
  24. 24. Research Healthcare International National EBI / Elixir Healthcare English National Language Low legalities Complex legalities2
  25. 25. Other infrastructures needed for biology• EuroBioImaging • Cellular and whole organism Imaging• BioBanks (BBMRI) • We need numbers – European populations – in particular for rare diseases, but also for specific sub types of common disease• Mouse models and phenotypes (Infrafrontier) • A baseline set of knockouts and phenotypes in our most tractable mammalian model • (it’s hard to prove something in human)• Robust molecular assays in a clinical setting (EATRIS) • The ability to reliably use state of the art molecular techniques in a clinical research setting
  26. 26. (you can follow me on twitter @ewanbirney)I blog and update this on Google Plus publically

×