Curation
Ewan Birney (tweetable)
Who am I?
• Associate Director at
  European Bioinformatics
  Institute (EBI)
• Involved in genomics since I
  was 19 (> 20 years!)
• Trained as a biochemist –
  most people think I am CS
                                 EBI is in Hinxton, South
• Analysed – sometimes lead
                                 Cambridgeshire
  –
  human/mouse/rat/platypus
                                 EBI is part of EMBL, ~like
  etc genomes, ENCODE,
                                 CERN for molecular biology
  Others.
Molecular Biology
• The study of how life works – at a molecular level

• Key molecules:
  • DNA – Information store (Disk)
  • RNA – Key information transformer, also does stuff (RAM)
  • Proteins – The business end of life (Chip, robotic arms)
  • Metabolites – Fuel and signalling molecules (electricity)
• Theories of how these interact – no theories of to predict what
  they are
• Instead we determine attributes of molecules and store them in
  globally accessible, open, databases
Theory  Observation


                    Can accurately predict from models




 Must directly observe
    Molecular Geology,  Climate        High Energy
    Biology   Astronomy modelling      Physics
This ratio is not well correlated with data size


   ~60PB                        High Energy Physics

Data Size
             Molecular Astronomy
             Biology
    ~5PB                      Climate Models




             Ratio of model predictability
“Knowing stuff” is critical to biology…

• The bases of the human genome
  • … and the Mouse, Rat, Wheat, Ecoli, Plasmodium, Cow….
• The functions of proteins
  • Enzymes, Transcription Factors, Signalling….
• The types of cells, their lineages and organ composition
  • …and all the molecular components in each cell
• Small molecules
  • … and their conversions, binding partners
• Structures of molecules, complexes and cells
  • … at atomic and higher resolution
Two fundamental types of information

• Experimental data           • Consensus Knowledge

• The result of a specific    • Integration of different
  experiment                    strands of information on a
• Often an experiment           topic
  specific, data heavy part   • Realised as a
  plus a “meta-data” part       computationally accessible
• Might be contradictory        scheme


• “Primary paper”             • “Review article”
Five types of curation
Experimental Data Entry

• Intact – Protein:Protein
  interactions


• GWAS Catalog –
  extraction of summary
  statistics
Experimental Meta data capture

• Sample, CDS lines in
  ENA
• Sample in Metabolights,
  PRIDE etc
• Machine and analysis
  specification in PDB,
  PRIDE, ENA
Consensus integration of information

• GenCode gene models in
  human
• Summaries and GO
  assignment in UniProt
• Pathway information in
  Reactome
• GO assignment and
  summaries in MODs (eg,
  PomBase, WormBase,
  PhytoPathDB etc)
Knowledge frameworks

•   The EC classification
•   Cell type ontologies
•   Cell lineages – Worms!
•   SnowMed, HPO etc
•   GO ontologies
Knowledge management

• Creation of rules
  representing ENA
  standards compliance
• Cross-ontology
  coordination (eg, EFO) or
  tieing (GO  ChEBI)
• RuleBase / UniRule
  curation processes
Data Entry vs Programming

 Direct                                    Programmatic
 Data Entry                                Data Entry




                      “Messy” Scripting
         Improved
         Data entry
         tools              RuleBase,
                            Computational Accessible
                            Standards
Thank You!
Curation Dilema

• If you do your job well…   • If you do your job badly…

• Everyone assumes it’s      • Everyone assumes it’s
  easy                         easy
• People forget about the    • People forget about the
  complexity                   complexity


• You are ignored           • People complain 
Why we need an infrastructure…
Infrastructures are critical…
But we only notice them when they go wrong
Biology already needs an information
infrastructure

• For the human genome
  • (…and the mouse, and the rat, and… x 150 now, 1000 in the
    future!) - Ensembl
• For the function of genes and proteins
  • For all genes, in text and computational – UniProt and GO
• For all 3D structures
  • To understand how proteins work – PDBe
• For where things are expressed
  • The differences and functionality of cells - Atlas
..But this keeps on going…

• We have to scale across all of (interesting) life
  • There are a lot of species out there!
• We have to handle new areas, in particular medicine
  • A set of European haplotypes for good imputation
  • A set of actionable variants in germline and cancers
• We have to improve our chemical understanding
  • Of biological chemicals
  • Of chemicals which interfere with Biology
ELIXIR’s mission
To build a sustainable
European infrastructure for
biological
information, supporting life
science research and its
                                                  medicine
translation to:

                                    environment


                         bioindustries

            society


              22
How?

Fully Centralised                                 Fully Distributed




Pros: Stability, reuse,             Pros: Responsive, Geographic
Learning ease                       Language responsive
Cons: Hard to concentrate           Cons: Internal communication overhead
Expertise across of life science    Harder for end users to learn
Geographic, language placement      Harder to provide multi-decade stability
Bottlenecks and lack of diversity
Research        Healthcare




    International    National
    EBI / Elixir     Healthcare
    English          National Language
    Low legalities   Complex legalities

2
Other infrastructures needed for biology
• EuroBioImaging
  • Cellular and whole organism Imaging
• BioBanks (BBMRI)
  • We need numbers – European populations – in particular for rare
    diseases, but also for specific sub types of common disease
• Mouse models and phenotypes (Infrafrontier)
  • A baseline set of knockouts and phenotypes in our most tractable
    mammalian model
  • (it’s hard to prove something in human)
• Robust molecular assays in a clinical setting (EATRIS)
  • The ability to reliably use state of the art molecular techniques in a
    clinical research setting
(you can follow me on twitter @ewanbirney)
I blog and update this on Google Plus publically

Ewan Birney Biocuration 2013

  • 1.
  • 2.
    Who am I? •Associate Director at European Bioinformatics Institute (EBI) • Involved in genomics since I was 19 (> 20 years!) • Trained as a biochemist – most people think I am CS EBI is in Hinxton, South • Analysed – sometimes lead Cambridgeshire – human/mouse/rat/platypus EBI is part of EMBL, ~like etc genomes, ENCODE, CERN for molecular biology Others.
  • 3.
    Molecular Biology • Thestudy of how life works – at a molecular level • Key molecules: • DNA – Information store (Disk) • RNA – Key information transformer, also does stuff (RAM) • Proteins – The business end of life (Chip, robotic arms) • Metabolites – Fuel and signalling molecules (electricity) • Theories of how these interact – no theories of to predict what they are • Instead we determine attributes of molecules and store them in globally accessible, open, databases
  • 4.
    Theory  Observation Can accurately predict from models Must directly observe Molecular Geology, Climate High Energy Biology Astronomy modelling Physics
  • 5.
    This ratio isnot well correlated with data size ~60PB High Energy Physics Data Size Molecular Astronomy Biology ~5PB Climate Models Ratio of model predictability
  • 6.
    “Knowing stuff” iscritical to biology… • The bases of the human genome • … and the Mouse, Rat, Wheat, Ecoli, Plasmodium, Cow…. • The functions of proteins • Enzymes, Transcription Factors, Signalling…. • The types of cells, their lineages and organ composition • …and all the molecular components in each cell • Small molecules • … and their conversions, binding partners • Structures of molecules, complexes and cells • … at atomic and higher resolution
  • 7.
    Two fundamental typesof information • Experimental data • Consensus Knowledge • The result of a specific • Integration of different experiment strands of information on a • Often an experiment topic specific, data heavy part • Realised as a plus a “meta-data” part computationally accessible • Might be contradictory scheme • “Primary paper” • “Review article”
  • 8.
    Five types ofcuration
  • 9.
    Experimental Data Entry •Intact – Protein:Protein interactions • GWAS Catalog – extraction of summary statistics
  • 10.
    Experimental Meta datacapture • Sample, CDS lines in ENA • Sample in Metabolights, PRIDE etc • Machine and analysis specification in PDB, PRIDE, ENA
  • 11.
    Consensus integration ofinformation • GenCode gene models in human • Summaries and GO assignment in UniProt • Pathway information in Reactome • GO assignment and summaries in MODs (eg, PomBase, WormBase, PhytoPathDB etc)
  • 12.
    Knowledge frameworks • The EC classification • Cell type ontologies • Cell lineages – Worms! • SnowMed, HPO etc • GO ontologies
  • 13.
    Knowledge management • Creationof rules representing ENA standards compliance • Cross-ontology coordination (eg, EFO) or tieing (GO  ChEBI) • RuleBase / UniRule curation processes
  • 14.
    Data Entry vsProgramming Direct Programmatic Data Entry Data Entry “Messy” Scripting Improved Data entry tools RuleBase, Computational Accessible Standards
  • 15.
  • 16.
    Curation Dilema • Ifyou do your job well… • If you do your job badly… • Everyone assumes it’s • Everyone assumes it’s easy easy • People forget about the • People forget about the complexity complexity • You are ignored  • People complain 
  • 17.
    Why we needan infrastructure…
  • 18.
  • 19.
    But we onlynotice them when they go wrong
  • 20.
    Biology already needsan information infrastructure • For the human genome • (…and the mouse, and the rat, and… x 150 now, 1000 in the future!) - Ensembl • For the function of genes and proteins • For all genes, in text and computational – UniProt and GO • For all 3D structures • To understand how proteins work – PDBe • For where things are expressed • The differences and functionality of cells - Atlas
  • 21.
    ..But this keepson going… • We have to scale across all of (interesting) life • There are a lot of species out there! • We have to handle new areas, in particular medicine • A set of European haplotypes for good imputation • A set of actionable variants in germline and cancers • We have to improve our chemical understanding • Of biological chemicals • Of chemicals which interfere with Biology
  • 22.
    ELIXIR’s mission To builda sustainable European infrastructure for biological information, supporting life science research and its medicine translation to: environment bioindustries society 22
  • 23.
    How? Fully Centralised Fully Distributed Pros: Stability, reuse, Pros: Responsive, Geographic Learning ease Language responsive Cons: Hard to concentrate Cons: Internal communication overhead Expertise across of life science Harder for end users to learn Geographic, language placement Harder to provide multi-decade stability Bottlenecks and lack of diversity
  • 24.
    Research Healthcare International National EBI / Elixir Healthcare English National Language Low legalities Complex legalities 2
  • 25.
    Other infrastructures neededfor biology • EuroBioImaging • Cellular and whole organism Imaging • BioBanks (BBMRI) • We need numbers – European populations – in particular for rare diseases, but also for specific sub types of common disease • Mouse models and phenotypes (Infrafrontier) • A baseline set of knockouts and phenotypes in our most tractable mammalian model • (it’s hard to prove something in human) • Robust molecular assays in a clinical setting (EATRIS) • The ability to reliably use state of the art molecular techniques in a clinical research setting
  • 26.
    (you can followme on twitter @ewanbirney) I blog and update this on Google Plus publically