Ewan Birney Biocuration 2013

Curation
Ewan Birney (tweetable)

Who am I?
• Associate Director at
European Bioinformatics
Institute (EBI)
• Involved in genomics since I
was 19 (> 20 years!)
• Trained as a biochemist –
most people think I am CS
EBI is in Hinxton, South
• Analysed – sometimes lead
Cambridgeshire
–
human/mouse/rat/platypus
EBI is part of EMBL, ~like
etc genomes, ENCODE,
CERN for molecular biology
Others.

Molecular Biology
• The study of how life works – at a molecular level

• Key molecules:
• DNA – Information store (Disk)
• RNA – Key information transformer, also does stuff (RAM)
• Proteins – The business end of life (Chip, robotic arms)
• Metabolites – Fuel and signalling molecules (electricity)
• Theories of how these interact – no theories of to predict what
they are
• Instead we determine attributes of molecules and store them in
globally accessible, open, databases

Theory  Observation

Can accurately predict from models

Must directly observe
Molecular Geology, Climate High Energy
Biology Astronomy modelling Physics

This ratio is not well correlated with data size

~60PB High Energy Physics

Data Size
Molecular Astronomy
Biology
~5PB Climate Models

Ratio of model predictability

“Knowing stuff” is critical to biology…

• The bases of the human genome
• … and the Mouse, Rat, Wheat, Ecoli, Plasmodium, Cow….
• The functions of proteins
• Enzymes, Transcription Factors, Signalling….
• The types of cells, their lineages and organ composition
• …and all the molecular components in each cell
• Small molecules
• … and their conversions, binding partners
• Structures of molecules, complexes and cells
• … at atomic and higher resolution

Two fundamental types of information

• Experimental data • Consensus Knowledge

• The result of a specific • Integration of different
experiment strands of information on a
• Often an experiment topic
specific, data heavy part • Realised as a
plus a “meta-data” part computationally accessible
• Might be contradictory scheme

• “Primary paper” • “Review article”

Experimental Data Entry

• Intact – Protein:Protein
interactions

• GWAS Catalog –
extraction of summary
statistics

Experimental Meta data capture

• Sample, CDS lines in
ENA
• Sample in Metabolights,
PRIDE etc
• Machine and analysis
specification in PDB,
PRIDE, ENA

Consensus integration of information

• GenCode gene models in
human
• Summaries and GO
assignment in UniProt
• Pathway information in
Reactome
• GO assignment and
summaries in MODs (eg,
PomBase, WormBase,
PhytoPathDB etc)

Knowledge frameworks

• The EC classification
• Cell type ontologies
• Cell lineages – Worms!
• SnowMed, HPO etc
• GO ontologies

Knowledge management

• Creation of rules
representing ENA
standards compliance
• Cross-ontology
coordination (eg, EFO) or
tieing (GO  ChEBI)
• RuleBase / UniRule
curation processes

Data Entry vs Programming

Direct Programmatic
Data Entry Data Entry

“Messy” Scripting
Improved
Data entry
tools RuleBase,
Computational Accessible
Standards

Curation Dilema

• If you do your job well… • If you do your job badly…

• Everyone assumes it’s • Everyone assumes it’s
easy easy
• People forget about the • People forget about the
complexity complexity

• You are ignored  • People complain 

Why we need an infrastructure…

Infrastructures are critical…

But we only notice them when they go wrong

Biology already needs an information
infrastructure

• For the human genome
• (…and the mouse, and the rat, and… x 150 now, 1000 in the
future!) - Ensembl
• For the function of genes and proteins
• For all genes, in text and computational – UniProt and GO
• For all 3D structures
• To understand how proteins work – PDBe
• For where things are expressed
• The differences and functionality of cells - Atlas

..But this keeps on going…

• We have to scale across all of (interesting) life
• There are a lot of species out there!
• We have to handle new areas, in particular medicine
• A set of European haplotypes for good imputation
• A set of actionable variants in germline and cancers
• We have to improve our chemical understanding
• Of biological chemicals
• Of chemicals which interfere with Biology

ELIXIR’s mission
To build a sustainable
European infrastructure for
biological
information, supporting life
science research and its
medicine
translation to:

environment

bioindustries

society

22

How?

Fully Centralised Fully Distributed

Pros: Stability, reuse, Pros: Responsive, Geographic
Learning ease Language responsive
Cons: Hard to concentrate Cons: Internal communication overhead
Expertise across of life science Harder for end users to learn
Geographic, language placement Harder to provide multi-decade stability
Bottlenecks and lack of diversity

Research Healthcare

International National
EBI / Elixir Healthcare
English National Language
Low legalities Complex legalities

2

Other infrastructures needed for biology
• EuroBioImaging
• Cellular and whole organism Imaging
• BioBanks (BBMRI)
• We need numbers – European populations – in particular for rare
diseases, but also for specific sub types of common disease
• Mouse models and phenotypes (Infrafrontier)
• A baseline set of knockouts and phenotypes in our most tractable
mammalian model
• (it’s hard to prove something in human)
• Robust molecular assays in a clinical setting (EATRIS)
• The ability to reliably use state of the art molecular techniques in a
clinical research setting

(you can follow me on twitter @ewanbirney)
I blog and update this on Google Plus publically

Ewan Birney Biocuration 2013

More Related Content

What's hot

Similar to Ewan Birney Biocuration 2013

More from Iddo

Recently uploaded

Ewan Birney Biocuration 2013