2. Who am I?
โข Associate Director at
European Bioinformatics
Institute (EBI)
โข Involved in genomics since I
was 19 (> 20 years!)
โข Trained as a biochemist โ
most people think I am CS
EBI is in Hinxton, South
โข Analysed โ sometimes lead
Cambridgeshire
โ
human/mouse/rat/platypus
EBI is part of EMBL, ~like
etc genomes, ENCODE,
CERN for molecular biology
Others.
3. Molecular Biology
โข The study of how life works โ at a molecular level
โข Key molecules:
โข DNA โ Information store (Disk)
โข RNA โ Key information transformer, also does stuff (RAM)
โข Proteins โ The business end of life (Chip, robotic arms)
โข Metabolites โ Fuel and signalling molecules (electricity)
โข Theories of how these interact โ no theories of to predict what
they are
โข Instead we determine attributes of molecules and store them in
globally accessible, open, databases
4. Theory ๏ณ Observation
Can accurately predict from models
Must directly observe
Molecular Geology, Climate High Energy
Biology Astronomy modelling Physics
5. This ratio is not well correlated with data size
~60PB High Energy Physics
Data Size
Molecular Astronomy
Biology
~5PB Climate Models
Ratio of model predictability
6. โKnowing stuffโ is critical to biologyโฆ
โข The bases of the human genome
โข โฆ and the Mouse, Rat, Wheat, Ecoli, Plasmodium, Cowโฆ.
โข The functions of proteins
โข Enzymes, Transcription Factors, Signallingโฆ.
โข The types of cells, their lineages and organ composition
โข โฆand all the molecular components in each cell
โข Small molecules
โข โฆ and their conversions, binding partners
โข Structures of molecules, complexes and cells
โข โฆ at atomic and higher resolution
7. Two fundamental types of information
โข Experimental data โข Consensus Knowledge
โข The result of a specific โข Integration of different
experiment strands of information on a
โข Often an experiment topic
specific, data heavy part โข Realised as a
plus a โmeta-dataโ part computationally accessible
โข Might be contradictory scheme
โข โPrimary paperโ โข โReview articleโ
9. Experimental Data Entry
โข Intact โ Protein:Protein
interactions
โข GWAS Catalog โ
extraction of summary
statistics
10. Experimental Meta data capture
โข Sample, CDS lines in
ENA
โข Sample in Metabolights,
PRIDE etc
โข Machine and analysis
specification in PDB,
PRIDE, ENA
11. Consensus integration of information
โข GenCode gene models in
human
โข Summaries and GO
assignment in UniProt
โข Pathway information in
Reactome
โข GO assignment and
summaries in MODs (eg,
PomBase, WormBase,
PhytoPathDB etc)
12. Knowledge frameworks
โข The EC classification
โข Cell type ontologies
โข Cell lineages โ Worms!
โข SnowMed, HPO etc
โข GO ontologies
13. Knowledge management
โข Creation of rules
representing ENA
standards compliance
โข Cross-ontology
coordination (eg, EFO) or
tieing (GO ๏ณ ChEBI)
โข RuleBase / UniRule
curation processes
14. Data Entry vs Programming
Direct Programmatic
Data Entry Data Entry
โMessyโ Scripting
Improved
Data entry
tools RuleBase,
Computational Accessible
Standards
16. Curation Dilema
โข If you do your job wellโฆ โข If you do your job badlyโฆ
โข Everyone assumes itโs โข Everyone assumes itโs
easy easy
โข People forget about the โข People forget about the
complexity complexity
โข You are ignored ๏ โข People complain ๏
20. Biology already needs an information
infrastructure
โข For the human genome
โข (โฆand the mouse, and the rat, andโฆ x 150 now, 1000 in the
future!) - Ensembl
โข For the function of genes and proteins
โข For all genes, in text and computational โ UniProt and GO
โข For all 3D structures
โข To understand how proteins work โ PDBe
โข For where things are expressed
โข The differences and functionality of cells - Atlas
21. ..But this keeps on goingโฆ
โข We have to scale across all of (interesting) life
โข There are a lot of species out there!
โข We have to handle new areas, in particular medicine
โข A set of European haplotypes for good imputation
โข A set of actionable variants in germline and cancers
โข We have to improve our chemical understanding
โข Of biological chemicals
โข Of chemicals which interfere with Biology
22. ELIXIRโs mission
To build a sustainable
European infrastructure for
biological
information, supporting life
science research and its
medicine
translation to:
environment
bioindustries
society
22
23. How?
Fully Centralised Fully Distributed
Pros: Stability, reuse, Pros: Responsive, Geographic
Learning ease Language responsive
Cons: Hard to concentrate Cons: Internal communication overhead
Expertise across of life science Harder for end users to learn
Geographic, language placement Harder to provide multi-decade stability
Bottlenecks and lack of diversity
24. Research Healthcare
International National
EBI / Elixir Healthcare
English National Language
Low legalities Complex legalities
2
25. Other infrastructures needed for biology
โข EuroBioImaging
โข Cellular and whole organism Imaging
โข BioBanks (BBMRI)
โข We need numbers โ European populations โ in particular for rare
diseases, but also for specific sub types of common disease
โข Mouse models and phenotypes (Infrafrontier)
โข A baseline set of knockouts and phenotypes in our most tractable
mammalian model
โข (itโs hard to prove something in human)
โข Robust molecular assays in a clinical setting (EATRIS)
โข The ability to reliably use state of the art molecular techniques in a
clinical research setting
26. (you can follow me on twitter @ewanbirney)
I blog and update this on Google Plus publically