Standardisation in BMS European infrastructures

European Life Sciences Infrastructure for Biological Information
www.elixir-europe.org
Standardisation in BMS
European infrastructures
Managing Big Data Workshop
Setting the standards for analyzing and integrating big data
ELIXIR Hub technical coordinator
July 9-10 2014, Berlin Germany

TOC
• ELIXIR
• Standards
• BioMedBridges workshops update
• Standards
• Data deluge
2

ELIXIR
• European life sciences research
infrastructure for biological
information to facilitate research
• Safeguard data and build
sustainable data services
• Participated by major bioinformatics
service providers and supported by
17 EU member states
• Creating a robust infrastructure for
biological information is a bigger
task than any individual
organisation or nation can take on
alone
3

7 | 62
Figure 2 Together, the biomedical science research infrastructuresaddresssocietal challenges
By establishing interoperability between data and services in the biological,
medical, translational and clinical domains, BioMedBridges links basic
BioMedBridges
Biomedical sciences research infrastructures
stronger through common links
• FP7-funded cluster project
• 21 partners in 9 countries
• Computational ‘data and
service’ bridges between the
BMS RIs
• Interoperability between
data and services in the
biological, medical,
translational and clinical
domains
4

Rafael C Jimenez
ELIXIR Hub technical coordinator
Standards

18.12.18
6
DB
QI
A AA A
DB
QI
DB
QI
DB
QI
DB
QI
A AA A
A Annotator Database Query InterfaceQI User
Data submission/access
Ideally Reality

Data resources in life science
• Many
• Diverse
• Disperse
NAR online Molecular Biology Database Collection 2014
~1800molecular biology
data resources

Utility of databasesScientificimpact
Too little
information
Many, diverse & disperse
databases and interfaces
Tim Hubbard

Data integration
DB
I
DB
I
DB
I
DB
I
Ideally Compromise
Database InterfaceI User
Combining data residing in different sources
… providing users with a unified view of these data.
DB
I
DB DB DB
DB
I
Reality

Many, diverse & disperse
databases and interfaces
18.12.18
10
Utility of bioinformaticsScientificimpact
Too little
bioinformatics
Integration of

Data integration issues
Many data sources
• Maintain and update
• New appearing
• Many vanishing*
Different query interfaces
data integration?
Variable results
• Syntax
• Semantics
• Minimum information
* Merali Z. et all. Databases in peril. Nature 2005.
Where to find them?
Redundant data?

Standards
• Community agreed specification for how data types
should be represented and described.
• Standards facilitates:
 Interoperability
 Integration
 Exchange
 Portability
 Comparison
 Representation
 Sharing
 Replication
 Consistency
 Verification
 Compliance
 Reusability
 Access
 Submission
 Analysis
 Edition
 Visualization
 Conversion
 Validation
 Annotation
 Search

Heterogeneous integration
Homogeneous integration
Data integration
A B C
1
2

Improving Links Between distributed European
resources
ELIXIR pilot: Interoperability of protein expressions resources
The Human Protein Atlas portal is a publicly available database
with millions of high-resolution images showing the spatial
distribution of proteins in 46 different normal human tissues and
20 different cancer types, as well as 47 different human cell lines.

Standards
15
Schema
Interfaces
Guidelines
Ontologies
Format
Identifiers
Data
Definition Representation Access
• Not just a format …

Molecular interactions
PSI-MI
PSICQUIC
MIMIx/IMEX
PSI-MI CV
XML/TAB
IMEX/Uniprot
Data
Definition Representation Access

Standards in data sharing
http://www.mrc.ac.uk/Utilities/Documentrecord/index.htm?d=MRC002552

Different formats for the same data
18
MI
Data
PSI-XML
PSI-MITAB
BioPax
RDF
Cytoscape
DAS • Comprehensive
• Simple
• Generic
• Domain specific
• Structured

http://biosharing.org
Standards (formats, guidelines, ontologies) and databases

Registry - Minimum information guidelines

Registry - Controlled vocabularies
• Ontology browser: http://www.ebi.ac.uk/ontology-lookup
Ontology Lookup Service

Communities organized per domain
• Produce technical standards intended to address the needs of
a community of users.
develop, coordinate, promulgate, revise, amend, reissue, interpret
23

ELIXIR role
• Support communities developing standards
• Encourage communication among communities
• Links amongst standards
• Promote the adoption of standards
• Help to find the gaps among standards
• Recommend standards best practices in data sharing
24

Data deluge & standards
BMB workshops update

Knowledge ExchangeWorkshop:WP3 Standards
24 - 25 June 2014.VUMC, Amsterdam,The Netherlands
•Best practice for identifiers
•Development of the BMB standards registry
26

Identifiers Best Practice - purpose
• Recommendations for identifiers best practice
• Designing (format, re-use)
• Managing (creation, versioning, provenance, deprecation etc.)
• Using (resolving, mapping etc.)
• Publish a paper
• Introduction to identifier concepts
• Case Studies illustrating identifier usage in real-world scenarios
• Recommendations on best practice
• Show not tell
• Descriptive not normative
• fornon-experts/newcomers
• Gap analysis
• list the biological entities and identifiers type used by BMB partners
27

Identifiers Best Practice – topics 1/2
• Identifier formats
• syntax of database IDs, URI patterns
• Identifier management
• creation, versioning, provenance, deprecation
• Identifier resolution
• how to use an ID to get useful information about the entity
• services for this, e.g. Identifiers.org
• what info. should be given ?
28

• Identifier mapping / aggregation
• how to map IDs on entries in one resource to those in another, to
assign equivalence / make useful links
• e.g. IDs for equivalent protein sequences in different organisms
• e.g. probe->gene->pathway->function (GO)
• Identifier cataloguing
• compilations of types of identifiers (e.g. EDAM ontology) or specific
identifiers (e.g. Cell Line Ontology)
• what info. should be given?
• Case Studies
• use of IDs in a particular domain
29

Standards format registry - purpose
• Discovery, Agreement, Benchmarking, …
• Facilitate syntactic operability across research infrastructure
so samples and data can be integrated and analysed across
ESFRI BMS domains.
30

Standards format registry - topics
• Catalogue of standard
• Interoperability among registry
• identifiers.org, biosahring, EDAM, ELIXIR service registry
• Adaptat to user needs
• Community of users vs. community of producers
• Microstandards
• Standards mapping
• Access to exert knowledge
• Assess fit for purpose
• Rating/metrics
31

BioMedBridges Knowledge ExchangeWorkshop
Tuesday 24 - Wednesday 25 June 2014.VUMC, Amsterdam,The Netherlands
Workshop organised byWP3 - Standards Description and
Harmonisation, to bring together BMB partners, biomedical
standards experts and representatives of external projects.
Best practice for identifiers
•Gap analysis of current identifiers
Development of the BMB standards registry
•Gap analysis for usage of the registry
•Integration of the registry with other tools
32

Identifiers Best Practice - purpose
• Recommendations for identifiers best practice
• Designing (format, re-use)
• Managing (creation, versioning, provenance, deprecation etc.)
• Using (resolving, mapping etc.)
• Publish a paper
• Introduction to identifier concepts
• Case Studies illustrating identifier usage in real-world scenarios
• Recommendations on best practice
• Show not tell
• Descriptive not normative
• fornon-experts/newcomers
• Gap analysis
• list the biological entities and identifiers type used by BMB partners
33

• Identifier formats
• syntax of database IDs, URI patterns
• Identifier management
• creation, versioning, provenance, deprecation
• Identifier resolution
• how to use an ID to get useful information about the entity
• services for this, e.g. Identifiers.org
• what info. should be given ?
34

• Identifier mapping / aggregation
• how to map IDs on entries in one resource to those in another, to
assign equivalence / make useful links
• e.g. IDs for equivalent protein sequences in different organisms
• e.g. probe->gene->pathway->function (GO)
• Identifier cataloguing
• compilations of types of identifiers (e.g. EDAM ontology) or specific
identifiers (e.g. Cell Line Ontology)
• what info. should be given?
• Case Studies
• use of IDs in a particular domain
35

Standards format registry - purpose
• Discovery, Agreement, Benchmarking, …
• Facilitate syntactic operability across research infrastructure
so samples and data can be integrated and analysed across
ESFRI BMS domains.
36

Standards format registry - topics
• Catalogue of standard
• Interoperability among registry
• identifiers.org, biosahring, EDAM, ELIXIR service registry
• Adaptat to user needs
• Community of users vs. community of producers
• Microstandards
• Standards mapping
• Access to exert knowledge
• Assess fit for purpose
• Rating/metrics
37

BioMedBridges workshops
Knowledge ExchangeWorkshop:WP3 Standards
24 -25 June 2014
VUMC, Amsterdam,The Netherlands
38
E-Infrastructure support for the life sciences:
Preparing for the data deluge
15 May 2014
Genome Campus, Hinxton, UK

BioMedBridges workshop
E-Infrastructure support for the life
sciences:
Preparing for the data
deluge
15 May, 2014
Genome Campus, Hinxton, UK

Knowledge exchange workshop
• Discussion of big data challenges in life sciences
• Focus on few representative domains
• Looking 5 years ahead
• Jointly identify potential solutions to our problems
Data
ICT
e-infrastructures
LS
life sciencesPhysical facilities
Scientific information
Transfer
Computation
Storage

How does it affect data sharing
in life sciences?

Large-scale data sharing in the life sciences

How does big data affect data sharing?
Compute Compute
Compute
Storage Compute Transfer
Transfer
Transfer Transfer
Transfer
Storage Storage
Storage
What How Where

Growing data
Guy Cochrane, EMBL-EBI

Data generation vs. data transfer
47
~100 GB
~4 TB
~4 TB
24 hours 1 Gb 100 Mb 10 Mb
~30 min
~9 hour
~9 hour
~5 hours
~4 days
~4 days
~2 days
~5 weeks
~5 weeks
DNA sequencing
Mass spectrometry
Microscopy
Network File Transfer

Bottlenecks in Life Sciences?
• Data production grows faster than storage
• Cost of data production technologies declines faster than
storage
• It takes longer to transfer data than produce the data.

Data growth
how to reduce the IT budget shortfall?
http://www.eweek.com/

Data growth
how to reduce the IT budget shortfall?
http://www.eweek.com/
Optimization
Using technology more effectively
Selecting relevant data

Potential solutions
• Storage
• Data compression
• Select what we store
• Evaluate data reproducibility & value of data
• Network
• Faster protocols
• Partitioning
• Network upgrade
• Computation
• Clouds
• Data close to computation

Data compression
• Efficient representation
• Capacity for controlled data
reduction
• Efficient transformations
• Tool chain Precisi
on
Compression
CRAM
Fritz, M.H. Leinonen, R., et al. (2011) Efficient storage of high throughput DNA sequencing data using reference-based
compression. Genome Res. 21 (5), 734-40
Cochrane G., Cook C.E. and Birney E. (2012) The future of DNA sequence archiving. GigaScience 2012, 1:2
http://www.ebi.ac.uk/ena/about/cram_toolkit

Data transfer optimization
• e.g. Getting more from available bandwidth

Data partitioning
• Organisation of data around biological concepts
• Indexing system around these concepts
• Support for requests for partitions along this index
Reference-oriented indexing

56
Life sciences diversity
Genomes
Nucleotides
Transcripts
Proteins
Complexes
Pathways
Small molecules
Structures
Domains
Cells
Biobanks
Tissues and
organs
Human
populations
Therapies
Disease
prevention
Early
Diagnosis
Human
individuals

Life sciences diversity
• Different communities
• Some similar requirements
• Not always the same solutions
ProteomicsMetabolomics Clinical data GenomicsImaging

Some conclusions
• Opportunity for e-infrastructures to better understand BMS RI problems.
• Identification of bottlenecks
• Discussion of some potential solutions
• Data growth will change how we do things today
• Different communities -> different models -> some common solutions
• Solutions have to come from use cases
• BMS RI need to be better defining requirements
• We need to use technology more efficiently
• BMS community has to evaluate the practicality of storing everything
• Privacy issues makes big data more challenging
• Difficult to separate big data from computation
• Shortage of expertise of how to deal with scientific data and IT services

Thank you

Data submission
61
raw data
processed data
metadata
Centralized
database

Data sharing
Data on my disk and available
to anyone who requests it
Submission to data repositories
From
To

Data submissions
63
Data
repository
Journal
submission
Data
repository
Journal
submission reads
Journal request Curator
Data
repository
Data Management Plan
submission
Data management
+

Data sharing
Will big data affect data deposition?
Data on my disk and available
to anyone who requests it
Submission to data repositoriesFrom
To

Data submissions
How much data?
How much available data?

Standardisation in BMS European infrastructures

More Related Content

What's hot

Similar to Standardisation in BMS European infrastructures

More from Rafael C. Jimenez

Recently uploaded

Standardisation in BMS European infrastructures

Editor's Notes