European Life Sciences Infrastructure for Biological Information
www.elixir-europe.org
Standardisation in BMS
European infrastructures
Managing Big Data Workshop
Setting the standards for analyzing and integrating big data
ELIXIR Hub technical coordinator
July 9-10 2014, Berlin Germany
TOC
• ELIXIR
• Standards
• BioMedBridges workshops update
• Standards
• Data deluge
2
ELIXIR
• European life sciences research
infrastructure for biological
information to facilitate research
• Safeguard data and build
sustainable data services
• Participated by major bioinformatics
service providers and supported by
17 EU member states
• Creating a robust infrastructure for
biological information is a bigger
task than any individual
organisation or nation can take on
alone
3
7 | 62
Figure 2 Together, the biomedical science research infrastructuresaddresssocietal challenges
By establishing interoperability between data and services in the biological,
medical, translational and clinical domains, BioMedBridges links basic
BioMedBridges
Biomedical sciences research infrastructures
stronger through common links
• FP7-funded cluster project
• 21 partners in 9 countries
• Computational ‘data and
service’ bridges between the
BMS RIs
• Interoperability between
data and services in the
biological, medical,
translational and clinical
domains
4
European Life Sciences Infrastructure for Biological Information
www.elixir-europe.org
Rafael C Jimenez
ELIXIR Hub technical coordinator
Standards
18.12.18
6
DB
QI
A AA A
DB
QI
DB
QI
DB
QI
DB
QI
A AA A
A Annotator Database Query InterfaceQI User
Data submission/access
Ideally Reality
Data resources in life science
• Many
• Diverse
• Disperse
NAR online Molecular Biology Database Collection 2014
~1800molecular biology
data resources
Utility of databasesScientificimpact
Too little
information
Many, diverse & disperse
databases and interfaces
Tim Hubbard
Data integration
DB
I
DB
I
DB
I
DB
I
Ideally Compromise
Database InterfaceI User
Combining data residing in different sources
… providing users with a unified view of these data.
DB
I
DB DB DB
DB
I
Reality
Many, diverse & disperse
databases and interfaces
18.12.18
10
Utility of bioinformaticsScientificimpact
Too little
bioinformatics
Integration of
Data integration issues
Many data sources
• Maintain and update
• New appearing
• Many vanishing*
Different query interfaces
data integration?
Variable results
• Syntax
• Semantics
• Minimum information
* Merali Z. et all. Databases in peril. Nature 2005.
Where to find them?
Redundant data?
Standards
• Community agreed specification for how data types
should be represented and described.
• Standards facilitates:
 Interoperability
 Integration
 Exchange
 Portability
 Comparison
 Representation
 Sharing
 Replication
 Consistency
 Verification
 Compliance
 Reusability
 Access
 Submission
 Analysis
 Edition
 Visualization
 Conversion
 Validation
 Annotation
 Search
Heterogeneous integration
Homogeneous integration
Data integration
A B C
1
2
Improving Links Between distributed European
resources
ELIXIR pilot: Interoperability of protein expressions resources
The Human Protein Atlas portal is a publicly available database
with millions of high-resolution images showing the spatial
distribution of proteins in 46 different normal human tissues and
20 different cancer types, as well as 47 different human cell lines.
Standards
15
Schema
Interfaces
Guidelines
Ontologies
Format
Identifiers
Data
Definition Representation Access
• Not just a format …
Molecular interactions
PSI-MI
PSICQUIC
MIMIx/IMEX
PSI-MI CV
XML/TAB
IMEX/Uniprot
Data
Definition Representation Access
Standards in data sharing
http://www.mrc.ac.uk/Utilities/Documentrecord/index.htm?d=MRC002552
Different formats for the same data
18
MI
Data
PSI-XML
PSI-MITAB
BioPax
RDF
Cytoscape
DAS • Comprehensive
• Simple
• Generic
• Domain specific
• Structured
http://biosharing.org
Standards (formats, guidelines, ontologies) and databases
20
Registry - Identifiers
Registry - Minimum information guidelines
Registry - Controlled vocabularies
• Ontology browser: http://www.ebi.ac.uk/ontology-lookup
Ontology Lookup Service
Communities organized per domain
• Produce technical standards intended to address the needs of
a community of users.
develop, coordinate, promulgate, revise, amend, reissue, interpret
23
ELIXIR role
• Support communities developing standards
• Encourage communication among communities
• Links amongst standards
• Promote the adoption of standards
• Help to find the gaps among standards
• Recommend standards best practices in data sharing
24
European Life Sciences Infrastructure for Biological Information
www.elixir-europe.org
Data deluge & standards
BMB workshops update
Knowledge ExchangeWorkshop:WP3 Standards
24 - 25 June 2014.VUMC, Amsterdam,The Netherlands
•Best practice for identifiers
•Development of the BMB standards registry
26
Identifiers Best Practice - purpose
• Recommendations for identifiers best practice
• Designing (format, re-use)
• Managing (creation, versioning, provenance, deprecation etc.)
• Using (resolving, mapping etc.)
• Publish a paper
• Introduction to identifier concepts
• Case Studies illustrating identifier usage in real-world scenarios
• Recommendations on best practice
• Show not tell
• Descriptive not normative
• fornon-experts/newcomers
• Gap analysis
• list the biological entities and identifiers type used by BMB partners
27
Identifiers Best Practice – topics 1/2
• Identifier formats
• syntax of database IDs, URI patterns
• Identifier management
• creation, versioning, provenance, deprecation
• Identifier resolution
• how to use an ID to get useful information about the entity
• services for this, e.g. Identifiers.org
• what info. should be given ?
28
Identifiers Best Practice – topics 2/2
• Identifier mapping / aggregation
• how to map IDs on entries in one resource to those in another, to
assign equivalence / make useful links
• e.g. IDs for equivalent protein sequences in different organisms
• e.g. probe->gene->pathway->function (GO)
• Identifier cataloguing
• compilations of types of identifiers (e.g. EDAM ontology) or specific
identifiers (e.g. Cell Line Ontology)
• what info. should be given?
• Case Studies
• use of IDs in a particular domain
29
Standards format registry - purpose
• Discovery, Agreement, Benchmarking, …
• Facilitate syntactic operability across research infrastructure
so samples and data can be integrated and analysed across
ESFRI BMS domains.
30
Standards format registry - topics
• Catalogue of standard
• Interoperability among registry
• identifiers.org, biosahring, EDAM, ELIXIR service registry
• Adaptat to user needs
• Community of users vs. community of producers
• Microstandards
• Standards mapping
• Access to exert knowledge
• Assess fit for purpose
• Rating/metrics
31
BioMedBridges Knowledge ExchangeWorkshop
Tuesday 24 - Wednesday 25 June 2014.VUMC, Amsterdam,The Netherlands
Workshop organised byWP3 - Standards Description and
Harmonisation, to bring together BMB partners, biomedical
standards experts and representatives of external projects.
Best practice for identifiers
•Gap analysis of current identifiers
Development of the BMB standards registry
•Gap analysis for usage of the registry
•Integration of the registry with other tools
32
Identifiers Best Practice - purpose
• Recommendations for identifiers best practice
• Designing (format, re-use)
• Managing (creation, versioning, provenance, deprecation etc.)
• Using (resolving, mapping etc.)
• Publish a paper
• Introduction to identifier concepts
• Case Studies illustrating identifier usage in real-world scenarios
• Recommendations on best practice
• Show not tell
• Descriptive not normative
• fornon-experts/newcomers
• Gap analysis
• list the biological entities and identifiers type used by BMB partners
33
Identifiers Best Practice – topics 1/2
• Identifier formats
• syntax of database IDs, URI patterns
• Identifier management
• creation, versioning, provenance, deprecation
• Identifier resolution
• how to use an ID to get useful information about the entity
• services for this, e.g. Identifiers.org
• what info. should be given ?
34
Identifiers Best Practice – topics 2/2
• Identifier mapping / aggregation
• how to map IDs on entries in one resource to those in another, to
assign equivalence / make useful links
• e.g. IDs for equivalent protein sequences in different organisms
• e.g. probe->gene->pathway->function (GO)
• Identifier cataloguing
• compilations of types of identifiers (e.g. EDAM ontology) or specific
identifiers (e.g. Cell Line Ontology)
• what info. should be given?
• Case Studies
• use of IDs in a particular domain
35
Standards format registry - purpose
• Discovery, Agreement, Benchmarking, …
• Facilitate syntactic operability across research infrastructure
so samples and data can be integrated and analysed across
ESFRI BMS domains.
36
Standards format registry - topics
• Catalogue of standard
• Interoperability among registry
• identifiers.org, biosahring, EDAM, ELIXIR service registry
• Adaptat to user needs
• Community of users vs. community of producers
• Microstandards
• Standards mapping
• Access to exert knowledge
• Assess fit for purpose
• Rating/metrics
37
BioMedBridges workshops
Knowledge ExchangeWorkshop:WP3 Standards
24 -25 June 2014
VUMC, Amsterdam,The Netherlands
38
E-Infrastructure support for the life sciences:
Preparing for the data deluge
15 May 2014
Genome Campus, Hinxton, UK
BioMedBridges workshop
E-Infrastructure support for the life
sciences:
Preparing for the data
deluge
15 May, 2014
Genome Campus, Hinxton, UK
Knowledge exchange workshop
• Discussion of big data challenges in life sciences
• Focus on few representative domains
• Looking 5 years ahead
• Jointly identify potential solutions to our problems
Data
ICT
e-infrastructures
LS
life sciencesPhysical facilities
Scientific information
Transfer
Computation
Storage
How does it affect data sharing
in life sciences?
Large-scale data sharing in the life sciences
http://www.mrc.ac.uk/Utilities/Documentrecord/index.htm?d=MRC002552
How does big data affect data sharing?
http://www.mrc.ac.uk/Utilities/Documentrecord/index.htm?d=MRC002552
Compute Compute
Compute
Storage Compute Transfer
Transfer
Transfer Transfer
Transfer
Storage Storage
Storage
What How Where
Growing data
Guy Cochrane, EMBL-EBI
Cost of DNA sequencing
46
Data generation vs. data transfer
47
~100 GB
~4 TB
~4 TB
24 hours 1 Gb 100 Mb 10 Mb
~30 min
~9 hour
~9 hour
~5 hours
~4 days
~4 days
~2 days
~5 weeks
~5 weeks
DNA sequencing
Mass spectrometry
Microscopy
Network File Transfer
Bottlenecks in Life Sciences?
• Data production grows faster than storage
• Cost of data production technologies declines faster than
storage
• It takes longer to transfer data than produce the data.
Data growth
how to reduce the IT budget shortfall?
http://www.eweek.com/
Data growth
how to reduce the IT budget shortfall?
http://www.eweek.com/
Optimization
Using technology more effectively
Selecting relevant data
Potential solutions
• Storage
• Data compression
• Select what we store
• Evaluate data reproducibility & value of data
• Network
• Faster protocols
• Partitioning
• Network upgrade
• Computation
• Clouds
• Data close to computation
Data compression
• Efficient representation
• Capacity for controlled data
reduction
• Efficient transformations
• Tool chain Precisi
on
Compression
CRAM
Fritz, M.H. Leinonen, R., et al. (2011) Efficient storage of high throughput DNA sequencing data using reference-based
compression. Genome Res. 21 (5), 734-40
Cochrane G., Cook C.E. and Birney E. (2012) The future of DNA sequence archiving. GigaScience 2012, 1:2
http://www.ebi.ac.uk/ena/about/cram_toolkit
Data transfer optimization
• e.g. Getting more from available bandwidth
Guy Cochrane, EMBL-EBI
Data partitioning
• Organisation of data around biological concepts
• Indexing system around these concepts
• Support for requests for partitions along this index
Reference-oriented indexing
Guy Cochrane, EMBL-EBI
What data is relevant?
56
Life sciences diversity
Genomes
Nucleotides
Transcripts
Proteins
Complexes
Pathways
Small molecules
Structures
Domains
Cells
Biobanks
Tissues and
organs
Human
populations
Therapies
Disease
prevention
Early
Diagnosis
Human
individuals
Life sciences diversity
• Different communities
• Some similar requirements
• Not always the same solutions
ProteomicsMetabolomics Clinical data GenomicsImaging
Some conclusions
• Opportunity for e-infrastructures to better understand BMS RI problems.
• Identification of bottlenecks
• Discussion of some potential solutions
• Data growth will change how we do things today
• Different communities -> different models -> some common solutions
• Solutions have to come from use cases
• BMS RI need to be better defining requirements
• We need to use technology more efficiently
• BMS community has to evaluate the practicality of storing everything
• Privacy issues makes big data more challenging
• Difficult to separate big data from computation
• Shortage of expertise of how to deal with scientific data and IT services
European Life Sciences Infrastructure for Biological Information
www.elixir-europe.org
Thank you
Data deposition
Data submission
61
raw data
processed data
metadata
Centralized
database
Data sharing
Data on my disk and available
to anyone who requests it
Submission to data repositories
From
To
Data submissions
63
Data
repository
Journal
submission
Data
repository
Journal
submission reads
Journal request Curator
Data
repository
Data Management Plan
submission
Data management
+
Data sharing
Will big data affect data deposition?
Data on my disk and available
to anyone who requests it
Submission to data repositoriesFrom
To
Data submissions
How much data?
How much available data?
European Life Sciences Infrastructure for Biological Information
www.elixir-europe.org
Thank you

Standardisation in BMS European infrastructures

  • 1.
    European Life SciencesInfrastructure for Biological Information www.elixir-europe.org Standardisation in BMS European infrastructures Managing Big Data Workshop Setting the standards for analyzing and integrating big data ELIXIR Hub technical coordinator July 9-10 2014, Berlin Germany
  • 2.
    TOC • ELIXIR • Standards •BioMedBridges workshops update • Standards • Data deluge 2
  • 3.
    ELIXIR • European lifesciences research infrastructure for biological information to facilitate research • Safeguard data and build sustainable data services • Participated by major bioinformatics service providers and supported by 17 EU member states • Creating a robust infrastructure for biological information is a bigger task than any individual organisation or nation can take on alone 3
  • 4.
    7 | 62 Figure2 Together, the biomedical science research infrastructuresaddresssocietal challenges By establishing interoperability between data and services in the biological, medical, translational and clinical domains, BioMedBridges links basic BioMedBridges Biomedical sciences research infrastructures stronger through common links • FP7-funded cluster project • 21 partners in 9 countries • Computational ‘data and service’ bridges between the BMS RIs • Interoperability between data and services in the biological, medical, translational and clinical domains 4
  • 5.
    European Life SciencesInfrastructure for Biological Information www.elixir-europe.org Rafael C Jimenez ELIXIR Hub technical coordinator Standards
  • 6.
    18.12.18 6 DB QI A AA A DB QI DB QI DB QI DB QI AAA A A Annotator Database Query InterfaceQI User Data submission/access Ideally Reality
  • 7.
    Data resources inlife science • Many • Diverse • Disperse NAR online Molecular Biology Database Collection 2014 ~1800molecular biology data resources
  • 8.
    Utility of databasesScientificimpact Toolittle information Many, diverse & disperse databases and interfaces Tim Hubbard
  • 9.
    Data integration DB I DB I DB I DB I Ideally Compromise DatabaseInterfaceI User Combining data residing in different sources … providing users with a unified view of these data. DB I DB DB DB DB I Reality
  • 10.
    Many, diverse &disperse databases and interfaces 18.12.18 10 Utility of bioinformaticsScientificimpact Too little bioinformatics Integration of
  • 11.
    Data integration issues Manydata sources • Maintain and update • New appearing • Many vanishing* Different query interfaces data integration? Variable results • Syntax • Semantics • Minimum information * Merali Z. et all. Databases in peril. Nature 2005. Where to find them? Redundant data?
  • 12.
    Standards • Community agreedspecification for how data types should be represented and described. • Standards facilitates:  Interoperability  Integration  Exchange  Portability  Comparison  Representation  Sharing  Replication  Consistency  Verification  Compliance  Reusability  Access  Submission  Analysis  Edition  Visualization  Conversion  Validation  Annotation  Search
  • 13.
  • 14.
    Improving Links Betweendistributed European resources ELIXIR pilot: Interoperability of protein expressions resources The Human Protein Atlas portal is a publicly available database with millions of high-resolution images showing the spatial distribution of proteins in 46 different normal human tissues and 20 different cancer types, as well as 47 different human cell lines.
  • 15.
  • 16.
  • 17.
    Standards in datasharing http://www.mrc.ac.uk/Utilities/Documentrecord/index.htm?d=MRC002552
  • 18.
    Different formats forthe same data 18 MI Data PSI-XML PSI-MITAB BioPax RDF Cytoscape DAS • Comprehensive • Simple • Generic • Domain specific • Structured
  • 19.
  • 20.
  • 21.
    Registry - Minimuminformation guidelines
  • 22.
    Registry - Controlledvocabularies • Ontology browser: http://www.ebi.ac.uk/ontology-lookup Ontology Lookup Service
  • 23.
    Communities organized perdomain • Produce technical standards intended to address the needs of a community of users. develop, coordinate, promulgate, revise, amend, reissue, interpret 23
  • 24.
    ELIXIR role • Supportcommunities developing standards • Encourage communication among communities • Links amongst standards • Promote the adoption of standards • Help to find the gaps among standards • Recommend standards best practices in data sharing 24
  • 25.
    European Life SciencesInfrastructure for Biological Information www.elixir-europe.org Data deluge & standards BMB workshops update
  • 26.
    Knowledge ExchangeWorkshop:WP3 Standards 24- 25 June 2014.VUMC, Amsterdam,The Netherlands •Best practice for identifiers •Development of the BMB standards registry 26
  • 27.
    Identifiers Best Practice- purpose • Recommendations for identifiers best practice • Designing (format, re-use) • Managing (creation, versioning, provenance, deprecation etc.) • Using (resolving, mapping etc.) • Publish a paper • Introduction to identifier concepts • Case Studies illustrating identifier usage in real-world scenarios • Recommendations on best practice • Show not tell • Descriptive not normative • fornon-experts/newcomers • Gap analysis • list the biological entities and identifiers type used by BMB partners 27
  • 28.
    Identifiers Best Practice– topics 1/2 • Identifier formats • syntax of database IDs, URI patterns • Identifier management • creation, versioning, provenance, deprecation • Identifier resolution • how to use an ID to get useful information about the entity • services for this, e.g. Identifiers.org • what info. should be given ? 28
  • 29.
    Identifiers Best Practice– topics 2/2 • Identifier mapping / aggregation • how to map IDs on entries in one resource to those in another, to assign equivalence / make useful links • e.g. IDs for equivalent protein sequences in different organisms • e.g. probe->gene->pathway->function (GO) • Identifier cataloguing • compilations of types of identifiers (e.g. EDAM ontology) or specific identifiers (e.g. Cell Line Ontology) • what info. should be given? • Case Studies • use of IDs in a particular domain 29
  • 30.
    Standards format registry- purpose • Discovery, Agreement, Benchmarking, … • Facilitate syntactic operability across research infrastructure so samples and data can be integrated and analysed across ESFRI BMS domains. 30
  • 31.
    Standards format registry- topics • Catalogue of standard • Interoperability among registry • identifiers.org, biosahring, EDAM, ELIXIR service registry • Adaptat to user needs • Community of users vs. community of producers • Microstandards • Standards mapping • Access to exert knowledge • Assess fit for purpose • Rating/metrics 31
  • 32.
    BioMedBridges Knowledge ExchangeWorkshop Tuesday24 - Wednesday 25 June 2014.VUMC, Amsterdam,The Netherlands Workshop organised byWP3 - Standards Description and Harmonisation, to bring together BMB partners, biomedical standards experts and representatives of external projects. Best practice for identifiers •Gap analysis of current identifiers Development of the BMB standards registry •Gap analysis for usage of the registry •Integration of the registry with other tools 32
  • 33.
    Identifiers Best Practice- purpose • Recommendations for identifiers best practice • Designing (format, re-use) • Managing (creation, versioning, provenance, deprecation etc.) • Using (resolving, mapping etc.) • Publish a paper • Introduction to identifier concepts • Case Studies illustrating identifier usage in real-world scenarios • Recommendations on best practice • Show not tell • Descriptive not normative • fornon-experts/newcomers • Gap analysis • list the biological entities and identifiers type used by BMB partners 33
  • 34.
    Identifiers Best Practice– topics 1/2 • Identifier formats • syntax of database IDs, URI patterns • Identifier management • creation, versioning, provenance, deprecation • Identifier resolution • how to use an ID to get useful information about the entity • services for this, e.g. Identifiers.org • what info. should be given ? 34
  • 35.
    Identifiers Best Practice– topics 2/2 • Identifier mapping / aggregation • how to map IDs on entries in one resource to those in another, to assign equivalence / make useful links • e.g. IDs for equivalent protein sequences in different organisms • e.g. probe->gene->pathway->function (GO) • Identifier cataloguing • compilations of types of identifiers (e.g. EDAM ontology) or specific identifiers (e.g. Cell Line Ontology) • what info. should be given? • Case Studies • use of IDs in a particular domain 35
  • 36.
    Standards format registry- purpose • Discovery, Agreement, Benchmarking, … • Facilitate syntactic operability across research infrastructure so samples and data can be integrated and analysed across ESFRI BMS domains. 36
  • 37.
    Standards format registry- topics • Catalogue of standard • Interoperability among registry • identifiers.org, biosahring, EDAM, ELIXIR service registry • Adaptat to user needs • Community of users vs. community of producers • Microstandards • Standards mapping • Access to exert knowledge • Assess fit for purpose • Rating/metrics 37
  • 38.
    BioMedBridges workshops Knowledge ExchangeWorkshop:WP3Standards 24 -25 June 2014 VUMC, Amsterdam,The Netherlands 38 E-Infrastructure support for the life sciences: Preparing for the data deluge 15 May 2014 Genome Campus, Hinxton, UK
  • 39.
    BioMedBridges workshop E-Infrastructure supportfor the life sciences: Preparing for the data deluge 15 May, 2014 Genome Campus, Hinxton, UK
  • 40.
    Knowledge exchange workshop •Discussion of big data challenges in life sciences • Focus on few representative domains • Looking 5 years ahead • Jointly identify potential solutions to our problems Data ICT e-infrastructures LS life sciencesPhysical facilities Scientific information Transfer Computation Storage
  • 41.
    How does itaffect data sharing in life sciences?
  • 42.
    Large-scale data sharingin the life sciences http://www.mrc.ac.uk/Utilities/Documentrecord/index.htm?d=MRC002552
  • 43.
    How does bigdata affect data sharing? http://www.mrc.ac.uk/Utilities/Documentrecord/index.htm?d=MRC002552 Compute Compute Compute Storage Compute Transfer Transfer Transfer Transfer Transfer Storage Storage Storage What How Where
  • 45.
  • 46.
    Cost of DNAsequencing 46
  • 47.
    Data generation vs.data transfer 47 ~100 GB ~4 TB ~4 TB 24 hours 1 Gb 100 Mb 10 Mb ~30 min ~9 hour ~9 hour ~5 hours ~4 days ~4 days ~2 days ~5 weeks ~5 weeks DNA sequencing Mass spectrometry Microscopy Network File Transfer
  • 48.
    Bottlenecks in LifeSciences? • Data production grows faster than storage • Cost of data production technologies declines faster than storage • It takes longer to transfer data than produce the data.
  • 49.
    Data growth how toreduce the IT budget shortfall? http://www.eweek.com/
  • 50.
    Data growth how toreduce the IT budget shortfall? http://www.eweek.com/ Optimization Using technology more effectively Selecting relevant data
  • 51.
    Potential solutions • Storage •Data compression • Select what we store • Evaluate data reproducibility & value of data • Network • Faster protocols • Partitioning • Network upgrade • Computation • Clouds • Data close to computation
  • 52.
    Data compression • Efficientrepresentation • Capacity for controlled data reduction • Efficient transformations • Tool chain Precisi on Compression CRAM Fritz, M.H. Leinonen, R., et al. (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21 (5), 734-40 Cochrane G., Cook C.E. and Birney E. (2012) The future of DNA sequence archiving. GigaScience 2012, 1:2 http://www.ebi.ac.uk/ena/about/cram_toolkit
  • 53.
    Data transfer optimization •e.g. Getting more from available bandwidth Guy Cochrane, EMBL-EBI
  • 54.
    Data partitioning • Organisationof data around biological concepts • Indexing system around these concepts • Support for requests for partitions along this index Reference-oriented indexing Guy Cochrane, EMBL-EBI
  • 55.
    What data isrelevant?
  • 56.
    56 Life sciences diversity Genomes Nucleotides Transcripts Proteins Complexes Pathways Smallmolecules Structures Domains Cells Biobanks Tissues and organs Human populations Therapies Disease prevention Early Diagnosis Human individuals
  • 57.
    Life sciences diversity •Different communities • Some similar requirements • Not always the same solutions ProteomicsMetabolomics Clinical data GenomicsImaging
  • 58.
    Some conclusions • Opportunityfor e-infrastructures to better understand BMS RI problems. • Identification of bottlenecks • Discussion of some potential solutions • Data growth will change how we do things today • Different communities -> different models -> some common solutions • Solutions have to come from use cases • BMS RI need to be better defining requirements • We need to use technology more efficiently • BMS community has to evaluate the practicality of storing everything • Privacy issues makes big data more challenging • Difficult to separate big data from computation • Shortage of expertise of how to deal with scientific data and IT services
  • 59.
    European Life SciencesInfrastructure for Biological Information www.elixir-europe.org Thank you
  • 60.
  • 61.
    Data submission 61 raw data processeddata metadata Centralized database
  • 62.
    Data sharing Data onmy disk and available to anyone who requests it Submission to data repositories From To
  • 63.
    Data submissions 63 Data repository Journal submission Data repository Journal submission reads Journalrequest Curator Data repository Data Management Plan submission Data management +
  • 64.
    Data sharing Will bigdata affect data deposition? Data on my disk and available to anyone who requests it Submission to data repositoriesFrom To
  • 65.
    Data submissions How muchdata? How much available data?
  • 66.
    European Life SciencesInfrastructure for Biological Information www.elixir-europe.org Thank you

Editor's Notes

  • #5 Previous example leads into BioMedBridges project that build bridges between the infrastructures and starting to develop data and service bridges to support research projects that of course will access and benefit from services involving several of these.
  • #7 As a biologist I would prefer to see all the information in one unique database. Centralized databases have this mission. The aim to collect all the information for one specific domain. However … Medium-size labs and organizations are capable to produce large amounts of data. The it becomes harder to submit data to centralized repositories. Moreover data producers like to control and structure their own databases, developing their own GUI and access protocols. For us, the users, it becomes harder to access the information. For one specific domain we might find different databases, using different GUIs. We might end up downloading data in different formats complicating the integration of results. After integration we might find a problem of high redundancy in our results.
  • #43 Data resource: Sustainability, availability and integration
  • #46 'compute power’ doubles every two years. Production of data doubles faster.
  • #47 Sequencing prices below Moore’s law Moore’s law predict exponential decline of computing cost Doubling of 'compute power' every two years Store data more expensive than produce it
  • #48 Technology get cheaper and faster ~15.000 hospital ~4.000 universities ~2.000 life sciences research institutes How much data we will produce? How we will store it?
  • #49 decline of computing cost
  • #63 necessary to understand, develop or reproduce published research
  • #64 Not all the data make it to the public repositories
  • #65 necessary to understand, develop or reproduce published research