Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Standards Key to Unlocking Biological Data Integration
1. European Life Sciences Infrastructure for Biological Information
www.elixir-europe.org
Standardisation in BMS
European infrastructures
Managing Big Data Workshop
Setting the standards for analyzing and integrating big data
ELIXIR Hub technical coordinator
July 9-10 2014, Berlin Germany
3. ELIXIR
• European life sciences research
infrastructure for biological
information to facilitate research
• Safeguard data and build
sustainable data services
• Participated by major bioinformatics
service providers and supported by
17 EU member states
• Creating a robust infrastructure for
biological information is a bigger
task than any individual
organisation or nation can take on
alone
3
4. 7 | 62
Figure 2 Together, the biomedical science research infrastructuresaddresssocietal challenges
By establishing interoperability between data and services in the biological,
medical, translational and clinical domains, BioMedBridges links basic
BioMedBridges
Biomedical sciences research infrastructures
stronger through common links
• FP7-funded cluster project
• 21 partners in 9 countries
• Computational ‘data and
service’ bridges between the
BMS RIs
• Interoperability between
data and services in the
biological, medical,
translational and clinical
domains
4
5. European Life Sciences Infrastructure for Biological Information
www.elixir-europe.org
Rafael C Jimenez
ELIXIR Hub technical coordinator
Standards
7. Data resources in life science
• Many
• Diverse
• Disperse
NAR online Molecular Biology Database Collection 2014
~1800molecular biology
data resources
10. Many, diverse & disperse
databases and interfaces
18.12.18
10
Utility of bioinformaticsScientificimpact
Too little
bioinformatics
Integration of
11. Data integration issues
Many data sources
• Maintain and update
• New appearing
• Many vanishing*
Different query interfaces
data integration?
Variable results
• Syntax
• Semantics
• Minimum information
* Merali Z. et all. Databases in peril. Nature 2005.
Where to find them?
Redundant data?
12. Standards
• Community agreed specification for how data types
should be represented and described.
• Standards facilitates:
Interoperability
Integration
Exchange
Portability
Comparison
Representation
Sharing
Replication
Consistency
Verification
Compliance
Reusability
Access
Submission
Analysis
Edition
Visualization
Conversion
Validation
Annotation
Search
14. Improving Links Between distributed European
resources
ELIXIR pilot: Interoperability of protein expressions resources
The Human Protein Atlas portal is a publicly available database
with millions of high-resolution images showing the spatial
distribution of proteins in 46 different normal human tissues and
20 different cancer types, as well as 47 different human cell lines.
17. Standards in data sharing
http://www.mrc.ac.uk/Utilities/Documentrecord/index.htm?d=MRC002552
18. Different formats for the same data
18
MI
Data
PSI-XML
PSI-MITAB
BioPax
RDF
Cytoscape
DAS • Comprehensive
• Simple
• Generic
• Domain specific
• Structured
23. Communities organized per domain
• Produce technical standards intended to address the needs of
a community of users.
develop, coordinate, promulgate, revise, amend, reissue, interpret
23
24. ELIXIR role
• Support communities developing standards
• Encourage communication among communities
• Links amongst standards
• Promote the adoption of standards
• Help to find the gaps among standards
• Recommend standards best practices in data sharing
24
25. European Life Sciences Infrastructure for Biological Information
www.elixir-europe.org
Data deluge & standards
BMB workshops update
27. Identifiers Best Practice - purpose
• Recommendations for identifiers best practice
• Designing (format, re-use)
• Managing (creation, versioning, provenance, deprecation etc.)
• Using (resolving, mapping etc.)
• Publish a paper
• Introduction to identifier concepts
• Case Studies illustrating identifier usage in real-world scenarios
• Recommendations on best practice
• Show not tell
• Descriptive not normative
• fornon-experts/newcomers
• Gap analysis
• list the biological entities and identifiers type used by BMB partners
27
28. Identifiers Best Practice – topics 1/2
• Identifier formats
• syntax of database IDs, URI patterns
• Identifier management
• creation, versioning, provenance, deprecation
• Identifier resolution
• how to use an ID to get useful information about the entity
• services for this, e.g. Identifiers.org
• what info. should be given ?
28
29. Identifiers Best Practice – topics 2/2
• Identifier mapping / aggregation
• how to map IDs on entries in one resource to those in another, to
assign equivalence / make useful links
• e.g. IDs for equivalent protein sequences in different organisms
• e.g. probe->gene->pathway->function (GO)
• Identifier cataloguing
• compilations of types of identifiers (e.g. EDAM ontology) or specific
identifiers (e.g. Cell Line Ontology)
• what info. should be given?
• Case Studies
• use of IDs in a particular domain
29
30. Standards format registry - purpose
• Discovery, Agreement, Benchmarking, …
• Facilitate syntactic operability across research infrastructure
so samples and data can be integrated and analysed across
ESFRI BMS domains.
30
31. Standards format registry - topics
• Catalogue of standard
• Interoperability among registry
• identifiers.org, biosahring, EDAM, ELIXIR service registry
• Adaptat to user needs
• Community of users vs. community of producers
• Microstandards
• Standards mapping
• Access to exert knowledge
• Assess fit for purpose
• Rating/metrics
31
32. BioMedBridges Knowledge ExchangeWorkshop
Tuesday 24 - Wednesday 25 June 2014.VUMC, Amsterdam,The Netherlands
Workshop organised byWP3 - Standards Description and
Harmonisation, to bring together BMB partners, biomedical
standards experts and representatives of external projects.
Best practice for identifiers
•Gap analysis of current identifiers
Development of the BMB standards registry
•Gap analysis for usage of the registry
•Integration of the registry with other tools
32
33. Identifiers Best Practice - purpose
• Recommendations for identifiers best practice
• Designing (format, re-use)
• Managing (creation, versioning, provenance, deprecation etc.)
• Using (resolving, mapping etc.)
• Publish a paper
• Introduction to identifier concepts
• Case Studies illustrating identifier usage in real-world scenarios
• Recommendations on best practice
• Show not tell
• Descriptive not normative
• fornon-experts/newcomers
• Gap analysis
• list the biological entities and identifiers type used by BMB partners
33
34. Identifiers Best Practice – topics 1/2
• Identifier formats
• syntax of database IDs, URI patterns
• Identifier management
• creation, versioning, provenance, deprecation
• Identifier resolution
• how to use an ID to get useful information about the entity
• services for this, e.g. Identifiers.org
• what info. should be given ?
34
35. Identifiers Best Practice – topics 2/2
• Identifier mapping / aggregation
• how to map IDs on entries in one resource to those in another, to
assign equivalence / make useful links
• e.g. IDs for equivalent protein sequences in different organisms
• e.g. probe->gene->pathway->function (GO)
• Identifier cataloguing
• compilations of types of identifiers (e.g. EDAM ontology) or specific
identifiers (e.g. Cell Line Ontology)
• what info. should be given?
• Case Studies
• use of IDs in a particular domain
35
36. Standards format registry - purpose
• Discovery, Agreement, Benchmarking, …
• Facilitate syntactic operability across research infrastructure
so samples and data can be integrated and analysed across
ESFRI BMS domains.
36
37. Standards format registry - topics
• Catalogue of standard
• Interoperability among registry
• identifiers.org, biosahring, EDAM, ELIXIR service registry
• Adaptat to user needs
• Community of users vs. community of producers
• Microstandards
• Standards mapping
• Access to exert knowledge
• Assess fit for purpose
• Rating/metrics
37
38. BioMedBridges workshops
Knowledge ExchangeWorkshop:WP3 Standards
24 -25 June 2014
VUMC, Amsterdam,The Netherlands
38
E-Infrastructure support for the life sciences:
Preparing for the data deluge
15 May 2014
Genome Campus, Hinxton, UK
40. Knowledge exchange workshop
• Discussion of big data challenges in life sciences
• Focus on few representative domains
• Looking 5 years ahead
• Jointly identify potential solutions to our problems
Data
ICT
e-infrastructures
LS
life sciencesPhysical facilities
Scientific information
Transfer
Computation
Storage
41. How does it affect data sharing
in life sciences?
42. Large-scale data sharing in the life sciences
http://www.mrc.ac.uk/Utilities/Documentrecord/index.htm?d=MRC002552
43. How does big data affect data sharing?
http://www.mrc.ac.uk/Utilities/Documentrecord/index.htm?d=MRC002552
Compute Compute
Compute
Storage Compute Transfer
Transfer
Transfer Transfer
Transfer
Storage Storage
Storage
What How Where
47. Data generation vs. data transfer
47
~100 GB
~4 TB
~4 TB
24 hours 1 Gb 100 Mb 10 Mb
~30 min
~9 hour
~9 hour
~5 hours
~4 days
~4 days
~2 days
~5 weeks
~5 weeks
DNA sequencing
Mass spectrometry
Microscopy
Network File Transfer
48. Bottlenecks in Life Sciences?
• Data production grows faster than storage
• Cost of data production technologies declines faster than
storage
• It takes longer to transfer data than produce the data.
49. Data growth
how to reduce the IT budget shortfall?
http://www.eweek.com/
50. Data growth
how to reduce the IT budget shortfall?
http://www.eweek.com/
Optimization
Using technology more effectively
Selecting relevant data
51. Potential solutions
• Storage
• Data compression
• Select what we store
• Evaluate data reproducibility & value of data
• Network
• Faster protocols
• Partitioning
• Network upgrade
• Computation
• Clouds
• Data close to computation
52. Data compression
• Efficient representation
• Capacity for controlled data
reduction
• Efficient transformations
• Tool chain Precisi
on
Compression
CRAM
Fritz, M.H. Leinonen, R., et al. (2011) Efficient storage of high throughput DNA sequencing data using reference-based
compression. Genome Res. 21 (5), 734-40
Cochrane G., Cook C.E. and Birney E. (2012) The future of DNA sequence archiving. GigaScience 2012, 1:2
http://www.ebi.ac.uk/ena/about/cram_toolkit
54. Data partitioning
• Organisation of data around biological concepts
• Indexing system around these concepts
• Support for requests for partitions along this index
Reference-oriented indexing
Guy Cochrane, EMBL-EBI
57. Life sciences diversity
• Different communities
• Some similar requirements
• Not always the same solutions
ProteomicsMetabolomics Clinical data GenomicsImaging
58. Some conclusions
• Opportunity for e-infrastructures to better understand BMS RI problems.
• Identification of bottlenecks
• Discussion of some potential solutions
• Data growth will change how we do things today
• Different communities -> different models -> some common solutions
• Solutions have to come from use cases
• BMS RI need to be better defining requirements
• We need to use technology more efficiently
• BMS community has to evaluate the practicality of storing everything
• Privacy issues makes big data more challenging
• Difficult to separate big data from computation
• Shortage of expertise of how to deal with scientific data and IT services
59. European Life Sciences Infrastructure for Biological Information
www.elixir-europe.org
Thank you
66. European Life Sciences Infrastructure for Biological Information
www.elixir-europe.org
Thank you
Editor's Notes
Previous example leads into BioMedBridges project that build bridges between the infrastructures and starting to develop data and service bridges to support research projects that of course will access and benefit from services involving several of these.
As a biologist I would prefer to see all the information in one unique database.
Centralized databases have this mission.
The aim to collect all the information for one specific domain.
However …
Medium-size labs and organizations are capable to produce large amounts of data.
The it becomes harder to submit data to centralized repositories.
Moreover data producers like to control and structure their own databases, developing their own GUI and access protocols.
For us, the users, it becomes harder to access the information.
For one specific domain we might find different databases, using different GUIs. We might end up downloading data in different formats complicating the integration of results. After integration we might find a problem of high redundancy in our results.
Data resource: Sustainability, availability and integration
'compute power’ doubles every two years. Production of data doubles faster.
Sequencing prices below Moore’s law
Moore’s law predict exponential decline of computing cost
Doubling of 'compute power' every two years
Store data more expensive than produce it
Technology get cheaper and faster
~15.000 hospital
~4.000 universities
~2.000 life sciences research institutes
How much data we will produce? How we will store it?
decline of computing cost
necessary to understand, develop or reproduce published research
Not all the data make it to the public repositories
necessary to understand, develop or reproduce published research