1. E-Infrastructure support for the life sciences:
Preparing for the data deluge
Rafael Jimenez
ELIXIR CTO
16 May, 2014
BioMedBridges
Summary Day 1
2. How does it affect data
sharing in life sciences?
3. Problems of big data
http://www.mrc.ac.uk/Utilities/Documentrecord/index.htm?d=MRC002552
Compute Compute
Compute
Storage Compute Transfer
Transfer
Transfer Transfer
Transfer
Storage Storage
Storage
What How Where
4. Knowledge exchange workshop
Discussion of big data challenges in life sciences
Focus on few representative domains
Looking 5 years ahead
Jointly identify potential solutions to our problems
Data
ICT
e-infrastructures
LS
life sciencesPhysical facilities
Scientific information
Transfer
Computation
Storage
9. Group discussions
Group discussion, session 1 (5 groups)
How much data and what type of data?
Who are the stakeholders?
What factors can influence data availability?
Group discussion, session 2 (5 groups)
What are the potential bottlenecks per stakeholder?
What are the potential solutions to these bottlenecks?
10.
11. Stakeholders
Researchers, Patients, Industry, Local users,
Clinical, Academic, Pressure groups, IT in hospitals,
Pharma, Agri, Structural Biologist, LS, RI Nodes,
Institutions, Algorithm developers, Genomics
researchers, Personalized medicine, Funders, TAX
payers, EuroBioimaging, Facilities, PDB, Institutes,
Commercial data provider, Data repositories
Types (Producer, Data resource, Consumer)
Production distribution +(Genomics, Clinical, Metabolomics,
Proteomics MS, Proteomics ST, Imaging)-.
Privacy
12. How much data and types
Raw data, Process data, Metadata
+(Genomics, Clinical, Proteomics ST,
Proteomics MS, Metabolomics, Imaging)-
13. Factors that can influence data
availability
scientific (e.g. data reproducibility, uniqueness, value of
processed and/or raw data)
financial (cost of data storage, transfer, reproduction)
technical (storage, network, computation…)
political (drivers e.g. from funding bodies/large
organisations/national interests)
social (data sharing mentality of the community in
question)
legal/ethical/formal (requirements/constraints for data
storage/transfer/access - e.g. need to store data on
German citizens in Germany; requirements from journal
publishers, data management plans, etc.)
14. Bottlenecks
Storage
Data grows faster than data storage (G,P,M,I,C?)
Security restricts how to store/share some of the data (G,C)
Keeping data close
Raw data is not always stored (PST, I, Pms)
Missing repositories (I)
Repositories storing a small part of the data (Pms, M)
Transfer
Data submissions (Pms, G)
Transferring to repositories slower than producing
Downloading (G,P,I)
Just copying data to a HD (Pst)
Same time than producing data
Computation
Preprocessing slower than producing (Pst)
Producer, Data resource, Consumer
(G)enomics, (C)linical, (P)roteomics (st),
(P)roteomics (ms), (M)etabolomics, (I)maging
15. Potential solutions
Storage
Solve problems with technology (e.g. compression)
Evaluate data reproducibility
Network
Faster protocols
Partitioning
Network upgrade
Computation
Clouds
General
Buy services instead of investing in infrastructure
Producer, Data resource, Consumer