Analyzing Big Data in Medicine with Virtual Research Environments and Microservices

Analyzing Big Data in Medicine with
Virtual Research Environments and
Microservices
Ola Spjuth <ola.spjuth@farmbio.uu.se>
Department of Pharmaceutical Biosciences
Science for Life Laboratory
Uppsala University

Today: We have access to high-throughput
technologies to study biological phenomena

New challenges: Data management and
analysis
• Storage
• Analysis methods, pipelines
• Scaling
• Automation
• Data integration, security
• Predictions
• …

European Open Science Cloud (EOSC)
• The vast majority of all data in the world (in fact up to 90%) has been
generated in the last two years.
• Scientific data is in direct need of openness, better handling, careful
management, machine actionability and sheer re-use.
• European Open Science Cloud: A vision of a future infrastructure to
support Open Research Data and Open Science in Europe
– It should enable trusted access to services, systems and the re-use
of shared scientific data across disciplinary, social and geographical
borders
– research data should be findable, accessible, interoperable and re-
usable (FAIR)
– provide the means to analyze datasets of huge sizes
4http://ec.europa.eu/research/openscience/index.cfm?pg=open-science-cloud

Contemporary Big Data analysis in
bioinformatics
• High-Performance Computing with shared storage
– Linux, Terminal, batch queue
• Problems/challenges
– Access to resources is limited
– Dependency management for tools is cumbersome, need help from
system administrators to install software
– Privacy-related issues
– Difficult to share/integrate data
– Accessibility issues
• A common approach: Internet-based services
– Retrieve data
– Analysis tools
5

Service-Oriented Architectures (SOA) in
the life sciences
• Standardize
– Agree on e.g. interfaces, data formats,
protocols etc.
• Decompose and compartmentalize
– Experts (scientists) should provide
services – do one thing and do it well
– Achieve interoperability by exposing
data and tools as Web services
• Integrate
– Users should access and integrate
remote services
API
Scientist
service
Scientist
consume

Service-Oriented Architectures (SOA) in
the life sciences, ~2005
Scientist
downtime
API
changed
Not maintained
Difficult to sustain,
unreliable solutions
API
API
API

Cloud Computing
• Cloud computing offers advantages over
contemporary e-infrastructures in the life sciences
– On-demand elastic resources and services
– No up-front costs, pay-per-use
• A lot of businesses (and software development)
moving into the cloud
– Vibrant ecosystem of frameworks and tools, including for
big data
• High potential for science

Virtual Machines and Containers
Virtual machines
• Package entire systems (heavy)
• Completely isolated
• Suitable in cloud environments
Containers:
• Share OS
• Smaller, faster, portable
• Docker!
10

MicroServices
• Similar to Web services: Decompose functionality into smaller, loosely
coupled services communicating via API
– “Do one thing and do it well”
• Preferably smaller, light-weight and fast to instantiate on demand
• Easy to replace, language-agnostic
– Suitable for loosely coupled teams (which we have in science)
– Portable - easy to deploy and scale
– Maximize agility for developers
• Suitable to deploy as containers in cloud environments

Scaling microservices
12
http://martinfowler.com/articles/microservices.html

Kubernetes: Orchestrating containers
• Origin: Google
• A declarative language for
launching containers
• Start, stop, update, and manage
a cluster of machines running
containers in a consistent and
maintainable way
• Suitable for microservices
Containers
Scheduled and packed containers on nodes

Virtual Research Environment (VRE)
• Virtual (online) environments for research
– Easy and user-friendly access to computational resources, tools and
data, commonly for a scientific domain
• Multi-tenant VRE – log into shared system
• Private VRE
– Deploy on your favorite cloud provider
16

• Horizon 2020-project, €8 M, 2015-2018
– “standardized e-infrastructure for the processing, analysis and information-
mining of the massive amount of medical molecular phenotyping and
genotyping data generated by metabolomics applications.”
• Enable users to provision their own virtual infrastructure (VRE)
– Public cloud, private cloud, local servers
– Easy access to compatible tools exposed as microservices
– Will in minutes set up and configure a complete data-center (compute
nodes, storage, networks, DNS, firewall etc)
– Can achieve high-availability, scalability and fault tolerance
• Use modern and established tools and frameworks supported by industry
– Reduce risk and improve sustainability
• Offer an agile and scalable environment to use, and a straightforward
platform to extend
http://phenomenal-h2020.eu/

Deployment and user access
Launch on reference installation
Launch on public cloud
Private VRE

In-house deployment scenarios
MRC-NIHR Phenome Centre
• Medium-sized
IT-infrastructure
• Dedicated IT-
personnel
• Users: ICL staff
Hospital environment
• Dedicated
server
• No IT-personnel
• User: Clinical
researcher
Private VRE

Build and test
tools, images,
infrastructure
Docker Hub
PhenoMeNal
Jenkins
PhenoMeNal
Container Hub
Development: Container lifecycle
Source code repositories

Two proof of concepts so far
Kultima group Pablo Moreno

Implications
• Improve sustainability
– Not dependent on specific data centers
• Improve reliability and security
– Users can run their own service environments (VREs) within isolated
environments
– High-availability and fault tolerance
• Scalability
– Deploy in elastic environments
• Agile development
– Automate “from develop to deploy”
• Agile science
– Simple access to discoverable, scalable tools on elastic compute
resources with no up-front costs
• NB: Many problems of interoperability remains!
– Data
– APIs
– etc.
24

Ongoing research on VREs
25
Data
federation
Compute
federation
Privacy
preservation
Workflows
Big Data
frameworks
Data management and
modeling

Acknowledgements
Wesley Schaal
Jonathan Alvarsson
Staffan Arvidsson
Arvid Berg
Samuel Lampa
Marco Capuccini
Martin Dahlö
Valentin Georgiev
Anders Larsson
Polina Georgiev
Maris Lapins
26
AstraZeneca
Lars Carlsson
Ernst Ahlberg
University Vienna
David Kreil
Maciej Kańduła
SNIC Science Cloud
Andreas Hellander
Salman Toor
Caramba.clinic
Kim Kultima
Stephanie Herman
Payam Emami
ToxHQ team
Barry Hardy
Thomas Exner
Joh Dokler
Daniel Bachler

Analyzing Big Data in Medicine with Virtual Research Environments and Microservices

More Related Content

What's hot

Viewers also liked

Similar to Analyzing Big Data in Medicine with Virtual Research Environments and Microservices

More from Ola Spjuth

Recently uploaded

Analyzing Big Data in Medicine with Virtual Research Environments and Microservices

Editor's Notes