Climb bath

The Cloud Infrastructure for
Microbial Bioinformatics;
overcoming barriers to software
use and data sharing in biology
Bath HPC Seminar, 9th June 2016
Dr Thomas R Connor
Senior Lecturer
Cardiff University School of Biosciences
@tomrconnor ; connortr@cardiff.ac.uk
http://www.climb.ac.uk

Overview
• Microbiology
introduction
• Why Biology is now a
data intensive
science
• Biological problems
• CLIMB

Microbiology
• Microbiology is the study
of microorganisms
• Make up the majority of
life on earth
• Responsible for
everything from the
oxygen we breathe to
being able to digest our
food
• Microbes are everywhere
• Plus community is well
defined and has been at
the cutting edge of
applying genomics
approaches

More importantly, microbial
pathogens cause disease
• 200 Million people have
GI disease at any point
in time
• In a day, they will
produce ~60,000,000
litres of diarrhoea
• That is equivalent to all
the water passing over
Victoria Falls in one
minute

Why it matters
• 2 Billion cases of disease
every year worldwide
• ~5% of all deaths in low and
middle income countries are
due to diarrhoeal diseases
• Mostly kills children
• Not only limited to
low/middle income
countries
• GI pathogens cause tens of
thousands of cases of severe
disease in the UK, with many
deaths

Why Biology is now a data
intensive science: Genomics
• Bacteria are small; but
Biology is now a data
intensive science: Why?
• Genomics is a term
given to a group of
technologies
• These technologies
allow us to explore the
genome sequence of an
organism

What does that mean?
DNA encodes the blueprint for virtually
every cell of every organism on the planet
That blueprint defines the features of the
cell in which it is found
Genomics enables us to
read this blueprint

This is possible because of high
throughput sequencing
ABI “Sanger” sequencers
Ion Torrent
Pac Bio
Roche 454
Solexa / Illumina
2008 2012
EvolutionofTyphi
(19genomes)
2010
EvolutionofMRSA
ST239(63genomes)
EvolutionofPMEN1
(240genomes)
2011
1,000human
genomespublished
FirstBacterial
GenomeSequence
1995 1998
MLST
2000 2003
DraftoftheHuman
Genome
HumanGenome
finished
96 reads/run
@IL0_0000:1:1:1:1#0/1
gattatccttcgctcaatctggggcaggcggtgatggtctattgctatcaattagcaacattaatacaacaaccggcgaaaagtgatgcaacggcagacc
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
Illumina: 500,000,000 reads/run (>1TB data)
454: 1,000,000 reads/run
Nanopore

Which really matters because: The ever decreasing
cost of sequencing a human genome*
2015
2003
* Humans are boring. For the
same money we can sequence
~50 bacterial genomes

Global sequencing data is growing
fast

Challenges of Biological Big Data
However, the major costs and difficulties do not lie with the
generation of data, they lie with how we share, store and analyse the
data we generate
Informatics expertise
User accessibility of software/hardware
Appropriate compute capacity
Software development
Storage availability
Network capacity
There are many biological analysis platforms available now that make
producing large, rich complex datasets relatively cheap and easy

Wave 1 Wave 2 Wave 3
2005-
09
1989-
97
2003-
07
1992-2002
1993-98
1975-86
1937-611966-71
1967-89
1969-73
1969-81
1981-85
1974
1986-87
1969-73
Mutreja, Kim, Thomson, Connor et al, Nature, 2011
Illustrating the size of the challenge

320 samples
Approx 6-700GB
uncompressed data
Sequence Assembly
Each job 4-8GB RAM
1 CPU core
Each job generates
intermediate files of
~6GB
Runtime: 1+
hours/job
Sequence mapping
320 jobs
Each job 4GB RAM
1 CPU core
Each job generates
intermediate files of
~3GB
Runtime: 1 hour/job
Phylogenomics
1 job, 1+ cores, up to
128GB RAM
Intermediate file size
~2+GB
Output file ~2GB
Runtime 1-2 days
Virulence and
antimicrobial
resistance screening
320 jobs, single core
100MB ram
Runtime: 5 mins/job
Generates 10-20 small
files per job
Bayesian modelling
3 jobs, 1 core+, up to
1 GB RAM
CPU intensive
Runtime: 2 days per
job
Output file ~10GB
Can use GPUs
Written in Java
Larger RAM HTC HTC
HPC Possibly HPC

This is beginning to matter in a lot
of new places
Drug
Development
Diagnostics Treatment SelectionTarget
Identification
Public Health

The rise and rise of biological shadow
IT
• Everything we do is underpinned by
having access to compute and
storage capacity
• Joined Cardiff from the Sanger in
2012
• When I arrived at Cardiff, I had the
joy of working out how I was going
to do my research in a new place
• How to get scripts/software
installed
• Where to install scripts/software
• And I had to do this mostly on my
own
• So I built my own system
• Not an unusual story

Understanding how
bioinformaticians work
0 20 40 60 80 100 120
Cloud
Ins tu on-wide resource
Local resource
Personal computer
Where do bioinformaticians do most of their work
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00%
Best for job
Good documenta on
Word of mouth recommenda on
Used in similar analysis
Quickest
Already installed on server
Other
Graphical interface
Results from:
Loman, Nicholas; Connor, Thomas (2015):
Bioinformatics infrastructure and training
survey.
figshare.http://dx.doi.org/10.6084/m9.figsh
are.1572287

Compounded by fact that
biologists work in silos
• As a discipline we think in
terms of ‘labs’, ‘groups’ and
‘experiments’ being wet
work
• IT infrastructure is treated
the same way
• This means we develop
bespoke, local solutions to
informatics problems
• Because our software/data
is locally stored/setup – it is
often less portable than
wet lab methods /
approaches

Why am I here?
• We need a revolution in Biology
• We need to change how we do
bioinformatics
• We need to change how we share our
data
• We have to accept that biologists wont (in
the short or medium term) become skilled
in IT
• But – we can’t change the field, or how
PIs think
• Cloud-based approaches provide a
mechanism that could accommodate this
“silo” mentality, but which also allows and
enables sharing of data/software
• Helps that our workloads are mostly HTC
rather than HPC
• Has the benefit that a users individual
“silo” can scale elastically with demand
• Microbiology is the perfect starting point
BIOINFORMATICS

One of the servers
A user
A user wants a server; the system spins up a
VM (a self contained ‘module’ containing
operating system and software) and slots it
onto one of the servers
Other users also want systems; the
cloud OS will load those up too
sharing a large server between users
A user
A user
So wouldn't it be great if….
Because these VMs are on a
common system, these are then
sharable between users

Iceberg breaking with the Cloud?
• The cloud provides a mechanism to
provide core infrastructure at scale
• Clouds remove the need for local hardware
maintenance and support
• Storage, compute, networking are most
expensive when bought one by one;
building a large system represents better
value for money
• Sharing servers simplifies the process of
installing and maintaining software
• It provides a mechanism for software/data
reuse as well as sharing
• Sharing servers simplifies training; you
can use the system you trained on,
once you get back home

Introducing the CLoud Infrastructure
for Microbial Bioinformatics (CLIMB)
• MRC funded project to
develop Cloud
Infrastructure for microbial
bioinformatics
• ~£4M of hardware, capable
of supporting >1000
individual virtual servers
• Providing a core, national
cyberinfrastructure for
Microbial Bioinformatics

The CLoud Infrastructure for Microbial
Bioinformatics (climb.ac.uk)
• We are creating a one stop shop for
• Public/private cloud for use by UK
academics
• Standardised cloud images that
implement key pipelines
• Storage repository for data/images
that are made available online and
within our system, anywhere
(‘eduroam for microbial genomics’)
• We will provide access to other
databases from within the
system
• As well as providing a place to
support orphan databases and
tools
• We are also eating our own
dogfood

System Outline
• 4 sites
• Connected over Janet
• Different sizes of VM available; personal, standard, large memory, huge memory
• Able to support >1,000 VMs simultaneously (1:1 vCPUs/vRAM : CPUs/RAM)
• ~7PB of object storage across 4 sites (~3PB usable)
• 4-500TB of local high performance storage per site
• A single system, with common log in, and between site data replication*
• System has been designed to enable the addition of extra nodes / Universities

CLIMB Overview
• 4 sites, running OpenStack
• Hardware procured in a
two stage process
• IBM/OCF provided
compute, Dell/redhat
provided storage
• Networks provided by
Brocade
• Are defining a reference
architecture to enable
other sites to trivially be
added

Hardware (per site)
• 2 router/firewalls (capable of
routing >80Gb each
• 3 Controllers
• 21x 64 vCore, 512GB RAM
nodes
• 3x 192 vCore, 3TB RAM nodes
• ~500TB GPFS (local)
• 4 controllers
• Infiniband, with 10Gb failover
• ~2-3PB Ceph (shared)
• 27x 64TB nodes/site
• Cross site replication
• 10Gb Backbone

Configuration
• OpenStack Kilo
• GPFS provides block storage
• Ceph provides block and object
storage (via S3 gateway)
• VMs are spun up locally on GPFS
; research data etc are stored
within Ceph
• Ceph is configured to replicate
between sites
• Theoretically means if one site
goes down, users can still access
their data

0
0.5
1
1.5
2
2.5
3
beast blastn gunzip muscle nhmmer phyml prokka snippy velvetg velveth geometric mean
Performance
• Generally extremely good
• Performance is quite consistent across workloads
• But we do see how critical configuration is
• We also see some issues with the large memory machines

Where are we now?
• Computational hardware was procured by March 2015 (~6 month process)
• Ahead of schedule - system is now online and in use for research (>50
users)
• Adopting two models for access
• Access for registered users to core OpenStack system online now
• “version 1.0” system providing universal access to predefined images – Launch in
July

Providing easy access to resource

Coupling sequencing to the
compute resource

Providing a mechanism for data
and software sharing

Key (ongoing) challenges
• Federated access
• VM Scheduling
• Storage configuration
• Block size on GPFS
• Getting Ceph running as we want
• Dealing with large volumes in OpenStack
• Networking
• VPN tunnels between sites
• Difficulties with Vyatta
• Complexity of OpenStack and Ceph
• User experience of Horizon
• Lack of useful error info
• Unexpected gotcha’s
• Mysql/log problems
• Update processes
• Expected gotcha’s
• High-use Users

Overcoming Biological reality
• We don’t need an LHC to generate
TB’s of data a week
• We do this routinely in a (my) lab
• Poses serious (local)
computational challenges
• Forces us to consider better how
we design our (global)
infrastructure
• Virtualisation provides a way to
“meet the user where they are”
• Is a new solution for an old
problem. Time will tell if it works

The CLIMB Consortium Are
• Professor Mark Pallen (Warwick) and Professor Sam Sheppard (Bath) –
Joint PIs
• Professor Mark Achtman (Warwick), Professor Steve Busby FRS
(Birmingham), Dr Tom Connor (Cardiff site lead)*, Professor Tim
Walsh (Cardiff), Dr Robin Howe (Public Health Wales) – Co-Is
• Dr Nick Loman (Birmingham site lead)* and Dr Chris Quince
(Warwick), Dr Daniel Falush (Bath) ; MRC Research Fellows
• Simon Thompson (Birmingham, Project Technical/OpenStack lead),
• Marius Bakke (Warwick, Systems administrator/Ceph lead), Dr Matt
Bull (Cardiff Sysadmin), Radoslaw Poplawski (Birmingham sysadmin)
• Simon Thompson (Swansea HPC team), Kevin Munn and Ian Merrick
(Cardiff Biosciences Research Computing team), Wayne Lawrence, Dr
Chrisine Kitchen, Professor Martyn Guest (Cardiff HPC team), Matt
Ismail (Warwick HPC lead),
* Principal bioinformaticians architecting and designing the system

The CLoud Infrastructure for
Microbial Bioinformatics (CLIMB)
• MRC funded project to
develop Cloud
Infrastructure for microbial
bioinformatics
• ~£4M of hardware, capable
of supporting >1000
individual virtual servers
• Providing a core, national
cyberinfrastructure for

Moving compute to the data isn’t
always possible
Volume of data generated;
network speed
Data complexity;
processing power
Size of data generated; Storage
and transfer limitations
Data generation
Multiple TB, every
few days
Transport of data
from lab
Processing, analysis
storage and sharing
of data

Climb bath

More Related Content

Viewers also liked

Similar to Climb bath

Recently uploaded

Climb bath