SlideShare a Scribd company logo
The Cloud Infrastructure for
Microbial Bioinformatics;
overcoming barriers to software
use and data sharing in biology
Bath HPC Seminar, 9th June 2016
Dr Thomas R Connor
Senior Lecturer
Cardiff University School of Biosciences
@tomrconnor ; connortr@cardiff.ac.uk
http://www.climb.ac.uk
Overview
• Microbiology
introduction
• Why Biology is now a
data intensive
science
• Biological problems
• CLIMB
Microbiology
• Microbiology is the study
of microorganisms
• Make up the majority of
life on earth
• Responsible for
everything from the
oxygen we breathe to
being able to digest our
food
• Microbes are everywhere
• Plus community is well
defined and has been at
the cutting edge of
applying genomics
approaches
More importantly, microbial
pathogens cause disease
• 200 Million people have
GI disease at any point
in time
• In a day, they will
produce ~60,000,000
litres of diarrhoea
• That is equivalent to all
the water passing over
Victoria Falls in one
minute
Why it matters
• 2 Billion cases of disease
every year worldwide
• ~5% of all deaths in low and
middle income countries are
due to diarrhoeal diseases
• Mostly kills children
• Not only limited to
low/middle income
countries
• GI pathogens cause tens of
thousands of cases of severe
disease in the UK, with many
deaths
Why Biology is now a data
intensive science: Genomics
• Bacteria are small; but
Biology is now a data
intensive science: Why?
• Genomics is a term
given to a group of
technologies
• These technologies
allow us to explore the
genome sequence of an
organism
What does that mean?
DNA encodes the blueprint for virtually
every cell of every organism on the planet
That blueprint defines the features of the
cell in which it is found
Genomics enables us to
read this blueprint
This is possible because of high
throughput sequencing
ABI “Sanger” sequencers
Ion Torrent
Pac Bio
Roche 454
Solexa / Illumina
2008 2012
EvolutionofTyphi
(19genomes)
2010
EvolutionofMRSA
ST239(63genomes)
EvolutionofPMEN1
(240genomes)
2011
1,000human
genomespublished
FirstBacterial
GenomeSequence
1995 1998
MLST
2000 2003
DraftoftheHuman
Genome
HumanGenome
finished
96 reads/run
@IL0_0000:1:1:1:1#0/1
gattatccttcgctcaatctggggcaggcggtgatggtctattgctatcaattagcaacattaatacaacaaccggcgaaaagtgatgcaacggcagacc
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
Illumina: 500,000,000 reads/run (>1TB data)
454: 1,000,000 reads/run
Nanopore
Which really matters because: The ever decreasing
cost of sequencing a human genome*
2015
2003
* Humans are boring. For the
same money we can sequence
~50 bacterial genomes
Global sequencing data is growing
fast
Challenges of Biological Big Data
However, the major costs and difficulties do not lie with the
generation of data, they lie with how we share, store and analyse the
data we generate
Informatics expertise
User accessibility of software/hardware
Appropriate compute capacity
Software development
Storage availability
Network capacity
There are many biological analysis platforms available now that make
producing large, rich complex datasets relatively cheap and easy
Wave 1 Wave 2 Wave 3
2005-
09
1989-
97
2003-
07
1992-2002
1993-98
1975-86
1937-611966-71
1967-89
1969-73
1969-81
1981-85
1974
1986-87
1969-73
Mutreja, Kim, Thomson, Connor et al, Nature, 2011
Illustrating the size of the challenge
320 samples
Approx 6-700GB
uncompressed data
Sequence Assembly
Each job 4-8GB RAM
1 CPU core
Each job generates
intermediate files of
~6GB
Runtime: 1+
hours/job
Sequence mapping
320 jobs
Each job 4GB RAM
1 CPU core
Each job generates
intermediate files of
~3GB
Runtime: 1 hour/job
Phylogenomics
1 job, 1+ cores, up to
128GB RAM
Intermediate file size
~2+GB
Output file ~2GB
Runtime 1-2 days
Virulence and
antimicrobial
resistance screening
320 jobs, single core
100MB ram
Runtime: 5 mins/job
Generates 10-20 small
files per job
Bayesian modelling
3 jobs, 1 core+, up to
1 GB RAM
CPU intensive
Runtime: 2 days per
job
Output file ~10GB
Can use GPUs
Written in Java
Larger RAM HTC HTC
HPC Possibly HPC
At the other end of the scale
This is beginning to matter in a lot
of new places
Drug
Development
Diagnostics Treatment SelectionTarget
Identification
Public Health
The rise and rise of biological shadow
IT
• Everything we do is underpinned by
having access to compute and
storage capacity
• Joined Cardiff from the Sanger in
2012
• When I arrived at Cardiff, I had the
joy of working out how I was going
to do my research in a new place
• How to get scripts/software
installed
• Where to install scripts/software
• And I had to do this mostly on my
own
• So I built my own system
• Not an unusual story
Understanding how
bioinformaticians work
0 20 40 60 80 100 120
Cloud
Ins tu on-wide resource
Local resource
Personal computer
Where do bioinformaticians do most of their work
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00%
Best for job
Good documenta on
Word of mouth recommenda on
Used in similar analysis
Quickest
Already installed on server
Other
Graphical interface
Results from:
Loman, Nicholas; Connor, Thomas (2015):
Bioinformatics infrastructure and training
survey.
figshare.http://dx.doi.org/10.6084/m9.figsh
are.1572287
Compounded by fact that
biologists work in silos
• As a discipline we think in
terms of ‘labs’, ‘groups’ and
‘experiments’ being wet
work
• IT infrastructure is treated
the same way
• This means we develop
bespoke, local solutions to
informatics problems
• Because our software/data
is locally stored/setup – it is
often less portable than
wet lab methods /
approaches
Why am I here?
• We need a revolution in Biology
• We need to change how we do
bioinformatics
• We need to change how we share our
data
• We have to accept that biologists wont (in
the short or medium term) become skilled
in IT
• But – we can’t change the field, or how
PIs think
• Cloud-based approaches provide a
mechanism that could accommodate this
“silo” mentality, but which also allows and
enables sharing of data/software
• Helps that our workloads are mostly HTC
rather than HPC
• Has the benefit that a users individual
“silo” can scale elastically with demand
• Microbiology is the perfect starting point
BIOINFORMATICS
One of the servers
A user
A user wants a server; the system spins up a
VM (a self contained ‘module’ containing
operating system and software) and slots it
onto one of the servers
Other users also want systems; the
cloud OS will load those up too
sharing a large server between users
A user
A user
So wouldn't it be great if….
Because these VMs are on a
common system, these are then
sharable between users
Iceberg breaking with the Cloud?
• The cloud provides a mechanism to
provide core infrastructure at scale
• Clouds remove the need for local hardware
maintenance and support
• Storage, compute, networking are most
expensive when bought one by one;
building a large system represents better
value for money
• Sharing servers simplifies the process of
installing and maintaining software
• It provides a mechanism for software/data
reuse as well as sharing
• Sharing servers simplifies training; you
can use the system you trained on,
once you get back home
Introducing the CLoud Infrastructure
for Microbial Bioinformatics (CLIMB)
• MRC funded project to
develop Cloud
Infrastructure for microbial
bioinformatics
• ~£4M of hardware, capable
of supporting >1000
individual virtual servers
• Providing a core, national
cyberinfrastructure for
Microbial Bioinformatics
The CLoud Infrastructure for Microbial
Bioinformatics (climb.ac.uk)
• We are creating a one stop shop for
Microbial Bioinformatics
• Public/private cloud for use by UK
academics
• Standardised cloud images that
implement key pipelines
• Storage repository for data/images
that are made available online and
within our system, anywhere
(‘eduroam for microbial genomics’)
• We will provide access to other
databases from within the
system
• As well as providing a place to
support orphan databases and
tools
• We are also eating our own
dogfood
System Outline
• 4 sites
• Connected over Janet
• Different sizes of VM available; personal, standard, large memory, huge memory
• Able to support >1,000 VMs simultaneously (1:1 vCPUs/vRAM : CPUs/RAM)
• ~7PB of object storage across 4 sites (~3PB usable)
• 4-500TB of local high performance storage per site
• A single system, with common log in, and between site data replication*
• System has been designed to enable the addition of extra nodes / Universities
CLIMB Overview
• 4 sites, running OpenStack
• Hardware procured in a
two stage process
• IBM/OCF provided
compute, Dell/redhat
provided storage
• Networks provided by
Brocade
• Are defining a reference
architecture to enable
other sites to trivially be
added
Hardware (per site)
• 2 router/firewalls (capable of
routing >80Gb each
• 3 Controllers
• 21x 64 vCore, 512GB RAM
nodes
• 3x 192 vCore, 3TB RAM nodes
• ~500TB GPFS (local)
• 4 controllers
• Infiniband, with 10Gb failover
• ~2-3PB Ceph (shared)
• 27x 64TB nodes/site
• Cross site replication
• 10Gb Backbone
Configuration
• OpenStack Kilo
• GPFS provides block storage
• Ceph provides block and object
storage (via S3 gateway)
• VMs are spun up locally on GPFS
; research data etc are stored
within Ceph
• Ceph is configured to replicate
between sites
• Theoretically means if one site
goes down, users can still access
their data
0
0.5
1
1.5
2
2.5
3
beast blastn gunzip muscle nhmmer phyml prokka snippy velvetg velveth geometric mean
Performance
• Generally extremely good
• Performance is quite consistent across workloads
• But we do see how critical configuration is
• We also see some issues with the large memory machines
Where are we now?
• Computational hardware was procured by March 2015 (~6 month process)
• Ahead of schedule - system is now online and in use for research (>50
users)
• Adopting two models for access
• Access for registered users to core OpenStack system online now
• “version 1.0” system providing universal access to predefined images – Launch in
July
Providing easy access to resource
Coupling sequencing to the
compute resource
Providing a mechanism for data
and software sharing
Key (ongoing) challenges
• Federated access
• VM Scheduling
• Storage configuration
• Block size on GPFS
• Getting Ceph running as we want
• Dealing with large volumes in OpenStack
• Networking
• VPN tunnels between sites
• Difficulties with Vyatta
• Complexity of OpenStack and Ceph
• User experience of Horizon
• Lack of useful error info
• Unexpected gotcha’s
• Mysql/log problems
• Update processes
• Expected gotcha’s
• High-use Users
Overcoming Biological reality
• We don’t need an LHC to generate
TB’s of data a week
• We do this routinely in a (my) lab
• Poses serious (local)
computational challenges
• Forces us to consider better how
we design our (global)
infrastructure
• Virtualisation provides a way to
“meet the user where they are”
• Is a new solution for an old
problem. Time will tell if it works
The CLIMB Consortium Are
• Professor Mark Pallen (Warwick) and Professor Sam Sheppard (Bath) –
Joint PIs
• Professor Mark Achtman (Warwick), Professor Steve Busby FRS
(Birmingham), Dr Tom Connor (Cardiff site lead)*, Professor Tim
Walsh (Cardiff), Dr Robin Howe (Public Health Wales) – Co-Is
• Dr Nick Loman (Birmingham site lead)* and Dr Chris Quince
(Warwick), Dr Daniel Falush (Bath) ; MRC Research Fellows
• Simon Thompson (Birmingham, Project Technical/OpenStack lead),
• Marius Bakke (Warwick, Systems administrator/Ceph lead), Dr Matt
Bull (Cardiff Sysadmin), Radoslaw Poplawski (Birmingham sysadmin)
• Simon Thompson (Swansea HPC team), Kevin Munn and Ian Merrick
(Cardiff Biosciences Research Computing team), Wayne Lawrence, Dr
Chrisine Kitchen, Professor Martyn Guest (Cardiff HPC team), Matt
Ismail (Warwick HPC lead),
* Principal bioinformaticians architecting and designing the system
The CLoud Infrastructure for
Microbial Bioinformatics (CLIMB)
• MRC funded project to
develop Cloud
Infrastructure for microbial
bioinformatics
• ~£4M of hardware, capable
of supporting >1000
individual virtual servers
• Providing a core, national
cyberinfrastructure for
Microbial Bioinformatics
Moving compute to the data isn’t
always possible
Volume of data generated;
network speed
Data complexity;
processing power
Size of data generated; Storage
and transfer limitations
Data generation
Multiple TB, every
few days
Transport of data
from lab
Processing, analysis
storage and sharing
of data

More Related Content

Viewers also liked

The Case For Docker In Multi-Cloud Enabled Bioinformatics Applications
The Case For Docker In Multi-Cloud Enabled Bioinformatics ApplicationsThe Case For Docker In Multi-Cloud Enabled Bioinformatics Applications
The Case For Docker In Multi-Cloud Enabled Bioinformatics Applications
Ahmed Abdullah
 
BIMSB/MDC Bioinformatics Platform Overview
BIMSB/MDC Bioinformatics Platform OverviewBIMSB/MDC Bioinformatics Platform Overview
BIMSB/MDC Bioinformatics Platform Overview
Altuna Akalin
 
Mapping of genes using cloud technologies
Mapping of genes using cloud technologiesMapping of genes using cloud technologies
Mapping of genes using cloud technologies
eSAT Journals
 
IRIDA: A Federated Bioinformatics Platform Enabling Richer Genomic Epidemiolo...
IRIDA: A Federated Bioinformatics Platform Enabling Richer Genomic Epidemiolo...IRIDA: A Federated Bioinformatics Platform Enabling Richer Genomic Epidemiolo...
IRIDA: A Federated Bioinformatics Platform Enabling Richer Genomic Epidemiolo...
William Hsiao
 
Internet and Bioinformatics for Biologists
Internet and Bioinformatics for BiologistsInternet and Bioinformatics for Biologists
Internet and Bioinformatics for Biologists
Dr Mehul Dave
 
Machine Learning in Bioinformatics
Machine Learning in BioinformaticsMachine Learning in Bioinformatics
Machine Learning in Bioinformatics
Dmytro Fishman
 
Building cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and DockerBuilding cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and Docker
Jacob Feala
 
AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research...
AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research...AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research...
AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research...
Amazon Web Services
 
AWS re:Invent 2016: High Performance Cinematic Production in the Cloud (MAE304)
AWS re:Invent 2016: High Performance Cinematic Production in the Cloud (MAE304)AWS re:Invent 2016: High Performance Cinematic Production in the Cloud (MAE304)
AWS re:Invent 2016: High Performance Cinematic Production in the Cloud (MAE304)
Amazon Web Services
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline Patterns
Lynn Langit
 
Scaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud PlatformScaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud Platform
Lynn Langit
 

Viewers also liked (11)

The Case For Docker In Multi-Cloud Enabled Bioinformatics Applications
The Case For Docker In Multi-Cloud Enabled Bioinformatics ApplicationsThe Case For Docker In Multi-Cloud Enabled Bioinformatics Applications
The Case For Docker In Multi-Cloud Enabled Bioinformatics Applications
 
BIMSB/MDC Bioinformatics Platform Overview
BIMSB/MDC Bioinformatics Platform OverviewBIMSB/MDC Bioinformatics Platform Overview
BIMSB/MDC Bioinformatics Platform Overview
 
Mapping of genes using cloud technologies
Mapping of genes using cloud technologiesMapping of genes using cloud technologies
Mapping of genes using cloud technologies
 
IRIDA: A Federated Bioinformatics Platform Enabling Richer Genomic Epidemiolo...
IRIDA: A Federated Bioinformatics Platform Enabling Richer Genomic Epidemiolo...IRIDA: A Federated Bioinformatics Platform Enabling Richer Genomic Epidemiolo...
IRIDA: A Federated Bioinformatics Platform Enabling Richer Genomic Epidemiolo...
 
Internet and Bioinformatics for Biologists
Internet and Bioinformatics for BiologistsInternet and Bioinformatics for Biologists
Internet and Bioinformatics for Biologists
 
Machine Learning in Bioinformatics
Machine Learning in BioinformaticsMachine Learning in Bioinformatics
Machine Learning in Bioinformatics
 
Building cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and DockerBuilding cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and Docker
 
AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research...
AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research...AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research...
AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research...
 
AWS re:Invent 2016: High Performance Cinematic Production in the Cloud (MAE304)
AWS re:Invent 2016: High Performance Cinematic Production in the Cloud (MAE304)AWS re:Invent 2016: High Performance Cinematic Production in the Cloud (MAE304)
AWS re:Invent 2016: High Performance Cinematic Production in the Cloud (MAE304)
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline Patterns
 
Scaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud PlatformScaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud Platform
 

Similar to Climb bath

CLIMB talk in the Virtual Laboratories session at the RCUK Cloud Working Grou...
CLIMB talk in the Virtual Laboratories session at the RCUK Cloud Working Grou...CLIMB talk in the Virtual Laboratories session at the RCUK Cloud Working Grou...
CLIMB talk in the Virtual Laboratories session at the RCUK Cloud Working Grou...
thomasrconnor
 
Climb stateoftheartintro
Climb stateoftheartintroClimb stateoftheartintro
Climb stateoftheartintro
thomasrconnor
 
2015 04 bio it world
2015 04 bio it world2015 04 bio it world
2015 04 bio it world
Chris Dwan
 
The pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an exampleThe pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an example
Enis Afgan
 
e-infrastructural needs to support informatics
e-infrastructural needs to support informaticse-infrastructural needs to support informatics
e-infrastructural needs to support informatics
David Wallom
 
Desktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omicsDesktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omics
David Wallom
 
2016 05 sanger
2016 05 sanger2016 05 sanger
2016 05 sanger
Chris Dwan
 
Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)
Robert Grossman
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
Bonnie Hurwitz
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
Ian Foster
 
2015 09 emc lsug
2015 09 emc lsug2015 09 emc lsug
2015 09 emc lsug
Chris Dwan
 
Federating Infrastructure as a Service cloud computing systems to create a un...
Federating Infrastructure as a Service cloud computing systems to create a un...Federating Infrastructure as a Service cloud computing systems to create a un...
Federating Infrastructure as a Service cloud computing systems to create a un...
David Wallom
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011
Robert Grossman
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
Ian Foster
 
Science for the Future: Strategies for Moving and Sharing Data
Science for the Future: Strategies for Moving and Sharing DataScience for the Future: Strategies for Moving and Sharing Data
Science for the Future: Strategies for Moving and Sharing Data
Ian Foster
 
Globus in European Life Science
Globus in European Life ScienceGlobus in European Life Science
Globus in European Life Science
Globus
 
Research Cyberinfrastructure at UCSD - David Minor - RDAP12
Research Cyberinfrastructure at UCSD - David Minor - RDAP12Research Cyberinfrastructure at UCSD - David Minor - RDAP12
Research Cyberinfrastructure at UCSD - David Minor - RDAP12
ASIS&T
 
Cyverse: Extensible Cyberinfrastructure for Life Science
Cyverse: Extensible Cyberinfrastructure for Life ScienceCyverse: Extensible Cyberinfrastructure for Life Science
Cyverse: Extensible Cyberinfrastructure for Life Science
EMBL Australia Bioinformatics Resource
 
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
SURFnet
 
ELIXIR
ELIXIRELIXIR

Similar to Climb bath (20)

CLIMB talk in the Virtual Laboratories session at the RCUK Cloud Working Grou...
CLIMB talk in the Virtual Laboratories session at the RCUK Cloud Working Grou...CLIMB talk in the Virtual Laboratories session at the RCUK Cloud Working Grou...
CLIMB talk in the Virtual Laboratories session at the RCUK Cloud Working Grou...
 
Climb stateoftheartintro
Climb stateoftheartintroClimb stateoftheartintro
Climb stateoftheartintro
 
2015 04 bio it world
2015 04 bio it world2015 04 bio it world
2015 04 bio it world
 
The pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an exampleThe pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an example
 
e-infrastructural needs to support informatics
e-infrastructural needs to support informaticse-infrastructural needs to support informatics
e-infrastructural needs to support informatics
 
Desktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omicsDesktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omics
 
2016 05 sanger
2016 05 sanger2016 05 sanger
2016 05 sanger
 
Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
 
2015 09 emc lsug
2015 09 emc lsug2015 09 emc lsug
2015 09 emc lsug
 
Federating Infrastructure as a Service cloud computing systems to create a un...
Federating Infrastructure as a Service cloud computing systems to create a un...Federating Infrastructure as a Service cloud computing systems to create a un...
Federating Infrastructure as a Service cloud computing systems to create a un...
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
Science for the Future: Strategies for Moving and Sharing Data
Science for the Future: Strategies for Moving and Sharing DataScience for the Future: Strategies for Moving and Sharing Data
Science for the Future: Strategies for Moving and Sharing Data
 
Globus in European Life Science
Globus in European Life ScienceGlobus in European Life Science
Globus in European Life Science
 
Research Cyberinfrastructure at UCSD - David Minor - RDAP12
Research Cyberinfrastructure at UCSD - David Minor - RDAP12Research Cyberinfrastructure at UCSD - David Minor - RDAP12
Research Cyberinfrastructure at UCSD - David Minor - RDAP12
 
Cyverse: Extensible Cyberinfrastructure for Life Science
Cyverse: Extensible Cyberinfrastructure for Life ScienceCyverse: Extensible Cyberinfrastructure for Life Science
Cyverse: Extensible Cyberinfrastructure for Life Science
 
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
 
ELIXIR
ELIXIRELIXIR
ELIXIR
 

Recently uploaded

Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
Leonel Morgado
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
PirithiRaju
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
LengamoLAppostilic
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
Texas Alliance of Groundwater Districts
 
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
Scintica Instrumentation
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
Areesha Ahmad
 
HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1
Shashank Shekhar Pandey
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
Leonel Morgado
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
Daniel Tubbenhauer
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
Anagha Prasad
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Leonel Morgado
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
University of Maribor
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Selcen Ozturkcan
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
Vandana Devesh Sharma
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
MaheshaNanjegowda
 

Recently uploaded (20)

Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
 
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
 
HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
 

Climb bath

  • 1. The Cloud Infrastructure for Microbial Bioinformatics; overcoming barriers to software use and data sharing in biology Bath HPC Seminar, 9th June 2016 Dr Thomas R Connor Senior Lecturer Cardiff University School of Biosciences @tomrconnor ; connortr@cardiff.ac.uk http://www.climb.ac.uk
  • 2. Overview • Microbiology introduction • Why Biology is now a data intensive science • Biological problems • CLIMB
  • 3. Microbiology • Microbiology is the study of microorganisms • Make up the majority of life on earth • Responsible for everything from the oxygen we breathe to being able to digest our food • Microbes are everywhere • Plus community is well defined and has been at the cutting edge of applying genomics approaches
  • 4. More importantly, microbial pathogens cause disease • 200 Million people have GI disease at any point in time • In a day, they will produce ~60,000,000 litres of diarrhoea • That is equivalent to all the water passing over Victoria Falls in one minute
  • 5. Why it matters • 2 Billion cases of disease every year worldwide • ~5% of all deaths in low and middle income countries are due to diarrhoeal diseases • Mostly kills children • Not only limited to low/middle income countries • GI pathogens cause tens of thousands of cases of severe disease in the UK, with many deaths
  • 6. Why Biology is now a data intensive science: Genomics • Bacteria are small; but Biology is now a data intensive science: Why? • Genomics is a term given to a group of technologies • These technologies allow us to explore the genome sequence of an organism
  • 7. What does that mean? DNA encodes the blueprint for virtually every cell of every organism on the planet That blueprint defines the features of the cell in which it is found Genomics enables us to read this blueprint
  • 8. This is possible because of high throughput sequencing ABI “Sanger” sequencers Ion Torrent Pac Bio Roche 454 Solexa / Illumina 2008 2012 EvolutionofTyphi (19genomes) 2010 EvolutionofMRSA ST239(63genomes) EvolutionofPMEN1 (240genomes) 2011 1,000human genomespublished FirstBacterial GenomeSequence 1995 1998 MLST 2000 2003 DraftoftheHuman Genome HumanGenome finished 96 reads/run @IL0_0000:1:1:1:1#0/1 gattatccttcgctcaatctggggcaggcggtgatggtctattgctatcaattagcaacattaatacaacaaccggcgaaaagtgatgcaacggcagacc + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB Illumina: 500,000,000 reads/run (>1TB data) 454: 1,000,000 reads/run Nanopore
  • 9. Which really matters because: The ever decreasing cost of sequencing a human genome* 2015 2003 * Humans are boring. For the same money we can sequence ~50 bacterial genomes
  • 10. Global sequencing data is growing fast
  • 11. Challenges of Biological Big Data However, the major costs and difficulties do not lie with the generation of data, they lie with how we share, store and analyse the data we generate Informatics expertise User accessibility of software/hardware Appropriate compute capacity Software development Storage availability Network capacity There are many biological analysis platforms available now that make producing large, rich complex datasets relatively cheap and easy
  • 12. Wave 1 Wave 2 Wave 3 2005- 09 1989- 97 2003- 07 1992-2002 1993-98 1975-86 1937-611966-71 1967-89 1969-73 1969-81 1981-85 1974 1986-87 1969-73 Mutreja, Kim, Thomson, Connor et al, Nature, 2011 Illustrating the size of the challenge
  • 13. 320 samples Approx 6-700GB uncompressed data Sequence Assembly Each job 4-8GB RAM 1 CPU core Each job generates intermediate files of ~6GB Runtime: 1+ hours/job Sequence mapping 320 jobs Each job 4GB RAM 1 CPU core Each job generates intermediate files of ~3GB Runtime: 1 hour/job Phylogenomics 1 job, 1+ cores, up to 128GB RAM Intermediate file size ~2+GB Output file ~2GB Runtime 1-2 days Virulence and antimicrobial resistance screening 320 jobs, single core 100MB ram Runtime: 5 mins/job Generates 10-20 small files per job Bayesian modelling 3 jobs, 1 core+, up to 1 GB RAM CPU intensive Runtime: 2 days per job Output file ~10GB Can use GPUs Written in Java Larger RAM HTC HTC HPC Possibly HPC
  • 14. At the other end of the scale
  • 15. This is beginning to matter in a lot of new places Drug Development Diagnostics Treatment SelectionTarget Identification Public Health
  • 16. The rise and rise of biological shadow IT • Everything we do is underpinned by having access to compute and storage capacity • Joined Cardiff from the Sanger in 2012 • When I arrived at Cardiff, I had the joy of working out how I was going to do my research in a new place • How to get scripts/software installed • Where to install scripts/software • And I had to do this mostly on my own • So I built my own system • Not an unusual story
  • 17. Understanding how bioinformaticians work 0 20 40 60 80 100 120 Cloud Ins tu on-wide resource Local resource Personal computer Where do bioinformaticians do most of their work 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% Best for job Good documenta on Word of mouth recommenda on Used in similar analysis Quickest Already installed on server Other Graphical interface Results from: Loman, Nicholas; Connor, Thomas (2015): Bioinformatics infrastructure and training survey. figshare.http://dx.doi.org/10.6084/m9.figsh are.1572287
  • 18. Compounded by fact that biologists work in silos • As a discipline we think in terms of ‘labs’, ‘groups’ and ‘experiments’ being wet work • IT infrastructure is treated the same way • This means we develop bespoke, local solutions to informatics problems • Because our software/data is locally stored/setup – it is often less portable than wet lab methods / approaches
  • 19. Why am I here? • We need a revolution in Biology • We need to change how we do bioinformatics • We need to change how we share our data • We have to accept that biologists wont (in the short or medium term) become skilled in IT • But – we can’t change the field, or how PIs think • Cloud-based approaches provide a mechanism that could accommodate this “silo” mentality, but which also allows and enables sharing of data/software • Helps that our workloads are mostly HTC rather than HPC • Has the benefit that a users individual “silo” can scale elastically with demand • Microbiology is the perfect starting point BIOINFORMATICS
  • 20. One of the servers A user A user wants a server; the system spins up a VM (a self contained ‘module’ containing operating system and software) and slots it onto one of the servers Other users also want systems; the cloud OS will load those up too sharing a large server between users A user A user So wouldn't it be great if…. Because these VMs are on a common system, these are then sharable between users
  • 21. Iceberg breaking with the Cloud? • The cloud provides a mechanism to provide core infrastructure at scale • Clouds remove the need for local hardware maintenance and support • Storage, compute, networking are most expensive when bought one by one; building a large system represents better value for money • Sharing servers simplifies the process of installing and maintaining software • It provides a mechanism for software/data reuse as well as sharing • Sharing servers simplifies training; you can use the system you trained on, once you get back home
  • 22. Introducing the CLoud Infrastructure for Microbial Bioinformatics (CLIMB) • MRC funded project to develop Cloud Infrastructure for microbial bioinformatics • ~£4M of hardware, capable of supporting >1000 individual virtual servers • Providing a core, national cyberinfrastructure for Microbial Bioinformatics
  • 23. The CLoud Infrastructure for Microbial Bioinformatics (climb.ac.uk) • We are creating a one stop shop for Microbial Bioinformatics • Public/private cloud for use by UK academics • Standardised cloud images that implement key pipelines • Storage repository for data/images that are made available online and within our system, anywhere (‘eduroam for microbial genomics’) • We will provide access to other databases from within the system • As well as providing a place to support orphan databases and tools • We are also eating our own dogfood
  • 24. System Outline • 4 sites • Connected over Janet • Different sizes of VM available; personal, standard, large memory, huge memory • Able to support >1,000 VMs simultaneously (1:1 vCPUs/vRAM : CPUs/RAM) • ~7PB of object storage across 4 sites (~3PB usable) • 4-500TB of local high performance storage per site • A single system, with common log in, and between site data replication* • System has been designed to enable the addition of extra nodes / Universities
  • 25. CLIMB Overview • 4 sites, running OpenStack • Hardware procured in a two stage process • IBM/OCF provided compute, Dell/redhat provided storage • Networks provided by Brocade • Are defining a reference architecture to enable other sites to trivially be added
  • 26. Hardware (per site) • 2 router/firewalls (capable of routing >80Gb each • 3 Controllers • 21x 64 vCore, 512GB RAM nodes • 3x 192 vCore, 3TB RAM nodes • ~500TB GPFS (local) • 4 controllers • Infiniband, with 10Gb failover • ~2-3PB Ceph (shared) • 27x 64TB nodes/site • Cross site replication • 10Gb Backbone
  • 27. Configuration • OpenStack Kilo • GPFS provides block storage • Ceph provides block and object storage (via S3 gateway) • VMs are spun up locally on GPFS ; research data etc are stored within Ceph • Ceph is configured to replicate between sites • Theoretically means if one site goes down, users can still access their data
  • 28. 0 0.5 1 1.5 2 2.5 3 beast blastn gunzip muscle nhmmer phyml prokka snippy velvetg velveth geometric mean Performance • Generally extremely good • Performance is quite consistent across workloads • But we do see how critical configuration is • We also see some issues with the large memory machines
  • 29. Where are we now? • Computational hardware was procured by March 2015 (~6 month process) • Ahead of schedule - system is now online and in use for research (>50 users) • Adopting two models for access • Access for registered users to core OpenStack system online now • “version 1.0” system providing universal access to predefined images – Launch in July
  • 30. Providing easy access to resource
  • 31. Coupling sequencing to the compute resource
  • 32. Providing a mechanism for data and software sharing
  • 33. Key (ongoing) challenges • Federated access • VM Scheduling • Storage configuration • Block size on GPFS • Getting Ceph running as we want • Dealing with large volumes in OpenStack • Networking • VPN tunnels between sites • Difficulties with Vyatta • Complexity of OpenStack and Ceph • User experience of Horizon • Lack of useful error info • Unexpected gotcha’s • Mysql/log problems • Update processes • Expected gotcha’s • High-use Users
  • 34. Overcoming Biological reality • We don’t need an LHC to generate TB’s of data a week • We do this routinely in a (my) lab • Poses serious (local) computational challenges • Forces us to consider better how we design our (global) infrastructure • Virtualisation provides a way to “meet the user where they are” • Is a new solution for an old problem. Time will tell if it works
  • 35. The CLIMB Consortium Are • Professor Mark Pallen (Warwick) and Professor Sam Sheppard (Bath) – Joint PIs • Professor Mark Achtman (Warwick), Professor Steve Busby FRS (Birmingham), Dr Tom Connor (Cardiff site lead)*, Professor Tim Walsh (Cardiff), Dr Robin Howe (Public Health Wales) – Co-Is • Dr Nick Loman (Birmingham site lead)* and Dr Chris Quince (Warwick), Dr Daniel Falush (Bath) ; MRC Research Fellows • Simon Thompson (Birmingham, Project Technical/OpenStack lead), • Marius Bakke (Warwick, Systems administrator/Ceph lead), Dr Matt Bull (Cardiff Sysadmin), Radoslaw Poplawski (Birmingham sysadmin) • Simon Thompson (Swansea HPC team), Kevin Munn and Ian Merrick (Cardiff Biosciences Research Computing team), Wayne Lawrence, Dr Chrisine Kitchen, Professor Martyn Guest (Cardiff HPC team), Matt Ismail (Warwick HPC lead), * Principal bioinformaticians architecting and designing the system
  • 36. The CLoud Infrastructure for Microbial Bioinformatics (CLIMB) • MRC funded project to develop Cloud Infrastructure for microbial bioinformatics • ~£4M of hardware, capable of supporting >1000 individual virtual servers • Providing a core, national cyberinfrastructure for Microbial Bioinformatics
  • 37. Moving compute to the data isn’t always possible Volume of data generated; network speed Data complexity; processing power Size of data generated; Storage and transfer limitations Data generation Multiple TB, every few days Transport of data from lab Processing, analysis storage and sharing of data