SlideShare a Scribd company logo
1 of 32
Microbial Bioinformatics in the
cloud
Introducing CLIMB
Dr Tom Connor
Cardiff University
Biological Cloud Computing Workshop
www.climb.ac.uk
@tomrconnor ; @mrcclimb
Overview
• Background
• View from a newly
minted academic
Bioinformatician at a
“regular” University
– Bioinformatics needs
and challenges
• Introducing CLIMB
Big Data
Wave 1 Wave 2 Wave 3
2005-
09
1989-
97
2003-
07
1992-
2002
1993-98
1975-86
1937-611966-71
1967-89
1969-73
1969-81
1981-85
1974
1986-87
1969-73
Adapted from Mutreja, Kim, Thomson, Connor et al, Nature, 2011
Population genomics; using genomics to reconstruct the
global spread of pathogens
Bioinformatics; developing new approaches to analyse
massive datasets
From Marttinen, Hanage, Croucher,
Connor, Harris, Bentley and
Corander, Nucleic Acids Res. 2011
From Cheng, Connor, Sirén, Aanensen,
Corander, MBE, 2013
Grand challenges; fighting antimicrobial resistance
From Fookes et al, PLoS Pathogens 2011
From Reuter, Connor et al, PNAS 2014
From Okoro, Kingsley, Connor et al.
Nature Genetics 2012
From He et al, Nature Genetics
2013
From Dziva, Hauser, Connor et al,
I&I, 2013
Pathogen genomics: understanding how pathogens
evolve
About Me
• 2006-2010 - PhD at
Imperial; Population
Genetics / Molecular
Epidemiology of
bacterial pathogens
• 2010-2012 - Post-Doc,
Wellcome Trust Sanger
Institute; Pathogen
Genomics
• 2012 – present Lecturer
then Senior Lecturer,
Cardiff University
Hanage, Fraser, Tang, Connor, Corander, Science, 2009
Building bioinformatics capacity
• When I arrived at Cardiff, I
had the joy of working
out how I was going to do
my research in a new
place
• How to get
scripts/software installed
• Where to install
scripts/software
• And I had to do this
mostly on my own
• Not an unusual story
Key Challenges
• Infrastructure
– Storage
– HTC capacity
• Portability of software
• Portability of datasets
• But, we do have the
University
Supercomputer….
Advanced Research Computing @
Cardiff (ARCCA)
• 2048 core HPC cluster
• Second hand ~868
Westmere core “HTC”
partition
• 8 ‘large memory’ 128GB
nodes
• Lustre file system (scratch,
nominally unlimited, but
50TB total space initially)
• NFS /home mount (50GB
maximum quota)
• Freely available to University
Researchers
…. So the first thing I did was to buy
myself some servers
• Large HPC clusters often don’t
meet our needs
• Bioinformaticians aren’t the
ideal HPC users
– Disruptive software needs
– Disruptive usage patterns
– Disruptive storage needs
• Setting up our “own” system
seems the most intuitive way
to ensure that you have
something that works
Biologists often end up working in silos
• As a discipline we have probably been
taught to think in terms of ‘labs’,
‘groups’ and ‘experiments’ being wet
work
• We build capacity and teams locally,
and those are the resources that we
use every day
• For bioinformaticians this means we
are likely to develop our solutions
locally first, building a local group and
local capacity
• Our software, data set storage, LIMS
etc are usually bespoke
• Because our software/data is locally
stored/setup – it is often less portable
than wet lab methods / approaches
• Bioinformaticians should be working
differently
Key Challenges Overall
• We need systems that allow us to rapidly and easily share
complete systems, from the perspective of both novice users and
experienced developers
• We need to develop systems to share complete datasets, rather
than forcing users to install loads of bits of software to
reconstitute the development environment we used, or forcing
us to become proper software developers
• We need to lower the barrier to access for for research scientists
with a limited understanding of UNIX/Computer Science
• We need a system that allows us to train users on systems that
they will then be able to use when they go home
• We need to understand the needs of individual fields
• We need to integrate activities across these needs, to avoid
reinventing the wheel
The cloud
• All of the infrastructure
issues are much easier when
tackled at scale
• This concept led to Amazon,
followed by others, offering
Cloud services
• A cloud infrastructure
provides a mechanism to
share systems/software and
data, at scale
• Let someone else do the
admin etc, and all you have
to worry about is running
the software
Why not use a commercial cloud?
• We often want lots of RAM – Amazon
max flavour size is ~250GB
• Prices are high ($1200/month for
~250GB RAM flavour)
• Storage costs also high – 1TB on Amazon
S3 costs $30/month (our current costs
are £7/month)
• Additionally Amazon isn’t designed to
facilitate sharing of data etc between
different people who have VMs
• There are possible issues around T&C’s,
governance etc
• Even if we overcome these, often these
services are too hard for novice users to
make use of
Needs
• Core infrastructure now to
take advantage of new
technologies
• Systems to easily share data
• Repositories that can make
tools/methods/data
available rapidly and easily
• Better use of the existing
RCUK/University server
estate
• To change the view of
Biologists about working in
silos
Introducing the CLoud Infrastructure
for Microbial Bioinformatics (CLIMB)
• MRC funded project to
develop Cloud
Infrastructure for microbial
bioinformatics
• ~£4M of hardware, capable
of supporting >1000
individual virtual servers
• Providing a core, national
cyberinfrastructure for
Microbial Bioinformatics
The CLIMB Consortium Are
• Professor Mark Pallen (Warwick) and Dr Sam
Sheppard (Swansea) – Joint PIs
• Professor Mark Achtman (Warwick), Professor
SteveBusby FRS (Birmingham), Dr Tom
Connor (Cardiff)*, Professor Tim Walsh
(Cardiff), Dr Robin Howe (Public Health Wales)
– Co-Is
• Dr Nick Loman (Birmingham)* and Dr Chris
Quince (Warwick) ; MRC Research Fellows
* Principal bioinformaticians architecting and designing the system
The CLoud Infrastructure for Microbial
Bioinformatics (climb.ac.uk)
• We are creating A one stop shop
for Microbial Bioinformatics
– Public/private cloud for use by
UK academics
– Standardised cloud images that
implement key pipelines
– Storage repository for
data/images that are made
available online and within our
system, anywhere (‘eduroam
for microbial genomics’)
• We will provide access to
other databases from within
the system
• As well as providing a place to
support orphan databases and
tools
System Outline
• 4 sites
• Connected over Janet
• Different sizes of VM available; personal, standard, large memory, huge memory
• Able to support >1,000 VMs simultaneously (1:1 vCPUs/vRAM : CPUs/RAM)
• 7-8PB of object storage across 4 sites (~2-3PB usable with erasure coding)
• 4-500TB of local high performance storage per site
• A single system, with common log in, and between site data replication
• System has been designed to enable the addition of extra nodes / Universities
CLIMB Overview
• 4 sites, running OpenStack
• Hardware procured in a two
stage process
• IBM/OCF provided
compute, Dell/redhat
provided storage
• Networks provided by
Brocade
• Are defining a reference
architecture to enable
other sites to trivially be
added
Hardware (per site)
• 2 router/firewalls (capable of
routing >80Gb each
• 3 Controllers
• 21x 64 vCore, 512GB RAM
nodes
• 3x 192 vCore, 3TB RAM nodes
• ~500TB GPFS (local)
– 4 controllers
– Infiniband, with 10Gb failover
• ~2TB Ceph (shared)
– 27x 64TB nodes/site
– Cross site replication
– 10Gb Backbone
Overview – 4 sites, (virtually) identical
hardware
Each site is connected to the
others over VPN tunnels.
Sites can be easily added.
System can use free router
software and commodity
hardware, pay for-software
or dedicated router/firewalls
Our intention is for the system
to be presented to users as a
single system, with single login,
via Shibboleth.
We are currently working on that
bit 
A single system makes it easy(er)
to share methods and data!
External clouds
External databases
External clouds
External databases
Flavours
• User configurable, with standard
flavours
• Regular; up to 8 vCPUs, 64GB
RAM
• xlarge; up to 16 vCPUs, 256GB
RAM
• Huge; up to 192 vCPUs, 3TB
RAM
• System also supports a scalable
virtual cluster (large
embarrassingly parallel projects)
– 2+ nodes with 2+ vCPUs, 2-4GB
RAM/vCPU
• Also provides for Long Term
Hosting (for orphan
datasets/tools)
Access
• Microbial researchers will be
able to access the system
through one of two ways
– Externally, via federated access
system, login via .ac.uk user
login in first instance, later
(hopefully) open to anyone
who uses shibboleth
– Internally, via user accounts
setup by consortium for
collaborators
• Researchers will be able to
provision up a set number of
VMs
Where are we now?
• Computational hardware was procured by March 2015 (~6 month process)
• Ahead of schedule - system is now online and in use for research
• Adopting two models for access
– Access for registered users to core OpenStack system online now
– “version 1.0” system providing universal access to predefined images starting
with the GVL – Autumn 2015
VMs are already up
Users are already using CLIMB to do
research
Challenges
• Future Planning (CLIMB will run for 5 years,
then what?)
• Cross-Cloud Integration
• Not reinventing the wheel
• Standardising software stacks for .ac.uk clouds
• Being able to embrace new technologies
• Meeting cloud development needs
The Sequencing Iceberg
All of the sequencing platforms available
now make producing large genomics
datasets relatively cheap and easy
However, the major costs and difficulties
do not lie with the generation of data,
they lie with the pre-requisites for storing
and analysing that data
Informatics expertise
Storage availability
Appropriate HTC capacity
These are
interlinked,
and
expensive
Iceberg breaking with the Cloud?
• It is a mechanism for sharing servers
– Clouds remove the need for hardware
maintenance and support
– Storage, compute, networking are most
expensive when bought one by one; building a
large system represents better value for
money
• Sharing servers means you can have
standardised systems, simplifying the process
of installing and maintaining software
– It provides a mechanism for software/data
reuse as well as sharing
– Also makes training easier; you can use the
system you trained on, once you get back
home
• Sharing servers also makes training easier;
you can use the system you trained on, once
you get back home
CLIMB Next Steps – and future needs
• New images/analytics tools (GVL!)
• Integration of datasets
• Expanding our userbase
• Collaboration with other cloud services
• Integrating with databases
• Integrating with other clouds
• Developing new sites
• Developing the system to meet the
needs of our users
• Developing policy
• Defining and developing security policy
• Developing/setting up federated access
• Possibly looking at capacity to burst out
or accept bursts from other resources
• Developing our training programme
and outreach
Cloud Infrastructure for Microbial
Bioinformatics
• A multi site system to provide a one-stop-bioinformatics-shop, designed
specifically to support Microbial researchers
• For both Bioinformaticians and wet lab scientists
• Combines hardware with training
• Free, simple interface, easy to use
• Common login
• Easy data and method sharing
• Already have multiple users from across UK academia and healthcare
The CLIMB Consortium Are
• Professor Mark Pallen (Warwick) and Dr Sam
Sheppard (Swansea) – Joint PIs
• Professor Mark Achtman (Warwick), Professor
SteveBusby FRS (Birmingham), Dr Tom
Connor (Cardiff)*, Professor Tim Walsh
(Cardiff), Dr Robin Howe (Public Health Wales)
– Co-Is
• Dr Nick Loman (Birmingham)* and Dr Chris
Quince (Warwick) ; MRC Research Fellows
* Principal bioinformaticians architecting and designing the system
CLoud Infrastructure for Microbial
Bioinformatics (CLIMB)
• MRC funded project to
develop Cloud
Infrastructure for
microbial bioinformatics
• ~£4M of hardware,
capable of supporting
>1000 individual virtual
servers
• Amazon/Google cloud for
Academics

More Related Content

Similar to Climb stateoftheartintro

CLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB LaunchCLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB LaunchTom Connor
 
CLIMB talk in the Virtual Laboratories session at the RCUK Cloud Working Grou...
CLIMB talk in the Virtual Laboratories session at the RCUK Cloud Working Grou...CLIMB talk in the Virtual Laboratories session at the RCUK Cloud Working Grou...
CLIMB talk in the Virtual Laboratories session at the RCUK Cloud Working Grou...thomasrconnor
 
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebula Project
 
Supporting Research through "Desktop as a Service" models of e-infrastructure...
Supporting Research through "Desktop as a Service" models of e-infrastructure...Supporting Research through "Desktop as a Service" models of e-infrastructure...
Supporting Research through "Desktop as a Service" models of e-infrastructure...David Wallom
 
2015 04 bio it world
2015 04 bio it world2015 04 bio it world
2015 04 bio it worldChris Dwan
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyPeter Clapham
 
Analyzing Big Data in Medicine with Virtual Research Environments and Microse...
Analyzing Big Data in Medicine with Virtual Research Environments and Microse...Analyzing Big Data in Medicine with Virtual Research Environments and Microse...
Analyzing Big Data in Medicine with Virtual Research Environments and Microse...Ola Spjuth
 
e-infrastructural needs to support informatics
e-infrastructural needs to support informaticse-infrastructural needs to support informatics
e-infrastructural needs to support informaticsDavid Wallom
 
Science for the Future: Strategies for Moving and Sharing Data
Science for the Future: Strategies for Moving and Sharing DataScience for the Future: Strategies for Moving and Sharing Data
Science for the Future: Strategies for Moving and Sharing DataIan Foster
 
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardDocker, Inc.
 
Dynamic Provisioning of Data Intensive Computing Middleware Frameworks
Dynamic Provisioning of Data Intensive Computing Middleware FrameworksDynamic Provisioning of Data Intensive Computing Middleware Frameworks
Dynamic Provisioning of Data Intensive Computing Middleware FrameworksLinh Ngo
 
071310 sun d_0930_feldman_stephen
071310 sun d_0930_feldman_stephen071310 sun d_0930_feldman_stephen
071310 sun d_0930_feldman_stephenSteve Feldman
 
2016 05 sanger
2016 05 sanger2016 05 sanger
2016 05 sangerChris Dwan
 
Lecture 3.31 3.32.pptx
Lecture 3.31  3.32.pptxLecture 3.31  3.32.pptx
Lecture 3.31 3.32.pptxRATISHKUMAR32
 
e-Infrastructure available for research, using the right tool for the right job
e-Infrastructure available for research, using the right tool for the right jobe-Infrastructure available for research, using the right tool for the right job
e-Infrastructure available for research, using the right tool for the right jobDavid Wallom
 
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...Dirk Petersen
 
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...Ed Dodds
 
Desktop as a Service supporting Environmental 'Omics
Desktop as a Service supporting Environmental 'OmicsDesktop as a Service supporting Environmental 'Omics
Desktop as a Service supporting Environmental 'OmicsDavid Wallom
 
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...SURFnet
 

Similar to Climb stateoftheartintro (20)

CLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB LaunchCLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB Launch
 
CLIMB talk in the Virtual Laboratories session at the RCUK Cloud Working Grou...
CLIMB talk in the Virtual Laboratories session at the RCUK Cloud Working Grou...CLIMB talk in the Virtual Laboratories session at the RCUK Cloud Working Grou...
CLIMB talk in the Virtual Laboratories session at the RCUK Cloud Working Grou...
 
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
 
Supporting Research through "Desktop as a Service" models of e-infrastructure...
Supporting Research through "Desktop as a Service" models of e-infrastructure...Supporting Research through "Desktop as a Service" models of e-infrastructure...
Supporting Research through "Desktop as a Service" models of e-infrastructure...
 
2015 04 bio it world
2015 04 bio it world2015 04 bio it world
2015 04 bio it world
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Analyzing Big Data in Medicine with Virtual Research Environments and Microse...
Analyzing Big Data in Medicine with Virtual Research Environments and Microse...Analyzing Big Data in Medicine with Virtual Research Environments and Microse...
Analyzing Big Data in Medicine with Virtual Research Environments and Microse...
 
e-infrastructural needs to support informatics
e-infrastructural needs to support informaticse-infrastructural needs to support informatics
e-infrastructural needs to support informatics
 
Science for the Future: Strategies for Moving and Sharing Data
Science for the Future: Strategies for Moving and Sharing DataScience for the Future: Strategies for Moving and Sharing Data
Science for the Future: Strategies for Moving and Sharing Data
 
CC unit 1.pptx
CC unit 1.pptxCC unit 1.pptx
CC unit 1.pptx
 
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
 
Dynamic Provisioning of Data Intensive Computing Middleware Frameworks
Dynamic Provisioning of Data Intensive Computing Middleware FrameworksDynamic Provisioning of Data Intensive Computing Middleware Frameworks
Dynamic Provisioning of Data Intensive Computing Middleware Frameworks
 
071310 sun d_0930_feldman_stephen
071310 sun d_0930_feldman_stephen071310 sun d_0930_feldman_stephen
071310 sun d_0930_feldman_stephen
 
2016 05 sanger
2016 05 sanger2016 05 sanger
2016 05 sanger
 
Lecture 3.31 3.32.pptx
Lecture 3.31  3.32.pptxLecture 3.31  3.32.pptx
Lecture 3.31 3.32.pptx
 
e-Infrastructure available for research, using the right tool for the right job
e-Infrastructure available for research, using the right tool for the right jobe-Infrastructure available for research, using the right tool for the right job
e-Infrastructure available for research, using the right tool for the right job
 
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
 
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...
Creating a Climate for Innovation on Internet2 - Eric Boyd Senior Director, S...
 
Desktop as a Service supporting Environmental 'Omics
Desktop as a Service supporting Environmental 'OmicsDesktop as a Service supporting Environmental 'Omics
Desktop as a Service supporting Environmental 'Omics
 
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
 

Recently uploaded

Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomyDrAnita Sharma
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 

Recently uploaded (20)

Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomy
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 

Climb stateoftheartintro

  • 1. Microbial Bioinformatics in the cloud Introducing CLIMB Dr Tom Connor Cardiff University Biological Cloud Computing Workshop www.climb.ac.uk @tomrconnor ; @mrcclimb
  • 2. Overview • Background • View from a newly minted academic Bioinformatician at a “regular” University – Bioinformatics needs and challenges • Introducing CLIMB
  • 3. Big Data Wave 1 Wave 2 Wave 3 2005- 09 1989- 97 2003- 07 1992- 2002 1993-98 1975-86 1937-611966-71 1967-89 1969-73 1969-81 1981-85 1974 1986-87 1969-73 Adapted from Mutreja, Kim, Thomson, Connor et al, Nature, 2011 Population genomics; using genomics to reconstruct the global spread of pathogens Bioinformatics; developing new approaches to analyse massive datasets From Marttinen, Hanage, Croucher, Connor, Harris, Bentley and Corander, Nucleic Acids Res. 2011 From Cheng, Connor, Sirén, Aanensen, Corander, MBE, 2013 Grand challenges; fighting antimicrobial resistance From Fookes et al, PLoS Pathogens 2011 From Reuter, Connor et al, PNAS 2014 From Okoro, Kingsley, Connor et al. Nature Genetics 2012 From He et al, Nature Genetics 2013 From Dziva, Hauser, Connor et al, I&I, 2013 Pathogen genomics: understanding how pathogens evolve
  • 4. About Me • 2006-2010 - PhD at Imperial; Population Genetics / Molecular Epidemiology of bacterial pathogens • 2010-2012 - Post-Doc, Wellcome Trust Sanger Institute; Pathogen Genomics • 2012 – present Lecturer then Senior Lecturer, Cardiff University Hanage, Fraser, Tang, Connor, Corander, Science, 2009
  • 5. Building bioinformatics capacity • When I arrived at Cardiff, I had the joy of working out how I was going to do my research in a new place • How to get scripts/software installed • Where to install scripts/software • And I had to do this mostly on my own • Not an unusual story
  • 6. Key Challenges • Infrastructure – Storage – HTC capacity • Portability of software • Portability of datasets • But, we do have the University Supercomputer….
  • 7. Advanced Research Computing @ Cardiff (ARCCA) • 2048 core HPC cluster • Second hand ~868 Westmere core “HTC” partition • 8 ‘large memory’ 128GB nodes • Lustre file system (scratch, nominally unlimited, but 50TB total space initially) • NFS /home mount (50GB maximum quota) • Freely available to University Researchers
  • 8. …. So the first thing I did was to buy myself some servers • Large HPC clusters often don’t meet our needs • Bioinformaticians aren’t the ideal HPC users – Disruptive software needs – Disruptive usage patterns – Disruptive storage needs • Setting up our “own” system seems the most intuitive way to ensure that you have something that works
  • 9. Biologists often end up working in silos • As a discipline we have probably been taught to think in terms of ‘labs’, ‘groups’ and ‘experiments’ being wet work • We build capacity and teams locally, and those are the resources that we use every day • For bioinformaticians this means we are likely to develop our solutions locally first, building a local group and local capacity • Our software, data set storage, LIMS etc are usually bespoke • Because our software/data is locally stored/setup – it is often less portable than wet lab methods / approaches • Bioinformaticians should be working differently
  • 10. Key Challenges Overall • We need systems that allow us to rapidly and easily share complete systems, from the perspective of both novice users and experienced developers • We need to develop systems to share complete datasets, rather than forcing users to install loads of bits of software to reconstitute the development environment we used, or forcing us to become proper software developers • We need to lower the barrier to access for for research scientists with a limited understanding of UNIX/Computer Science • We need a system that allows us to train users on systems that they will then be able to use when they go home • We need to understand the needs of individual fields • We need to integrate activities across these needs, to avoid reinventing the wheel
  • 11. The cloud • All of the infrastructure issues are much easier when tackled at scale • This concept led to Amazon, followed by others, offering Cloud services • A cloud infrastructure provides a mechanism to share systems/software and data, at scale • Let someone else do the admin etc, and all you have to worry about is running the software
  • 12. Why not use a commercial cloud? • We often want lots of RAM – Amazon max flavour size is ~250GB • Prices are high ($1200/month for ~250GB RAM flavour) • Storage costs also high – 1TB on Amazon S3 costs $30/month (our current costs are £7/month) • Additionally Amazon isn’t designed to facilitate sharing of data etc between different people who have VMs • There are possible issues around T&C’s, governance etc • Even if we overcome these, often these services are too hard for novice users to make use of
  • 13. Needs • Core infrastructure now to take advantage of new technologies • Systems to easily share data • Repositories that can make tools/methods/data available rapidly and easily • Better use of the existing RCUK/University server estate • To change the view of Biologists about working in silos
  • 14. Introducing the CLoud Infrastructure for Microbial Bioinformatics (CLIMB) • MRC funded project to develop Cloud Infrastructure for microbial bioinformatics • ~£4M of hardware, capable of supporting >1000 individual virtual servers • Providing a core, national cyberinfrastructure for Microbial Bioinformatics
  • 15. The CLIMB Consortium Are • Professor Mark Pallen (Warwick) and Dr Sam Sheppard (Swansea) – Joint PIs • Professor Mark Achtman (Warwick), Professor SteveBusby FRS (Birmingham), Dr Tom Connor (Cardiff)*, Professor Tim Walsh (Cardiff), Dr Robin Howe (Public Health Wales) – Co-Is • Dr Nick Loman (Birmingham)* and Dr Chris Quince (Warwick) ; MRC Research Fellows * Principal bioinformaticians architecting and designing the system
  • 16. The CLoud Infrastructure for Microbial Bioinformatics (climb.ac.uk) • We are creating A one stop shop for Microbial Bioinformatics – Public/private cloud for use by UK academics – Standardised cloud images that implement key pipelines – Storage repository for data/images that are made available online and within our system, anywhere (‘eduroam for microbial genomics’) • We will provide access to other databases from within the system • As well as providing a place to support orphan databases and tools
  • 17. System Outline • 4 sites • Connected over Janet • Different sizes of VM available; personal, standard, large memory, huge memory • Able to support >1,000 VMs simultaneously (1:1 vCPUs/vRAM : CPUs/RAM) • 7-8PB of object storage across 4 sites (~2-3PB usable with erasure coding) • 4-500TB of local high performance storage per site • A single system, with common log in, and between site data replication • System has been designed to enable the addition of extra nodes / Universities
  • 18. CLIMB Overview • 4 sites, running OpenStack • Hardware procured in a two stage process • IBM/OCF provided compute, Dell/redhat provided storage • Networks provided by Brocade • Are defining a reference architecture to enable other sites to trivially be added
  • 19. Hardware (per site) • 2 router/firewalls (capable of routing >80Gb each • 3 Controllers • 21x 64 vCore, 512GB RAM nodes • 3x 192 vCore, 3TB RAM nodes • ~500TB GPFS (local) – 4 controllers – Infiniband, with 10Gb failover • ~2TB Ceph (shared) – 27x 64TB nodes/site – Cross site replication – 10Gb Backbone
  • 20. Overview – 4 sites, (virtually) identical hardware Each site is connected to the others over VPN tunnels. Sites can be easily added. System can use free router software and commodity hardware, pay for-software or dedicated router/firewalls Our intention is for the system to be presented to users as a single system, with single login, via Shibboleth. We are currently working on that bit  A single system makes it easy(er) to share methods and data! External clouds External databases External clouds External databases
  • 21. Flavours • User configurable, with standard flavours • Regular; up to 8 vCPUs, 64GB RAM • xlarge; up to 16 vCPUs, 256GB RAM • Huge; up to 192 vCPUs, 3TB RAM • System also supports a scalable virtual cluster (large embarrassingly parallel projects) – 2+ nodes with 2+ vCPUs, 2-4GB RAM/vCPU • Also provides for Long Term Hosting (for orphan datasets/tools)
  • 22. Access • Microbial researchers will be able to access the system through one of two ways – Externally, via federated access system, login via .ac.uk user login in first instance, later (hopefully) open to anyone who uses shibboleth – Internally, via user accounts setup by consortium for collaborators • Researchers will be able to provision up a set number of VMs
  • 23. Where are we now? • Computational hardware was procured by March 2015 (~6 month process) • Ahead of schedule - system is now online and in use for research • Adopting two models for access – Access for registered users to core OpenStack system online now – “version 1.0” system providing universal access to predefined images starting with the GVL – Autumn 2015
  • 25. Users are already using CLIMB to do research
  • 26. Challenges • Future Planning (CLIMB will run for 5 years, then what?) • Cross-Cloud Integration • Not reinventing the wheel • Standardising software stacks for .ac.uk clouds • Being able to embrace new technologies • Meeting cloud development needs
  • 27. The Sequencing Iceberg All of the sequencing platforms available now make producing large genomics datasets relatively cheap and easy However, the major costs and difficulties do not lie with the generation of data, they lie with the pre-requisites for storing and analysing that data Informatics expertise Storage availability Appropriate HTC capacity These are interlinked, and expensive
  • 28. Iceberg breaking with the Cloud? • It is a mechanism for sharing servers – Clouds remove the need for hardware maintenance and support – Storage, compute, networking are most expensive when bought one by one; building a large system represents better value for money • Sharing servers means you can have standardised systems, simplifying the process of installing and maintaining software – It provides a mechanism for software/data reuse as well as sharing – Also makes training easier; you can use the system you trained on, once you get back home • Sharing servers also makes training easier; you can use the system you trained on, once you get back home
  • 29. CLIMB Next Steps – and future needs • New images/analytics tools (GVL!) • Integration of datasets • Expanding our userbase • Collaboration with other cloud services • Integrating with databases • Integrating with other clouds • Developing new sites • Developing the system to meet the needs of our users • Developing policy • Defining and developing security policy • Developing/setting up federated access • Possibly looking at capacity to burst out or accept bursts from other resources • Developing our training programme and outreach
  • 30. Cloud Infrastructure for Microbial Bioinformatics • A multi site system to provide a one-stop-bioinformatics-shop, designed specifically to support Microbial researchers • For both Bioinformaticians and wet lab scientists • Combines hardware with training • Free, simple interface, easy to use • Common login • Easy data and method sharing • Already have multiple users from across UK academia and healthcare
  • 31. The CLIMB Consortium Are • Professor Mark Pallen (Warwick) and Dr Sam Sheppard (Swansea) – Joint PIs • Professor Mark Achtman (Warwick), Professor SteveBusby FRS (Birmingham), Dr Tom Connor (Cardiff)*, Professor Tim Walsh (Cardiff), Dr Robin Howe (Public Health Wales) – Co-Is • Dr Nick Loman (Birmingham)* and Dr Chris Quince (Warwick) ; MRC Research Fellows * Principal bioinformaticians architecting and designing the system
  • 32. CLoud Infrastructure for Microbial Bioinformatics (CLIMB) • MRC funded project to develop Cloud Infrastructure for microbial bioinformatics • ~£4M of hardware, capable of supporting >1000 individual virtual servers • Amazon/Google cloud for Academics