Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Introducing the CLoud
Infrastructure For Microbial
Bioinformatics System
CLIMB Launch, July 2016
Dr Thomas R Connor
Senior...
The CLIMB Consortium Are
• Professor Mark Pallen (Warwick) and Professor Sam Sheppard (Bath) – Joint
PIs
• Professor Mark ...
Over the next few days you will hear mostly from
academic staff involved in the project. But what we
have achieved to date...
Introducing the CLoud Infrastructure
for Microbial Bioinformatics (CLIMB)
• MRC funded project to
develop Cloud
Infrastruc...
The Sequencing Iceberg
However, the major costs and difficulties do not lie with the
generation of data, they lie with how...
The rise and rise of biological shadow
IT
• Everything we do is underpinned by
having access to compute and
storage capaci...
Shadow IT is compounded by the
fact that biologists work in silos
• As a discipline we think in
terms of ‘labs’, ‘groups’ ...
Wave 1 Wave 2 Wave 3
2005-
09
1989-
97
2003-
07
1992-2002
1993-98
1975-86
1937-611966-71
1967-89
1969-73
1969-81
1981-85
1...
320 samples
Approx 6-700GB
uncompressed data
Sequence Assembly
Each job 4-8GB RAM
1 CPU core
Each job generates
intermedia...
At the other end of the scale
The key is, as microbiologists we are likely to need a wide variety of systems, for a
wide r...
Basic Premise
• Wouldn’t it be great if there
were a single system that
microbiologists could use to
analyse and share dat...
Thought process behind the
project
• Custom designed, properly
engineered, institution-wide
systems can work brilliantly
f...
How to achieve this -
Virtualisation
• Virtualisation is a way of running multiple, distinct
computers (could be desktops,...
One of the servers
A user
A user wants a server; the system spins up a
VM (a self contained ‘module’ containing
operating ...
Virtualisation and the cloud
• Virtualisation is the central premise
behind systems like AWS
• Underpins services from Goo...
Why not use a commercial cloud?
• Bioinformatics workloads often
require lots of RAM, a good
number of CPUs and lots of
st...
So - what we said in 2014: CLIMB
Aims
• Create a public/private cloud
for use by UK academics
• Create a set of standardis...
2014: Expected specifications
• 4 sites
• Connected over Janet
• Different sizes of VM available; personal, standard, larg...
Where we are now
• CLIMB aims to become a one-stop-shop
for microbial bioinformatics
• Public/private cloud for use by tho...
Actual System Outline
• 4 sites
• Connected over Janet
• Different sizes of VM available; personal, standard, large memory...
CLIMB Overview
• 4 sites, running OpenStack
• Hardware procured in a
two stage process
• IBM/OCF provided
compute, Dell/re...
What the £3.6M bought
• 8 router/firewalls (capable of
routing >80Gb each
• 12 OpenStack Controller nodes
• 84x 64 vCore, ...
0
0.5
1
1.5
2
2.5
3
beast blastn gunzip muscle nhmmer phyml prokka snippy velvetg velveth geometric mean
Performance
• Gen...
What this means for you
• CLIMB doesn’t (really) provide small
machines ; you have those in your
office, or you can buy th...
Group quotas
• Each group gets a default allocation
• RAM: 64Gb * 10
• Instances 10
• Volumes 20
• Total Disk: 10TB
• Tota...
System Access
• Following testing from a number of groups, we have developed
an initial service offering
• Adopting two mo...
User access
• All users are members or a
project
• Project owners are PIs
• Means a PI has to register
ahead of the group
...
System Status
• Warwick is online and you
will be mostly using this
today
• Birmingham is online and
fully available, but ...
Other parts of the project
• We have a forum
• For anyone with an account and for anon posting
• Forum also has links to t...
About Today and the Future
• Today we are introducing the system
• We want you to use the system and we have put time into...
The CLIMB Consortium Are
• Professor Mark Pallen (Warwick) and Professor Sam Sheppard (Bath) –
Joint PIs
• Professor Mark ...
The CLoud Infrastructure for
Microbial Bioinformatics (CLIMB)
• MRC funded project to
develop Cloud
Infrastructure for mic...
OpenStack terminology cheat
sheet
• Instances
• Each instance is a virtual server. The instance can have an external
IP ad...
Parts of OpenStack / CLIMB
• Bryn
• Our interface for registering and spinning up VMs
• Horizon
• The OpenStack control pa...
Upcoming SlideShare
Loading in …5
×

of

CLIMB System Introduction Talk - CLIMB Launch Slide 1 CLIMB System Introduction Talk - CLIMB Launch Slide 2 CLIMB System Introduction Talk - CLIMB Launch Slide 3 CLIMB System Introduction Talk - CLIMB Launch Slide 4 CLIMB System Introduction Talk - CLIMB Launch Slide 5 CLIMB System Introduction Talk - CLIMB Launch Slide 6 CLIMB System Introduction Talk - CLIMB Launch Slide 7 CLIMB System Introduction Talk - CLIMB Launch Slide 8 CLIMB System Introduction Talk - CLIMB Launch Slide 9 CLIMB System Introduction Talk - CLIMB Launch Slide 10 CLIMB System Introduction Talk - CLIMB Launch Slide 11 CLIMB System Introduction Talk - CLIMB Launch Slide 12 CLIMB System Introduction Talk - CLIMB Launch Slide 13 CLIMB System Introduction Talk - CLIMB Launch Slide 14 CLIMB System Introduction Talk - CLIMB Launch Slide 15 CLIMB System Introduction Talk - CLIMB Launch Slide 16 CLIMB System Introduction Talk - CLIMB Launch Slide 17 CLIMB System Introduction Talk - CLIMB Launch Slide 18 CLIMB System Introduction Talk - CLIMB Launch Slide 19 CLIMB System Introduction Talk - CLIMB Launch Slide 20 CLIMB System Introduction Talk - CLIMB Launch Slide 21 CLIMB System Introduction Talk - CLIMB Launch Slide 22 CLIMB System Introduction Talk - CLIMB Launch Slide 23 CLIMB System Introduction Talk - CLIMB Launch Slide 24 CLIMB System Introduction Talk - CLIMB Launch Slide 25 CLIMB System Introduction Talk - CLIMB Launch Slide 26 CLIMB System Introduction Talk - CLIMB Launch Slide 27 CLIMB System Introduction Talk - CLIMB Launch Slide 28 CLIMB System Introduction Talk - CLIMB Launch Slide 29 CLIMB System Introduction Talk - CLIMB Launch Slide 30 CLIMB System Introduction Talk - CLIMB Launch Slide 31 CLIMB System Introduction Talk - CLIMB Launch Slide 32 CLIMB System Introduction Talk - CLIMB Launch Slide 33 CLIMB System Introduction Talk - CLIMB Launch Slide 34
Upcoming SlideShare
IMMEM XI: Ten Simple Rules to Build a Better Public Health Genomic Epidemiology Analysis Platform
Next
Download to read offline and view in fullscreen.

1 Like

Share

Download to read offline

CLIMB System Introduction Talk - CLIMB Launch

Download to read offline

Talk outlining the CLoud Infrastructure for Microbial Bioinformatics (CLIMB) system given at the CLIMB Launch in July 2016. CLIMB is a UK national e-infrastructure providing Microbial Bioinformatics as a Service.

Related Books

Free with a 30 day trial from Scribd

See all

CLIMB System Introduction Talk - CLIMB Launch

  1. 1. Introducing the CLoud Infrastructure For Microbial Bioinformatics System CLIMB Launch, July 2016 Dr Thomas R Connor Senior Lecturer Cardiff University School of Biosciences @tomrconnor ; connortr@cardiff.ac.uk http://www.climb.ac.uk
  2. 2. The CLIMB Consortium Are • Professor Mark Pallen (Warwick) and Professor Sam Sheppard (Bath) – Joint PIs • Professor Mark Achtman (Warwick), Professor Steve Busby FRS (Birmingham), Dr Tom Connor (Cardiff site lead)*, Professor Tim Walsh (Cardiff), Dr Robin Howe (Public Health Wales) – Co-Is • Dr Nick Loman (Birmingham site lead)* and Dr Chris Quince (Warwick), Dr Daniel Falush (Bath) ; MRC Research Fellows • Simon Thompson (Birmingham, Project Technical/OpenStack lead), • Marius Bakke (Warwick, Systems administrator/Ceph lead), Dr Matt Bull (Cardiff Sysadmin), Radoslaw Poplawski (Birmingham sysadmin), Andy Smith (Birmingham software development) • Simon Thompson (Swansea HPC team), Kevin Munn and Ian Merrick (Cardiff Biosciences Research Computing team), Wayne Lawrence, Dr Chrisine Kitchen, Professor Martyn Guest (Cardiff HPC team), Matt Ismail (Warwick HPC lead), * Principal bioinformaticians architecting and designing the system
  3. 3. Over the next few days you will hear mostly from academic staff involved in the project. But what we have achieved to date would not be possible without the technical team • Simon Thompson (Birmingham), Marius Bakke (Warwick) • Radoslaw Poplawski (Birmingham sysadmin), Andy Smith (Birmingham software development), Dr Matt Bull (Cardiff Sysadmin) • Matt Ismail (Warwick HPC lead), Kevin Munn and Dr Ian Merrick (Cardiff Biosciences Research Computing team), Wayne Lawrence, Dr Christine Kitchen, Professor Martyn Guest (Cardiff HPC team), Simon Thompson (Swansea HPC team)
  4. 4. Introducing the CLoud Infrastructure for Microbial Bioinformatics (CLIMB) • MRC funded project to develop Cloud Infrastructure for microbial bioinformatics • ~£4M of hardware, capable of supporting >1000 individual virtual servers • Providing a core, national cyberinfrastructure for Microbial Bioinformatics
  5. 5. The Sequencing Iceberg However, the major costs and difficulties do not lie with the generation of data, they lie with how we share, store and analyse the data we generate Informatics expertise User accessibility of software/hardware Appropriate compute capacity Software development Storage availability Network capacity There are many biological analysis platforms available now that make producing large, rich complex datasets relatively cheap and easy
  6. 6. The rise and rise of biological shadow IT • Everything we do is underpinned by having access to compute and storage capacity • Joined Cardiff from the Sanger in 2012 • When I arrived at Cardiff, I had the joy of working out how I was going to do my research in a new place • How to get scripts/software installed • Where to install scripts/software • And I had to do this mostly on my own • So I built my own system • Not an unusual story
  7. 7. Shadow IT is compounded by the fact that biologists work in silos • As a discipline we think in terms of ‘labs’, ‘groups’ and ‘experiments’ being wet work • IT infrastructure is treated the same way • This means we develop bespoke, local solutions to informatics problems • Because our software/data is locally stored/setup – it is often less portable than wet lab methods / approaches
  8. 8. Wave 1 Wave 2 Wave 3 2005- 09 1989- 97 2003- 07 1992-2002 1993-98 1975-86 1937-611966-71 1967-89 1969-73 1969-81 1981-85 1974 1986-87 1969-73 Mutreja, Kim, Thomson, Connor et al, Nature, 2011 Illustrating the size of the challenge
  9. 9. 320 samples Approx 6-700GB uncompressed data Sequence Assembly Each job 4-8GB RAM 1 CPU core Each job generates intermediate files of ~6GB Runtime: 1+ hours/job Sequence mapping 320 jobs Each job 4GB RAM 1 CPU core Each job generates intermediate files of ~3GB Runtime: 1 hour/job Phylogenomics 1 job, 1+ cores, up to 128GB RAM Intermediate file size ~2+GB Output file ~2GB Runtime 1-2 days Virulence and antimicrobial resistance screening 320 jobs, single core 100MB ram Runtime: 5 mins/job Generates 10-20 small files per job Bayesian modelling 3 jobs, 1 core+, up to 1 GB RAM CPU intensive Runtime: 2 days per job Output file ~10GB Can use GPUs Written in Java Larger RAM HTC HTC HPC Possibly HPC Illustrating the size of the challenge
  10. 10. At the other end of the scale The key is, as microbiologists we are likely to need a wide variety of systems, for a wide range of workloads This does not normally fit well with “standard” local systems, as our workloads can be disruptive or impossible to run It is also hard to reproduce this across different systems
  11. 11. Basic Premise • Wouldn’t it be great if there were a single system that microbiologists could use to analyse and share data • Data is more easily shared when one uses a common environment • Software is more easily shared on a common platform • A common platform could also make the hardware required for complex analyses available to all, easily
  12. 12. Thought process behind the project • Custom designed, properly engineered, institution-wide systems can work brilliantly for enabling data and software sharing • Works brilliantly at the Sanger • BUT how many other places have a critical mass of microbiologists to justify the expense of having such a system? • Answer is, relatively few, so we thought a shared system open to all was the logical solution
  13. 13. How to achieve this - Virtualisation • Virtualisation is a way of running multiple, distinct computers (could be desktops, workstations, servers) on one physical piece of hardware • Not a new concept, is a mainstay of enterprise computing • Is a way for sharing resources • Is a way for businesses to cut costs by consolidating servers • Is a way for businesses to increase reliability; these physical pieces of hardware can be networked and the VMs they run can be moved around as required • Also provides a way for businesses to easily deploy and maintain software • Virtualisation answers a lot of the questions that are posed in bioinformatics around reproducibility
  14. 14. One of the servers A user A user wants a server; the system spins up a VM (a self contained ‘module’ containing operating system and software) and slots it onto one of the servers Other users from other institutes also want systems; the system could then load those up to, sharing a large server between users A user A user So wouldn't it be great if…. Because these VMs are on a common system, these are then sharable between users
  15. 15. Virtualisation and the cloud • Virtualisation is the central premise behind systems like AWS • Underpins services from Google to Netflix • Ultimately enables service providers to share out comodity servers as required • Means they can sell off access to small slices of servers, and make lots of money • The idea behind CLIMB was to build on this concept to provide a similar service for the UK Medical Microbiology community, without the profit, and designed to meet the needs of microbiologists
  16. 16. Why not use a commercial cloud? • Bioinformatics workloads often require lots of RAM, a good number of CPUs and lots of storage • Commercial cloud providers are not targeting this market, so prices are very high • A (amazon) storage optimised VM with 244GB RAM and 32 vCPU cores costs ~$3,000 per month • Some configs are simply not available • Storage costs also high – 1TB on Amazon S3 costs $30/month (our current costs are £3/month) • In future these solutions might be suitable, but at the moment they are not cost effective and don’t really meet the needs of researchers
  17. 17. So - what we said in 2014: CLIMB Aims • Create a public/private cloud for use by UK academics • Create a set of standardised cloud images that implement key pipelines • Create a storage repository for data that are made available online and within our system, anywhere (‘eduroam for microbial genomics’) • Provide access to other databases from within the system
  18. 18. 2014: Expected specifications • 4 sites • Connected over Janet • Different sizes of VM available; personal, standard, large memory, huge memory • Able to support 1,000 VMs simultaneously • 4PB of object storage across 4 sites (~2-3PB usable with erasure coding) • 300TB of local high performance storage per site
  19. 19. Where we are now • CLIMB aims to become a one-stop-shop for microbial bioinformatics • Public/private cloud for use by those with a .ac.uk, .nhs.uk, .gov.uk email account • Standardised cloud images that implement key pipelines • Storage repository for data/images that are made available online and within our system, anywhere (‘eduroam for microbial genomics’) • We will provide access to other databases from within the system • As well as providing a place to support orphan databases and tools • Today we will be introducing you to the first set of VMs on the system, and how to gain access • Has been a lot of work, but hopefully it will be worth it
  20. 20. Actual System Outline • 4 sites • Connected over Janet • Different sizes of VM available; personal, standard, large memory, huge memory • Able to support >1,000 VMs simultaneously (1:1 vCPUs/vRAM : CPUs/RAM) • ~7PB of object storage across 4 sites (~2-3PB usable) • 4-500TB of local high performance storage per site • A single system, with common log in, and between site data replication* • System has been designed to enable the addition of extra nodes / Universities
  21. 21. CLIMB Overview • 4 sites, running OpenStack • Hardware procured in a two stage process • IBM/OCF provided compute, Dell/redhat provided storage • Networks provided by Brocade • Now have a fairly clear reference architecture that would enable other nodes to be added
  22. 22. What the £3.6M bought • 8 router/firewalls (capable of routing >80Gb each • 12 OpenStack Controller nodes • 84x 64 vCore, 512GB RAM nodes • 12x 192 vCore, 3TB RAM nodes • ~2PB GPFS (localy distributed) • 16 GPFSs controllers • Infiniband, with 10Gb failover • ~7PB Ceph (shared) • 108x 64TB nodes
  23. 23. 0 0.5 1 1.5 2 2.5 3 beast blastn gunzip muscle nhmmer phyml prokka snippy velvetg velveth geometric mean Performance • Generally extremely good • Performance is quite consistent across workloads • Compares well to both HPC systems and other cloud systems
  24. 24. What this means for you • CLIMB doesn’t (really) provide small machines ; you have those in your office, or you can buy them from Amazon • Our “personal” servers start at 4CPU cores and ~15GB of RAM • Our “group” servers are 8CPU cores and ~60GB of RAM • Other flavour sizes available on request • Means you get free, immediate access to dedicated hardware to do your analysis • For comparison, a workstation with similar spec to a “personal” server retails at ~£1k, a server with similar spec to a group server retails at ~£3k
  25. 25. Group quotas • Each group gets a default allocation • RAM: 64Gb * 10 • Instances 10 • Volumes 20 • Total Disk: 10TB • Total cores: 128 • Up to you to manage the allocation • Allocation is for whole team • It can be increased; requests for increases will be considered by the management group
  26. 26. System Access • Following testing from a number of groups, we have developed an initial service offering • Adopting two models for access • Access for registered users to core OpenStack system via the horizon dashboard • Access via our own launcher called bryn (welsh for “Hill”)
  27. 27. User access • All users are members or a project • Project owners are PIs • Means a PI has to register ahead of the group • PI then chooses people to invite • Might seem a pain, but gives clear lines of responsibility, and enables us to better track impact of the system • Also creates a mechanism for overseas collaborators to access the system, and to ensure that they are contributing to UK research outputs
  28. 28. System Status • Warwick is online and you will be mostly using this today • Birmingham is online and fully available, but due to a server fire (not CLIMB) will be taken down this weekend for some datacentre maintenance • Cardiff is online and available, but hasn’t been fully stress tested yet • Swansea is awaiting final configuration and integration into Bryn
  29. 29. Other parts of the project • We have a forum • For anyone with an account and for anon posting • Forum also has links to tutorials • CLIMB also has a twitter account – please tweet us your successes using CLIMB • The google group is dead, use the forum now • Now that the system is mostly up, we will be looking to deliver training events. Look out for these on our website and twitter
  30. 30. About Today and the Future • Today we are introducing the system • We want you to use the system and we have put time into setting up tools/images already • But, while CLIMB has/had money for hardware, but no real money for software development • Means that either we need RCUK funds for developing tools and resources, or we need you to share your software on our system • Is a complex system, so might be a few teething problems (this is the first time we will have had so many users hammering the system at the same time) but it will provide a resource that we expect will be of huge value in future • In the next couple of months there will be a CLIMB paper coming out ; please cite this when you use the system
  31. 31. The CLIMB Consortium Are • Professor Mark Pallen (Warwick) and Professor Sam Sheppard (Bath) – Joint PIs • Professor Mark Achtman (Warwick), Professor Steve Busby FRS (Birmingham), Dr Tom Connor (Cardiff site lead)*, Professor Tim Walsh (Cardiff), Dr Robin Howe (Public Health Wales) – Co-Is • Dr Nick Loman (Birmingham site lead)* and Dr Chris Quince (Warwick), Dr Daniel Falush (Bath) ; MRC Research Fellows • Simon Thompson (Birmingham, Project Technical/OpenStack lead), • Marius Bakke (Warwick, Systems administrator/Ceph lead), Dr Matt Bull (Cardiff Sysadmin), Radoslaw Poplawski (Birmingham sysadmin) • Simon Thompson (Swansea HPC team), Kevin Munn and Ian Merrick (Cardiff Biosciences Research Computing team), Wayne Lawrence, Dr Chrisine Kitchen, Professor Martyn Guest (Cardiff HPC team), Matt Ismail (Warwick HPC lead), * Principal bioinformaticians who architected and designed the system
  32. 32. The CLoud Infrastructure for Microbial Bioinformatics (CLIMB) • MRC funded project to develop Cloud Infrastructure for microbial bioinformatics • ~£4M of hardware, capable of supporting >1000 individual virtual servers • Providing a core, national cyberinfrastructure for Microbial Bioinformatics
  33. 33. OpenStack terminology cheat sheet • Instances • Each instance is a virtual server. The instance can have an external IP address, and is accessed via either ssh or through the web • Volumes • A private disk, running on a large storage system. Volumes can be treated like disk drives, and can be attached and detached from running instances • Snapshots • Effectively a digital photo of everything in a volume at the point of snapshotting. Can be used as a basis to create a new volume containing the original data • Project / tenant • All VMs are part of a project/tenant ; this is the level at which quotas apply
  34. 34. Parts of OpenStack / CLIMB • Bryn • Our interface for registering and spinning up VMs • Horizon • The OpenStack control panel (not reccomended for anyone but power users • Keystone • The OpenStack identity service • Nova • OpenStack compute service • S3 • Amazon’s storage API • Cinder • OpenStack block storage service • Glance • OpenStack image service • Neutron • The OpenStack Network Service
  • JamieBraley1

    Dec. 1, 2021

Talk outlining the CLoud Infrastructure for Microbial Bioinformatics (CLIMB) system given at the CLIMB Launch in July 2016. CLIMB is a UK national e-infrastructure providing Microbial Bioinformatics as a Service.

Views

Total views

986

On Slideshare

0

From embeds

0

Number of embeds

363

Actions

Downloads

10

Shares

0

Comments

0

Likes

1

×