Clouds: All fluff and no
Wellcome Trust Sanger Institute
About the Sanger Institute.
Experience with cloud to date.
The Sanger Institute
Funded by Wellcome Trust.
• 2nd largest research charity in the world.
• ~700 employees.
• Based on Hinxton Genome Campus,
Large scale genomic research.
• We have active cancer, malaria,
pathogen and genomic variation / human
• 1k genomes, & 10k-UK Genomes,
Cancer genome projects.
All data is made publicly
• Websites, ftp, direct database access,
As cost of sequencing halves every 12
• cf Moore's Law
The Human genome project:
• 13 years.
• 23 labs.
• $500 Million.
A Human genome today:
• 3 days.
• 1 machine.
• Large centres are now doing studies with 1000s and
10,000s of genomes.
Changes in sequencing technology are
going to continue this trend.
• “Next-next” generation sequencers are on their way.
• $500 genome is probable within 5 years.
The scary graph
Peak Yearly capillary
We have exponential growth in
storage and compute.
• Storage /compute doubles every 12 Disk Storage
• 2009 ~7 PB raw
Moore's law will not save us.
• Transistor/disk density: Td=18 months
• Sequencing cost: Td=12 months 2000
My Job: 1000
• Running the team who do the IT 0
systems heavy-lifting to make it all work. 1995 1997 1999 2001 2003 2005 2007 2009
1994 1996 1998 2000 2002 2004 2006 2008
Tech evaluations. Year
• Systems architecture.
• Day-to-day administration.
• All in conjunction with informaticians,
programmers & investigators who are
doing the science.
Cloud: Where are we at?
What is cloud?
• On demand, virtual machines.
• Root access, total ownership.
• Pay-as-you-go model.
• “Free” compute we can use to solve all of the hard problems thrown up by
• (8cents/hour is almost free, right...?)
• Web 2.0 / Friendface use it, so it must be good.
Out of the trough of
We currently have three areas of activity:
• Web presence
• HPC workload
• Data Warehousing
Ensembl is a system for genome Annotation.
Data visualisation (Web Presence)
• Provides web / programmatic interfaces to genomic data.
• 10k visitors / 126k page views per day.
Compute Pipeline (HPC Workload)
• Take a raw genome and run it through a compute pipeline to find genes
and other features of interest.
• Ensembl at Sanger/EBI provides automated analysis for 51 vertebrate
• Software is Open Source (apache license).
• Data is free for download.
We have done cloud experiments with both the web site
Ensembl has a worldwide audience.
Historically, web site performance was not great, especially
for non-european institutes.
• Pages were quite heavyweight.
• Not properly cached etc.
Web team spent a lot of time re-designing the code to
make it more streamlined.
• Greatly improved performance.
Coding can only get you so-far.
• “A canna' change the laws of physics.”
•150-240ms round trip time from Europe to the US.
• We need a set of geographically dispersed mirrors.
Traditional mirror: Real machines in a co-lo facility in
Hardware was initially configured on site.
• 16 servers, SAN storage, SAN switches, SAN management appliance,
Ethernet switches, firewall, out-of-band management etc.
Shipped to the co-lo for installation.
• Sent a person to California for 3 weeks.
• Spent 1 week getting stuff into/out of customs.
• ****ing FCC paperwork!
Additional infrastructure work.
• VPN between UK and US.
Incredibly time consuming.
• Really don't want to end up having to send someone on a plane to the US
to fix things.
US-West currently takes ~1/3rd of total Ensembl web traffic.
• Much lower latency and improved site usibility.
What has this got to do with
We want an east coast US mirror to complement our west
Built the mirror in AWS.
• Initially a proof of concept / test-bed.
• Production-level in due course.
Gives us operational experience.
• We can compare to a “real” colo.
Building a mirror on AWS
Some software development / sysadmin work needed.
• Preparation of OS images, software stack configuration.
• West-coast was built as an extension of Sanger internal network via VPN.
• AWS images built as standalone systems.
Web code changes
• Significant code changes required to make the webcode “mirror aware”.
• Seach, site login etc.
• We chose not to set up VPN into AWS.
• Work already done for the first mirror.
Significant amount of tuning required.
• Initial mysql performance was pretty bad, especially for the large ensembl
• Lots of people doing Apache/mysql on AWS, so there is a good amount of
best-practice etc available.
Does it work?
Is it better than the co-lo?
No physical hardware.
• Work can start as soon as we enter our credit card numbers...
• No US customs, Fedex etc.
Much simpler management infra-stucture.
• AWS give you out of band management “for free”.
• Much simpler to deal with hardware problems.
• And we do remote-management all the time.
“Free” hardware upgrades.
• As faster machines become available we can take advantage of them
• No need to get tin decommissioned /re-installed at Co-lo.
Is it cost effective?
Lots of misleading cost statements made about cloud.
• “Our analysis only cost $500.”
• CPU is only “$0.085 / hr”.
What are we comparing against?
• Doing the analysis once? Continually?
• Buying a $2000 server?
• Leasing a $2000 server for 3 years?
• Using $150 of time at your local supercomputing facility?
• Buying a $2000 of server but having to build a $1M datacentre to put it
Requires the dreaded Total Cost of Ownership (TCO)
• hardware + power + cooling + facilities + admin/developers etc
• Incredibly hard to do.
Lets do it anyway...
Comparing costs to the co-lo is simpler.
• power, cooling costs are all included.
• Admin costs are the same, so we can ignore them.
• Same people responsible for both.
Cost for Co-location facility:
• $120,000 hardware + $51,000 /yr colo.
• $91,000 per year (3 years hardware lifetime).
Cost for AWS :
• $77,000 per year (estimated based on US-east traffic / IOPs)
Result: Estimated 16% cost saving.
• It is not free!
Website + code is packaged together.
• Can be conveniently given away to end users in a “ready-to-run” config.
• Simplifies configuration for other users wanting to run Ensembl sites.
• Configuring an ensembl site is non-trivial for non-informaticians.
• Cvs, mysql setup, apache configuration etc.
Ensembl data is already available as an Amazon public
• Makes a complete system.
What about scale-up?
Current installation is a minimal config.
• Single web / database nodes.
• Main site and us-east use multiple load balanced servers.
AWS load-balancing architecture is different from what we
• In theory there should be no problems...
• ...but we don't know until we try.
• Do we go for automatic scale-out?
Underestimated the time it would take to make the web-
• Not a cloud specific problem, but something to be aware of when you take
big applications and move them outside your home institution.
Packaging OS images, code and data needs to be done for
every ensembl release.
• Ensembl team now has a dedicated person responsible for the cloud.
• Somebody has to look after the systems.
Management overhead does not necessarily go down.
• But it does change.
useast.ensembl.org to go into production later this year.
• Far-east Amazon availability zone is also of interest.
• Likely to be next, assuming useast works.
“Virtual” Co-location concept will be useful for a number of
• Other Sanger websites?
• Eg replicate critical databases / storage into AWS.
• Currently all of Sanger data lives in a single datacentre.
• We have a small amount of co-lo space for mirroring critical data.
• Same argument apply as for the uswest mirror.
HPC element of Ensembl.
• Takes raw genomes and performs automated annotation on them.
other genes and
• OO perl pipeline manager.
• Core algorithms are C.
• 200 auxiliary binaries.
• Investigator describes analysis at high level.
• Pipeline manager splits the analysis into parallel chunks.
• Typically 50k-100k jobs.
• Sorts out the dependences and then submits jobs to a DRM.
• Typically LSF or SGE.
• Pipeline state and results are stored in a mysql database.
Workflow is embarrassingly parallel.
• Integer, not floating point.
• 64 bit memory address is nice, but not required.
• 64 bit file access is required.
• Single threaded jobs.
• Very IO intensive.
Running the pipeline in
Requires a significant amount of domain knowledge.
Software install is complicated.
• Lots of perl modules and dependencies.
Need a well tuned compute cluster.
• Pipeline takes ~500 CPU days for a moderate genome.
• Ensembl chewed up 160k CPU days last year.
• Code is IO bound in a number of places.
• Typically need a high performance filesystem.
• Lustre, GPFS, Isilon, Ibrix etc.
• Need large mysql database.
• 100GB-TB mysql instances, very high query load generated from the
Proof of concept
• Is HPC is even possible in Cloud infrastructures?
Coping with the big increase in data
• Will we be able to provision new machines/datacentre space to keep up?
• What happens if we need to “out-source” our compute?
• Can we be in a position to shift peaks of demand to cloud facilities?
There are going to be lots of new genomes that need
• Sequencers moving into small labs, clinical settings.
• Limited informatics / systems experience.
• Typically postdocs/PhD who have a “real” job to do.
• They may want to run the genebuild pipeline on their data, but they may
not have the expertise to do so.
We have already done all the hard work on installing the
software and tuning it.
• Can we package up the pipeline, put it in the cloud?
Goal: End user should simply be able to upload their data,
insert their credit-card number, and press “GO”.
Porting HPC code to the cloud
Lets build a compute cluster in the cloud.
Software stack / machine image.
• Creating images with software is reasonably straightforward.
• No big surprises.
• Pipeline requires a queueing system: (LSF/SGE)
• Licensing problems.
• Getting them to run took a lot of fiddling.
• Machines need to find each other one they are inside the cloud.
• Building an automated “self discovering” cluster takes some hacking.
• Hopefully others can re-use it.
• Lots of best practice on how to do that on EC2.
It took time, even for experienced systems people.
• (You will not be firing your system-administrators just yet!).
Did it work? NO!
“High performance computing is not facebook.”
-- Chris Dagdigian
The big problem data:
• Moving data into the cloud is hard.
• Doing stuff with data once it is in the cloud is also hard.
If you look closely, most successful cloud projects have
small amounts of data (10-100 Mbytes).
Genomics projects have Tbytes → Pbytes of data.
Moving data is hard
Commonly used tools (FTP,ssh/rsync) are not suited to
• Need to use specialised WAN tools: gridFTP/FDT/Aspera.
There is a lot of broken internet.
Data transfer rates (gridFTP/FDT via our 2 Gbit/s site link).
• Cambridge → EC2 East coast: 12 Mbytes/s (96 Mbits/s)
• Cambridge → EC2 Dublin: 25 Mbytes/s (200 Mbits/s)
• 11 hours to move 1TB to Dublin.
• 23 hours to move 1 TB to East coast.
What speed should we get?
• Once we leave JANET (UK academic network) finding out what the
connectivity is and what we should expect is almost impossible.
• Finding out who to talk to when you diagnose a troublesome link is also
“But the physicists do this
all the time.”
• No they don't.
• LHC Grid; Dedicated networking
between CERN and the T1
centres who get all of the data.
Can we use this model?
• We have relatively short lived and
fluid collaborations. (1-2 years,
• As more labs get sequencers, our
potential collaborators also
• We need good connectivity to
Using data within the cloud
Compute nodes need to have fast access to the data.
• We solve this with exotic and temperamental filesystems/storage.
No viable global filesystems on EC2.
• NFS has poor scaling at the best of times.
• EC2 has poor inter-node networking. > 8 NFS clients, everything stops.
• Subcloud; commercial product that allows you to run a POSIX filesystem
on top of S3.
• Interesting performance, and you are paying by the hour...
Batch schedular hadoop/S3
Fat Network thin network
CPU CPU CPU CPU CPU CPU CPU
Local Local Local Local
Posix Global filesystem
storage storage storage storage
Why not S3 /hadoop/map-
• Lots of code expects file on a filesystem.
• Limitations; cannot store objects > 5GB.
• Throw away file formats?
Nobody want to re-write existing applications.
• They already work on our compute farm.
• How do hadoop apps co-exist with non-hadoop ones?
•Do we have to have two different type of infrastructure and move data
• Barrier for entry seems much lower for file-systems.
Am I being a reactionary old fart?
• 15 years ago clusters of PCs were not “real” supercomputers.
• ...then beowulf took over the world.
• Big difference: porting applications between the two architectures was
• MPI/PVM etc.
Will the market provide “traditional” compute clusters in
HPC app summary
You cannot take an existing data-rich HPC app and expect
it to work.
• IO architectures are too different.
There is some re-factoring going on for the ensembl
pipeline to make it EC2 friendly.
• Currently on a case-by-case basis.
• For the less-data intensive parts.
Waiting for the market to deliver...
Shared data archives
Sequencing centre Sequencing
Centre + DCC
Data size per Genome
Individual Structured data
features Clinical Researchers, (databases)
Variation data (1GB)
Alignments (200 GB)
Sequence + quality data (500 GB)
Intensities / raw data (2TB) (flat files)
The Problem With Current
Data in current archives is
• You can put/get data, but cannot
compute across it.
Data is all in one place.
• Problematic if you are not the DCC:
• You have to pull the data down to do
something with it,
• Holding data in one place is bad for
disaster-recovery and network access.
Is data in an inaccessible
archive really useful?
A real example...
“We want to run out pipeline across 100TB of data
currently in EGA/SRA.”
We will need to de-stage the data to Sanger, and then run
• Extra 0.5 PB of storage, 1000 cores of compute.
• 3 month lead time.
• ~$1.5M capex.
• 46 days at 25 Mbytes/s (best transatlantic link).
• 10 days at 1 Gbit/s. (sling an cable across the datacentre to EBI).
Easy to solve problem in
Put data into a cloud.
• Big cloud providers already have replicated storage infrastructures.
Upload workload onto VMs.
• Put VMs on compute that is “attached” to the data.
CPU CPU CPU CPU
CPU CPU CPU CPU
How do you expose the data?
• Flat files? Database?
How do you make the compute efficient?
• Cloud IO problems still there.
• And you make the end user pay for them.
How do we deal with controlled access?
• Hard problem. Grid / delegated security mechanisms are complicated for
Most of us are funded to hold data, not to fund everyone
else's compute costs to.
• Now need to budget for raw compute power as well as disk.
• Implement visualisation infrastructure, billing etc.
• Are you legally allowed to charge?
• Who underwrites it if nobody actually uses your service?
Strongly implies data has to be held on a commercial
Can it solve our networking
Moving data across the internet is hard.
• Fixing the internet is not going to be cost effective for us.
Fixing the internet may be cost effective for big cloud
• Core to their business model.
• All we need to do is get data into Amazon, and then everyone else can get
the data from there.
Do we invest in a fast links to Amazon?
• It changes the business dynamic.
• We have effectively tied ourselves to a single provider.
Where are we?
Cloud work well for webservices.
Data rich HPC workloads are still hard.
Cloud based data archives look really interesting.
• James Beal
• Gen-Tao Chiang
• Pete Clapham
• Simon Kelley
• Steve Searle
• Jan-Hinnerk Vogel
• Bronwen Aken
• Glenn Proctor
• Stephen Keenan
Cancer Genome Project
• Adam Butler
• John Teague