3. The Sanger Institute
Funded by Wellcome Trust.
• 2nd largest research charity in the world.
• ~700 employees.
• Based on Hinxton Genome Campus,
Cambridge, UK.
Large scale genomic research.
• We have active cancer, malaria,
pathogen and genomic variation / human
health studies.
• 1k genomes, & 10k-UK Genomes,
Cancer genome projects.
All data is made publicly
available.
• Websites, ftp, direct database access,
programmatic APIs.
4. Economic Trends:
As cost of sequencing halves every 12
months.
• cf Moore's Law
The Human genome project:
• 13 years.
• 23 labs.
• $500 Million.
A Human genome today:
• 3 days.
• 1 machine.
• $10,000.
• Large centres are now doing studies with 1000s and
10,000s of genomes.
Changes in sequencing technology are
going to continue this trend.
• “Next-next” generation sequencers are on their way.
• $500 genome is probable within 5 years.
5. The scary graph
Instrument upgrades
Peak Yearly capillary
sequencing
6. Managing Growth
We have exponential growth in
storage and compute.
• Storage /compute doubles every 12 Disk Storage
months. 6000
• 2009 ~7 PB raw
5000
4000
Moore's law will not save us.
• Transistor/disk density: Td=18 months
Terabytes
3000
• Sequencing cost: Td=12 months 2000
My Job: 1000
• Running the team who do the IT 0
systems heavy-lifting to make it all work. 1995 1997 1999 2001 2003 2005 2007 2009
•
1994 1996 1998 2000 2002 2004 2006 2008
Tech evaluations. Year
• Systems architecture.
• Day-to-day administration.
• All in conjunction with informaticians,
programmers & investigators who are
doing the science.
8. What is cloud?
Technical view:
• On demand, virtual machines.
• Root access, total ownership.
• Pay-as-you-go model.
Non-technical view:
• “Free” compute we can use to solve all of the hard problems thrown up by
new sequencing.
• (8cents/hour is almost free, right...?)
• Web 2.0 / Friendface use it, so it must be good.
13. Ensembl
Ensembl is a system for genome Annotation.
Data visualisation (Web Presence)
• www.ensembl.org
• Provides web / programmatic interfaces to genomic data.
• 10k visitors / 126k page views per day.
Compute Pipeline (HPC Workload)
• Take a raw genome and run it through a compute pipeline to find genes
and other features of interest.
• Ensembl at Sanger/EBI provides automated analysis for 51 vertebrate
genomes.
• Software is Open Source (apache license).
• Data is free for download.
We have done cloud experiments with both the web site
and pipeline.
16. Web Presence
Ensembl has a worldwide audience.
Historically, web site performance was not great, especially
for non-european institutes.
• Pages were quite heavyweight.
• Not properly cached etc.
Web team spent a lot of time re-designing the code to
make it more streamlined.
• Greatly improved performance.
Coding can only get you so-far.
• “A canna' change the laws of physics.”
•150-240ms round trip time from Europe to the US.
• We need a set of geographically dispersed mirrors.
17. uswest.ensembl.org
Traditional mirror: Real machines in a co-lo facility in
California.
Hardware was initially configured on site.
• 16 servers, SAN storage, SAN switches, SAN management appliance,
Ethernet switches, firewall, out-of-band management etc.
Shipped to the co-lo for installation.
• Sent a person to California for 3 weeks.
• Spent 1 week getting stuff into/out of customs.
• ****ing FCC paperwork!
Additional infrastructure work.
• VPN between UK and US.
Incredibly time consuming.
• Really don't want to end up having to send someone on a plane to the US
to fix things.
18. Usage
US-West currently takes ~1/3rd of total Ensembl web traffic.
• Much lower latency and improved site usibility.
20. useast.ensembl.org
We want an east coast US mirror to complement our west
coast mirror.
Built the mirror in AWS.
• Initially a proof of concept / test-bed.
• Production-level in due course.
Gives us operational experience.
• We can compare to a “real” colo.
21. Building a mirror on AWS
Some software development / sysadmin work needed.
• Preparation of OS images, software stack configuration.
• West-coast was built as an extension of Sanger internal network via VPN.
• AWS images built as standalone systems.
Web code changes
• Significant code changes required to make the webcode “mirror aware”.
• Seach, site login etc.
• We chose not to set up VPN into AWS.
• Work already done for the first mirror.
Significant amount of tuning required.
• Initial mysql performance was pretty bad, especially for the large ensembl
databases. (~1TB).
• Lots of people doing Apache/mysql on AWS, so there is a good amount of
best-practice etc available.
23. Is it better than the co-lo?
No physical hardware.
• Work can start as soon as we enter our credit card numbers...
• No US customs, Fedex etc.
Much simpler management infra-stucture.
• AWS give you out of band management “for free”.
• Much simpler to deal with hardware problems.
• And we do remote-management all the time.
“Free” hardware upgrades.
• As faster machines become available we can take advantage of them
immediately.
• No need to get tin decommissioned /re-installed at Co-lo.
24. Is it cost effective?
Lots of misleading cost statements made about cloud.
• “Our analysis only cost $500.”
• CPU is only “$0.085 / hr”.
What are we comparing against?
• Doing the analysis once? Continually?
• Buying a $2000 server?
• Leasing a $2000 server for 3 years?
• Using $150 of time at your local supercomputing facility?
• Buying a $2000 of server but having to build a $1M datacentre to put it
in?
Requires the dreaded Total Cost of Ownership (TCO)
calculation.
• hardware + power + cooling + facilities + admin/developers etc
• Incredibly hard to do.
25. Lets do it anyway...
Comparing costs to the co-lo is simpler.
• power, cooling costs are all included.
• Admin costs are the same, so we can ignore them.
• Same people responsible for both.
Cost for Co-location facility:
• $120,000 hardware + $51,000 /yr colo.
• $91,000 per year (3 years hardware lifetime).
Cost for AWS :
• $77,000 per year (estimated based on US-east traffic / IOPs)
Result: Estimated 16% cost saving.
• It is not free!
26. Additional Benefits
Website + code is packaged together.
• Can be conveniently given away to end users in a “ready-to-run” config.
• Simplifies configuration for other users wanting to run Ensembl sites.
• Configuring an ensembl site is non-trivial for non-informaticians.
• Cvs, mysql setup, apache configuration etc.
Ensembl data is already available as an Amazon public
dataset.
• Makes a complete system.
27. Unknowns
What about scale-up?
Current installation is a minimal config.
• Single web / database nodes.
• Main site and us-east use multiple load balanced servers.
AWS load-balancing architecture is different from what we
currently use.
• In theory there should be no problems...
• ...but we don't know until we try.
• Do we go for automatic scale-out?
28. Downsides
Underestimated the time it would take to make the web-
code mirror-ready.
• Not a cloud specific problem, but something to be aware of when you take
big applications and move them outside your home institution.
Packaging OS images, code and data needs to be done for
every ensembl release.
• Ensembl team now has a dedicated person responsible for the cloud.
• Somebody has to look after the systems.
Management overhead does not necessarily go down.
• But it does change.
29. Going forward
useast.ensembl.org to go into production later this year.
• Far-east Amazon availability zone is also of interest.
• Likely to be next, assuming useast works.
“Virtual” Co-location concept will be useful for a number of
other projects.
• Other Sanger websites?
Disaster recovery.
• Eg replicate critical databases / storage into AWS.
• Currently all of Sanger data lives in a single datacentre.
• We have a small amount of co-lo space for mirroring critical data.
• Same argument apply as for the uswest mirror.
35. Gene Finding
DNA
HMM Prediction
Alignment with
fragments recovered
in vivo
Alignment with
known proteins
Alignment with
other genes and
other species
37. Compute Pipeline
Architecture:
• OO perl pipeline manager.
• Core algorithms are C.
• 200 auxiliary binaries.
Workflow:
• Investigator describes analysis at high level.
• Pipeline manager splits the analysis into parallel chunks.
• Typically 50k-100k jobs.
• Sorts out the dependences and then submits jobs to a DRM.
• Typically LSF or SGE.
• Pipeline state and results are stored in a mysql database.
Workflow is embarrassingly parallel.
• Integer, not floating point.
• 64 bit memory address is nice, but not required.
• 64 bit file access is required.
• Single threaded jobs.
• Very IO intensive.
38. Running the pipeline in
practice
Requires a significant amount of domain knowledge.
Software install is complicated.
• Lots of perl modules and dependencies.
Need a well tuned compute cluster.
• Pipeline takes ~500 CPU days for a moderate genome.
• Ensembl chewed up 160k CPU days last year.
• Code is IO bound in a number of places.
• Typically need a high performance filesystem.
• Lustre, GPFS, Isilon, Ibrix etc.
• Need large mysql database.
• 100GB-TB mysql instances, very high query load generated from the
cluster.
39. Why Cloud?
Proof of concept
• Is HPC is even possible in Cloud infrastructures?
Coping with the big increase in data
• Will we be able to provision new machines/datacentre space to keep up?
• What happens if we need to “out-source” our compute?
• Can we be in a position to shift peaks of demand to cloud facilities?
40. Expanding markets
There are going to be lots of new genomes that need
annotating.
• Sequencers moving into small labs, clinical settings.
• Limited informatics / systems experience.
• Typically postdocs/PhD who have a “real” job to do.
• They may want to run the genebuild pipeline on their data, but they may
not have the expertise to do so.
We have already done all the hard work on installing the
software and tuning it.
• Can we package up the pipeline, put it in the cloud?
Goal: End user should simply be able to upload their data,
insert their credit-card number, and press “GO”.
41. Porting HPC code to the cloud
Lets build a compute cluster in the cloud.
Software stack / machine image.
• Creating images with software is reasonably straightforward.
• No big surprises.
Queuing system
• Pipeline requires a queueing system: (LSF/SGE)
• Licensing problems.
• Getting them to run took a lot of fiddling.
• Machines need to find each other one they are inside the cloud.
• Building an automated “self discovering” cluster takes some hacking.
• Hopefully others can re-use it.
Mysql databases
• Lots of best practice on how to do that on EC2.
It took time, even for experienced systems people.
• (You will not be firing your system-administrators just yet!).
42. Did it work? NO!
“High performance computing is not facebook.”
-- Chris Dagdigian
The big problem data:
• Moving data into the cloud is hard.
• Doing stuff with data once it is in the cloud is also hard.
If you look closely, most successful cloud projects have
small amounts of data (10-100 Mbytes).
Genomics projects have Tbytes → Pbytes of data.
43. Moving data is hard
Commonly used tools (FTP,ssh/rsync) are not suited to
wide-area networks.
• Need to use specialised WAN tools: gridFTP/FDT/Aspera.
There is a lot of broken internet.
Data transfer rates (gridFTP/FDT via our 2 Gbit/s site link).
• Cambridge → EC2 East coast: 12 Mbytes/s (96 Mbits/s)
• Cambridge → EC2 Dublin: 25 Mbytes/s (200 Mbits/s)
• 11 hours to move 1TB to Dublin.
• 23 hours to move 1 TB to East coast.
What speed should we get?
• Once we leave JANET (UK academic network) finding out what the
connectivity is and what we should expect is almost impossible.
• Finding out who to talk to when you diagnose a troublesome link is also
almost impossible.
44. Networking
“But the physicists do this
all the time.”
• No they don't.
• LHC Grid; Dedicated networking
between CERN and the T1
centres who get all of the data.
Can we use this model?
• We have relatively short lived and
fluid collaborations. (1-2 years,
many institutions).
• As more labs get sequencers, our
potential collaborators also
increase.
• We need good connectivity to
everywhere.
45. Using data within the cloud
Compute nodes need to have fast access to the data.
• We solve this with exotic and temperamental filesystems/storage.
No viable global filesystems on EC2.
• NFS has poor scaling at the best of times.
• EC2 has poor inter-node networking. > 8 NFS clients, everything stops.
Nasty-hacks:
• Subcloud; commercial product that allows you to run a POSIX filesystem
on top of S3.
• Interesting performance, and you are paying by the hour...
46. Compute architecture
Data-store
Batch schedular hadoop/S3
Fat Network thin network
VS
CPU CPU CPU CPU CPU CPU CPU
Local Local Local Local
Posix Global filesystem
storage storage storage storage
Data-store
47. Why not S3 /hadoop/map-
reduce?
Not POSIX.
• Lots of code expects file on a filesystem.
• Limitations; cannot store objects > 5GB.
• Throw away file formats?
Nobody want to re-write existing applications.
• They already work on our compute farm.
• How do hadoop apps co-exist with non-hadoop ones?
•Do we have to have two different type of infrastructure and move data
between them?
• Barrier for entry seems much lower for file-systems.
Am I being a reactionary old fart?
• 15 years ago clusters of PCs were not “real” supercomputers.
• ...then beowulf took over the world.
• Big difference: porting applications between the two architectures was
easy.
• MPI/PVM etc.
Will the market provide “traditional” compute clusters in
the cloud?
49. HPC app summary
You cannot take an existing data-rich HPC app and expect
it to work.
• IO architectures are too different.
There is some re-factoring going on for the ensembl
pipeline to make it EC2 friendly.
• Currently on a case-by-case basis.
• For the less-data intensive parts.
Waiting for the market to deliver...
51. Past Collaborations
Sequencing
Sequencing centre Sequencing
centre centre
Sequencing
Centre + DCC
Data
52. Genomics Data
Data size per Genome
Individual Structured data
features Clinical Researchers, (databases)
(3MB) non-infomaticians
Variation data (1GB)
Alignments (200 GB)
Sequencing informatics
specialists
Sequence + quality data (500 GB)
Unstructured data
Intensities / raw data (2TB) (flat files)
53. The Problem With Current
Archives
Data in current archives is
“dark”.
• You can put/get data, but cannot
compute across it.
Data is all in one place.
• Problematic if you are not the DCC:
• You have to pull the data down to do
something with it,
• Holding data in one place is bad for
disaster-recovery and network access.
Is data in an inaccessible
archive really useful?
54. A real example...
“We want to run out pipeline across 100TB of data
currently in EGA/SRA.”
We will need to de-stage the data to Sanger, and then run
the compute.
• Extra 0.5 PB of storage, 1000 cores of compute.
• 3 month lead time.
• ~$1.5M capex.
• Download:
• 46 days at 25 Mbytes/s (best transatlantic link).
• 10 days at 1 Gbit/s. (sling an cable across the datacentre to EBI).
55. Easy to solve problem in
powerpoint:
Put data into a cloud.
• Big cloud providers already have replicated storage infrastructures.
Upload workload onto VMs.
• Put VMs on compute that is “attached” to the data.
CPU CPU CPU CPU
Data
CPU CPU CPU CPU
Data
VM
56. Practical Hurdles
How do you expose the data?
• Flat files? Database?
How do you make the compute efficient?
• Cloud IO problems still there.
• And you make the end user pay for them.
How do we deal with controlled access?
• Hard problem. Grid / delegated security mechanisms are complicated for
a reason.
57. Whose Cloud?
Most of us are funded to hold data, not to fund everyone
else's compute costs to.
• Now need to budget for raw compute power as well as disk.
• Implement visualisation infrastructure, billing etc.
• Are you legally allowed to charge?
• Who underwrites it if nobody actually uses your service?
Strongly implies data has to be held on a commercial
provider.
58. Can it solve our networking
problems?
Moving data across the internet is hard.
• Fixing the internet is not going to be cost effective for us.
Fixing the internet may be cost effective for big cloud
providers.
• Core to their business model.
• All we need to do is get data into Amazon, and then everyone else can get
the data from there.
Do we invest in a fast links to Amazon?
• It changes the business dynamic.
• We have effectively tied ourselves to a single provider.
60. Summary
Cloud work well for webservices.
Data rich HPC workloads are still hard.
Cloud based data archives look really interesting.
61. Acknowledgements
Phil Butcher
ISG Team
• James Beal
• Gen-Tao Chiang
• Pete Clapham
• Simon Kelley
Ensembl
• Steve Searle
• Jan-Hinnerk Vogel
• Bronwen Aken
• Glenn Proctor
• Stephen Keenan
Cancer Genome Project
• Adam Butler
• John Teague