Coates bosc2010 clouds-fluff-and-no-substance


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Coates bosc2010 clouds-fluff-and-no-substance

  1. 1. Clouds: All fluff and no substance? Guy Coates Wellcome Trust Sanger Institute
  2. 2. Outline About the Sanger Institute. Experience with cloud to date. Future Directions.
  3. 3. The Sanger Institute Funded by Wellcome Trust. • 2nd largest research charity in the world. • ~700 employees. • Based on Hinxton Genome Campus, Cambridge, UK. Large scale genomic research. • We have active cancer, malaria, pathogen and genomic variation / human health studies. • 1k genomes, & 10k-UK Genomes, Cancer genome projects. All data is made publicly available. • Websites, ftp, direct database access, programmatic APIs.
  4. 4. Economic Trends: As cost of sequencing halves every 12 months. • cf Moore's Law The Human genome project: • 13 years. • 23 labs. • $500 Million. A Human genome today: • 3 days. • 1 machine. • $10,000. • Large centres are now doing studies with 1000s and 10,000s of genomes. Changes in sequencing technology are going to continue this trend. • “Next-next” generation sequencers are on their way. • $500 genome is probable within 5 years.
  5. 5. The scary graph Instrument upgrades Peak Yearly capillary sequencing
  6. 6. Managing Growth We have exponential growth in storage and compute. • Storage /compute doubles every 12 Disk Storage months. 6000 • 2009 ~7 PB raw 5000 4000 Moore's law will not save us. • Transistor/disk density: Td=18 months Terabytes 3000 • Sequencing cost: Td=12 months 2000 My Job: 1000 • Running the team who do the IT 0 systems heavy-lifting to make it all work. 1995 1997 1999 2001 2003 2005 2007 2009 • 1994 1996 1998 2000 2002 2004 2006 2008 Tech evaluations. Year • Systems architecture. • Day-to-day administration. • All in conjunction with informaticians, programmers & investigators who are doing the science.
  7. 7. Cloud: Where are we at?
  8. 8. What is cloud? Technical view: • On demand, virtual machines. • Root access, total ownership. • Pay-as-you-go model. Non-technical view: • “Free” compute we can use to solve all of the hard problems thrown up by new sequencing. • (8cents/hour is almost free, right...?) • Web 2.0 / Friendface use it, so it must be good.
  9. 9. Hype Cycle Awesome! Just works...
  10. 10. Out of the trough of disillusionment...
  11. 11. Victory!
  12. 12. Cloud Use-Cases We currently have three areas of activity: • Web presence • HPC workload • Data Warehousing
  13. 13. Ensembl Ensembl is a system for genome Annotation. Data visualisation (Web Presence) • • Provides web / programmatic interfaces to genomic data. • 10k visitors / 126k page views per day. Compute Pipeline (HPC Workload) • Take a raw genome and run it through a compute pipeline to find genes and other features of interest. • Ensembl at Sanger/EBI provides automated analysis for 51 vertebrate genomes. • Software is Open Source (apache license). • Data is free for download. We have done cloud experiments with both the web site and pipeline.
  14. 14. Web presence
  15. 15. Web Presence Ensembl has a worldwide audience. Historically, web site performance was not great, especially for non-european institutes. • Pages were quite heavyweight. • Not properly cached etc. Web team spent a lot of time re-designing the code to make it more streamlined. • Greatly improved performance. Coding can only get you so-far. • “A canna' change the laws of physics.” •150-240ms round trip time from Europe to the US. • We need a set of geographically dispersed mirrors.
  16. 16. Traditional mirror: Real machines in a co-lo facility in California. Hardware was initially configured on site. • 16 servers, SAN storage, SAN switches, SAN management appliance, Ethernet switches, firewall, out-of-band management etc. Shipped to the co-lo for installation. • Sent a person to California for 3 weeks. • Spent 1 week getting stuff into/out of customs. • ****ing FCC paperwork! Additional infrastructure work. • VPN between UK and US. Incredibly time consuming. • Really don't want to end up having to send someone on a plane to the US to fix things.
  17. 17. Usage US-West currently takes ~1/3rd of total Ensembl web traffic. • Much lower latency and improved site usibility.
  18. 18. What has this got to do with clouds?
  19. 19. We want an east coast US mirror to complement our west coast mirror. Built the mirror in AWS. • Initially a proof of concept / test-bed. • Production-level in due course. Gives us operational experience. • We can compare to a “real” colo.
  20. 20. Building a mirror on AWS Some software development / sysadmin work needed. • Preparation of OS images, software stack configuration. • West-coast was built as an extension of Sanger internal network via VPN. • AWS images built as standalone systems. Web code changes • Significant code changes required to make the webcode “mirror aware”. • Seach, site login etc. • We chose not to set up VPN into AWS. • Work already done for the first mirror. Significant amount of tuning required. • Initial mysql performance was pretty bad, especially for the large ensembl databases. (~1TB). • Lots of people doing Apache/mysql on AWS, so there is a good amount of best-practice etc available.
  21. 21. Does it work? BETA!
  22. 22. Is it better than the co-lo? No physical hardware. • Work can start as soon as we enter our credit card numbers... • No US customs, Fedex etc. Much simpler management infra-stucture. • AWS give you out of band management “for free”. • Much simpler to deal with hardware problems. • And we do remote-management all the time. “Free” hardware upgrades. • As faster machines become available we can take advantage of them immediately. • No need to get tin decommissioned /re-installed at Co-lo.
  23. 23. Is it cost effective? Lots of misleading cost statements made about cloud. • “Our analysis only cost $500.” • CPU is only “$0.085 / hr”. What are we comparing against? • Doing the analysis once? Continually? • Buying a $2000 server? • Leasing a $2000 server for 3 years? • Using $150 of time at your local supercomputing facility? • Buying a $2000 of server but having to build a $1M datacentre to put it in? Requires the dreaded Total Cost of Ownership (TCO) calculation. • hardware + power + cooling + facilities + admin/developers etc • Incredibly hard to do.
  24. 24. Lets do it anyway... Comparing costs to the co-lo is simpler. • power, cooling costs are all included. • Admin costs are the same, so we can ignore them. • Same people responsible for both. Cost for Co-location facility: • $120,000 hardware + $51,000 /yr colo. • $91,000 per year (3 years hardware lifetime). Cost for AWS : • $77,000 per year (estimated based on US-east traffic / IOPs) Result: Estimated 16% cost saving. • It is not free!
  25. 25. Additional Benefits Website + code is packaged together. • Can be conveniently given away to end users in a “ready-to-run” config. • Simplifies configuration for other users wanting to run Ensembl sites. • Configuring an ensembl site is non-trivial for non-informaticians. • Cvs, mysql setup, apache configuration etc. Ensembl data is already available as an Amazon public dataset. • Makes a complete system.
  26. 26. Unknowns What about scale-up? Current installation is a minimal config. • Single web / database nodes. • Main site and us-east use multiple load balanced servers. AWS load-balancing architecture is different from what we currently use. • In theory there should be no problems... • ...but we don't know until we try. • Do we go for automatic scale-out?
  27. 27. Downsides Underestimated the time it would take to make the web- code mirror-ready. • Not a cloud specific problem, but something to be aware of when you take big applications and move them outside your home institution. Packaging OS images, code and data needs to be done for every ensembl release. • Ensembl team now has a dedicated person responsible for the cloud. • Somebody has to look after the systems. Management overhead does not necessarily go down. • But it does change.
  28. 28. Going forward to go into production later this year. • Far-east Amazon availability zone is also of interest. • Likely to be next, assuming useast works. “Virtual” Co-location concept will be useful for a number of other projects. • Other Sanger websites? Disaster recovery. • Eg replicate critical databases / storage into AWS. • Currently all of Sanger data lives in a single datacentre. • We have a small amount of co-lo space for mirroring critical data. • Same argument apply as for the uswest mirror.
  29. 29. Hype Cycle Web services
  30. 30. Ensembl Pipeline HPC element of Ensembl. • Takes raw genomes and performs automated annotation on them.
  32. 32. Raw Sequence → Something useful
  33. 33. Example annotation
  34. 34. Gene Finding DNA HMM Prediction Alignment with fragments recovered in vivo Alignment with known proteins Alignment with other genes and other species
  35. 35. Workflow
  36. 36. Compute Pipeline Architecture: • OO perl pipeline manager. • Core algorithms are C. • 200 auxiliary binaries. Workflow: • Investigator describes analysis at high level. • Pipeline manager splits the analysis into parallel chunks. • Typically 50k-100k jobs. • Sorts out the dependences and then submits jobs to a DRM. • Typically LSF or SGE. • Pipeline state and results are stored in a mysql database. Workflow is embarrassingly parallel. • Integer, not floating point. • 64 bit memory address is nice, but not required. • 64 bit file access is required. • Single threaded jobs. • Very IO intensive.
  37. 37. Running the pipeline in practice Requires a significant amount of domain knowledge. Software install is complicated. • Lots of perl modules and dependencies. Need a well tuned compute cluster. • Pipeline takes ~500 CPU days for a moderate genome. • Ensembl chewed up 160k CPU days last year. • Code is IO bound in a number of places. • Typically need a high performance filesystem. • Lustre, GPFS, Isilon, Ibrix etc. • Need large mysql database. • 100GB-TB mysql instances, very high query load generated from the cluster.
  38. 38. Why Cloud? Proof of concept • Is HPC is even possible in Cloud infrastructures? Coping with the big increase in data • Will we be able to provision new machines/datacentre space to keep up? • What happens if we need to “out-source” our compute? • Can we be in a position to shift peaks of demand to cloud facilities?
  39. 39. Expanding markets There are going to be lots of new genomes that need annotating. • Sequencers moving into small labs, clinical settings. • Limited informatics / systems experience. • Typically postdocs/PhD who have a “real” job to do. • They may want to run the genebuild pipeline on their data, but they may not have the expertise to do so. We have already done all the hard work on installing the software and tuning it. • Can we package up the pipeline, put it in the cloud? Goal: End user should simply be able to upload their data, insert their credit-card number, and press “GO”.
  40. 40. Porting HPC code to the cloud Lets build a compute cluster in the cloud. Software stack / machine image. • Creating images with software is reasonably straightforward. • No big surprises. Queuing system • Pipeline requires a queueing system: (LSF/SGE) • Licensing problems. • Getting them to run took a lot of fiddling. • Machines need to find each other one they are inside the cloud. • Building an automated “self discovering” cluster takes some hacking. • Hopefully others can re-use it. Mysql databases • Lots of best practice on how to do that on EC2. It took time, even for experienced systems people. • (You will not be firing your system-administrators just yet!).
  41. 41. Did it work? NO! “High performance computing is not facebook.” -- Chris Dagdigian The big problem data: • Moving data into the cloud is hard. • Doing stuff with data once it is in the cloud is also hard. If you look closely, most successful cloud projects have small amounts of data (10-100 Mbytes). Genomics projects have Tbytes → Pbytes of data.
  42. 42. Moving data is hard Commonly used tools (FTP,ssh/rsync) are not suited to wide-area networks. • Need to use specialised WAN tools: gridFTP/FDT/Aspera. There is a lot of broken internet. Data transfer rates (gridFTP/FDT via our 2 Gbit/s site link). • Cambridge → EC2 East coast: 12 Mbytes/s (96 Mbits/s) • Cambridge → EC2 Dublin: 25 Mbytes/s (200 Mbits/s) • 11 hours to move 1TB to Dublin. • 23 hours to move 1 TB to East coast. What speed should we get? • Once we leave JANET (UK academic network) finding out what the connectivity is and what we should expect is almost impossible. • Finding out who to talk to when you diagnose a troublesome link is also almost impossible.
  43. 43. Networking “But the physicists do this all the time.” • No they don't. • LHC Grid; Dedicated networking between CERN and the T1 centres who get all of the data. Can we use this model? • We have relatively short lived and fluid collaborations. (1-2 years, many institutions). • As more labs get sequencers, our potential collaborators also increase. • We need good connectivity to everywhere.
  44. 44. Using data within the cloud Compute nodes need to have fast access to the data. • We solve this with exotic and temperamental filesystems/storage. No viable global filesystems on EC2. • NFS has poor scaling at the best of times. • EC2 has poor inter-node networking. > 8 NFS clients, everything stops. Nasty-hacks: • Subcloud; commercial product that allows you to run a POSIX filesystem on top of S3. • Interesting performance, and you are paying by the hour...
  45. 45. Compute architecture Data-store Batch schedular hadoop/S3 Fat Network thin network VS CPU CPU CPU CPU CPU CPU CPU Local Local Local Local Posix Global filesystem storage storage storage storage Data-store
  46. 46. Why not S3 /hadoop/map- reduce? Not POSIX. • Lots of code expects file on a filesystem. • Limitations; cannot store objects > 5GB. • Throw away file formats? Nobody want to re-write existing applications. • They already work on our compute farm. • How do hadoop apps co-exist with non-hadoop ones? •Do we have to have two different type of infrastructure and move data between them? • Barrier for entry seems much lower for file-systems. Am I being a reactionary old fart? • 15 years ago clusters of PCs were not “real” supercomputers. • ...then beowulf took over the world. • Big difference: porting applications between the two architectures was easy. • MPI/PVM etc. Will the market provide “traditional” compute clusters in the cloud?
  47. 47. Hype cycle HPC
  48. 48. HPC app summary You cannot take an existing data-rich HPC app and expect it to work. • IO architectures are too different. There is some re-factoring going on for the ensembl pipeline to make it EC2 friendly. • Currently on a case-by-case basis. • For the less-data intensive parts. Waiting for the market to deliver...
  49. 49. Shared data archives
  50. 50. Past Collaborations Sequencing Sequencing centre Sequencing centre centre Sequencing Centre + DCC Data
  51. 51. Genomics Data Data size per Genome Individual Structured data features Clinical Researchers, (databases) (3MB) non-infomaticians Variation data (1GB) Alignments (200 GB) Sequencing informatics specialists Sequence + quality data (500 GB) Unstructured data Intensities / raw data (2TB) (flat files)
  52. 52. The Problem With Current Archives Data in current archives is “dark”. • You can put/get data, but cannot compute across it. Data is all in one place. • Problematic if you are not the DCC: • You have to pull the data down to do something with it, • Holding data in one place is bad for disaster-recovery and network access. Is data in an inaccessible archive really useful?
  53. 53. A real example... “We want to run out pipeline across 100TB of data currently in EGA/SRA.” We will need to de-stage the data to Sanger, and then run the compute. • Extra 0.5 PB of storage, 1000 cores of compute. • 3 month lead time. • ~$1.5M capex. • Download: • 46 days at 25 Mbytes/s (best transatlantic link). • 10 days at 1 Gbit/s. (sling an cable across the datacentre to EBI).
  54. 54. Easy to solve problem in powerpoint: Put data into a cloud. • Big cloud providers already have replicated storage infrastructures. Upload workload onto VMs. • Put VMs on compute that is “attached” to the data. CPU CPU CPU CPU Data CPU CPU CPU CPU Data VM
  55. 55. Practical Hurdles How do you expose the data? • Flat files? Database? How do you make the compute efficient? • Cloud IO problems still there. • And you make the end user pay for them. How do we deal with controlled access? • Hard problem. Grid / delegated security mechanisms are complicated for a reason.
  56. 56. Whose Cloud? Most of us are funded to hold data, not to fund everyone else's compute costs to. • Now need to budget for raw compute power as well as disk. • Implement visualisation infrastructure, billing etc. • Are you legally allowed to charge? • Who underwrites it if nobody actually uses your service? Strongly implies data has to be held on a commercial provider.
  57. 57. Can it solve our networking problems? Moving data across the internet is hard. • Fixing the internet is not going to be cost effective for us. Fixing the internet may be cost effective for big cloud providers. • Core to their business model. • All we need to do is get data into Amazon, and then everyone else can get the data from there. Do we invest in a fast links to Amazon? • It changes the business dynamic. • We have effectively tied ourselves to a single provider.
  58. 58. Where are we? Computable archives
  59. 59. Summary Cloud work well for webservices. Data rich HPC workloads are still hard. Cloud based data archives look really interesting.
  60. 60. Acknowledgements Phil Butcher ISG Team • James Beal • Gen-Tao Chiang • Pete Clapham • Simon Kelley Ensembl • Steve Searle • Jan-Hinnerk Vogel • Bronwen Aken • Glenn Proctor • Stephen Keenan Cancer Genome Project • Adam Butler • John Teague