Your SlideShare is downloading. ×
Cloud Experiences
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Cloud Experiences

2,399

Published on

Sanger Institute's experiences with the cloud. …

Sanger Institute's experiences with the cloud.

Given at Green Datacentre & Cloud Control 2011

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,399
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
24
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Cloud Experiences
    • Guy Coates
    • 2. Wellcome Trust Sanger Institute
    • 3. [email_address]
  • 4. The Sanger Institute
    • Funded by Wellcome Trust.
      • 2 nd largest research charity in the world.
      • 5. ~700 employees.
      • 6. Based in Hinxton Genome Campus, Cambridge, UK.
    • Large scale genomic research.
      • Sequenced 1/3 of the human genome. (largest single contributor).
      • 7. We have active cancer, malaria, pathogen and genomic variation / human health studies.
    • All data is made publicly available.
      • Websites, ftp, direct database. access, programmatic APIs.
  • 8. DNA Sequencing TCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG AAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA TGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG TGCACTCCAGCTTGGGTGACACAG CAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG AAATAATCAGTTTCCTAAGATTTTTTTCCTGAAAAATACACATTTGGTTTCA ATGAAGTAAATCG ATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC 250 Million * 75-108 Base fragments Human Genome (3GBases)
  • 9. Moore's Law Compute/disk doubles every 18 months Sequencing doubles every 12 months
  • 10. Economic Trends:
    • The Human genome project:
    • A Human genome today:
    • Trend will continue:
      • $500 genome is probable within 3-5 years.
  • 15. The scary graph Peak Yearly capillary sequencing: 30 Gbase Current weekly sequencing: 6000 Gbase
  • 16. Our Science
  • 17. UK 10K Project
    • Decode the genome of 10,000 people in the uk.
    • 18. Will improve the understanding of human genetic variation and disease.
    Genome Research Limited Wellcome Trust launches study of 10,000 human genomes in UK; 24 June 2010 www.sanger.ac.uk/about/press/2010/100624-uk10k.html
  • 19. New scale, new insights . . . to common disease
  • 28. Cancer Genome Project
    • Cancer is a disease caused by abnormalities in a cell's genome.
  • 29. Detailed Changes:
    • Sequencing hundreds of cancer samples
    • 30. First Comprehensive look at cancer genomes
      • Lung Cancer
      • 31. Malignant melanoma
      • 32. Breast cancer
    • Identify driver mutations for:
      • Improved diagnostics
      • 33. Development of novel therapies
      • 34. Targeting of existing therapeutics
    Lung Cancer and melanoma laid bare; 16 December 2009 www.sanger.ac.uk/about/press/2009/091216.html
  • 35. IT Challenges
  • 36. Managing Growth
    • Analysing the data takes a lot of compute and disk space
      • Finished sequence is the start of the problem, not the end.
    • Growth of compute & storage
      • Storage /compute doubles every 12 months.
        • 2010 ~12 PB raw
    • Moore's law will not save us.
    • 37. 1000$ genome*
    • 38. *Informatics not included
  • 39. Sequencing data flow. Alignments (200GB) Variation data (1GB) Feature (3MB) Raw data (10 TB) Sequence (500GB) Sequencer Processing/ QC Comparative analysis datastore Structured data (databases) Unstructured data (Flat files) Internet
  • 40. Data centre
    • 4x250 M 2 Data centres.
      • 2-4KW / M 2 cooling.
      • 41. 1.8 MW power draw
      • 42. 1.5 PUE
    • Overhead aircon, power and networking.
      • Allows counter-current cooling.
      • 43. Focus on power & space efficient storage and compute.
    • Technology Refresh.
      • 1 data centre is an empty shell.
        • Rotate into the empty room every 4 years and refurb.
      • “Fallow Field” principle.
    rack rack rack rack
  • 44. Our HPC Infrastructure
    • Compute
      • 8500 cores
      • 45. 10GigE / 1GigE networking.
    • High performance storage
      • 1.5 PB DDN 9000&10000 storage
      • 46. Lustre filesystem
    • LSF queuing system
  • 47. Ensembl
    • Data visualisation / Mining web services.
      • www.ensembl.org
      • 48. Provides web / programmatic interfaces to genomic data.
      • 49. 10k visitors / 126k page views per day.
    • Compute Pipeline (HPTC Workload)
      • Take a raw genome and run it through a compute pipeline to find genes and other features of interest.
      • 50. Ensembl at Sanger/EBI provides automated analysis for 51 vertebrate genomes.
      • Software is Open Source (apache license).
      • 51. Data is free for download.
  • 52. Sequencing data flow. Alignments (200GB) Variation data (1GB) Feature (3MB) Raw data (10 TB) Sequence (500GB) Sequencer Processing/ QC Comparative analysis datastore Structured data (databases) Unstructured data (Flat files) Internet HPC Compute Pipeline Web / Database infrastructure
  • 53. TCCTCTCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG GAATTGTCAGACATATACCAAATCCCTTCTGTTGATTCTGCTGACAATCTATCTGAAAAA TTGGAAAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA TTTAGAGAAGAGAAAGCAAACATATTATAAGTTTAATTCTTATATTTAAAAATAGGAGCC AAGTATGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC TTGAGACCAGGAGTTTGATACCAGCCTGGGCAACATAGCAAGATGTTATCTCTACACAAA ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG AAGCAGGAGGGTTACTTGAGCCCAGGAGTTTGAGGTTGCAGTGAGCTATGATTGTGCCAC TGCACTCCAGCTTGGGTGACACAGCAAAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG AACATCTCATTTTCACACTGAAATGTTGACTGAAATCATTAAACAATAAAATCATAAAAG AAAAATAATCAGTTTCCTAAGAAATGATTTTTTTTCCTGAAAAATACACATTTGGTTTCA GAGAATTTGTCTTATTAGAGACCATGAGATGGATTTTGTGAAAACTAAAGTAACACCATT ATGAAGTAAATCGTGTATATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC
  • 54. Annotation
  • 55. Annotation
  • 56. Why Cloud?
  • 57. Web Services
    • Ensembl has a worldwide audience.
    • 58. Historically, web site performance was not great, especially for non-european institutes.
      • Pages were quite heavyweight.
      • 59. Not properly cached etc.
    • Web team spent a lot of time re-designing the code to make it more streamlined.
      • Greatly improved performance.
    • Coding can only get you so-far.
      • 150-240ms round trip time from Europe to the US.
      • 60. We need a set of geographically dispersed mirrors.
  • 61. Colocation
    • Real machines in a co-lo facility in California.
      • Traditional mirror.
    • Hardware was initially configured on site.
      • 16 servers, SAN storage, SAN switches, SAN management appliance, Ethernet switches, firewall, out-of-band management etc.
    • Shipped to the co-lo for installation.
      • Sent a person to California for 3 weeks.
      • 62. Spent 1 week getting stuff into/out of customs.
        • ****ing FCC paperwork!
    • Additional infrastructure work.
      • VPN between UK and US.
    • Incredibly time consuming.
      • Really don't want to end up having to send someone on a plane to the US to fix things.
  • 63. Cloud Opportunities
    • We wanted more mirrors.
      • US East coast, asia-pacific.
    • Investigations into AWS already ongoing.
    • 64. Many people would like to run ensembl webcode to visualise their own data.
      • Non trivial for the non-expert user.
        • Mysql, apache, perl.
    • Can we distribute AMIs instead?
      • Ready to run.
    • Can we eat our own dog-food?
      • Run mirror site from the AMIs?
  • 65. What we actually did: AWS Sanger Sanger VPN
  • 66. Building a mirror on AWS
    • Application development was required
      • Significant code changes required to make the webcode “mirror aware”.
        • Mostly done for the original co-location site.
    • Some software development / sysadmin work needed.
      • Preparation of OS images, software stack configuration.
      • 67. VPN configuration
    • Significant amount of tuning required.
      • Initial mysql performance was pretty bad, especially for the large ensembl databases. (~1TB).
      • 68. Lots of people doing Apache/mysql on AWS, so there is a good amount of best-practice etc available.
  • 69. Traffic
  • 70. Is it cost effective?
    • Lots of misleading cost statements made about cloud.
      • “Our analysis only cost $500.”
      • 71. CPU is only “$0.085 / hr”.
    • What are we comparing against?
      • Doing the analysis once? Continually?
      • 72. Buying a $2000 server?
      • 73. Leasing a $2000 server for 3 years?
      • 74. Using $150 of time at your local supercomputing facility?
      • 75. Buying a $2000 of server but having to build a $1M datacentre to put it in?
    • Requires the dreaded Total Cost of Ownership (TCO) calculation.
      • hardware + power + cooling + facilities + admin/developers etc
        • Incredibly hard to do.
  • 76. Breakdown:
    • Comparing costs to the “real” Co-lo
      • power, cooling costs are all included.
      • 77. Admin costs are the same, so we can ignore them.
        • Same people responsible for both.
    • Cost for Co-location facility:
      • $120,000 hardware + $51,000 /yr colo.
      • 78. $91,000 per year (3 years hardware lifetime).
    • Cost for AWS site:
      • $84,000 per year.
    • We can run 3 mirrors for 90% of the cost of 1 mirror.
    • 79. It is not free!
  • 80. Advantages
    • No physical hardware.
      • Work can start as soon as we enter our credit card numbers...
      • 81. No US customs, Fedex etc.
    • Less hardware:
      • No Firewalls, SAN management appliances etc.
    • Much simpler management infrastructure.
        • AWS give you out of band management “for free”.
        • 82. No hardware issues.
    • Easy path for growth.
      • No space constraints.
        • No need to get tin decommissioned /re-installed at Co-lo.
      • Add more machines until we run out of cash.
  • 83. Downsides
    • Underestimated the time it would take to make the web-code mirror-ready.
      • Not a cloud specific problem, but something to be aware of when you take big applications and move them outside your home institution.
    • Curation of software images takes time.
      • Regular releases of new data and code.
      • 84. Ensembl team now has a dedicated person responsible for the cloud.
      • 85. Somebody has to look after the systems.
    • Management overhead does not necessarily go down.
      • But it does change.
  • 86. Going forward
      Change code to remove all dependencies on Sanger.
      • Full DR capability.
    • Make the AMIs publically available.
      • Today we have Mysql servers + data.
        • Data generously hosted on Amazon public datasets.
      • Allow users to simply run their own sites.
  • 87. HPC Workloads
  • 88. Why HPC in the Cloud?
    • We already have a data-centre.
      • Not seeking to replace our existing infrastructure.
      • 89. Not cost effective.
    • But: Long lead-times for installing kit.
      • ~3-6 months from idea to going live.
      • 90. Longer than the science can wait.
      • 91. Ability to burst capacity might be useful.
    • Test environments.
      • Test at scale.
      • 92. Large clusters for a short amount of time.
  • 93. Distributing analysis tools
    • Sequencing is becoming a commodity.
    • 94. Informatics / analysis tools needs to be commodity too.
    • 95. Requires a significant amount of domain knowledge.
      • Complicated software installs, relational databases etc.
    • Goal:
      • Researcher with no IT knowledge can take their sequence data, upload it to AWS, get it analysed and view the results.
  • 96. Life Sciences HPC Workloads Tightly Coupled (MPI) Embarrassingly Parallel CPU Bound IO Bound modelling/ docking Simulation Genomics
  • 97. Our Workload
    • Embarrassingly Parallel.
      • Lots of single threaded jobs.
      • 98. 10,000s of jobs.
      • 99. Core algorithms in C
      • 100. Perl pipeline manager to generate and manage workflow.
      • 101. Batch schedular to execute jobs on nodes.
      • 102. mysql database to hold results & state.
    • Moderate memory sizes.
      • 3 GB/core
    • IO bound.
      • Fast parallel filesystems.
  • 103. Life Sciences HPC Workloads Tightly Coupled (MPI) Embarrassingly Parallel CPU Bound IO Bound modelling/ docking Simulation Genomics
  • 104. Different Architectures VS CPU CPU CPU Fat Network POSIX Global filesystem CPU CPU CPU CPU thin network Local storage Local storage Local storage Local storage Batch schedular S3 Hadoop?
  • 105. Life Sciences HPC Workloads Tightly Coupled (MPI) Embarrassingly Parallel CPU Bound IO Bound modelling/ docking Simulation Genomics
  • 106. Careful choice of problem:
    • Choose a simple part of the pipeline
      • Re-factor all the code that expects global filesystem and make it use S3.
    • Why not use hadoop?
      • Production code that works nicely inside Sanger.
      • 107. Vast effort to port code, for little benefit.
      • 108. Questions about stability for multi-user systems internally.
    • Build self assembling HPC cluster.
      • Code which will spin up AWS images and self assembles into a HPC cluster and batch schedular.
    • Cloud allows you to simplify.
      • Sanger compute cluster is shared.
        • Lots of complexity in ensuring applications/users play nicely together.
      • AWS clusters are unique to a user/application.
  • 109. The real problem: Internet
    • Data transfer rates (gridFTP/FDT via our 2 Gbit/s site link).
      • Cambridge -> EC2 East coast: 12 Mbytes/s (96 Mbits/s)
      • 110. Cambridge -> EC2 Dublin: 25 Mbytes/s (200 Mbits/s)
      • 111. 11 hours to move 1TB to Dublin.
      • 112. 23 hours to move 1 TB to East coast.
    • What speed should we get?
      • Once we leave JANET (UK academic network) finding out what the connectivity is and what we should expect is almost impossible.
    • Do you have fast enough disks at each end to keep the network full?
  • 113. Networking
    • How do we improve data transfers across the public internet?
      • CERN approach; don't.
      • 114. 10 Gbit dedicated network between CERN and the T1 centres.
    • Can it work for cloud?
      • Buy dedicated bandwidth to a provider.
        • Ties you in.
        • 115. Should they pay?
    • What happens when you want to move?
  • 116. Summary
    • Moving existing HPC applications is painful.
    • 117. Small data / high CPU applications work really well.
    • 118. Large data applications less well.
  • 119. Data Security
  • 120. Are you allowed to put data on the cloud?
    • Default policy:
    • 121. “Our data is confidential/important/critical to our business.
    • 122. We must keep our data on our computers.”
    • 123. “Apart from when we outsource it already.”
  • 124. Reasons to be optimistic:
    • Most (all?) data security issues can be dealt with.
      • But the devil is in the details.
      • 125. Data can be put on the cloud, if care is taken.
    • It is probably more secure there than in your own data-centre.
      • Can you match AWS data availability guarantees?
    • Are cloud providers different from any other organisation you outsource to?
  • 126. Outstanding Issues
    • Audit and compliance:
      • If you need IP agreements, above your providers standard T&Cs, how do you push them through?
    • Geographical boundaries mean little in the cloud.
      • Data can be replicated across national boundaries, without end user being aware.
    • Moving personally identifiable data outside of the EU is potentially problematic.
      • (Can be problematic within the EU; privacy laws are not as harmonised as you might think.)
      • 127. More sequencing experiments are trying to link with phenotype data. (ie personally identifiable medical records).
  • 128. Private Cloud to rescue?
    • Can we do something different?
  • 129. Traditional Collaboration DCC: Sequencing Centre + Archive Sequencing centre Sequencing centre Sequencing centre Sequencing centre IT IT IT IT
  • 130. Dark Archives
    • Storing data in an archive is not particularly useful.
      • You need to be able to access the data and do something useful with it.
    • Data in current archives is “dark”.
      • You can put/get data, but cannot compute across it.
      • 131. Is data in an inaccessible archive really useful?
  • 132. Private Cloud Collaborations Sequencing Centre Sequencing centre Sequencing centre Sequencing centre Private Cloud IaaS / SaaS Private Cloud IaaS / SaaS
  • 133. Private Cloud
    • Advantages:
      • Small organisations leverage expertise of big IT organisations.
      • 134. Academia tends to be linked by fast research networks.
        • Moving data is easier. (move compute to the data via VMs)
      • Consortium will be signed up to data-access agreements.
        • Simplifies data governance.
    • Problems:
      • Big change in funding model.
      • 135. Are big centres set up to provide private cloud services?
        • Selling services is hard if you are a charity.
      • Can we do it as well as the big internet companies?
  • 136. Summary
    • Cloud is a useful tool.
      • Will not replace our local IT infrastructure.
    • Porting existing applications can be hard.
      • Do not underestimate time / people.
    • Still need IT staff.
      • End up doing different things.
  • 137. Acknowledgements

×