Cloud Technical Challenges               Guy Coates      Wellcome Trust Sanger Institute           gmpc@sanger.ac.uk
OutlineBackgroundCloud ExperiencesBarriersFuture Directions
The Sanger InstituteFunded by Wellcome Trust.• 2nd largest research charity in the world.• ~700 employees.• Based in Hinxt...
Lost in the clouds...
Victory!
Our Cloud Experiences
Hype CycleAwesome!                        Just works...
EnsemblEnsembl is a system for genome Annotation.Data visualisation / Mining web services.• www.ensembl.org• Provides web ...
Why Cloud?Web services• Was hosted in a single datacentre at the Genome Campus, UK.• 1 datacentre = Single point of failur...
Hype Cycle         Web services /          Some HPC
That was easy...
Hype cycleSequencinginformatics
DNA sequencing
Economic Trends:As cost of sequencing halves every 12months.• cf Moores LawThe Human genome project:• 13 years.• 23 labs.•...
The scary graphPeak Yearly capillary   Current weeky sequencing:sequencing: 30 Gbase    3000 Gbase
Managing GrowthWe have exponential growth instorage and compute.• Storage /compute doubles every 12                       ...
What do you need to do                  sequencing?                    LIMS System       /      Data Tracking             ...
What IT do you need to do              sequencing?                       LIMS System       /      Data Tracking           ...
This is really hard...We have a whole division of HPC specialists, LIMsdevelopers, bio-informaticians.What about smaller l...
...and then change it.Sequencing informatics is massively fluid.• New chemistry.• More sequencing machines.• New analysis ...
How can cloud help?
What can we put on the Cloud?                   LIMS System       /      Data Tracking                                    ...
Does it Cloud?How do we decide what to cloud?Rule of thumb borrowed from HPC.• Small data / High CPU work better in distri...
Sequencing Data   Data size per Genome      Tracking / LIMs              Structured data       (100s Kbytes)              ...
Sequencing Data   Data size per Genome                                   Cloud Friendly      Tracking / LIMs              ...
Can we Cloudify Sequencing?                   LIMS System       /      Data Tracking                                      ...
What are the blockers?HPC infrastructure is now available in the cloud.• Good enough for 95% of sequencing.Doing big data ...
Moving data is hardTools:• (FTP,ssh/rsync) are not suited to wide-area networks.• WAN tools: gridFTP/FDT/Aspera.Data trans...
NetworkingHow do we improve datatransfers across the publicinternet?• CERN approach; dont.• Dedicated networking has been ...
Data Security
Are you allowed to put data on          the cloud? Default policy: “Our data is confidential/important/critical to our bus...
What does “My System”                  mean? My System                                                                    ...
How confidential is the data?Low Risk                                                                High Risk            ...
Reasons to be optimistic:Most (all?) data security issues can be dealt with.• But the devil is in the details.• Data can b...
Outstanding IssuesAudit and compliance:• If you need IP agreements, above your providers standard T&Cs, how do    you push...
Private Cloud to rescue?Sequencing increasingly takes place in large consortiums.• Eg International Cancer Genome Consorti...
Traditional Collaboration                  IT                   IT    IT     IT      Sequencing             Sequencing    ...
Cloud Collaborations             Sequencing             Sequencing               centre               centre             P...
Private CloudAdvantages:• LIMS / analysis software easily shared with consortium.     • Small organisations leverage exper...
Cloud data archives
Dark ArchivesStoring data in an archive is notparticularly useful.• You need to be able to access the    data and do somet...
Example problem:“We want to run out pipeline across 100TB of datacurrently in EGA/SRA.”We will need to de-stage the data t...
Cloud / Computable archivesMove the compute to thedata.• Upload workload onto VMs.• Put VMs on compute that is    “attache...
AcknowledgementsSanger                EBI•   Phil Butcher      Glenn Proctor•   James Beal        Steve Keenan•   Pete Cla...
Upcoming SlideShare
Loading in …5
×

Cloud Technical Challenges

1,394 views

Published on

This talks covers the current challenges and opportunities for using cloud computing for data-heavy, research computing.

Talk given at the Marcus Evans "Cloud Computing in the Pharmaceutical Industry" conference, Frankfurt 2011.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,394
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
33
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Cloud Technical Challenges

  1. 1. Cloud Technical Challenges Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk
  2. 2. OutlineBackgroundCloud ExperiencesBarriersFuture Directions
  3. 3. The Sanger InstituteFunded by Wellcome Trust.• 2nd largest research charity in the world.• ~700 employees.• Based in Hinxton Genome Campus, Cambridge, UK.Large scale genomic research.• Sequenced 1/3 of the human genome. (largest single contributor).• We have active cancer, malaria, pathogen and genomic variation / human health studies.All data is made publiclyavailable.• Websites, ftp, direct database. access, programmatic APIs.
  4. 4. Lost in the clouds...
  5. 5. Victory!
  6. 6. Our Cloud Experiences
  7. 7. Hype CycleAwesome! Just works...
  8. 8. EnsemblEnsembl is a system for genome Annotation.Data visualisation / Mining web services.• www.ensembl.org• Provides web / programmatic interfaces to genomic data.• 10k visitors / 126k page views per day.Compute Pipeline (HPTC Workload)• Take a raw genome and run it through a compute pipeline to find genes and other features of interest.• Ensembl at Sanger/EBI provides automated analysis for 51 vertebrate genomes.• Software is Open Source (apache license).• Data is free for download.We have web services and HPTC workloads running onIaas.
  9. 9. Why Cloud?Web services• Was hosted in a single datacentre at the Genome Campus, UK.• 1 datacentre = Single point of failure.• Access slow if you were not in western Europe.Cloud Application• Build worldwide network of mirrors on IaaS.HPC• People want to run Ensembl HPC pipeline on their own data.• Requires skilled bioinformatician to get the software running and access to a HPC cluster.Cloud Application• Build HPC SaaS.• Users deploy ready-to-run Ensembl code on AWS, self-assembles into a HPC cluster and analyses their data.
  10. 10. Hype Cycle Web services / Some HPC
  11. 11. That was easy...
  12. 12. Hype cycleSequencinginformatics
  13. 13. DNA sequencing
  14. 14. Economic Trends:As cost of sequencing halves every 12months.• cf Moores LawThe Human genome project:• 13 years.• 23 labs.• $500 Million.A Human genome today:• 3 days.• 1 machine.• $10,000.• Large centres are now doing studies with 10,000s of genomes.Trend will continue:• Generation 3 sequencers are on their way.• $500 genome is probable within 5 years.
  15. 15. The scary graphPeak Yearly capillary Current weeky sequencing:sequencing: 30 Gbase 3000 Gbase
  16. 16. Managing GrowthWe have exponential growth instorage and compute.• Storage /compute doubles every 12 Disk Storage months. 6000 • 2009 ~7 PB raw 5000Gigabase of sequence ≠ Gigbyte 4000of storage.• 16 bytes per base for for sequence Terabytes 3000 data.• Intermediate analysis typically need 10x 2000 disk space of the raw data. 1000Moores law will not save us. 0• Transistor/disk density: Td=18 months 1995 1997 1999 2001 2003 2005 2007 2009 1994 1996 1998 2000 2002 2004 2006 2008• Sequencing cost: Td=12 months Year• Sequencing output: Td=3-6 months
  17. 17. What do you need to do sequencing? LIMS System / Data Tracking External External analysis analysis Data DataSample prepSample prep Sequencer Sequencer repository repository software software repository repository Integrated Integrated compute compute HPC HPC Resource Resource
  18. 18. What IT do you need to do sequencing? LIMS System / Data Tracking External External analysis analysis Data DataSample prepSample prep Sequencer Sequencer repository repository software software repository repository Integrated Integrated compute compute HPC HPC Resource Resource Part covered in the grant
  19. 19. This is really hard...We have a whole division of HPC specialists, LIMsdevelopers, bio-informaticians.What about smaller labs with 1 or 2 sequencers?
  20. 20. ...and then change it.Sequencing informatics is massively fluid.• New chemistry.• More sequencing machines.• New analysis software.Constant cycle of development and deployment.
  21. 21. How can cloud help?
  22. 22. What can we put on the Cloud? LIMS System / Data Tracking External External analysis analysis Data DataSample prepSample prep Sequencer Sequencer repository repository software software repository repository Integrated Integrated compute compute HPC HPC Resource Resource
  23. 23. Does it Cloud?How do we decide what to cloud?Rule of thumb borrowed from HPC.• Small data / High CPU work better in distributed environments.IO Bound CPU Bound/ Large data / small data
  24. 24. Sequencing Data Data size per Genome Tracking / LIMs Structured data (100s Kbytes) (databases) Individual features (3MB) Variation data (1GB) Alignments (200 GB)Sequence + quality data (500 GB) Unstructured data (flat files) ( Raw data (TB) )
  25. 25. Sequencing Data Data size per Genome Cloud Friendly Tracking / LIMs Structured data (100s Kbytes) (databases) Individual features (3MB) Variation data (1GB) Alignments (200 GB)Sequence + quality data (500 GB) Unstructured data Cloud Unfriendly (flat files) ( Raw data (TB) )
  26. 26. Can we Cloudify Sequencing? LIMS System / Data Tracking External External analysis analysis Data DataSample prepSample prep Sequencer Sequencer repository repository software software repository repository Integrated Integrated compute compute HPC HPC Resource Resource
  27. 27. What are the blockers?HPC infrastructure is now available in the cloud.• Good enough for 95% of sequencing.Doing big data is hard:1. You have to get the data there first.2. You may not be allowed to put the data there.
  28. 28. Moving data is hardTools:• (FTP,ssh/rsync) are not suited to wide-area networks.• WAN tools: gridFTP/FDT/Aspera.Data transfer rates (gridFTP/FDT via our 2 Gbit/s site link).• Cambridge → EC2 East coast: 12 Mbytes/s (96 Mbits/s)• Cambridge → EC2 Dublin: 25 Mbytes/s (200 Mbits/s)• 11 hours to move 1TB to Dublin.• 23 hours to move 1 TB to East coast.What speed should we get?• Once we leave JANET (UK academic network) finding out what the connectivity is and what we should expect is almost impossible.Do you have fast enough disks at each end to keep thenetwork full?Why not just ship disks?• Logistical nightmare.• Format issues, corruption, slow.
  29. 29. NetworkingHow do we improve datatransfers across the publicinternet?• CERN approach; dont.• Dedicated networking has been put in between CERN and the T1 centres who get all of the CERN data.Can it work for cloud?• Buy dedicated bandwidth to a provider. • Ties you in. • Should they pay?We need good connectivityto everywhere.
  30. 30. Data Security
  31. 31. Are you allowed to put data on the cloud? Default policy: “Our data is confidential/important/critical to our business. We must keep our data on our computers.”
  32. 32. What does “My System” mean? My System Not my systemPurchased computer in Purchased computer in IaaS on a cloudmy data centre a co-lo facility provider Leased computer in Traditionally outsourced IT SaaS on a cloud my data centre service provider Root / Admin Access? VPN / inside or outside firewall? Encrypted/ Non encrypted? Legal / IP agreement in place?
  33. 33. How confidential is the data?Low Risk High Risk Anonymised Personally Publically available datasets identifiable datasets Trade Secret / Genome data (eg individual Patentable data genomes with no identifiers)
  34. 34. Reasons to be optimistic:Most (all?) data security issues can be dealt with.• But the devil is in the details.• Data can be put on the cloud, if care is taken.It is probably more secure there than in your own data-centre.• Can you match AWS data availability guarantees?Are cloud providers different from any other organisationyou outsource to?
  35. 35. Outstanding IssuesAudit and compliance:• If you need IP agreements, above your providers standard T&Cs, how do you push them through?Geographical boundaries mean little in the cloud.• Data can be replicated across national boundaries, without end user being aware.Moving personally identifiable data outside of the EU ispotentially problematic.• (Can be problematic within the EU; privacy laws are not as harmonised as you might think.)• More sequencing experiments are trying to link with phenotype data. (ie personally identifiable medical records).
  36. 36. Private Cloud to rescue?Sequencing increasingly takes place in large consortiums.• Eg International Cancer Genome Consortium http://www.icgc.org)Can we do private clouds within the consortium?
  37. 37. Traditional Collaboration IT IT IT IT Sequencing Sequencing IT ITSequencing centre centre SequencingSequencing Sequencing centre centre centre centre Sequencing Sequencing Centre + DCC Centre + DCC IT IT
  38. 38. Cloud Collaborations Sequencing Sequencing centre centre Private Cloud Private Cloud IaaS // SaaS IaaS SaaSSequencingSequencing Sequencing Sequencing centre centre centre centre Private Cloud Private Cloud IaaS // SaaS IaaS SaaS Sequencing Sequencing Centre Centre
  39. 39. Private CloudAdvantages:• LIMS / analysis software easily shared with consortium. • Small organisations leverage expertise of big IT organisations.• Academia tends to be linked by fast research networks. • Moving data is easier.• Consortium will be signed up to data-access agreements. • Simplifies data governance.Problems:• Big change in funding model.• Are big centres set up to provide private cloud services? •Selling services is hard if you are a charity.• Can we do it as well as the big internet companies?
  40. 40. Cloud data archives
  41. 41. Dark ArchivesStoring data in an archive is notparticularly useful.• You need to be able to access the data and do something useful with it.Data in current archives is“dark”.• You can put/get data, but cannot compute across it.• Is data in an inaccessible archive really useful?
  42. 42. Example problem:“We want to run out pipeline across 100TB of datacurrently in EGA/SRA.”We will need to de-stage the data to Sanger, and then runthe compute.• Extra 0.5 PB of storage, 1000 cores of compute.• 3 month lead time.• ~$1.5M capex.
  43. 43. Cloud / Computable archivesMove the compute to thedata.• Upload workload onto VMs.• Put VMs on compute that is “attached” to the data. CPU CPU CPU CPU CPU CPU CPU CPUFederated betweencentres Data Data• Grid software build on top of CPU CPU CPU CPU CPU CPU CPU CPU cloud components.• Avoids scaling problems VM VM Data inherent in putting everything Data on one place.
  44. 44. AcknowledgementsSanger EBI• Phil Butcher Glenn Proctor• James Beal Steve Keenan• Pete Clapham• Simon Kelley• Gen-Tao Chiang• Steve Searle• Jan-Hinnerk Vogel• Bronwen Aken

×