Clouds, Grids and Data


Published on

The Next-Generation sequencing data-deluge requires storage and compute services to be provisioned at an ever-increasing rate. Can Cloud (and last decade's buzzword, Grid), help us?

Talk given at the NHGRI Cloud computing workshop, 2010.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Clouds, Grids and Data

  1. 1. Clouds, Grids and Data Guy Coates Wellcome Trust Sanger Institute [email_address]
  2. 2. <ul>The Sanger Institute </ul><ul><li>Funded by Wellcome Trust. </li></ul><ul><ul><li>2 nd largest research charity in the world.
  3. 3. ~700 employees.
  4. 4. Based in Hinxton Genome Campus, Cambridge, UK. </li></ul></ul><ul><li>Large scale genomic research. </li></ul><ul><ul><li>Sequenced 1/3 of the human genome. (largest single contributor).
  5. 5. We have active cancer, malaria, pathogen and genomic variation / human health studies. </li></ul></ul><ul><li>All data is made publicly available. </li></ul><ul><ul><li>Websites, ftp, direct database. access, programmatic APIs. </li></ul></ul>
  6. 6. Shared data archives
  7. 7. Past Collaborations Data Sequencing Centre + DCC Sequencing centre Sequencing centre Sequencing centre Sequencing centre
  8. 8. Future Collaborations Collaborations are short term: 18 months-3 years. Sequencing Centre 3 Sequencing Centre 1 Sequencing Centre 2A Sequencing Centre 2B Federated access
  9. 9. Genomics Data Unstructured data (flat files) Data size per Genome Structured data (databases) DAS, bioMART etc ? Intensities / raw data (2TB) Alignments (200 GB) Sequence + quality data (500 GB) Variation data (1GB) Individual features (3MB)
  10. 10. Sharing Unstructured data <ul><li>Large data volumes, flat files.
  11. 11. Federated access. </li><ul><li>Data is not going to be in once place.
  12. 12. Single institute will have data distributed for DR / worldwide access. </li><ul><li>Some parts of the data may be on cloud stores. </li></ul></ul><li>Controlled access. </li><ul><li>Many archives will be public.
  13. 13. Some will have patient identifiable data.
  14. 14. Plan for it now. </li></ul></ul>
  15. 15. iRODS <ul><li>iRODS: Integrated Rule-Oriented Data System. </li><ul><li>Produced by DICE Group (Data Intensive Cyber Environments) at U. North Carolina, Chapel Hill. </li></ul><li>Successor to SRB. </li><ul><li>SRB used by the High-Energy-Physics (HEP) community. </li><ul><li>20PB/year LHC data. </li></ul><li>HEP community has lots of “lessons learned” that we can benefit from. </li></ul><li>Promising “glue” layer to pull archives together. </li></ul>
  16. 16. iRODS ICAT Catalogue database Rule Engine Implements policies Irods Server Data on disk User interface WebDAV, icommands,fuse Irods Server Data in database Irods Server Data in S3
  17. 17. Useful Features <ul><li>Efficient </li><ul><li>Copes with PB of data and 100,000M+ files.
  18. 18. Fast parallel data transfers across local and wide area network links. </li></ul><li>Extensible </li><ul><li>System can be linked out external services. </li><ul><li>Eg external databases holding metadata, external authentication systems. </li></ul></ul><li>Federated </li><ul><li>Physically and logically separated iRODS installs can be federated.
  19. 19. Allows user at institute A to seamlessly access data at institute B in a controlled manner. </li></ul></ul>
  20. 20. What are we doing with it? <ul><li>Piloting it for internal use. </li><ul><li>Help groups keep track of their data.
  21. 21. Move files between different storage pools. </li><ul><li>Fast scratch space ↔ warehouse disk ↔ Offsite DR centre. </li></ul><li>Link metadata back to our LIMs/tracking databases. </li></ul><li>We need to share data with other institutions. </li><ul><li>Public data is easy: FTP/http.
  22. 22. Controlled data is hard:
  23. 23. Encrypt files and place on private FTP dropboxes.
  24. 24. Cumbersome to manage and insecure. </li></ul><li>Ports trivially to the cloud. </li><ul><li>Build with federation from day 1.
  25. 25. Software knows about S3 storage layers. </li></ul></ul>
  26. 26. Identity management <ul><li>Which identity management system to use for controlled access?
  27. 27. Culture shock.
  28. 28. Lots of solutions: </li><ul><ul><li>openID, shibboleth(aspis), globus/x509 etc. </li></ul></ul><li>What features are important? </li><ul><li>How much security?
  29. 29. Single sign on?
  30. 30. Delegated authentication? </li></ul><li>Finding consensus will be hard. </li></ul>
  31. 31. Cloud Archives
  32. 32. Dark Archives <ul><li>Storing data in an archive is not particularly useful. </li><ul><li>You need to be able to access the data and do something useful with it. </li></ul><li>Data in current archives is “dark”. </li><ul><li>You can put/get data, but cannot compute across it.
  33. 33. Is data in an inaccessible archive really useful? </li></ul></ul>
  34. 34. Last week's bombshell <ul><li>“We want to run out pipeline across 100TB of data currently in EGA/SRA.”
  35. 35. We will need to de-stage the data to Sanger, and then run the compute. </li><ul><li>Extra 0.5 PB of storage, 1000 cores of compute.
  36. 36. 3 month lead time.
  37. 37. ~$1.5M capex. </li></ul></ul>
  38. 38. Elephant in the room
  39. 39. Network speeds <ul><li>Moving large amounts of data across the public internet is hard.
  40. 40. Data transfer rates (gridFTP/FDT): </li><ul><li>Cambridge -> EC2 East coast: 12 Mbyte/s (96 Mbits/s)
  41. 41. NCBI -> Sanger: 15 Mbyte/s (120 Mbit/s)
  42. 42. Oxford -> Sanger: 60 Mbyte/s (480 Mbit/s) </li></ul><li>77 days to pull down 100TB from NCBI.
  43. 43. 20 days to pull down 100TB from Oxford.
  44. 44. Can we use the CERN model? </li><ul><li>Lay dedicated 10gig lines between Geneva and the 10 T1 centres.
  45. 45. Collaborations are too fluid. </li><ul><li>1.5-3 years vs 15 years for LHC. </li></ul></ul></ul>
  46. 46. Cloud / Computable archives <ul><li>Can we move the compute to the data? </li><ul><li>Upload workload onto VMs.
  47. 47. Put VMs on compute that is “attached” to the data. </li></ul></ul>Data CPU CPU CPU CPU Data CPU CPU CPU CPU VM
  48. 48. Proto-Example: Ssaha trace search Hash Table (320 GB) trace Database ~30TB 1. hash database CPU CPU CPU CPU hash hash hash hash 2 .Distribute hash across machines query 3. Run query in parallel
  49. 49. Practical Hurdles
  50. 50. Where does it live? <ul><li>Most of us are funded to hold data, not to fund everyone else's compute costs to. </li><ul><li>Now need to budget for raw compute power as well as disk.
  51. 51. Implement virtualisation infrastructure, billing etc. </li><ul><li>Are you legally allowed to charge?
  52. 52. Who underwrites it if nobody actually uses your service? </li></ul></ul><li>Strongly implies data has to be held on a commercial provider. </li></ul>
  53. 53. Networking: <ul>We still need to get data in. <ul><li>Fixing the internet is not going to be cost effective for us. </li></ul><li>Fixing the internet may be cost effective for big cloud providers. </li><ul><li>Core to their business model.
  54. 54. All we need to do is get data into Amazon, and then everyone else can get the data from there. </li></ul><li>Do we invest in a fast links to Amazon? </li><ul><li>It changes the business dynamic.
  55. 55. We have effectively tied ourselves to a single provider. </li></ul></ul>
  56. 56. Compute architecture VS CPU CPU CPU Fat Network Posix Global filesystem CPU CPU CPU CPU thin network Local storage Local storage Local storage Local storage Batch schedular hadoop/S3 Data-store Data-store
  57. 57. Architecture <ul><li>Our existing pipelines do not port well to clouds. </li><ul><li>Expect a POSIX shared filesystem.
  58. 58. Re-writing apps to use S3 or hadoop/HDFS is a real hurdle. </li><ul><li>Fork existing apps: HPTC and Cloud stream? </li></ul><li>New apps: </li><ul><li>How do you run them internally? </li><ul><li>Build a cloud? </li></ul></ul></ul><li>Am I being a reactionary old fart? </li><ul><li>15 years ago “clusters of PCs were not real supercomputers.”
  59. 59. ...then beowulf took over the world.
  60. 60. Big difference: porting applications between the two architectures was easy. </li></ul><li>Will the market provide “traditional” compute clusters in the cloud? </li></ul>
  61. 61. Summary <ul><li>Good tools are available for building federated data archives.
  62. 62. The challenge is computing across the data at scale.
  63. 63. Network infrastructure and cloud architectures still problematic. </li></ul>
  64. 64. Acknowledgements <ul><li>Phil Butcher
  65. 65. ISG Team </li><ul><li>James Beal
  66. 66. Gen-Tao Chiang
  67. 67. Pete Clapham
  68. 68. Simon Kelley </li></ul><li>1k Genomes Project </li><ul><li>Thomas Keane
  69. 69. Jim Stalker </li></ul><li>Cancer Genome Project </li><ul><li>Adam Butler
  70. 70. John Teague </li></ul></ul>
  71. 72. Backup
  72. 73. Other cloud projects
  73. 74. Virtual Colo. <ul><li>Ensembl Website. </li><ul><li>Access from outside Europe has been slow.
  74. 75. Build a mirror in US west coast commercial co-lo. </li><ul><li>~25% of total traffic uses the west coast mirror. </li></ul></ul><li>We would like to extend mirrors to other parts of the world. </li><ul><li>US East coast. </li></ul><li>Building a mirror inside Amazon. </li><ul><li>LAMP stack.
  75. 76. Common workload.
  76. 77. Not (technically) challenging. </li><ul><li>Management overhead. </li></ul><li>Cost comparisons will be interesting. </li></ul></ul>
  78. 80. HPTC workloads on the cloud <ul><li>There are going to be lots of new genomes that need annotating. </li><ul><li>Small labs: limited informatics / systems experience. </li><ul><li>Typically postdocs/PhD who have a “real” job to do. </li></ul><li>Getting ensembl pipeline up and running takes a lot of domain-expertease. </li></ul><li>We have already done all the hard work on installing the software and tuning it. </li><ul><li>Can we package up the pipeline, put it in the cloud? </li></ul><li>Goal: End user should simply be able to upload their data, insert their credit-card number, and press “GO” . </li></ul>