New sequencing Storage Guy Coates Wellcome Trust Sanger Institute [email_address]
About the Institute <ul><li>Funded by Wellcome Trust. </li><ul><li>2 nd  largest research charity in the world.
~700 employees. </li></ul><li>Large scale genomic research. </li><ul><li>Sequenced 1/3 of the human genome (largest single...
We have active cancer, malaria, pathogen and genomic variation studies. </li></ul><li>All data is made publicly available....
Recent initiatives: 1 1000 Genomes
Overview <ul><li>1. Scramble for nex-gen sequencing
2. The data explosion
3. Building flexible systems
4. Future directions </li></ul>
Scramble for Next-gen sequencing
Classic Sanger “Stealth project” <ul><li>Summer 2007; first early access sequencer.
Not long after: </li><ul><li>“15 sequencers have been ordered. They are arriving in 8 weeks. Can we have some storage and ...
What are we dealing with? <ul><li>It all started getting very Rumsfeld-ian:
“There are known knowns.” </li><ul><li>Must be in place by Oct 1st.
How much data will be coming off? </li></ul><li>“There are known unknowns.”  </li><ul><li>How does the analysis work?
Will we be CPU bound or IO bound?
What will the growth rate be? </li></ul><li>“But there are also unknown unknowns.” </li></ul>
Modular <ul><li>Small: 1 machine + disk pool per sequencer. </li><ul><li>Simple.
Easy to scale.
Small unit of failure; system failure should only affect one sequencer.
Hard to right-size, especially if things change. </li></ul></ul>CPU Disk Seq CPU Disk Seq CPU Disk Seq
Monolithic <ul><li>Large pool of machine and storage for all sequencers. </li><ul><li>Flexible: Can cope with fluctuations...
Complicated. Clustered storage at scale is hard.
Eliminating SPOF is hard.
Scale out can be expensive. </li></ul></ul>CPU FARM Cluster filesystem Seq Seq Seq Network Storage fabric
Supersize me <ul><li>We went monolithic. </li><ul><li>We  had experience with large storage and large clusters on our comp...
384 cores for the sequencers, the rest for other analysis.
300TB of storage. </li><ul><li>3x100TB lustre file-systems (HP SFS / lustre 1.4)
Lots of performance if we need it.
Plenty of space to cope with changing needs. </li></ul><li>Lustre storage was scratch space. </li><ul><li>Intermediate ima...
Compute farm analysis/QC pipeline suckers Data pull ... Final Repository (Oracle) 100TB / yr staging area 300 TB Lustre EV...
Hurrah! <ul><li>Sequencing storage problems solved,  back home in time for tea and medals. </li></ul>
The Data Explosion
The scary graph Instrument upgrades Peak Yearly capillary sequencing
Sequencing is not everything... <ul><li>We had been focused on the sequencing pipeline. </li><ul><li>Taking instrument out...
Expanding Everything <ul><li>Sequencers: </li><ul><li>15 -> 38, numerous upgrades, run-length increases, paired end etc.. ...
Upcoming SlideShare
Loading in …5
×

Storage for next-generation sequencing

4,332 views

Published on

The computational requirements of next generation sequencing is placing a huge demand on IT organisations .

Building compute clusters is now a well understood and relatively straightforward problem. However, NGS sequencing applications require large amounts of storage, and high IO rates.

This talk details our approach for providing storage for next-gen sequencing applications.

Talk given at BIO-IT World, Europe, 2009.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,332
On SlideShare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
127
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Storage for next-generation sequencing

  1. 1. New sequencing Storage Guy Coates Wellcome Trust Sanger Institute [email_address]
  2. 2. About the Institute <ul><li>Funded by Wellcome Trust. </li><ul><li>2 nd largest research charity in the world.
  3. 3. ~700 employees. </li></ul><li>Large scale genomic research. </li><ul><li>Sequenced 1/3 of the human genome (largest single contributor).
  4. 4. We have active cancer, malaria, pathogen and genomic variation studies. </li></ul><li>All data is made publicly available. </li><ul><li>Websites, ftp, direct database. access, programmatic APIs. </li></ul></ul>
  5. 5. Recent initiatives: 1 1000 Genomes
  6. 6. Overview <ul><li>1. Scramble for nex-gen sequencing
  7. 7. 2. The data explosion
  8. 8. 3. Building flexible systems
  9. 9. 4. Future directions </li></ul>
  10. 10. Scramble for Next-gen sequencing
  11. 11. Classic Sanger “Stealth project” <ul><li>Summer 2007; first early access sequencer.
  12. 12. Not long after: </li><ul><li>“15 sequencers have been ordered. They are arriving in 8 weeks. Can we have some storage and computers?” </li></ul><li>A fun summer was had by all! </li></ul>
  13. 13. What are we dealing with? <ul><li>It all started getting very Rumsfeld-ian:
  14. 14. “There are known knowns.” </li><ul><li>Must be in place by Oct 1st.
  15. 15. How much data will be coming off? </li></ul><li>“There are known unknowns.” </li><ul><li>How does the analysis work?
  16. 16. Will we be CPU bound or IO bound?
  17. 17. What will the growth rate be? </li></ul><li>“But there are also unknown unknowns.” </li></ul>
  18. 18. Modular <ul><li>Small: 1 machine + disk pool per sequencer. </li><ul><li>Simple.
  19. 19. Easy to scale.
  20. 20. Small unit of failure; system failure should only affect one sequencer.
  21. 21. Hard to right-size, especially if things change. </li></ul></ul>CPU Disk Seq CPU Disk Seq CPU Disk Seq
  22. 22. Monolithic <ul><li>Large pool of machine and storage for all sequencers. </li><ul><li>Flexible: Can cope with fluctuations in CPU/storage requirements.
  23. 23. Complicated. Clustered storage at scale is hard.
  24. 24. Eliminating SPOF is hard.
  25. 25. Scale out can be expensive. </li></ul></ul>CPU FARM Cluster filesystem Seq Seq Seq Network Storage fabric
  26. 26. Supersize me <ul><li>We went monolithic. </li><ul><li>We had experience with large storage and large clusters on our compute farm. </li></ul><li>Sequencing Compute farm. </li><ul><li>512 cores for pipeline and downstream analysis.
  27. 27. 384 cores for the sequencers, the rest for other analysis.
  28. 28. 300TB of storage. </li><ul><li>3x100TB lustre file-systems (HP SFS / lustre 1.4)
  29. 29. Lots of performance if we need it.
  30. 30. Plenty of space to cope with changing needs. </li></ul><li>Lustre storage was scratch space. </li><ul><li>Intermediate image data thrown away. </li></ul><li>Final data (sequence+quality) kept in an oracle data warehouse. </li></ul></ul>
  31. 31. Compute farm analysis/QC pipeline suckers Data pull ... Final Repository (Oracle) 100TB / yr staging area 300 TB Lustre EVA Seq 1 Seq 38
  32. 32. Hurrah! <ul><li>Sequencing storage problems solved, back home in time for tea and medals. </li></ul>
  33. 33. The Data Explosion
  34. 34. The scary graph Instrument upgrades Peak Yearly capillary sequencing
  35. 35. Sequencing is not everything... <ul><li>We had been focused on the sequencing pipeline. </li><ul><li>Taking instrument output and producing DNA sequence + quality. </li></ul><li>For many investigators, finished sequence is where they start . </li><ul><li>Investigators take the mass of finished sequence data and start computing on it. </li></ul><li>Big increase in data all across the institute. </li></ul>
  36. 36. Expanding Everything <ul><li>Sequencers: </li><ul><li>15 -> 38, numerous upgrades, run-length increases, paired end etc.. </li></ul><li>Sequencing compute farm: </li><ul><li>~500 -> 1000 cores
  37. 37. 300 -> 600TB </li></ul><li>General compute farm: </li><ul><li>2000 -> 5000 cores
  38. 38. 20TB -> 300 TB </li></ul><li>General storage </li><ul><li>2PB -> 4PB </li></ul><li>Most of the increases were not in the “Sequencing” area. </li></ul>
  39. 39. ... Data pull ... Compute farm analysis/QC pipeline suckers Final Repository (Oracle) 100TB / yr staging area 300 TB Lustre EVA Seq 1 Seq 38 Compute Farm Compute farm disk Collaberators / 3 rd party sequencing Unmanged LIMS managed data
  40. 40. Pipeline data is managed <ul><li>Data in the sequencing pipeline is tracked. </li><ul><li>We know how much there is, and who it belongs to. </li></ul><li>Data has a defined life-cycle. </li><ul><li>Intermediate data (images etc) are deleted after the runs pass QC.
  41. 41. Important data (finished sequence) is automatically moved to our archive, backed up and replicated off-site. </li></ul><li>Good communication between the pipeline / LIMS developers and the systems team. </li><ul><li>We know who to talk to.
  42. 42. We get a good heads up for changes/plans. </li></ul></ul>
  43. 43. Data outside the pipeline is not <ul><li>Investigators take data and “do stuff” with it. </li><ul><li>Analysis take lots of space; 10x the space of the “raw” data. </li></ul><li>Data is left in the wrong place. </li><ul><li>Typically left where it was created.
  44. 44. Moving data is hard and slow.
  45. 45. Important data left in scratch areas, or high IO analysis being run against slow storage. </li></ul><li>Capacity planning becomes impossible. </li><ul><li>Who is using our disk space? </li><ul><li>“du” on 4PB is not going to work... </li></ul><li>Are we getting duplication of datasets? </li></ul><li>How do we account for it? </li><ul><li>We need to help Investigators come up with costings that include analysis costs as well as the costs for initial sequencing. </li></ul></ul>
  46. 46. Old architecture <ul><li>Separate compute silos for separate groups </li><ul><li>Eg cancer, pathogen, sequencing </li></ul><li>“Fast” cluster storage (lustre) for IO bound work. </li><ul><li>Constrained by network topology/bandwidth. </li></ul><li>Problems: </li><ul><li>Projects span the organisation domains.
  47. 47. Projects can outgrow compute + storage.
  48. 48. Fast storage is always full.
  49. 49. Cannot afford to buy more. </li></ul></ul>network network Farm 1 Filesystem Farm 2 Filesystem Farm 1 Farm 2
  50. 50. Another Scary Graph Time Storage 4PB  Compute 5K cores  Sequencing output (xTBases)  Headcount  Budget 
  51. 51. Building Flexible Systems
  52. 52. Agile <ul><li>Science is changing very rapidly. </li><ul><li>Changing science usually means more data. </li></ul><li>LIMS / Pipeline software development teams use agile methods to cope with changes. </li><ul><li>Short, iterative, development process.
  53. 53. bi-weekly updates to pipeline.
  54. 54. Evolutionary process.
  55. 55. Allows changes to be put in place very quickly. </li></ul><li>Can we do “agile” systems? </li><ul><li>If systems can adapt easily, we do not have to worry about changes in science. </li></ul><li>Buzzword compliant! </li></ul>
  56. 56. Plan of attack <ul><li>1. Look at the workflow and data
  57. 57. 2. Start managing data
  58. 58. 3. Tie it all together into a flexible infrastructure </li></ul>
  59. 59. Workflows <ul><li>Identified 3 data patterns:
  60. 60. High IO, active datasets. </li><ul><li>Data being crunched on our compute farm.
  61. 61. Needs high performance storage. </li></ul><li>“Active Archive” datasets. </li><ul><li>Projects that has been finished, but are still need to be around. </li><ul><li>Reference datasets, previous tranches of data. </li></ul><li>Does not need to be fast, but it needs to be cheap, as we are going to have lots of it. </li></ul><li>Stuff in the middle. </li><ul><li>Home directories etc. </li></ul></ul>
  62. 62. High speed disk <ul><li>Lots of options around for high speed cluster storage. </li><ul><li>Lustre, GPFS, Isilon, Panasas, pNFS etc.
  63. 63. These are exotic and expensive.
  64. 64. Expect to break them and expect to spend time on care and feeding.
  65. 65. But you need them at scale.
  66. 66. We use DDN + Lustre 1.6 + support contract. </li></ul><li>Single name-space file-systems across clusters are nice; our investigators really like them. </li><ul><li>when they work:) </li></ul></ul>
  67. 67. Low speed / bulk disk <ul><li>This will be the bulk of our data; we needs lots of it. </li><ul><li>Price / TB is critical factor.
  68. 68. Shortly followed by space and power footprints.
  69. 69. Power and space are constraints for us.
  70. 70. Dense disk shelves.
  71. 71. MAID functionality.
  72. 72. Needs to be manageable. </li></ul><li>And don't forget the backup. </li><ul><li>Backup to tape is probably not practical.
  73. 73. Use disk -> disk replication.
  74. 74. You need 2x of whatever you buy. </li><ul><li>Do we do this in hardware? Software? Both? </li></ul></ul></ul>
  75. 75. Start Managing data <ul><li>How do we distribute data between these two disk pools?
  76. 76. Manually. </li><ul><li>This is less than idea, but you have to start somewhere.
  77. 77. Work with the researchers to identify data and then map it onto storage.
  78. 78. Works well with the power users.
  79. 79. Can be difficult with transient projects who do one-off large-scale analysis. </li><ul><li>Data is orphaned,
  80. 80. takes along time to track down who is actually responsible for the data. </li></ul><li>Stick rather than carrot; quotas, limits. </li></ul><li>Still a massive improvement. </li></ul>
  81. 81. Flexible Infrastructure <ul><li>Make storage visible from everywhere. </li><ul><li>Key enabler; lots of 10Gig </li></ul><li>This allows us to move compute jobs between farms. </li><ul><li>Logically rather than physically separated.
  82. 82. Currently using LSF to manage workflow. VMs in the future? </li></ul><li>Modular design. </li><ul><li>Blocks of network, compute and storage.
  83. 83. Assume from day 1 we will be adding more.
  84. 84. Expand simply by adding more blocks. </li></ul></ul>LSF Fast scratch disk Archival / Warehouse disk
  85. 85. Future Directions <ul>A.K.A.: Stuff that looks good in powerpoint but we still have to actually do. </ul>
  86. 86. Modular Systems <ul><li>Modular approach for compute and network in place. </li><ul><li>Add racks / chassis of compute and tie it all together with a load of 10GigE networking.
  87. 87. OS management / deployment tools take care of managing configs for us. </li></ul><li>Fast disk modules in place. </li><ul><li>DDN / lustre. </li></ul><li>We have not identified our building block for the bulk storage. </li><ul><li>How big should a storage module be? </li><ul><li>Storage fails (even enterprise storage).
  88. 88. What is the largest amount of storage we are comfortable with losing? </li></ul><li>How much disk behind a controller? </li><ul><li>Effects price / TB dramatically. </li></ul><li>And all the other features. </li><ul><li>MAID, dense, power efficient, scalable, replicable, etc etc </li></ul></ul></ul>
  89. 89. Data mangement <ul><li>Data management / movement needs to be done automatically. </li><ul><li>People are error-prone, software is not.
  90. 90. We need some sort of HSM / ILM systems. </li></ul><li>Can we find one that: </li><ul><li>Works at scale: (10s of PBs, Billions of files)
  91. 91. Does not tie us in to a file-system/storage setup that will be obsolete in 5 years. </li></ul><li>How far should we empower end users to manage their data? </li><ul><li>They know the most about their data and their workflow.
  92. 92. However, they are scientists, not sys-admins.
  93. 93. Their data is our responsibility. </li></ul></ul>
  94. 94. Cloud
  95. 95. Cloud... <ul><li>Cloud as compute on demand: </li><ul><li>We have spiky compute demands. </li><ul><li>Especially when “stealth” sequencing or analysis projects break cover. </li></ul></ul><li>Cloud as an external datastore: </li><ul><li>Cheap, off-site data archiving. (Disaster recovery). </li></ul><li>Cloud as a distributed datastore: </li><ul><li>Our data is publically available. </li><ul><li>Buy downloading 5TB of data across the public internet is not a pleasant experience. </li></ul><li>Cloud providers can do worldwide replication/ content delivery. </li></ul><li>Cloud as a collaboration: </li><ul><li>Data on its own in not very useful.
  96. 96. Bundle data and analysis pipelines for others to use. </li></ul></ul>
  97. 97. Our Cloud Experiments <ul><li>Can we take an Illumina run (4TB), run the image analysis pipeline and then align the results?
  98. 98. Pipeline ran, but it needed some re-writing. </li><ul><li>IO in Amazon is slow, unless you use S3. </li><ul><li>NFS on EBS performance is unusable with > 8 clients. </li></ul><li>S3 is not POSIX. </li><ul><li>Even with FUSE layer, code re-writing required. </li></ul></ul><li>Cloud compute is easy, cloud storage is hard.
  99. 99. Getting data in and out is slow: </li><ul><li>We realised ~10% of our theoretical bandwidth to Amazon. </li><ul><li>Even with gridFTP. </li></ul></ul><li>Promising, but more work needed... </li></ul>
  100. 100. Acknowledgements

×