Next-generation sequencing: Data mangement


Published on

Next-generation sequencing is producing vast amounts of data. Providing storage and compute is only half the battle. Researchers and IT staff need to be able to "manage" data, in order to stay productive.

Talk given at BIO-IT World, Europe 2010.

Published in: Technology

Next-generation sequencing: Data mangement

  1. 1. Next-Gen Sequencing: Data Management Guy Coates Wellcome Trust Sanger Institute [email_address]
  2. 2. About the Institute <ul><li>Funded by Wellcome Trust. </li><ul><li>2 nd largest research charity in the world.
  3. 3. ~700 employees. </li></ul><li>Large scale genomic research. </li><ul><li>Sequenced 1/3 of the human genome (largest single contributor).
  4. 4. We have active cancer, malaria, pathogen and genomic variation studies. </li></ul><li>All data is made publicly available. </li><ul><li>Websites, ftp, direct database. access, programmatic APIs. </li></ul></ul>
  5. 5. BioIT Europe:
  6. 6. The Scary Graph Instrument upgrades Peak Yearly capillary sequencing
  7. 7. The Scary Graph
  8. 8. Managing Growth <ul><li>We have exponential growth in storage and compute. </li><ul><li>Storage /compute doubles every 12 months. </li><ul><li>2009 ~7 PB raw </li></ul></ul><li>Moore's law will not save us. </li><ul><li>Transistor/disk density: T d =18 months
  9. 9. Sequencing cost: T d =12 months </li></ul></ul>
  10. 10. Classic Sanger “Stealth project” <ul><li>Summer 2007 </li><ul><li>first early access sequencer. </li></ul><li>Not long after: </li><ul><li>“15 sequencers have been ordered. They are arriving in 8 weeks. Can we have some storage and computers?” </li></ul><li>A fun summer was had by all! </li></ul>
  11. 11. Classic Sanger “Stealth project” <ul><li>Early 2010 </li><ul><li>Hi-seq announced. </li></ul><li>Not long after: </li><ul><li>“30 sequencers upgrades have been ordered. They are arriving in 8 weeks. Can we have some storage and computers?” </li></ul><li>A fun summer was had by all! </li></ul>
  12. 12. What we learned... <ul><li>“Masterly inactivity” </li><ul><li>We had 6 months where we bought no storage.
  13. 13. Nobody stops to tidy up until they have no more disk space. </li></ul><li>Data-triage: </li><ul><li>We are much more aggressive about throwing away data we no longer need. </li><ul><li>No raw images files, srf or fastq.
  14. 14. BAM only. </li></ul></ul><li>Storage-Tax: </li><ul><li>PI's requesting sequencing have a storage surcharge applied to them.
  15. 15. Historically sequencing and IT were budgeted separately.
  16. 16. Makes Pis aware of the IT costs, even if it does not cover 100%. </li></ul></ul>
  17. 17. Flexible Infrastructure <ul><li>Modular design. </li><ul><li>Blocks of network, compute and storage.
  18. 18. Assume from day 1 we will be adding more.
  19. 19. Expand simply by adding more blocks. </li></ul><li>Make storage visible from everywhere. </li><ul><li>Key enabler; lots of 10Gig. </li></ul><li>This allows us to move compute jobs between farms. </li><ul><li>Logically rather than physically separated.
  20. 20. Currently using LSF to manage workflow. </li></ul></ul>LSF Fast scratch disk Archival / Warehouse disk Network
  21. 21. Our Modules: <ul><li>KISS: Keep It Simple, Stupid. </li><ul><li>Tendency to go for the “clever” solution.
  22. 22. Simple might not be so robust, but it is much simpler and faster to fix if it breaks. More reliable in practice. </li></ul><li>Compute: </li><ul><li>Racks of blades. </li></ul><li>Bulk Storage: </li><ul><li>Nexsan Satabeast. Raid 6 SATA disks. </li><ul><li>Directly attached via FC or served via linux / NFS.
  23. 23. 50-100TB chunks. </li></ul></ul><li>Fast Storage: </li><ul><li>DDN9900/10000 + Lustre (250TB chunks) </li><ul><li>(KISS violation). </li></ul></ul><li>Reasonably successful. </li><ul><li>Takes longer than we would like to physically install it. </li></ul></ul>
  24. 24. Data management <ul><li>100TB filesystem, 136M files. </li><ul><li>“Where is the stuff we can delete so we can continue production...?” </li></ul></ul>#df -h Filesystem Size Used Avail Use% Mounted on lus02-mds1:/lus02 108T 107T 1T 99% /lustre/scratch102 #df -i Filesystem Inodes IUsed IFree IUse% Mounted on lus02-mds1:/lus02 300296107 136508072 163788035 45% /lustre/scratch102
  25. 25. Sequencing data flow. Automated processing and data management Sequencer Analysis/ alignment Internal repository EGA / SRA (EBI) compute-farm High-performance storage Manual data movement
  26. 26. Unmanaged data <ul><li>Investigators take sequence data off the pipeline and “do stuff” with it. </li><ul><li>Data inflation. </li><ul><li>10x the space of the “raw” data. </li></ul></ul><li>Data is left in the wrong place. </li><ul><li>Typically left where it was created. </li><ul><li>Moving data is hard and slow. </li></ul><li>Important data left in scratch areas, or high IO analysis being run against slow storage. </li></ul><li>Finding data is impossible. </li><ul><li>Where is the important data? </li><ul><li>Everyone creates a copy for themselves, “just to be sure.” </li></ul><li>Are we backing up the important stuff?
  27. 27. Are we keeping control of our “private” datasets? </li></ul></ul>
  28. 28. Managing unstructured data <ul><li>Automation is key: </li><ul><li>Computers, not people moving data around.
  29. 29. Works well for the pipelines where it is currently used. </li></ul><li>Hard to get buy-in from our “non production” users. </li><ul><li>Added complication that gets in the way of doing ad-hoc analysis. </li></ul><li>Our Breakthrough Moment: </li><ul><li>One of our informatics teams mentioned that they had written a simple data tracking application for their team. </li><ul><li>“We kept losing our files, or running out of disk space halfway through an analysis.” </li></ul></ul><li>Big benefits: </li><ul><li>Big increase in productivity.
  30. 30. 50% reduction in disk utilisation. </li><ul><li>50% of 2PB is a lot of $. </li></ul><li>Easy to do capacity planning. </li></ul></ul>
  31. 31. Bottlenecks: <ul><li>Data management now impacting on productivity. </li><ul><li>Groups who control their data get much more done.
  32. 32. As data sizes increase, even “smal datal” groups get hit. </li></ul><li>Money talks: </li><ul><li>“Group A only need ½ the storage budget of group B to do the same analysis.” </li><ul><li>Powerful message. </li></ul></ul><li>We do not want lots of distinct data tracking systems. </li><ul><li>Avoid wheel reinvention.
  33. 33. Groups need to exchange data.
  34. 34. Small groups do not have the manpower to hack something together. </li></ul><li>We need something with a simple interface so it can easily support ad-hoc requests. </li></ul>
  35. 35. Sequencing data flow. Automated processing and data management Manual Sequencer Analysis/ alignment Internal repository EGA / SRA (EBI) compute-farm High-performance storage Managed data movement
  36. 36. What are we using? <ul><li>iRODS: Integrated Rule-Oriented Data System. </li><ul><li>Produced by DICE Group (Data Intensive Cyber Environments) at U. North Carolina, Chapel Hill. </li></ul><li>Successor to SRB. </li><ul><li>SRB used by the High-Energy-Physics (HEP) community. </li><ul><li>20PB/year LHC data. </li></ul><li>HEP community has lots of “lessons learned” that we can benefit from. </li></ul></ul>
  37. 37. iRODS ICAT Catalogue database Rule Engine Implements policies Irods Server Data on disk User interface WebDAV, icommands,fuse Irods Server Data in database Irods Server Data in S3
  38. 38. iRODS Features <ul><li>Store data and metadata. </li><ul><li>Meta data can be queried. </li></ul><li>Scalable: </li><ul><li>Copes with PB of data and 100,000M+ files.
  39. 39. Replicates data.
  40. 40. Fast parallel data transfers across local and wide area network links. </li></ul><li>Extensible </li><ul><li>System can be linked out external services. </li><ul><li>Eg external databases holding metadata, external authentication systems. </li></ul></ul><li>Federated </li><ul><li>Physically and logically separated iRODS installs can be federated across institutions. </li></ul></ul>
  41. 41. First implementation Automated processing and data management Manual Sequencer Analysis/ alignment Internal repository EGA / SRA (EBI) compute-farm High-performance storage
  42. 42. First Implementation <ul><li>Simple archive system. </li><ul><li>It is our first production system:KISS.
  43. 43. Hold bam files, and a small amount of metadata. </li></ul><li>Rules:
  44. 44. Replicate: </li><ul><li>All files replicated across storage held in two different data centres. </li></ul><li>Set access controls: </li><ul><li>Enforce access-controls for some “confidential” datasets. </li><ul><li>Automatically triggered from study metadata. </li></ul></ul></ul>
  45. 45. Example access: $ icd /seq/5307 $ ils /seq/5307: 5307_1.bam 5307_2.bam 5307_3.bam $ ils -l 5307_1.bam srpipe 0 res-g2 1987106409 2010-09-24.13:35 & 5307_1.bam srpipe 1 res-r2 1987106409 2010-09-24.13:36 & 5307_1.bam
  46. 46. Metadata imeta ls -d /seq/5307/5307_1.bam AVUs defined for dataObj /seq/5307/5307_1.bam: attribute: type value: bam units: ---- attribute: sample value: BG81 units: ---- attribute: id_run value: 5307 units: ---- attribute: lane value: 1 units: ---- attribute: study value: TRANSCRIPTION FACTORS IN HAEMATOPOIESIS - MOUSE units: ---- attribute: library value: BG81 449223 units:
  47. 47. Query imeta qu -d study = &quot;TRANSCRIPTION FACTORS IN HAEMATOPOIESIS - MOUSE&quot; collection: /seq/5307 dataObj: 5307_1.bam ---- collection: /seq/5307 dataObj: 5307_2.bam ---- collection: /seq/5307 dataObj: 5307_3.bam ----
  48. 48. So what...?
  49. 49. Next steps Sanger iRODs Datacentre 2 Datacentre 1 Replicate EGA/ERA Automated release/purge Collaborator iRODs Federate
  50. 50. Wishlist: HPC Integration Data is staged in/out to filesystem Archive / Metadata system Fast Storage / POSIX filesystem Compute farm Fast Storage / POSIX filesystem + Metadata sytem Compute farm System can do rule/metadata based ops and standard POSIX ops too.
  51. 51. Managing Workflow
  52. 52. Modular Compute <ul><li>We have a very heterogeneous network. </li><ul><li>Modules of storage and compute.
  53. 53. Storage and servers spread across several locations. </li></ul></ul>Fast link Storage Storage Storage Storage CPU CPU CPU CPU CPU medium link slow link
  54. 54. How do we manage data and workflow? <ul><li>Some compute and data is “closer” than other parts. </li><ul><li>Jobs should use compute that is near their data. </li></ul><li>How do we steer workload to where we want it? </li><ul><li>We may want to mark modules offline for maintenance, and steer workload away from them. </li></ul></ul>
  55. 55. LSF Data Aware Scheduler <ul><li>How it works: </li><ul><li>LSF has a map describing the storage pool /compute topology. </li><ul><li>Simple weighting matrix.
  56. 56. LSF knows how much free space is available on each pool. </li></ul><li>Users can optionally register datasets as being on a particular storage pool. </li></ul><ul><li>Users submit a job request. </li><ul><li>May include a dataset request, an amount of storage, or a storage-distance request. </li></ul></ul><ul><li>LSF finds free machines and storage. </li><ul><li>Storage location is passed as an environment variables into the job at runtime. </li></ul></ul></ul>
  57. 57. Future Work <ul><li>Let the system know about load on storage. </li><ul><li>Perhaps storage that is further away is better, if the nearest storage is really busy. </li></ul><li>Let the system move data. </li><ul><li>The system currently moves jobs. Users are responsible for placing and registering datasets.
  58. 58. Hot datasets change over time.
  59. 59. Replicate/move the datasets to faster storage, or a greater number of storage pools. </li></ul><li>Making LSF to do data migration/replication will be a hard. </li><ul><li>If only there was some data-grid software that did that already... </li></ul></ul>
  60. 60. Acknowledgements <ul><li>Sanger Institute
  61. 61. Phil Butcher
  62. 62. ISG </li><ul><li>James Beal
  63. 63. Gen-Tao Chiang
  64. 64. Pete Clapham
  65. 65. Simon Kelley </li></ul><li>Platform Computing </li><ul><li>Chris Smith
  66. 66. Chris Duddington
  67. 67. Da Xu </li></ul></ul>