Successfully reported this slideshow.
Data Sharing: Sanger Experiences Guy Coates Wellcome Trust Sanger Institute [email_address]
Background <ul><li>Moving large amounts of data:
Cloud Experiments </li><ul><li>Moving data to the cloud </li></ul><li>Production Pipelines </li><ul><li>Moving data to EBI...
Cloud Experiments <ul><li>Can we move some solexa image files to AWS and run our processing pipeline?
Answer: No. </li><ul><li>Moving the data took much longer than processing it.
First attempt: 14 Mbits/s out of 2Gbit/s link. </li></ul></ul>
Do some reading: <ul><li>http://fasterdata.es.net
Department of Enegy Office of Science. </li><ul><li>Covers all of the technical bits and pieces required to make wide-area...
Getting better: <ul><li>Use the right tools: </li><ul><li>Use WAN tools:  gridFTP/FDT/Aspera, not rsync/ssh.
Tune your TCP stack. </li></ul><li>Data transfer rates: </li><ul><li>Cambridge -> EC2 East coast: 12 Mbytes/s (96 Mbits/s)
Cambridge -> EC2 Dublin: 25 Mbytes/s  (200 Mbits/s)  </li></ul><li>What speed  should  we get? </li><ul><li>Once we leave ...
What about the Physicists? <ul><li>LHC moves 20 PBytes year across the internet to their processing sites. </li><ul><li>No...
Dedicated 10GigE networking between CERN and the 10 Tier 1 centres. </li></ul><li>Even with dedicated paths, it is still h...
Constant monitoring / bandwidth tests to ensure it stays working.
See HEPIX talks for gory details. </li></ul></ul>
We need a bigger networks: <ul><li>A fast network is fundamental to moving data.
Is it the only thing we need to do? </li></ul>
Sanger Production Pipeline <ul><li>Provides a nice example of moving large amounts of data in real-life.  </li></ul>
Sequencing data flow Sanger Sequencer Analysis/ alignment Internal  repository EBI EGA / SRA (EBI)
Data movement between Sanger/EBI <ul><li>This should be easy... </li><ul><li>We are on the same campus.
Upcoming SlideShare
Loading in …5
×

Sharing data: Sanger Experiences

1,241 views

Published on

Sharing large amounts of data is easier said than done. This talk gives an overview of our experiences doing big-data science over wide-area networks.

Published in: Technology
  • Be the first to comment

Sharing data: Sanger Experiences

  1. 1. Data Sharing: Sanger Experiences Guy Coates Wellcome Trust Sanger Institute [email_address]
  2. 2. Background <ul><li>Moving large amounts of data:
  3. 3. Cloud Experiments </li><ul><li>Moving data to the cloud </li></ul><li>Production Pipelines </li><ul><li>Moving data to EBI </li></ul><li>Do we need to move this data at all? </li></ul>
  4. 4. Cloud Experiments <ul><li>Can we move some solexa image files to AWS and run our processing pipeline?
  5. 5. Answer: No. </li><ul><li>Moving the data took much longer than processing it.
  6. 6. First attempt: 14 Mbits/s out of 2Gbit/s link. </li></ul></ul>
  7. 7. Do some reading: <ul><li>http://fasterdata.es.net
  8. 8. Department of Enegy Office of Science. </li><ul><li>Covers all of the technical bits and pieces required to make wide-area transfers go fast. </li></ul></ul>
  9. 9. Getting better: <ul><li>Use the right tools: </li><ul><li>Use WAN tools: gridFTP/FDT/Aspera, not rsync/ssh.
  10. 10. Tune your TCP stack. </li></ul><li>Data transfer rates: </li><ul><li>Cambridge -> EC2 East coast: 12 Mbytes/s (96 Mbits/s)
  11. 11. Cambridge -> EC2 Dublin: 25 Mbytes/s (200 Mbits/s) </li></ul><li>What speed should we get? </li><ul><li>Once we leave JANET (UK academic network) finding out what the connectivity is and what we should expect is almost impossible. </li></ul><li>How do we get the broken bits in the middle? </li><ul><li>Finding the person responsible for a broken router on the “internet” is hard. </li></ul></ul>
  12. 12. What about the Physicists? <ul><li>LHC moves 20 PBytes year across the internet to their processing sites. </li><ul><li>Not really.
  13. 13. Dedicated 10GigE networking between CERN and the 10 Tier 1 centres. </li></ul><li>Even with dedicated paths, it is still hard. </li><ul><li>Multiple telcos involved, even for a point-to-point link.
  14. 14. Constant monitoring / bandwidth tests to ensure it stays working.
  15. 15. See HEPIX talks for gory details. </li></ul></ul>
  16. 16. We need a bigger networks: <ul><li>A fast network is fundamental to moving data.
  17. 17. Is it the only thing we need to do? </li></ul>
  18. 18. Sanger Production Pipeline <ul><li>Provides a nice example of moving large amounts of data in real-life. </li></ul>
  19. 19. Sequencing data flow Sanger Sequencer Analysis/ alignment Internal repository EBI EGA / SRA (EBI)
  20. 20. Data movement between Sanger/EBI <ul><li>This should be easy... </li><ul><li>We are on the same campus.
  21. 21. 10Gbit/s (1.2 Gbyte/s) link between EBI and Sanger.
  22. 22. We share a data-centre.
  23. 23. Physically near, so we do not need to worry about WAN issues. </li></ul></ul>
  24. 24. It is not just networks: <ul><li>Speed will only be as fast as the slowest link.
  25. 25. Speed was not a design point for our holding area. </li><ul><li>$ per TB was the overriding design goal, not speed. </li></ul></ul>EBI Sanger Server Firewall Internet Server Firewall Network Network Disk Disk
  26. 26. Organisational issues: <ul><li>Data movement was not considered until after Sanger/EBI started building the systems. </li><ul><li>Hard to do fast data transfers if your disk subsystem is not up to the job. </li></ul><li>Expectation management: </li><ul><li>“How fast should I be able to move data?” </li></ul><li>Good communication. </li><ul><li>Multi-institute teams.
  27. 27. Need to take end-to-end ownership across institutions. </li></ul><li>Application Led: </li><ul><li>Nobody cares about raw data rates- they care how fast their application can move data.
  28. 28. Need application developers and sys-admin to work together. </li></ul><li>This needs to be in-place before you start projects! </li></ul>
  29. 29. Do we need to move the data?
  30. 30. Centralised data Data Sequencing Centre + DCC Sequencing centre Sequencing centre Sequencing centre Sequencing centre
  31. 31. Example Problem: <ul><li>“We want to run out pipeline across 100TB of data currently in EGA/SRA.”
  32. 32. We will need to de-stage the data to Sanger, and then run the compute. </li><ul><li>Extra 0.5 PB of storage, 1000 cores of compute.
  33. 33. 3 month lead time.
  34. 34. ~$1.5M capex. </li></ul></ul>
  35. 35. Federation: A Better way: Collaborations are short term: 18 months-3 years. Sequencing centre Sequencing centre Sequencing centre Sequencing centre Federated access
  36. 36. Federation software: Unstructured data (flat files) Data size per Genome Structured data (databases) BioMart IRODS (data grid software) Intensities / raw data (2TB) Alignments (200 GB) Sequence + quality data (500 GB) Variation data (1GB) Individual features (3MB)
  37. 37. Cloud / Computable archives <ul><li>Can we move the compute to the data? </li><ul><li>Upload workload onto VMs.
  38. 38. Put VMs on compute that is “attached” to the data. </li></ul></ul>Data CPU CPU CPU CPU Data CPU CPU CPU CPU VM
  39. 39. Summary <ul><li>We need fast network links.
  40. 40. We need cross site teams who can troubleshoot all potential trouble spots.
  41. 41. Teams need application & systems people. </li></ul>
  42. 42. Acknowledgements: <ul><li>The HEPIX Community. </li><ul><li>Http://www.hepix.org </li></ul><li>Team ISG: </li><ul><li>James Beal
  43. 43. Gen-Tao Chiang
  44. 44. Pete Clapham
  45. 45. Simon Kelley </li></ul></ul>

×