Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data Sharing: Sanger Experiences Guy Coates Wellcome Trust Sanger Institute [email_address]
Background <ul><li>Moving large amounts of data:
Cloud Experiments </li><ul><li>Moving data to the cloud </li></ul><li>Production Pipelines </li><ul><li>Moving data to EBI...
Cloud Experiments <ul><li>Can we move some solexa image files to AWS and run our processing pipeline?
Answer: No. </li><ul><li>Moving the data took much longer than processing it.
First attempt: 14 Mbits/s out of 2Gbit/s link. </li></ul></ul>
Do some reading: <ul><li>http://fasterdata.es.net
Department of Enegy Office of Science. </li><ul><li>Covers all of the technical bits and pieces required to make wide-area...
Getting better: <ul><li>Use the right tools: </li><ul><li>Use WAN tools:  gridFTP/FDT/Aspera, not rsync/ssh.
Tune your TCP stack. </li></ul><li>Data transfer rates: </li><ul><li>Cambridge -> EC2 East coast: 12 Mbytes/s (96 Mbits/s)
Cambridge -> EC2 Dublin: 25 Mbytes/s  (200 Mbits/s)  </li></ul><li>What speed  should  we get? </li><ul><li>Once we leave ...
What about the Physicists? <ul><li>LHC moves 20 PBytes year across the internet to their processing sites. </li><ul><li>No...
Dedicated 10GigE networking between CERN and the 10 Tier 1 centres. </li></ul><li>Even with dedicated paths, it is still h...
Constant monitoring / bandwidth tests to ensure it stays working.
See HEPIX talks for gory details. </li></ul></ul>
We need a bigger networks: <ul><li>A fast network is fundamental to moving data.
Is it the only thing we need to do? </li></ul>
Sanger Production Pipeline <ul><li>Provides a nice example of moving large amounts of data in real-life.  </li></ul>
Sequencing data flow Sanger Sequencer Analysis/ alignment Internal  repository EBI EGA / SRA (EBI)
Data movement between Sanger/EBI <ul><li>This should be easy... </li><ul><li>We are on the same campus.
Upcoming SlideShare
Loading in …5
×

Sharing data: Sanger Experiences

1,275 views

Published on

Sharing large amounts of data is easier said than done. This talk gives an overview of our experiences doing big-data science over wide-area networks.

Published in: Technology
  • Be the first to comment

Sharing data: Sanger Experiences

  1. 1. Data Sharing: Sanger Experiences Guy Coates Wellcome Trust Sanger Institute [email_address]
  2. 2. Background <ul><li>Moving large amounts of data:
  3. 3. Cloud Experiments </li><ul><li>Moving data to the cloud </li></ul><li>Production Pipelines </li><ul><li>Moving data to EBI </li></ul><li>Do we need to move this data at all? </li></ul>
  4. 4. Cloud Experiments <ul><li>Can we move some solexa image files to AWS and run our processing pipeline?
  5. 5. Answer: No. </li><ul><li>Moving the data took much longer than processing it.
  6. 6. First attempt: 14 Mbits/s out of 2Gbit/s link. </li></ul></ul>
  7. 7. Do some reading: <ul><li>http://fasterdata.es.net
  8. 8. Department of Enegy Office of Science. </li><ul><li>Covers all of the technical bits and pieces required to make wide-area transfers go fast. </li></ul></ul>
  9. 9. Getting better: <ul><li>Use the right tools: </li><ul><li>Use WAN tools: gridFTP/FDT/Aspera, not rsync/ssh.
  10. 10. Tune your TCP stack. </li></ul><li>Data transfer rates: </li><ul><li>Cambridge -> EC2 East coast: 12 Mbytes/s (96 Mbits/s)
  11. 11. Cambridge -> EC2 Dublin: 25 Mbytes/s (200 Mbits/s) </li></ul><li>What speed should we get? </li><ul><li>Once we leave JANET (UK academic network) finding out what the connectivity is and what we should expect is almost impossible. </li></ul><li>How do we get the broken bits in the middle? </li><ul><li>Finding the person responsible for a broken router on the “internet” is hard. </li></ul></ul>
  12. 12. What about the Physicists? <ul><li>LHC moves 20 PBytes year across the internet to their processing sites. </li><ul><li>Not really.
  13. 13. Dedicated 10GigE networking between CERN and the 10 Tier 1 centres. </li></ul><li>Even with dedicated paths, it is still hard. </li><ul><li>Multiple telcos involved, even for a point-to-point link.
  14. 14. Constant monitoring / bandwidth tests to ensure it stays working.
  15. 15. See HEPIX talks for gory details. </li></ul></ul>
  16. 16. We need a bigger networks: <ul><li>A fast network is fundamental to moving data.
  17. 17. Is it the only thing we need to do? </li></ul>
  18. 18. Sanger Production Pipeline <ul><li>Provides a nice example of moving large amounts of data in real-life. </li></ul>
  19. 19. Sequencing data flow Sanger Sequencer Analysis/ alignment Internal repository EBI EGA / SRA (EBI)
  20. 20. Data movement between Sanger/EBI <ul><li>This should be easy... </li><ul><li>We are on the same campus.
  21. 21. 10Gbit/s (1.2 Gbyte/s) link between EBI and Sanger.
  22. 22. We share a data-centre.
  23. 23. Physically near, so we do not need to worry about WAN issues. </li></ul></ul>
  24. 24. It is not just networks: <ul><li>Speed will only be as fast as the slowest link.
  25. 25. Speed was not a design point for our holding area. </li><ul><li>$ per TB was the overriding design goal, not speed. </li></ul></ul>EBI Sanger Server Firewall Internet Server Firewall Network Network Disk Disk
  26. 26. Organisational issues: <ul><li>Data movement was not considered until after Sanger/EBI started building the systems. </li><ul><li>Hard to do fast data transfers if your disk subsystem is not up to the job. </li></ul><li>Expectation management: </li><ul><li>“How fast should I be able to move data?” </li></ul><li>Good communication. </li><ul><li>Multi-institute teams.
  27. 27. Need to take end-to-end ownership across institutions. </li></ul><li>Application Led: </li><ul><li>Nobody cares about raw data rates- they care how fast their application can move data.
  28. 28. Need application developers and sys-admin to work together. </li></ul><li>This needs to be in-place before you start projects! </li></ul>
  29. 29. Do we need to move the data?
  30. 30. Centralised data Data Sequencing Centre + DCC Sequencing centre Sequencing centre Sequencing centre Sequencing centre
  31. 31. Example Problem: <ul><li>“We want to run out pipeline across 100TB of data currently in EGA/SRA.”
  32. 32. We will need to de-stage the data to Sanger, and then run the compute. </li><ul><li>Extra 0.5 PB of storage, 1000 cores of compute.
  33. 33. 3 month lead time.
  34. 34. ~$1.5M capex. </li></ul></ul>
  35. 35. Federation: A Better way: Collaborations are short term: 18 months-3 years. Sequencing centre Sequencing centre Sequencing centre Sequencing centre Federated access
  36. 36. Federation software: Unstructured data (flat files) Data size per Genome Structured data (databases) BioMart IRODS (data grid software) Intensities / raw data (2TB) Alignments (200 GB) Sequence + quality data (500 GB) Variation data (1GB) Individual features (3MB)
  37. 37. Cloud / Computable archives <ul><li>Can we move the compute to the data? </li><ul><li>Upload workload onto VMs.
  38. 38. Put VMs on compute that is “attached” to the data. </li></ul></ul>Data CPU CPU CPU CPU Data CPU CPU CPU CPU VM
  39. 39. Summary <ul><li>We need fast network links.
  40. 40. We need cross site teams who can troubleshoot all potential trouble spots.
  41. 41. Teams need application & systems people. </li></ul>
  42. 42. Acknowledgements: <ul><li>The HEPIX Community. </li><ul><li>Http://www.hepix.org </li></ul><li>Team ISG: </li><ul><li>James Beal
  43. 43. Gen-Tao Chiang
  44. 44. Pete Clapham
  45. 45. Simon Kelley </li></ul></ul>

×