chris@bioteam.net / @chris_dag
PRACTICAL PETABYTE PUSHING
Jan 2019 / Lightning Talk / Foundation Medicine
Boston Computational Biology and Bioinformatics Meetup
Chris Dagdigian; chris@bioteam.net
chris@bioteam.net / @chris_dag
30 Second Background
● 24x7 Production HPC Environment
● 100s of user accounts; 10+ power users; 50+ frequent users
● Many integrated “cluster aware” commercial apps leverage this system
● ~2 petabytes scientific & user data (Linux & Windows clients)
● Multiple catastrophic NAS outages in 2018
○ Demoralized scientists; shell-shocked IT staff; angry management
○ Replacement storage platform procured; 100% NAS-to-NAS migration ordered
● Mandate / Mission - 2 petabyte live data migration
○ IT must re-earn trust and confidence of scientific end-users & leadership
○ User morale/confidence is low; Stability/Uptime is key; Zero Unplanned Outages
○ “Jobs must flow” -- HPC remains in production during data migration
chris@bioteam.net / @chris_dag
1. NEVER comingle “data management” & “data movement” at same time
Cleanup/manage your data BEFORE or AFTER; never DURING
2. Understand upfront vendor-specific data protection overhead (small files esp)
New NAS needed +20% more raw disk to store the same data, a non-trivial CapEx cost at petascale
3. Interrogate/Understand your data before you move it (or buy new storage!)
Massive replication bandwidth is meaningless if you have 200+ million tiny files;
This was our real-world data movement bottleneck
Lightning Talk ProTip: CONCLUSIONS FIRST
Things we already knew + things we wished we knew beforehand
chris@bioteam.net / @chris_dag
Lightning Talk ProTip: CONCLUSIONS FIRST
4. Be proactive in setting (and re-setting) management expectations
Data transfer time estimates based off of aggregate network bandwidth were
insanely wrong. Real world throughput range was: [ 2mb/sec -- 13GB/sec ]
5. Tasks that take days/weeks require visibility & transparency
Users & management will want a dashboard or progress view
6. Work against full filesystems or network shares ONLY (See tip #1 …)
Attempts to get clever with curated “exclude-these-files-and-folders” lists add
complexity and introduce vectors for human/operator error
Things we already knew + things we wished we knew beforehand
chris@bioteam.net / @chris_dag
Materials & Methods - Tooling
Tooling
● We are not special/unique in life science informatics - plagiarizing methods
from Amazon, supercomputing sites & high-energy physics is a legit strategy
● Our tooling choice: fpart/fpsync from https://github.com/martymac/fpart
○ ‘fpart’ - Does the hard work of filesystem crawling to build ‘partition’ lists that can be used as
input data for whatever tool you want to use to replicate/copy data
○ ‘fpsync’ - Wrapper script to parallelize, distribute and manage a swarm of replication jobs
○ ‘rsync’ - https://rsync.samba.org/
● Actual data replication via ‘rsync’ (managed by fpsync)
○ fpsync wrapper script is pluggable and supports different data mover/copy binaries
○ We explicitly chose ‘rsync’ because it is well known, well tested and had the least amount of
potential edge and corner-cases to deal with
Things we already knew + things we wished we knew beforehand
chris@bioteam.net / @chris_dag
Materials & Methods - Process
The Process (one filesystem or share at a time):
● [A] Perform initial full replication in background on live “in-use” file system
● [B] Perform additional ‘re-sync’ replications to stay current
● [C] Perform ‘delete pass’ sync to catch data that was deleted from source filesystem while
replication(s) were occuring
● Repeat tasks [B] and [C] until time window for full sync + delete-pass is small enough to fit
within an acceptable maintenance/outage window
● Schedule outage window; make source filesystem Read-Only at a global level; perform final
replication sync; migrate client mounts; have backout plan handy
● Test, test, test, test, test, test (admins & end-users should both be involved testing)
● Have a plan to document & support the previously unknown storage users that will come out of the
woodwork once you mark the source filesystem read/only (!)
Things we already knew + things we wished we knew beforehand
chris@bioteam.net / @chris_dag
Wrap Up
Commercial Alternative
● If management requires fancy live dashboards & other UI candy --OR-- you have limited IT/ops support available for
scripted OSS tooling support …
● You can purchase petascale data migration capability commercially
○ Recommendation: Talk to DataDobi (https://datadobi.com)
○ (Yes this is a different niche than IBM Aspera or GridFTP type tooling …)
Acknowledgements
● Aaron Gardner (aaron@bioteam.net)
○ One of several Bioteam infrastructure gurus with extreme storage & filesystem expertise
○ He did the hard work on this
○ I just scripted things & monitored progress #lazy
More Info/Details: If you want to see this topic expanded into a long-form blog post / technical write-up
or BioITWorld conference talk then please let me know via email!

Practical Petabyte Pushing

  • 1.
    chris@bioteam.net / @chris_dag PRACTICALPETABYTE PUSHING Jan 2019 / Lightning Talk / Foundation Medicine Boston Computational Biology and Bioinformatics Meetup Chris Dagdigian; chris@bioteam.net
  • 2.
    chris@bioteam.net / @chris_dag 30Second Background ● 24x7 Production HPC Environment ● 100s of user accounts; 10+ power users; 50+ frequent users ● Many integrated “cluster aware” commercial apps leverage this system ● ~2 petabytes scientific & user data (Linux & Windows clients) ● Multiple catastrophic NAS outages in 2018 ○ Demoralized scientists; shell-shocked IT staff; angry management ○ Replacement storage platform procured; 100% NAS-to-NAS migration ordered ● Mandate / Mission - 2 petabyte live data migration ○ IT must re-earn trust and confidence of scientific end-users & leadership ○ User morale/confidence is low; Stability/Uptime is key; Zero Unplanned Outages ○ “Jobs must flow” -- HPC remains in production during data migration
  • 3.
    chris@bioteam.net / @chris_dag 1.NEVER comingle “data management” & “data movement” at same time Cleanup/manage your data BEFORE or AFTER; never DURING 2. Understand upfront vendor-specific data protection overhead (small files esp) New NAS needed +20% more raw disk to store the same data, a non-trivial CapEx cost at petascale 3. Interrogate/Understand your data before you move it (or buy new storage!) Massive replication bandwidth is meaningless if you have 200+ million tiny files; This was our real-world data movement bottleneck Lightning Talk ProTip: CONCLUSIONS FIRST Things we already knew + things we wished we knew beforehand
  • 4.
    chris@bioteam.net / @chris_dag LightningTalk ProTip: CONCLUSIONS FIRST 4. Be proactive in setting (and re-setting) management expectations Data transfer time estimates based off of aggregate network bandwidth were insanely wrong. Real world throughput range was: [ 2mb/sec -- 13GB/sec ] 5. Tasks that take days/weeks require visibility & transparency Users & management will want a dashboard or progress view 6. Work against full filesystems or network shares ONLY (See tip #1 …) Attempts to get clever with curated “exclude-these-files-and-folders” lists add complexity and introduce vectors for human/operator error Things we already knew + things we wished we knew beforehand
  • 5.
    chris@bioteam.net / @chris_dag Materials& Methods - Tooling Tooling ● We are not special/unique in life science informatics - plagiarizing methods from Amazon, supercomputing sites & high-energy physics is a legit strategy ● Our tooling choice: fpart/fpsync from https://github.com/martymac/fpart ○ ‘fpart’ - Does the hard work of filesystem crawling to build ‘partition’ lists that can be used as input data for whatever tool you want to use to replicate/copy data ○ ‘fpsync’ - Wrapper script to parallelize, distribute and manage a swarm of replication jobs ○ ‘rsync’ - https://rsync.samba.org/ ● Actual data replication via ‘rsync’ (managed by fpsync) ○ fpsync wrapper script is pluggable and supports different data mover/copy binaries ○ We explicitly chose ‘rsync’ because it is well known, well tested and had the least amount of potential edge and corner-cases to deal with Things we already knew + things we wished we knew beforehand
  • 6.
    chris@bioteam.net / @chris_dag Materials& Methods - Process The Process (one filesystem or share at a time): ● [A] Perform initial full replication in background on live “in-use” file system ● [B] Perform additional ‘re-sync’ replications to stay current ● [C] Perform ‘delete pass’ sync to catch data that was deleted from source filesystem while replication(s) were occuring ● Repeat tasks [B] and [C] until time window for full sync + delete-pass is small enough to fit within an acceptable maintenance/outage window ● Schedule outage window; make source filesystem Read-Only at a global level; perform final replication sync; migrate client mounts; have backout plan handy ● Test, test, test, test, test, test (admins & end-users should both be involved testing) ● Have a plan to document & support the previously unknown storage users that will come out of the woodwork once you mark the source filesystem read/only (!) Things we already knew + things we wished we knew beforehand
  • 7.
    chris@bioteam.net / @chris_dag WrapUp Commercial Alternative ● If management requires fancy live dashboards & other UI candy --OR-- you have limited IT/ops support available for scripted OSS tooling support … ● You can purchase petascale data migration capability commercially ○ Recommendation: Talk to DataDobi (https://datadobi.com) ○ (Yes this is a different niche than IBM Aspera or GridFTP type tooling …) Acknowledgements ● Aaron Gardner (aaron@bioteam.net) ○ One of several Bioteam infrastructure gurus with extreme storage & filesystem expertise ○ He did the hard work on this ○ I just scripted things & monitored progress #lazy More Info/Details: If you want to see this topic expanded into a long-form blog post / technical write-up or BioITWorld conference talk then please let me know via email!