Practical Petabyte Pushing

Chris Dagdigian
Chris DagdigianCo-founder; Principal Consultant at The BioTeam, Inc.
chris@bioteam.net / @chris_dag
PRACTICAL PETABYTE PUSHING
Jan 2019 / Lightning Talk / Foundation Medicine
Boston Computational Biology and Bioinformatics Meetup
Chris Dagdigian; chris@bioteam.net
chris@bioteam.net / @chris_dag
30 Second Background
● 24x7 Production HPC Environment
● 100s of user accounts; 10+ power users; 50+ frequent users
● Many integrated “cluster aware” commercial apps leverage this system
● ~2 petabytes scientific & user data (Linux & Windows clients)
● Multiple catastrophic NAS outages in 2018
○ Demoralized scientists; shell-shocked IT staff; angry management
○ Replacement storage platform procured; 100% NAS-to-NAS migration ordered
● Mandate / Mission - 2 petabyte live data migration
○ IT must re-earn trust and confidence of scientific end-users & leadership
○ User morale/confidence is low; Stability/Uptime is key; Zero Unplanned Outages
○ “Jobs must flow” -- HPC remains in production during data migration
chris@bioteam.net / @chris_dag
1. NEVER comingle “data management” & “data movement” at same time
Cleanup/manage your data BEFORE or AFTER; never DURING
2. Understand upfront vendor-specific data protection overhead (small files esp)
New NAS needed +20% more raw disk to store the same data, a non-trivial CapEx cost at petascale
3. Interrogate/Understand your data before you move it (or buy new storage!)
Massive replication bandwidth is meaningless if you have 200+ million tiny files;
This was our real-world data movement bottleneck
Lightning Talk ProTip: CONCLUSIONS FIRST
Things we already knew + things we wished we knew beforehand
chris@bioteam.net / @chris_dag
Lightning Talk ProTip: CONCLUSIONS FIRST
4. Be proactive in setting (and re-setting) management expectations
Data transfer time estimates based off of aggregate network bandwidth were
insanely wrong. Real world throughput range was: [ 2mb/sec -- 13GB/sec ]
5. Tasks that take days/weeks require visibility & transparency
Users & management will want a dashboard or progress view
6. Work against full filesystems or network shares ONLY (See tip #1 …)
Attempts to get clever with curated “exclude-these-files-and-folders” lists add
complexity and introduce vectors for human/operator error
Things we already knew + things we wished we knew beforehand
chris@bioteam.net / @chris_dag
Materials & Methods - Tooling
Tooling
● We are not special/unique in life science informatics - plagiarizing methods
from Amazon, supercomputing sites & high-energy physics is a legit strategy
● Our tooling choice: fpart/fpsync from https://github.com/martymac/fpart
○ ‘fpart’ - Does the hard work of filesystem crawling to build ‘partition’ lists that can be used as
input data for whatever tool you want to use to replicate/copy data
○ ‘fpsync’ - Wrapper script to parallelize, distribute and manage a swarm of replication jobs
○ ‘rsync’ - https://rsync.samba.org/
● Actual data replication via ‘rsync’ (managed by fpsync)
○ fpsync wrapper script is pluggable and supports different data mover/copy binaries
○ We explicitly chose ‘rsync’ because it is well known, well tested and had the least amount of
potential edge and corner-cases to deal with
Things we already knew + things we wished we knew beforehand
chris@bioteam.net / @chris_dag
Materials & Methods - Process
The Process (one filesystem or share at a time):
● [A] Perform initial full replication in background on live “in-use” file system
● [B] Perform additional ‘re-sync’ replications to stay current
● [C] Perform ‘delete pass’ sync to catch data that was deleted from source filesystem while
replication(s) were occuring
● Repeat tasks [B] and [C] until time window for full sync + delete-pass is small enough to fit
within an acceptable maintenance/outage window
● Schedule outage window; make source filesystem Read-Only at a global level; perform final
replication sync; migrate client mounts; have backout plan handy
● Test, test, test, test, test, test (admins & end-users should both be involved testing)
● Have a plan to document & support the previously unknown storage users that will come out of the
woodwork once you mark the source filesystem read/only (!)
Things we already knew + things we wished we knew beforehand
chris@bioteam.net / @chris_dag
Wrap Up
Commercial Alternative
● If management requires fancy live dashboards & other UI candy --OR-- you have limited IT/ops support available for
scripted OSS tooling support …
● You can purchase petascale data migration capability commercially
○ Recommendation: Talk to DataDobi (https://datadobi.com)
○ (Yes this is a different niche than IBM Aspera or GridFTP type tooling …)
Acknowledgements
● Aaron Gardner (aaron@bioteam.net)
○ One of several Bioteam infrastructure gurus with extreme storage & filesystem expertise
○ He did the hard work on this
○ I just scripted things & monitored progress #lazy
More Info/Details: If you want to see this topic expanded into a long-form blog post / technical write-up
or BioITWorld conference talk then please let me know via email!
1 of 7

Recommended

Cloud Sobriety for Life Science IT Leadership (2018 Edition) by
Cloud Sobriety for Life Science IT Leadership (2018 Edition)Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Cloud Sobriety for Life Science IT Leadership (2018 Edition)Chris Dagdigian
586 views69 slides
Bio-IT Trends From The Trenches (digital edition) by
Bio-IT Trends From The Trenches (digital edition)Bio-IT Trends From The Trenches (digital edition)
Bio-IT Trends From The Trenches (digital edition)Chris Dagdigian
1.2K views44 slides
Trends from the Trenches: 2019 by
Trends from the Trenches: 2019Trends from the Trenches: 2019
Trends from the Trenches: 2019Chris Dagdigian
4.8K views62 slides
Facilitating Collaborative Life Science Research in Commercial & Enterprise E... by
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Chris Dagdigian
764 views30 slides
2021 Trends from the Trenches by
2021 Trends from the Trenches2021 Trends from the Trenches
2021 Trends from the TrenchesChris Dagdigian
212 views71 slides
2013: Trends from the Trenches by
2013: Trends from the Trenches2013: Trends from the Trenches
2013: Trends from the TrenchesChris Dagdigian
2.5K views70 slides

More Related Content

What's hot

BioIT Trends - 2014 Internet2 Technology Exchange by
BioIT Trends - 2014 Internet2 Technology ExchangeBioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology ExchangeChris Dagdigian
1.5K views82 slides
BioIT World 2016 - HPC Trends from the Trenches by
BioIT World 2016 - HPC Trends from the TrenchesBioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the TrenchesChris Dagdigian
4.5K views120 slides
2015 Bio-IT Trends From the Trenches by
2015 Bio-IT Trends From the Trenches2015 Bio-IT Trends From the Trenches
2015 Bio-IT Trends From the TrenchesChris Dagdigian
16.8K views146 slides
Cloud Security for Life Science R&D by
Cloud Security for Life Science R&DCloud Security for Life Science R&D
Cloud Security for Life Science R&DChris Dagdigian
999 views12 slides
Multi-Tenant Pharma HPC Clusters by
Multi-Tenant Pharma HPC ClustersMulti-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC ClustersChris Dagdigian
2.3K views62 slides
Mapping Life Science Informatics to the Cloud by
Mapping Life Science Informatics to the CloudMapping Life Science Informatics to the Cloud
Mapping Life Science Informatics to the CloudChris Dagdigian
6.4K views82 slides

What's hot(20)

BioIT Trends - 2014 Internet2 Technology Exchange by Chris Dagdigian
BioIT Trends - 2014 Internet2 Technology ExchangeBioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology Exchange
Chris Dagdigian1.5K views
BioIT World 2016 - HPC Trends from the Trenches by Chris Dagdigian
BioIT World 2016 - HPC Trends from the TrenchesBioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the Trenches
Chris Dagdigian4.5K views
2015 Bio-IT Trends From the Trenches by Chris Dagdigian
2015 Bio-IT Trends From the Trenches2015 Bio-IT Trends From the Trenches
2015 Bio-IT Trends From the Trenches
Chris Dagdigian16.8K views
Cloud Security for Life Science R&D by Chris Dagdigian
Cloud Security for Life Science R&DCloud Security for Life Science R&D
Cloud Security for Life Science R&D
Chris Dagdigian999 views
Multi-Tenant Pharma HPC Clusters by Chris Dagdigian
Multi-Tenant Pharma HPC ClustersMulti-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC Clusters
Chris Dagdigian2.3K views
Mapping Life Science Informatics to the Cloud by Chris Dagdigian
Mapping Life Science Informatics to the CloudMapping Life Science Informatics to the Cloud
Mapping Life Science Informatics to the Cloud
Chris Dagdigian6.4K views
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting by Chris Dagdigian
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome MeetingBio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Chris Dagdigian2.3K views
BigData HUB Workshop by Ahmed Salman
BigData HUB WorkshopBigData HUB Workshop
BigData HUB Workshop
Ahmed Salman712 views
Big Data and Fast Data – Big and Fast Combined, is it Possible? by Guido Schmutz
Big Data and Fast Data – Big and Fast Combined, is it Possible?Big Data and Fast Data – Big and Fast Combined, is it Possible?
Big Data and Fast Data – Big and Fast Combined, is it Possible?
Guido Schmutz3.9K views
2015 04 bio it world by Chris Dwan
2015 04 bio it world2015 04 bio it world
2015 04 bio it world
Chris Dwan2.9K views
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha... by Shirshanka Das
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Shirshanka Das5.9K views
Big Data and Fast Data - big and fast combined, is it possible? by Guido Schmutz
Big Data and Fast Data - big and fast combined, is it possible?Big Data and Fast Data - big and fast combined, is it possible?
Big Data and Fast Data - big and fast combined, is it possible?
Guido Schmutz4.5K views
The Evolution of Big Data Frameworks by eXascale Infolab
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data Frameworks
eXascale Infolab2.5K views
5 Factors Impacting Your Big Data Project's Performance by Qubole
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance
Qubole41.5K views
Guest Lecture: Introduction to Big Data at Indian Institute of Technology by Nishant Gandhi
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Nishant Gandhi469 views
Leveraging open source for big data stack by Flytxt
Leveraging open source for big data stackLeveraging open source for big data stack
Leveraging open source for big data stack
Flytxt 1.6K views
Big Data - An Overview by Arvind Kalyan
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
Arvind Kalyan1.6K views
Briefing room: An alternative for streaming data collection by mark madsen
Briefing room: An alternative for streaming data collectionBriefing room: An alternative for streaming data collection
Briefing room: An alternative for streaming data collection
mark madsen836 views
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014 by Josh Patterson
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Josh Patterson753 views
Python's Role in the Future of Data Analysis by Peter Wang
Python's Role in the Future of Data AnalysisPython's Role in the Future of Data Analysis
Python's Role in the Future of Data Analysis
Peter Wang6.4K views

Similar to Practical Petabyte Pushing

Data Engineer's Lunch #85: Designing a Modern Data Stack by
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
150 views27 slides
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness by
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessAnant Corporation
21 views39 slides
Audax Group: CIO Perspectives - Managing The Copy Data Explosion by
Audax Group: CIO Perspectives - Managing The Copy Data ExplosionAudax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data Explosionactifio
2.2K views21 slides
Google Cloud Computing on Google Developer 2008 Day by
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Dayprogrammermag
1.4K views67 slides
(ATS6-PLAT07) Managing AEP in an enterprise environment by
(ATS6-PLAT07) Managing AEP in an enterprise environment(ATS6-PLAT07) Managing AEP in an enterprise environment
(ATS6-PLAT07) Managing AEP in an enterprise environmentBIOVIA
848 views25 slides
Big Data Case Study: Fortune 100 Telco by
Big Data Case Study: Fortune 100 TelcoBig Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 TelcoBlueData, Inc.
7.6K views13 slides

Similar to Practical Petabyte Pushing(20)

Data Engineer's Lunch #85: Designing a Modern Data Stack by Anant Corporation
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
Anant Corporation150 views
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness by Anant Corporation
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Audax Group: CIO Perspectives - Managing The Copy Data Explosion by actifio
Audax Group: CIO Perspectives - Managing The Copy Data ExplosionAudax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
actifio2.2K views
Google Cloud Computing on Google Developer 2008 Day by programmermag
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
programmermag1.4K views
(ATS6-PLAT07) Managing AEP in an enterprise environment by BIOVIA
(ATS6-PLAT07) Managing AEP in an enterprise environment(ATS6-PLAT07) Managing AEP in an enterprise environment
(ATS6-PLAT07) Managing AEP in an enterprise environment
BIOVIA848 views
Big Data Case Study: Fortune 100 Telco by BlueData, Inc.
Big Data Case Study: Fortune 100 TelcoBig Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 Telco
BlueData, Inc. 7.6K views
DM Radio Webinar: Adopting a Streaming-Enabled Architecture by DATAVERSITY
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DATAVERSITY286 views
Accelerating workloads and bursting data with Google Dataproc & Alluxio by Alluxio, Inc.
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Alluxio, Inc.363 views
Data Management - Full Stack Deep Learning by Sergey Karayev
Data Management - Full Stack Deep LearningData Management - Full Stack Deep Learning
Data Management - Full Stack Deep Learning
Sergey Karayev26.5K views
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018 by Codemotion
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Codemotion215 views
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database by Kinetica
Powering Real-Time Big Data Analytics with a Next-Gen GPU DatabasePowering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
Kinetica634 views
Apache Cassandra at Target - Cassandra Summit 2014 by Dan Cundiff
Apache Cassandra at Target - Cassandra Summit 2014Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014
Dan Cundiff4.4K views
Accelerate your SAP BusinessObjects to the Cloud by Wiiisdom
Accelerate your SAP BusinessObjects to the CloudAccelerate your SAP BusinessObjects to the Cloud
Accelerate your SAP BusinessObjects to the Cloud
Wiiisdom90 views
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401) by Amazon Web Services
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
Amazon Web Services2.4K views
Bootstrapping state in Apache Flink by DataWorks Summit
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
DataWorks Summit1.7K views
Denver devops : enabling DevOps with data virtualization by Kyle Hailey
Denver devops : enabling DevOps with data virtualizationDenver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualization
Kyle Hailey2.3K views
Accelerating Cloud Training With Alluxio by Alluxio, Inc.
Accelerating Cloud Training With AlluxioAccelerating Cloud Training With Alluxio
Accelerating Cloud Training With Alluxio
Alluxio, Inc.38 views
Resume_Vignesh by Vignesh S
Resume_VigneshResume_Vignesh
Resume_Vignesh
Vignesh S166 views
Off-Label Data Mesh: A Prescription for Healthier Data by HostedbyConfluent
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data

More from Chris Dagdigian

2014 BioIT World - Trends from the trenches - Annual presentation by
2014 BioIT World - Trends from the trenches - Annual presentation2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentationChris Dagdigian
6.1K views105 slides
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned by
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons LearnedBio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons LearnedChris Dagdigian
1.5K views68 slides
AWS re:Invent - Accelerating Research by
AWS re:Invent - Accelerating ResearchAWS re:Invent - Accelerating Research
AWS re:Invent - Accelerating ResearchChris Dagdigian
1.2K views20 slides
Bio-IT for Core Facility Managers by
Bio-IT for Core Facility ManagersBio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersChris Dagdigian
1.7K views156 slides
Trends from the Trenches (Singapore Edition) by
Trends from the Trenches (Singapore Edition)Trends from the Trenches (Singapore Edition)
Trends from the Trenches (Singapore Edition)Chris Dagdigian
10.8K views66 slides
Practical Cloud & Workflow Orchestration by
Practical Cloud & Workflow OrchestrationPractical Cloud & Workflow Orchestration
Practical Cloud & Workflow OrchestrationChris Dagdigian
28K views109 slides

More from Chris Dagdigian(6)

2014 BioIT World - Trends from the trenches - Annual presentation by Chris Dagdigian
2014 BioIT World - Trends from the trenches - Annual presentation2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentation
Chris Dagdigian6.1K views
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned by Chris Dagdigian
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons LearnedBio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Chris Dagdigian1.5K views
AWS re:Invent - Accelerating Research by Chris Dagdigian
AWS re:Invent - Accelerating ResearchAWS re:Invent - Accelerating Research
AWS re:Invent - Accelerating Research
Chris Dagdigian1.2K views
Bio-IT for Core Facility Managers by Chris Dagdigian
Bio-IT for Core Facility ManagersBio-IT for Core Facility Managers
Bio-IT for Core Facility Managers
Chris Dagdigian1.7K views
Trends from the Trenches (Singapore Edition) by Chris Dagdigian
Trends from the Trenches (Singapore Edition)Trends from the Trenches (Singapore Edition)
Trends from the Trenches (Singapore Edition)
Chris Dagdigian10.8K views
Practical Cloud & Workflow Orchestration by Chris Dagdigian
Practical Cloud & Workflow OrchestrationPractical Cloud & Workflow Orchestration
Practical Cloud & Workflow Orchestration
Chris Dagdigian28K views

Recently uploaded

"Node.js Development in 2024: trends and tools", Nikita Galkin by
"Node.js Development in 2024: trends and tools", Nikita Galkin "Node.js Development in 2024: trends and tools", Nikita Galkin
"Node.js Development in 2024: trends and tools", Nikita Galkin Fwdays
33 views38 slides
The Role of Patterns in the Era of Large Language Models by
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsYunyao Li
91 views65 slides
Measurecamp Brussels - Synthetic data.pdf by
Measurecamp Brussels - Synthetic data.pdfMeasurecamp Brussels - Synthetic data.pdf
Measurecamp Brussels - Synthetic data.pdfHuman37
26 views14 slides
Inawisdom IDP by
Inawisdom IDPInawisdom IDP
Inawisdom IDPPhilipBasford
15 views48 slides
Deep Tech and the Amplified Organisation: Core Concepts by
Deep Tech and the Amplified Organisation: Core ConceptsDeep Tech and the Amplified Organisation: Core Concepts
Deep Tech and the Amplified Organisation: Core ConceptsHolonomics
17 views21 slides
Qualifying SaaS, IaaS.pptx by
Qualifying SaaS, IaaS.pptxQualifying SaaS, IaaS.pptx
Qualifying SaaS, IaaS.pptxSachin Bhandari
1.1K views8 slides

Recently uploaded(20)

"Node.js Development in 2024: trends and tools", Nikita Galkin by Fwdays
"Node.js Development in 2024: trends and tools", Nikita Galkin "Node.js Development in 2024: trends and tools", Nikita Galkin
"Node.js Development in 2024: trends and tools", Nikita Galkin
Fwdays33 views
The Role of Patterns in the Era of Large Language Models by Yunyao Li
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
Yunyao Li91 views
Measurecamp Brussels - Synthetic data.pdf by Human37
Measurecamp Brussels - Synthetic data.pdfMeasurecamp Brussels - Synthetic data.pdf
Measurecamp Brussels - Synthetic data.pdf
Human37 26 views
Deep Tech and the Amplified Organisation: Core Concepts by Holonomics
Deep Tech and the Amplified Organisation: Core ConceptsDeep Tech and the Amplified Organisation: Core Concepts
Deep Tech and the Amplified Organisation: Core Concepts
Holonomics17 views
AI + Memoori = AIM by Memoori
AI + Memoori = AIMAI + Memoori = AIM
AI + Memoori = AIM
Memoori14 views
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And... by ShapeBlue
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
ShapeBlue108 views
Business Analyst Series 2023 - Week 4 Session 7 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10146 views
Optimizing Communication to Optimize Human Behavior - LCBM by Yaman Kumar
Optimizing Communication to Optimize Human Behavior - LCBMOptimizing Communication to Optimize Human Behavior - LCBM
Optimizing Communication to Optimize Human Behavior - LCBM
Yaman Kumar38 views
GDSC GLAU Info Session.pptx by gauriverrma4
GDSC GLAU Info Session.pptxGDSC GLAU Info Session.pptx
GDSC GLAU Info Session.pptx
gauriverrma415 views
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf by ThomasBronack
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdfBronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf
ThomasBronack31 views
NTGapps NTG LowCode Platform by Mustafa Kuğu
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform
Mustafa Kuğu437 views
Transcript: Redefining the book supply chain: A glimpse into the future - Tec... by BookNet Canada
Transcript: Redefining the book supply chain: A glimpse into the future - Tec...Transcript: Redefining the book supply chain: A glimpse into the future - Tec...
Transcript: Redefining the book supply chain: A glimpse into the future - Tec...
BookNet Canada41 views
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」 by PC Cluster Consortium
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」
Innovation & Entrepreneurship strategies in Dairy Industry by PervaizDar1
Innovation & Entrepreneurship strategies in Dairy IndustryInnovation & Entrepreneurship strategies in Dairy Industry
Innovation & Entrepreneurship strategies in Dairy Industry
PervaizDar135 views
"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell by Fwdays
"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell
"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell
Fwdays14 views

Practical Petabyte Pushing

  • 1. chris@bioteam.net / @chris_dag PRACTICAL PETABYTE PUSHING Jan 2019 / Lightning Talk / Foundation Medicine Boston Computational Biology and Bioinformatics Meetup Chris Dagdigian; chris@bioteam.net
  • 2. chris@bioteam.net / @chris_dag 30 Second Background ● 24x7 Production HPC Environment ● 100s of user accounts; 10+ power users; 50+ frequent users ● Many integrated “cluster aware” commercial apps leverage this system ● ~2 petabytes scientific & user data (Linux & Windows clients) ● Multiple catastrophic NAS outages in 2018 ○ Demoralized scientists; shell-shocked IT staff; angry management ○ Replacement storage platform procured; 100% NAS-to-NAS migration ordered ● Mandate / Mission - 2 petabyte live data migration ○ IT must re-earn trust and confidence of scientific end-users & leadership ○ User morale/confidence is low; Stability/Uptime is key; Zero Unplanned Outages ○ “Jobs must flow” -- HPC remains in production during data migration
  • 3. chris@bioteam.net / @chris_dag 1. NEVER comingle “data management” & “data movement” at same time Cleanup/manage your data BEFORE or AFTER; never DURING 2. Understand upfront vendor-specific data protection overhead (small files esp) New NAS needed +20% more raw disk to store the same data, a non-trivial CapEx cost at petascale 3. Interrogate/Understand your data before you move it (or buy new storage!) Massive replication bandwidth is meaningless if you have 200+ million tiny files; This was our real-world data movement bottleneck Lightning Talk ProTip: CONCLUSIONS FIRST Things we already knew + things we wished we knew beforehand
  • 4. chris@bioteam.net / @chris_dag Lightning Talk ProTip: CONCLUSIONS FIRST 4. Be proactive in setting (and re-setting) management expectations Data transfer time estimates based off of aggregate network bandwidth were insanely wrong. Real world throughput range was: [ 2mb/sec -- 13GB/sec ] 5. Tasks that take days/weeks require visibility & transparency Users & management will want a dashboard or progress view 6. Work against full filesystems or network shares ONLY (See tip #1 …) Attempts to get clever with curated “exclude-these-files-and-folders” lists add complexity and introduce vectors for human/operator error Things we already knew + things we wished we knew beforehand
  • 5. chris@bioteam.net / @chris_dag Materials & Methods - Tooling Tooling ● We are not special/unique in life science informatics - plagiarizing methods from Amazon, supercomputing sites & high-energy physics is a legit strategy ● Our tooling choice: fpart/fpsync from https://github.com/martymac/fpart ○ ‘fpart’ - Does the hard work of filesystem crawling to build ‘partition’ lists that can be used as input data for whatever tool you want to use to replicate/copy data ○ ‘fpsync’ - Wrapper script to parallelize, distribute and manage a swarm of replication jobs ○ ‘rsync’ - https://rsync.samba.org/ ● Actual data replication via ‘rsync’ (managed by fpsync) ○ fpsync wrapper script is pluggable and supports different data mover/copy binaries ○ We explicitly chose ‘rsync’ because it is well known, well tested and had the least amount of potential edge and corner-cases to deal with Things we already knew + things we wished we knew beforehand
  • 6. chris@bioteam.net / @chris_dag Materials & Methods - Process The Process (one filesystem or share at a time): ● [A] Perform initial full replication in background on live “in-use” file system ● [B] Perform additional ‘re-sync’ replications to stay current ● [C] Perform ‘delete pass’ sync to catch data that was deleted from source filesystem while replication(s) were occuring ● Repeat tasks [B] and [C] until time window for full sync + delete-pass is small enough to fit within an acceptable maintenance/outage window ● Schedule outage window; make source filesystem Read-Only at a global level; perform final replication sync; migrate client mounts; have backout plan handy ● Test, test, test, test, test, test (admins & end-users should both be involved testing) ● Have a plan to document & support the previously unknown storage users that will come out of the woodwork once you mark the source filesystem read/only (!) Things we already knew + things we wished we knew beforehand
  • 7. chris@bioteam.net / @chris_dag Wrap Up Commercial Alternative ● If management requires fancy live dashboards & other UI candy --OR-- you have limited IT/ops support available for scripted OSS tooling support … ● You can purchase petascale data migration capability commercially ○ Recommendation: Talk to DataDobi (https://datadobi.com) ○ (Yes this is a different niche than IBM Aspera or GridFTP type tooling …) Acknowledgements ● Aaron Gardner (aaron@bioteam.net) ○ One of several Bioteam infrastructure gurus with extreme storage & filesystem expertise ○ He did the hard work on this ○ I just scripted things & monitored progress #lazy More Info/Details: If you want to see this topic expanded into a long-form blog post / technical write-up or BioITWorld conference talk then please let me know via email!