SlideShare a Scribd company logo
User #1
Name, Tile
Dr. Gene Cooperman
Professor
Northeastern University
Why Use Checkpoint-Restart?
Gene Cooperman December 10, 2020
 My application takes 1 week to run, and I only have a
48-hour batch allocation slot.
 There are upcoming shutdowns (either planned or
unplanned; due to electricity shutdown, building work,
etc.).
 Computer nodes sometimes crash..
DMTCP: What does it stand for?
 DMTCP:
Distributed MultiThreaded CheckPointing
 Current and past industrial support from:
 Academic support from:
Gene Cooperman December 10, 2020
So, what is Checkpoint-Restart/DMTCP?
Checkpoint-Restart is the ability to save a running application to
disk (and restart it).
 DMTCP is a widely used C-R package for HPC.
 120 publications by others using DMTCP:
http://dmtcp.sourceforge.net/publications.html
 Periodic ckpt or under program control.
 Runtime overhead typically less than 1%.
 End-user plugins for added flexibility.
 Leveraging support at NERSC Supercomputer Center (best
practices)
Gene Cooperman December 10, 2020
Philosophy of DMTCP
 Open Source: LGPL-v3 (non-contagious license)
 Considering switching to Apache license
 Transparent checkpointing:
 No modification to application binaries; no modification to Linux kernel or
system libraries
• Long-term Standards for robustness:
 Mostly built on top of POSIX syscalls
 Synergy between academia (publishing) and industry (impact
on current practices)
Gene Cooperman December 10, 2020
Long-Term Research Questions
FUNDAMENTAL QUESTION:
What are the limits of
transparent checkpointing?
 What about MPI over InfiniBand or TCP?
(Yes, DMTCP: last 7 years)
 And newer networks? (CRAY GNI, HPE Slingshot, Intel Omni-Path,
InfiniBand extensions, …):
(Yes, DMTCP/MANA: last two years)
 What about CUDA? (for NVIDIA GPUs)
(Yes, DMTCP/CRAC: last year)
*EXTENDING DMTCP: plugins, virtual ids, split processes (see later)
Gene Cooperman December 10, 2020
Checkpointing to Stable Storage
The time to checkpoint is a long-standing question.
The time can be decomposed into:
1) Saving kernel/network status: a few seconds
2) Saving the memory image to stable storage:
depends on bandwidth of storage
EXAMPLE: storage to disk: ~100 MB/s
storage to SSD: ~500 MB/s
storage to Optane: ~5 GB/s
Big Data Applications:
Is there a CoW (copy-on-write) filesystem for snapshot?
Gene Cooperman December 10, 2020
Persistent Storage/Memory and HPC
Newer persistent memories in clusters are used in at least three application-
visible ways:
1) Substitute for RAM: provides persistence in database applications
2) Local burst-buffer: serves to accelerate writes to secondary storage
3) Intermediate level in the storage hierarchy:
fast local-node storage or shared storage among nearby nodes
DMTCP plugins are used to recognize these situations during ckpt; and
then restore during restart.
Gene Cooperman December 10, 2020
DMTCP Internal Architecture
Gene Cooperman December 10, 2020
Plugins,
Virtual Ids,
and Split
Process:
Pt. 1: Plugins
Gene Cooperman December 10, 2020
Plugins, Virtual Ids, and Split Process
Part 2: Virtual Ids: PID Plugin
Gene Cooperman December 10, 2020
Plugins, Virtual Ids, and Split Process
Part 3a: Split Processes: MANA Plugin for MPI
Gene Cooperman December 10, 2020
Plugins, Virtual Ids, and Split Process
Part 3b:
Split Processes:
CRAC Plugin
for CUDA
Gene Cooperman December 10, 2020
Acknowledgment
THANKS TO THE MANY STUDENTS AND OTHERS WHO HAVE
CONTRIBUTED TO DMTCP OVER THE YEARS:
Jason Ansel, Kapil Arya, Alex Brick, Jiajun Cao, Prashant Chouhan, Tyler Denniston, Xin
Dong, Younes El Idrissi Yazami, William Enright, Rohan Garg, Twinkle Jain, Samaneh
Kazemi, Jay Kim, Gregory Kerr, Apoorve Mohan, Neil Resnik, Manuel Rodríguez
Pascual, Artem Y. Polyakov, Michael Rieker, Praveen S. Solanki, Ana-Maria Visan
*This work was partially supported by National Science Foundation Grant OAC-
1740218 and grants from Intel Corporation, and Raytheon Technologies.
If you want to hear more about checkpointing, consider the
upcoming SuperCheck'21 conference: Feb. 4-5, 2021
First Int. Symp. on Checkpointing for Supercomputing
Gene Cooperman December 10, 2020

More Related Content

What's hot

08 Supercomputer Fugaku
08 Supercomputer Fugaku08 Supercomputer Fugaku
08 Supercomputer Fugaku
RCCSRENKEI
 
High performance computing for research
High performance computing for researchHigh performance computing for research
High performance computing for research
Esteban Hernandez
 
Jong Won Koh - HP apollo reinventing HPC to accelerate the world of tomorrow
Jong Won Koh - HP apollo reinventing HPC to accelerate the world of tomorrowJong Won Koh - HP apollo reinventing HPC to accelerate the world of tomorrow
Jong Won Koh - HP apollo reinventing HPC to accelerate the world of tomorrow
Vu Hung Nguyen
 
HPC at HP Update
HPC at HP UpdateHPC at HP Update
HPC at HP Update
inside-BigData.com
 
Enterprise Mass Storage TCO Case Study
Enterprise Mass Storage TCO Case StudyEnterprise Mass Storage TCO Case Study
Enterprise Mass Storage TCO Case Study
IT Brand Pulse
 
09 The Extreme-scale Scientific Software Stack for Collaborative Open Source
09 The Extreme-scale Scientific Software Stack for Collaborative Open Source09 The Extreme-scale Scientific Software Stack for Collaborative Open Source
09 The Extreme-scale Scientific Software Stack for Collaborative Open Source
RCCSRENKEI
 
Business Continuity Presentation[1]
Business Continuity Presentation[1]Business Continuity Presentation[1]
Business Continuity Presentation[1]
jrm1224
 
Best Practices in Data Center Energy Resource Management
Best Practices in Data Center Energy Resource ManagementBest Practices in Data Center Energy Resource Management
Best Practices in Data Center Energy Resource Management
Viridity Software
 
SUPERMICRO Innovative Computing Architecture
SUPERMICRO Innovative Computing ArchitectureSUPERMICRO Innovative Computing Architecture
SUPERMICRO Innovative Computing Architecture
Intel IT Center
 

What's hot (9)

08 Supercomputer Fugaku
08 Supercomputer Fugaku08 Supercomputer Fugaku
08 Supercomputer Fugaku
 
High performance computing for research
High performance computing for researchHigh performance computing for research
High performance computing for research
 
Jong Won Koh - HP apollo reinventing HPC to accelerate the world of tomorrow
Jong Won Koh - HP apollo reinventing HPC to accelerate the world of tomorrowJong Won Koh - HP apollo reinventing HPC to accelerate the world of tomorrow
Jong Won Koh - HP apollo reinventing HPC to accelerate the world of tomorrow
 
HPC at HP Update
HPC at HP UpdateHPC at HP Update
HPC at HP Update
 
Enterprise Mass Storage TCO Case Study
Enterprise Mass Storage TCO Case StudyEnterprise Mass Storage TCO Case Study
Enterprise Mass Storage TCO Case Study
 
09 The Extreme-scale Scientific Software Stack for Collaborative Open Source
09 The Extreme-scale Scientific Software Stack for Collaborative Open Source09 The Extreme-scale Scientific Software Stack for Collaborative Open Source
09 The Extreme-scale Scientific Software Stack for Collaborative Open Source
 
Business Continuity Presentation[1]
Business Continuity Presentation[1]Business Continuity Presentation[1]
Business Continuity Presentation[1]
 
Best Practices in Data Center Energy Resource Management
Best Practices in Data Center Energy Resource ManagementBest Practices in Data Center Energy Resource Management
Best Practices in Data Center Energy Resource Management
 
SUPERMICRO Innovative Computing Architecture
SUPERMICRO Innovative Computing ArchitectureSUPERMICRO Innovative Computing Architecture
SUPERMICRO Innovative Computing Architecture
 

Similar to Checkpointing the Uncheckpointable

Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Impetus Technologies
 
FYP1 Progress Report (final)
FYP1 Progress Report (final)FYP1 Progress Report (final)
FYP1 Progress Report (final)
waqas khan
 
Checkpointing the Un-checkpointable: MANA and the Split-Process Approach
Checkpointing the Un-checkpointable: MANA and the Split-Process ApproachCheckpointing the Un-checkpointable: MANA and the Split-Process Approach
Checkpointing the Un-checkpointable: MANA and the Split-Process Approach
inside-BigData.com
 
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data ExplosionAudax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
actifio
 
Corporate Laptop Backup and Recovery
Corporate Laptop Backup and RecoveryCorporate Laptop Backup and Recovery
Corporate Laptop Backup and Recovery
Jaspreet Singh
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
Wilhelm van Belkum
 
Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operation
niallmilton
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdf
anil0878
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
Volodymyr Saviak
 
Os Vandeven
Os VandevenOs Vandeven
Os Vandeven
oscon2007
 
Final presentasi gnome asia
Final presentasi gnome asiaFinal presentasi gnome asia
Final presentasi gnome asia
Anton Siswo
 
Analytics over Terabytes of Data at Twitter
Analytics over Terabytes of Data at TwitterAnalytics over Terabytes of Data at Twitter
Analytics over Terabytes of Data at Twitter
Imply
 
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
DataWorks Summit
 
Advanced Computer Arch Assign II
Advanced Computer Arch Assign IIAdvanced Computer Arch Assign II
Advanced Computer Arch Assign II
Joshua Gorinson
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
programmermag
 
Implementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentImplementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch government
DoKC
 
Enery efficient data prefetching
Enery efficient data prefetchingEnery efficient data prefetching
Enery efficient data prefetching
Himanshu Koli
 
Are your ready for in memory applications?
Are your ready for in memory applications?Are your ready for in memory applications?
Are your ready for in memory applications?
G2MCommunications
 
Data recovery from storage device
Data recovery from storage deviceData recovery from storage device
Data recovery from storage device
Mohit Shah
 
NGD Systems and Microsoft Keynote Presentation at IPDPS MPP in Vacouver
NGD Systems and Microsoft Keynote Presentation at IPDPS MPP in VacouverNGD Systems and Microsoft Keynote Presentation at IPDPS MPP in Vacouver
NGD Systems and Microsoft Keynote Presentation at IPDPS MPP in Vacouver
Scott Shadley, MBA,PMC-III
 

Similar to Checkpointing the Uncheckpointable (20)

Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
FYP1 Progress Report (final)
FYP1 Progress Report (final)FYP1 Progress Report (final)
FYP1 Progress Report (final)
 
Checkpointing the Un-checkpointable: MANA and the Split-Process Approach
Checkpointing the Un-checkpointable: MANA and the Split-Process ApproachCheckpointing the Un-checkpointable: MANA and the Split-Process Approach
Checkpointing the Un-checkpointable: MANA and the Split-Process Approach
 
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data ExplosionAudax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
 
Corporate Laptop Backup and Recovery
Corporate Laptop Backup and RecoveryCorporate Laptop Backup and Recovery
Corporate Laptop Backup and Recovery
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Cassandra in Operation
Cassandra in OperationCassandra in Operation
Cassandra in Operation
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdf
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
 
Os Vandeven
Os VandevenOs Vandeven
Os Vandeven
 
Final presentasi gnome asia
Final presentasi gnome asiaFinal presentasi gnome asia
Final presentasi gnome asia
 
Analytics over Terabytes of Data at Twitter
Analytics over Terabytes of Data at TwitterAnalytics over Terabytes of Data at Twitter
Analytics over Terabytes of Data at Twitter
 
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
 
Advanced Computer Arch Assign II
Advanced Computer Arch Assign IIAdvanced Computer Arch Assign II
Advanced Computer Arch Assign II
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
Implementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentImplementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch government
 
Enery efficient data prefetching
Enery efficient data prefetchingEnery efficient data prefetching
Enery efficient data prefetching
 
Are your ready for in memory applications?
Are your ready for in memory applications?Are your ready for in memory applications?
Are your ready for in memory applications?
 
Data recovery from storage device
Data recovery from storage deviceData recovery from storage device
Data recovery from storage device
 
NGD Systems and Microsoft Keynote Presentation at IPDPS MPP in Vacouver
NGD Systems and Microsoft Keynote Presentation at IPDPS MPP in VacouverNGD Systems and Microsoft Keynote Presentation at IPDPS MPP in Vacouver
NGD Systems and Microsoft Keynote Presentation at IPDPS MPP in Vacouver
 

More from MemVerge

Analytical Biosciences Accelerates Single Cell Sequencing with Big Memory
Analytical Biosciences Accelerates Single Cell Sequencing with Big MemoryAnalytical Biosciences Accelerates Single Cell Sequencing with Big Memory
Analytical Biosciences Accelerates Single Cell Sequencing with Big Memory
MemVerge
 
Impact of Intel Optane Technology on HPC
Impact of Intel Optane Technology on HPCImpact of Intel Optane Technology on HPC
Impact of Intel Optane Technology on HPC
MemVerge
 
Live Data: For When Data is Greater than Memory
Live Data: For When Data is Greater than MemoryLive Data: For When Data is Greater than Memory
Live Data: For When Data is Greater than Memory
MemVerge
 
Big Memory for HPC
Big Memory for HPCBig Memory for HPC
Big Memory for HPC
MemVerge
 
Tech Talk: Moneyball - Hitting real-time apps out of the park with Big Memory
Tech Talk: Moneyball - Hitting real-time apps out of the park with Big MemoryTech Talk: Moneyball - Hitting real-time apps out of the park with Big Memory
Tech Talk: Moneyball - Hitting real-time apps out of the park with Big Memory
MemVerge
 
MemVerge Company Overview
MemVerge Company OverviewMemVerge Company Overview
MemVerge Company Overview
MemVerge
 
IDC Technology Spotlight: Big Memory Computing Emerges to Better Enable Dat...
IDC Technology Spotlight:   Big Memory Computing Emerges to Better Enable Dat...IDC Technology Spotlight:   Big Memory Computing Emerges to Better Enable Dat...
IDC Technology Spotlight: Big Memory Computing Emerges to Better Enable Dat...
MemVerge
 
Big Memory Webcast
Big Memory WebcastBig Memory Webcast
Big Memory Webcast
MemVerge
 

More from MemVerge (8)

Analytical Biosciences Accelerates Single Cell Sequencing with Big Memory
Analytical Biosciences Accelerates Single Cell Sequencing with Big MemoryAnalytical Biosciences Accelerates Single Cell Sequencing with Big Memory
Analytical Biosciences Accelerates Single Cell Sequencing with Big Memory
 
Impact of Intel Optane Technology on HPC
Impact of Intel Optane Technology on HPCImpact of Intel Optane Technology on HPC
Impact of Intel Optane Technology on HPC
 
Live Data: For When Data is Greater than Memory
Live Data: For When Data is Greater than MemoryLive Data: For When Data is Greater than Memory
Live Data: For When Data is Greater than Memory
 
Big Memory for HPC
Big Memory for HPCBig Memory for HPC
Big Memory for HPC
 
Tech Talk: Moneyball - Hitting real-time apps out of the park with Big Memory
Tech Talk: Moneyball - Hitting real-time apps out of the park with Big MemoryTech Talk: Moneyball - Hitting real-time apps out of the park with Big Memory
Tech Talk: Moneyball - Hitting real-time apps out of the park with Big Memory
 
MemVerge Company Overview
MemVerge Company OverviewMemVerge Company Overview
MemVerge Company Overview
 
IDC Technology Spotlight: Big Memory Computing Emerges to Better Enable Dat...
IDC Technology Spotlight:   Big Memory Computing Emerges to Better Enable Dat...IDC Technology Spotlight:   Big Memory Computing Emerges to Better Enable Dat...
IDC Technology Spotlight: Big Memory Computing Emerges to Better Enable Dat...
 
Big Memory Webcast
Big Memory WebcastBig Memory Webcast
Big Memory Webcast
 

Recently uploaded

Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
shanihomely
 
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and OllamaTirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Zilliz
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
Google Developer Group - Harare
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Tatiana Al-Chueyr
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
RaminGhanbari2
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
Steven Carlson
 
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
aslasdfmkhan4750
 
The importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT StandardizationThe importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT Standardization
Axel Rennoch
 
Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024
aakash malhotra
 
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
maigasapphire
 
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
SynapseIndia
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
Matthias Neugebauer
 
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
Priyanka Aash
 
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Nicolás Lopéz
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
sunilverma7884
 
The Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF GuideThe Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF Guide
Shiv Technolabs
 
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptxDublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Kunal Gupta
 
Amul milk launches in US: Key details of its new products ...
Amul milk launches in US: Key details of its new products ...Amul milk launches in US: Key details of its new products ...
Amul milk launches in US: Key details of its new products ...
chetankumar9855
 
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes..."Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
Anant Gupta
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
SynapseIndia
 

Recently uploaded (20)

Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
 
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and OllamaTirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
 
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
High Profile Girls Call ServiCe Hyderabad 0000000000 Tanisha Best High Class ...
 
The importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT StandardizationThe importance of Quality Assurance for ICT Standardization
The importance of Quality Assurance for ICT Standardization
 
Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024Three New Criminal Laws in India 1 July 2024
Three New Criminal Laws in India 1 July 2024
 
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
 
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
 
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
(CISOPlatform Summit & SACON 2024) Keynote _ Power Digital Identities With AI...
 
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
 
The Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF GuideThe Role of IoT in Australian Mobile App Development - PDF Guide
The Role of IoT in Australian Mobile App Development - PDF Guide
 
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptxDublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
 
Amul milk launches in US: Key details of its new products ...
Amul milk launches in US: Key details of its new products ...Amul milk launches in US: Key details of its new products ...
Amul milk launches in US: Key details of its new products ...
 
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes..."Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
"Mastering Graphic Design: Essential Tips and Tricks for Beginners and Profes...
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
 

Checkpointing the Uncheckpointable

  • 1. User #1 Name, Tile Dr. Gene Cooperman Professor Northeastern University
  • 2. Why Use Checkpoint-Restart? Gene Cooperman December 10, 2020  My application takes 1 week to run, and I only have a 48-hour batch allocation slot.  There are upcoming shutdowns (either planned or unplanned; due to electricity shutdown, building work, etc.).  Computer nodes sometimes crash..
  • 3. DMTCP: What does it stand for?  DMTCP: Distributed MultiThreaded CheckPointing  Current and past industrial support from:  Academic support from: Gene Cooperman December 10, 2020
  • 4. So, what is Checkpoint-Restart/DMTCP? Checkpoint-Restart is the ability to save a running application to disk (and restart it).  DMTCP is a widely used C-R package for HPC.  120 publications by others using DMTCP: http://dmtcp.sourceforge.net/publications.html  Periodic ckpt or under program control.  Runtime overhead typically less than 1%.  End-user plugins for added flexibility.  Leveraging support at NERSC Supercomputer Center (best practices) Gene Cooperman December 10, 2020
  • 5. Philosophy of DMTCP  Open Source: LGPL-v3 (non-contagious license)  Considering switching to Apache license  Transparent checkpointing:  No modification to application binaries; no modification to Linux kernel or system libraries • Long-term Standards for robustness:  Mostly built on top of POSIX syscalls  Synergy between academia (publishing) and industry (impact on current practices) Gene Cooperman December 10, 2020
  • 6. Long-Term Research Questions FUNDAMENTAL QUESTION: What are the limits of transparent checkpointing?  What about MPI over InfiniBand or TCP? (Yes, DMTCP: last 7 years)  And newer networks? (CRAY GNI, HPE Slingshot, Intel Omni-Path, InfiniBand extensions, …): (Yes, DMTCP/MANA: last two years)  What about CUDA? (for NVIDIA GPUs) (Yes, DMTCP/CRAC: last year) *EXTENDING DMTCP: plugins, virtual ids, split processes (see later) Gene Cooperman December 10, 2020
  • 7. Checkpointing to Stable Storage The time to checkpoint is a long-standing question. The time can be decomposed into: 1) Saving kernel/network status: a few seconds 2) Saving the memory image to stable storage: depends on bandwidth of storage EXAMPLE: storage to disk: ~100 MB/s storage to SSD: ~500 MB/s storage to Optane: ~5 GB/s Big Data Applications: Is there a CoW (copy-on-write) filesystem for snapshot? Gene Cooperman December 10, 2020
  • 8. Persistent Storage/Memory and HPC Newer persistent memories in clusters are used in at least three application- visible ways: 1) Substitute for RAM: provides persistence in database applications 2) Local burst-buffer: serves to accelerate writes to secondary storage 3) Intermediate level in the storage hierarchy: fast local-node storage or shared storage among nearby nodes DMTCP plugins are used to recognize these situations during ckpt; and then restore during restart. Gene Cooperman December 10, 2020
  • 9. DMTCP Internal Architecture Gene Cooperman December 10, 2020
  • 10. Plugins, Virtual Ids, and Split Process: Pt. 1: Plugins Gene Cooperman December 10, 2020
  • 11. Plugins, Virtual Ids, and Split Process Part 2: Virtual Ids: PID Plugin Gene Cooperman December 10, 2020
  • 12. Plugins, Virtual Ids, and Split Process Part 3a: Split Processes: MANA Plugin for MPI Gene Cooperman December 10, 2020
  • 13. Plugins, Virtual Ids, and Split Process Part 3b: Split Processes: CRAC Plugin for CUDA Gene Cooperman December 10, 2020
  • 14. Acknowledgment THANKS TO THE MANY STUDENTS AND OTHERS WHO HAVE CONTRIBUTED TO DMTCP OVER THE YEARS: Jason Ansel, Kapil Arya, Alex Brick, Jiajun Cao, Prashant Chouhan, Tyler Denniston, Xin Dong, Younes El Idrissi Yazami, William Enright, Rohan Garg, Twinkle Jain, Samaneh Kazemi, Jay Kim, Gregory Kerr, Apoorve Mohan, Neil Resnik, Manuel Rodríguez Pascual, Artem Y. Polyakov, Michael Rieker, Praveen S. Solanki, Ana-Maria Visan *This work was partially supported by National Science Foundation Grant OAC- 1740218 and grants from Intel Corporation, and Raytheon Technologies. If you want to hear more about checkpointing, consider the upcoming SuperCheck'21 conference: Feb. 4-5, 2021 First Int. Symp. on Checkpointing for Supercomputing Gene Cooperman December 10, 2020