SlideShare a Scribd company logo
1 of 81
Hard and Soft Error Resilience for One-
sided Dense Linear Algebra Algorithms
A Dissertation Defense
in Support of the
Doctor of Philosophy Degree
Peng Du
Advisor: Prof. Jack Dongarra
October 26, 2016
Agenda
October 26, 2016 2
Motivation
Dissertation
Statement
Original
Work
Background
Hard Errors
Soft Errors
Contributions
Publications
Motivation
• HPC systems are getting larger
• Chip is getting more and more dense
October 26, 2016 3
TOP500 List - June 2012
Motivation
• Proprietary Components: IBM BG/P
• Full system MBTF of 1 week: (1000+ year MTBF per node)
October 26, 2016 4
• Commodity Components: (X86, Intel + AMD)
• Full system MTBF of 1 day
• Energy budget limit the use of error/detection/redundancy
• Full system MTBF of 1 hour
Resilience Exascale Workshop Slides, Franck Cappello
(MTBF: Mean Time Between Failure)
Agenda
October 26, 2016 5
Motivation
Dissertation
Statement
Original
Work
Background
Hard Errors
Soft Errors
Contributions
Publications
Dissertation Statement
• The goal of the dissertation is to demonstrate that one-
sided dense linear algebra factorizations and solvers
can be made fault tolerant to both hard error (fail-stop
failure) and soft error. The following problems are
studied:
• Full matrix protection (the left and right factor)
• MPI support for runtime system recovery
• Detection multiple soft errors in time and space and
recovery of both factorization and solver
• Performance management on large scale system
• Soft errors on hybrid platform with GPGPU
October 26, 2016 6
Agenda
October 26, 2016 7
Motivation
Dissertation
Statement
Original
Work
Background
Hard Errors
Soft Errors
Contributions
Publications
Original Work
• Hard Error
• A performance efficient method to protect the left factor (e.g. L
in LU factorization)
• Recovery of running stack in QR factorization after hard error
using on-demand checkpointing
• Soft Error
• Scalable local (diskless) checkpointing to protect the left factor
from soft errors
• Floating point number weighted checksum encoding
• Multiple soft errors detection and recovery algorithm for the right
factor and trailing matrix using the weighted checksum
encoding
• LU based linear system solver
• Factorization (demonstrated by QR)
• Complexity reduction algorithm
October 26, 2016 8
Related Work
• Hardware Protection
• Memory, cache
• Single-bit-error-correction and double-bit-error-detection
(SEC/DED)
• Compute logic
• Logic circuits with verification functionalities
• Space or execution redundancy
• Disk Checkpointing/Restart
• coordinated and uncoordinated checkpointing
• incremental checkpointing, forked (copy-on-write)
checkpointing, etc.
October 26, 2016 9
Related Work
• Diskless Checkpointing
• Parity based checksum (XOR of bits)
• Neighbor- and parity-based diskless checkpointing
• Algorithm Based Fault Tolerance (ABFT)
• Checksum is generated only once, before the computation
• Checksum is updated by the host algorithms
• Check and fix are performed only after computation
• Backward Error Assertions
• Iterative refinement to correct small errors
• Uncorrectable errors are notified to the applications
October 26, 2016 10
Agenda
October 26, 2016 11
Motivation
Dissertation
Statement
Original
Work
Background
Hard Errors
Soft Errors
Contributions
Publications
Failure Types
• Hard Error (“fail-stop failure”)
October 26, 2016 12
P0
P1
P2
Time
✔
✔
✔
Failure Types
• Soft Error (“Transient error”, “Silent Data Corruption”, etc.)
• Radiation Induced
• Alpha particle
• High energy neutron
• Thermal neutron
October 26, 2016 13
0 0 1 1 0 10 1
1 0 1 1 0 10 1
P0
P1
P2
Time
✖
✔
✖
Factorization
• Dense matrix factorizations
• LU, Cholesky, QR
• Ax=b
October 26, 2016 14
A = LU
x =U  (L  b)




Block LU factorization
GETF2 TRSM
GEMM GETF2 TRSM
Hybrid & Blocked QR
DGEQRF & DLARFT (CPU) DLARFB (GPU)
DGEQRF & DLARFT (CPU) DLARFB(GPU)
… Q
R
Zones of the Matrix
October 26, 2016 17
Right
Factor
Left Factor
Agenda
October 26, 2016 18
Motivation
Dissertation
Statement
Original
Work
Background
Hard Errors
Soft Errors
Contributions
Publications
Hard Error
• Key problem: The protection of the left factor
• 2D block cyclic data distribution
• Checkpointing with high parallelism
• Efficient Recovery
• Running Stack Recovery
October 26, 2016 19
ABFT for the Right Factor
October 26, 2016 20
Q-Parallel Checkpointing for the Left Factor
• Q-parallel checkpoint (P x Q process grid)
• Checkpointing in parallel horizontally every Q iterations
• Scalable and no need for extra storage
"Algorithm-based Fault Tolerance for Dense Matrix Factorizations", Peng Du, Aurelien Bouteiller, George Bosilca, Thomas Herault and
Jack Dongarra, 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP' 12)
ABFT Checksum
Diskless Checkpoint
Recovery example
Overhead
• M (rows) x N (columns) input matrix
• P (rows) x Q (columns) process grid
• Storage
• Storage for checksum
• Ratio over the matrix
• Computation Overhead
MN
Q
1
Q
O(N2
)
Experiment Platforms
• “Dancer” @ UT
• 16-node
• Each node has two 2.27GHz quad-core Intel E5520 CPUs
• a 20GB/s Infiniband interconnect.
• Solid State Drive disks.
October 26, 2016 24
• “Kraken” @ ORNL
• Cray XT5 machine
• 9,408 compute nodes.
• Each node has two Istanbul 2.6 GHz six-core AMD Opteron
processors, 16 GB of memory
• connected through the SeaStar2+ interconnect
• The scalable cluster file system “Lustre”
Experiment Results (Kraken)
October 26, 2016 25
Experiment Results (Kraken)
October 26, 2016 26
Experiment Results (Kraken)
October 26, 2016 27
Recovery Of Running Stack
• No official MPI support to determine failure process’s identity
• In fact….
• Recovery of the running stack on the failed process
• matrix data
• Control variables (e.g. loop counts)
October 26, 2016 28
P0
P1
P2
Out of Synchronization
Checkpoint-on-Failure (CoF)
October 26, 2016 29
"A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI". Wesley Bland, Peng Du,
Aurelien Bouteiller, Thomas Herault, George Bosilca, Jack Dongarra. 18th International European Conference
on Parallel and Distributed Computing (Euro-Par 2012). August 2012, Rhodes Island, Greece.
FT-QR with CoF
October 26, 2016 30
P0
P1
P2
Surviving Processes
Checkpointing to disk
Restart program, dry
run to failed point
Surviving Processes
load checkpoint from
disk
Parallel-Q and ABFT
recovery
Recovery done;
Execution Resumed
Experiment Result (Dancer)
October 26, 2016 31
(8 processes/node ×16 nodes)
Experiment Result (Kraken)
October 26, 2016 32
(24 ×24 processes)
Agenda
October 26, 2016 33
Motivation
Dissertation
Statement
Original
Work
Background
Hard Errors
Soft Errors
Contributions
Publications
Soft Error
• Key problems
• Silent error
• Error propagation
• Detection
• Off-line recovery
October 26, 2016 34
Multiple Soft Errors
October 26, 2016 35
• Propagation in the right factor
Original errors
Errors due to
propagation
General work flow (LU solver)
(1) Generate checksum for the input matrix as additional columns
(2) Perform LU factorization WITH the additional checksum columns
(3) Solve Ax=b using LU from the factorization
(even if soft error occurs during LU factorization)
(4) Check for soft error
(5) Correct solution x
Soft Error
October 26, 2016 37
Error modeling Encoding for
checksum
Left Factor
Protection
Soft Error
October 26, 2016 38
Error modeling Encoding for
checksum
Left Factor
Protection
How to detect & recover soft errors in L?
• The recovery of Ax=b requires a correct L
• L does not change once produced
• Diskless checkpointing for L
• Delay pivoting on L to prevent checksum of L from being invalidated
L
U
• PDGEMM based checkpointing
• Checkpointing time increases when scaled to more processes and
larger matrices
Checkpointing for L, idea 1
NOT SCALABLE
Checkpointing for L, idea 2
• Local Checkpointing
• Each process checkpoints their local involved data
• Constant checkpointing time
SCALABLE
Soft Error
October 26, 2016 42
Error modeling Encoding for
checksum
Left Factor
Protection
Error modeling (1 error)
• When?
• Answer: Doesn’t really matter
October 26, 2016 43
L
U
A
A
Based on works by Luk et al. in 1980s for systolic array
Locate Error
P [ A, A×e, A× w]= L[ U, c, v], A = A+ dej
T
⇒
P A = L U
PAe = Lc
PAw = Lv





G = eT
wT








1 1  1
w1
w2
 wn








T
Initial error column j
Error modeling (2 errors)
October 26, 2016 45
L
U
B
B
A
A
j1 j2
Soft Error
October 26, 2016 46
Error modeling Encoding for
checksum
Left Factor
Protection
Floating Point Encoding
October 26, 2016 47
l1
+ l2
++ ln
= c1
w1
l1
+ w2
l2
++ wn
ln
= c2
u1
l1
+ u2
l2
++ un
ln
= c3






l1
++ li
++ lj
++ ln
= c1
w1
l1
++ wi
li
++ wj
lj
++ wn
ln
= c2
u1
l1
++ ui
li
++ uj
lj
++ un
ln
= c3







Let ui
= wi
2
(c3
− c3
) + (wi
+ wj
)(c2
− c2
)+ wi
wj
(c1
− c1
) = 0
O(N2
) to find wi
and wj
l = l1
 li
 lj
 ln





 l = l1
 li
 lj
 ln






Check equation for the left
factor
Multiple Soft Errors
October 26, 2016 48
Error modeling Encoding for
checksum
Locate and recover solution from multiple errors
in the right factor
But, what about performance?
October 26, 2016 49
(ˆs − ˆUw) − (wj1
+ wj2
)( ˆv − ˆUw) + wj1
wj2
(c − ˆUe) = 0
Vector
O(N3
) for two errors
But so is LU!
Check equation for the Right
factor
Complexity Reduction
October 26, 2016 50
Complexity Reduction
October 26, 2016 51
16×16 cores, 2 errors
Recovery
• Solver
• Sherman-Morrison formula to recover the solution of Ax=b
• Factorization
• Through reducing a spiked matrix to recover the left & right
factors
October 26, 2016 52
Based on works by Luk et al. in 1980s for systolic array
Experiment Results
October 26, 2016 53
Experiment Results
October 26, 2016 54
CPU: 2 6-core Xeon 5660
GPU: NVIDIA M2070
Agenda
October 26, 2016 55
Motivation
Dissertation
Statement
Original
Work
Background
Hard Errors
Soft Errors
Contributions
Publications
Contributions
• Hard Error
• A performance efficient method to protect the left factor (e.g. L
in LU factorization)
• Recovery of running stack in QR factorization after hard error
using on-demand checkpointing
• Soft Error
• Scalable local (diskless) checkpointing to protect the left factor
from soft errors
• Floating point number weighted checksum encoding
• Multiple soft errors detection and recovery algorithm for the right
factor and trailing matrix using the weighted checksum
encoding
• LU based linear system solver
• Factorization (demonstrated by QR)
• Complexity reduction algorithm
October 26, 2016 56
Publication
• Chapter 3
• "Algorithm-based Fault Tolerance for Dense Matrix Factorizations". Peng Du,
Aurelien Bouteiller, George Bosilca, Thomas Herault and Jack Dongarra. 17th
ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
(PPoPP' 12). February 2012, New Orleans, LA.
• "A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard
MPI". Wesley Bland, Peng Du, Aurelien Bouteiller, Thomas Herault, George
Bosilca, Jack Dongarra. 18th International European Conference on Parallel and
Distributed Computing (Euro-Par 2012). August 2012, Rhodes Island, Greece.
• Chapter 4
• "High Performance Dense Linear System Solver with Soft Error Resilience".
Peng Du, Piotr Luszczek, Stan Tomov and Jack Dongarra. IEEE Cluster 2011.
Austin, TX.
• "High Performance Dense Linear System Solver with Resilience to Multiple Soft
Errors". Peng Du, Piotr Luszczek and Jack Dongarra. The International
Conference on Computational Science (ICCS) 2012. Omaha, NE.
• Chapter 5
• "Soft Error Resilient QR Factorization for Hybrid System with GPGPU". Peng
Du, Piotr Luszczek, Stan Tomov and Jack Dongarra. The second workshop on
Scalable algorithms for large-scale systems (Scala) 2011. Seattle, Washington.
October 26, 2016 57
October 26, 2016 58
Backup Slides
October 26, 2016 59
Failures in PBLAS
October 26, 2016 60
LARFT+LARFB
A2Q
Data Layout
• 2D block cyclic distribution
October 26, 2016 61
0 1 2 0 1 2 0 1
0
1
0
1
0
1
0
1
2 × 3 process grid Process (0,1) dies
Recover QR factorization on Hybrid
Platform with GPGPU
A = A+ dej
T
Q
R
= +Q?
R?
QR update
October 26, 2016 63
A = Q RGiven QR = A+ uvT
Find
A− A = (a• j
− Q R• j
)× ej
T
= u× vT
A = Q R+ uvT
= Q( R+ QT
uvT
)
A = Q( R+ wvT
) w = QT
u = QT
a• j
− R• j
R+ wvT
=
Orthogonal
Transformation
(fast Given’s Rotation on GPU)
Encoding for Q
October 26, 2016 64
CPU
GPU
DGEQRF DLARFT Sending panel to GPU
Look-ahead DLARFB Trailing DLARFB
Matrix size: 17480
cublasSetMatrix() cudaMemcpy2DAsync()
Error modeling (for “where”)
A0
= A
1 1 1t t t tA L P A− − −=
At
= Lt−1
Pt−1
At−1
− λei
ej
T
= Lt−1
Pt−1
(Lt−2
Pt−2
L0
P0
)A0
− λei
ej
T
Define an initial erroneous initial matrix A
A ≅ (Lt−1
Pt−1
Lt−2
Pt−2
L0
P0
)−1 At
= A− (Lt−1
Pt−1
Lt−2
Pt−2
L0
P0
)−1
λei
ej
T
= A− dej
T
U = (Ln
Pn
)(L1
P1
)(L0
P)0
A0
Input matrix
One step of LU
If no soft error occurs
If soft error occurs at step t
Locate Error
P [ A, A×e, A× w]= L[ U, c, v], A = A+ dej
T
⇒
P A = L U
PAe = Lc
PAw = Lv





G = eT
wT








1 1  1
w1
w2
 wn








T
Column j
LU based linear solver
Ax = b
A = LU
x =U  (L  b)
General work flow
(1) Generate checksum for the input matrix as additional columns
(2) Perform LU factorization WITH the additional checksum columns
(3) Solve Ax=b using LU from the factorization
(even if soft error occurs during LU factorization)
(4) Check for soft error
(5) Correct solution x
Why is soft error hard to handle?
• Soft error occurs silently
• Propagation
Recover Ax=b
• Sherman Morison Formula
Recover Ax=b
Given:
P A = L U
Ax = b




To Solve:
Ax b=
Recover Ax=b
Ax = b
⇒ x = A−1
b
⇒ x = A−1
( P−1 P)b = ( PA)−1 Pb
( PA)−1
= ?
Recover Ax=b
A− A = dej
T
Recall:
PA− P A = ( Pai j
− L Ui j
)ej
T
PA = L U + L( L−1 Pai j
− Ui j
)ej
T
= L( U +tej
T
)
= L U(I + U −1
tej
T
) = L U(I + vej
T
)
t = L−1 Pai j
− Ui j
v = U −1
t
Therefore:
Recover Ax=b
( PA)−1
= ( L U(I + vej
T
))
= (I + vej
T
)−1
( L U)−1
= I −
1
1+ vj
vej
T






÷
÷( L U)−1
Sherman
Morrison
Recover Ax=b
Ax = b
= I −
1
1+ vj
vej
T






÷
÷x
Recover Ax=b
(1) L Ux = Pb
(2)
t = L−1 Pai j
− Ui j
v = U −1
t
x = I −
1
1+ vj
vej
T






÷
÷x











Recover Ax=b
(1) L Ux = Pb
(2)
t = L−1 Pai j
− Ui j
v = U −1
t
x = I −
1
1+ vj
vej
T






÷
÷x











Needs protection
Locate Error
PAe = Lc
⇒ c = L−1 PAe = L−1 P( A+ dej
T
)e
= L−1
( P A+ Pdej
T
)e
= L−1
( L U + Pdej
T
)e
= Ue + L−1 Pd
⇒ c − Ue = L−1 Pd = r
r =U × e − c
Locate Error
s =U × w− v
PAw = Lv
⇒ v = L−1 PAw = L−1 P( A+ dej
T
)w
= L−1
( P A+ Pdej
T
)w
= L−1
( L U + Pdej
T
)w
= Uw+ L−1 Pdwj
⇒ v − Uw = L−1 Pdwj
= s
Locate Error
c − Ue = L−1 Pd = r
v − Uw = wj
L−1 Pd = s




⇒ s = wj
× r
⇒ wj
1
1

1












= s./ r
• Wj is the jth element of vector w in the generator matrix
• Component-wise division of s and r reveals wj
• Search wj in w reveals the initial soft error’s column
Extra Storage
• For input matrix of size MxN on PxQ grid
• A copy of the original matrix
• Not necessary when it’s easy to re-generate the required
column of the original matrix
• 2 additional columns: 2 x M
• Each process has 2 rows: , in total2
N
Q
× 2P N× ×
2 2
2 2 2N
extra storage M P N
r
matrix storage M N
P P
N M M
→∞
× + × ×
= =
×
× ×
= + →

More Related Content

What's hot

Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine LearningFabian Pedregosa
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
 
QX Simulator and quantum programming - 2020-04-28
QX Simulator and quantum programming - 2020-04-28QX Simulator and quantum programming - 2020-04-28
QX Simulator and quantum programming - 2020-04-28Aritra Sarkar
 
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23Aritra Sarkar
 
Cape2013 scilab-workshop-19Oct13
Cape2013 scilab-workshop-19Oct13Cape2013 scilab-workshop-19Oct13
Cape2013 scilab-workshop-19Oct13Naren P.R.
 
Parallel Algorithms for Geometric Graph Problems (at Stanford)
Parallel Algorithms for Geometric Graph Problems (at Stanford)Parallel Algorithms for Geometric Graph Problems (at Stanford)
Parallel Algorithms for Geometric Graph Problems (at Stanford)Grigory Yaroslavtsev
 
Quantum Computing Fundamentals via OO
Quantum Computing Fundamentals via OOQuantum Computing Fundamentals via OO
Quantum Computing Fundamentals via OOCarl Belle
 
Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsUnai Lopez-Novoa
 
Reading: "Pi in the sky: Calculating a record-breaking 31.4 trillion digits o...
Reading: "Pi in the sky: Calculating a record-breaking 31.4 trillion digits o...Reading: "Pi in the sky: Calculating a record-breaking 31.4 trillion digits o...
Reading: "Pi in the sky: Calculating a record-breaking 31.4 trillion digits o...Kento Aoyama
 
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmA Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmNECST Lab @ Politecnico di Milano
 
20151021_DataScienceMeetup_revised
20151021_DataScienceMeetup_revised20151021_DataScienceMeetup_revised
20151021_DataScienceMeetup_revisedrerngvit yanggratoke
 
Genomics algorithms on digital NISQ accelerators - 2019-01-25
Genomics algorithms on digital NISQ accelerators - 2019-01-25Genomics algorithms on digital NISQ accelerators - 2019-01-25
Genomics algorithms on digital NISQ accelerators - 2019-01-25Aritra Sarkar
 
Asymmetry in Large-Scale Graph Analysis, Explained
Asymmetry in Large-Scale Graph Analysis, ExplainedAsymmetry in Large-Scale Graph Analysis, Explained
Asymmetry in Large-Scale Graph Analysis, ExplainedVasia Kalavri
 
Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming Flink Forward
 
Recommender Systems with Implicit Feedback Challenges, Techniques, and Applic...
Recommender Systems with Implicit Feedback Challenges, Techniques, and Applic...Recommender Systems with Implicit Feedback Challenges, Techniques, and Applic...
Recommender Systems with Implicit Feedback Challenges, Techniques, and Applic...NAVER Engineering
 
Quantum Machine Learning for IBM AI
Quantum Machine Learning for IBM AIQuantum Machine Learning for IBM AI
Quantum Machine Learning for IBM AISasha Lazarevic
 
Virtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinVirtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinAcunu
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
 

What's hot (20)

Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine Learning
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOA
 
QX Simulator and quantum programming - 2020-04-28
QX Simulator and quantum programming - 2020-04-28QX Simulator and quantum programming - 2020-04-28
QX Simulator and quantum programming - 2020-04-28
 
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23
 
Cape2013 scilab-workshop-19Oct13
Cape2013 scilab-workshop-19Oct13Cape2013 scilab-workshop-19Oct13
Cape2013 scilab-workshop-19Oct13
 
Parallel Algorithms for Geometric Graph Problems (at Stanford)
Parallel Algorithms for Geometric Graph Problems (at Stanford)Parallel Algorithms for Geometric Graph Problems (at Stanford)
Parallel Algorithms for Geometric Graph Problems (at Stanford)
 
Quantum Computing Fundamentals via OO
Quantum Computing Fundamentals via OOQuantum Computing Fundamentals via OO
Quantum Computing Fundamentals via OO
 
Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern Coprocessors
 
ASQ Talk v4
ASQ Talk v4ASQ Talk v4
ASQ Talk v4
 
Reading: "Pi in the sky: Calculating a record-breaking 31.4 trillion digits o...
Reading: "Pi in the sky: Calculating a record-breaking 31.4 trillion digits o...Reading: "Pi in the sky: Calculating a record-breaking 31.4 trillion digits o...
Reading: "Pi in the sky: Calculating a record-breaking 31.4 trillion digits o...
 
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmA Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
 
20151021_DataScienceMeetup_revised
20151021_DataScienceMeetup_revised20151021_DataScienceMeetup_revised
20151021_DataScienceMeetup_revised
 
Genomics algorithms on digital NISQ accelerators - 2019-01-25
Genomics algorithms on digital NISQ accelerators - 2019-01-25Genomics algorithms on digital NISQ accelerators - 2019-01-25
Genomics algorithms on digital NISQ accelerators - 2019-01-25
 
Asymmetry in Large-Scale Graph Analysis, Explained
Asymmetry in Large-Scale Graph Analysis, ExplainedAsymmetry in Large-Scale Graph Analysis, Explained
Asymmetry in Large-Scale Graph Analysis, Explained
 
Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming
 
Recommender Systems with Implicit Feedback Challenges, Techniques, and Applic...
Recommender Systems with Implicit Feedback Challenges, Techniques, and Applic...Recommender Systems with Implicit Feedback Challenges, Techniques, and Applic...
Recommender Systems with Implicit Feedback Challenges, Techniques, and Applic...
 
QUANTUM COMP 22
QUANTUM COMP 22QUANTUM COMP 22
QUANTUM COMP 22
 
Quantum Machine Learning for IBM AI
Quantum Machine Learning for IBM AIQuantum Machine Learning for IBM AI
Quantum Machine Learning for IBM AI
 
Virtual nodes: Operational Aspirin
Virtual nodes: Operational AspirinVirtual nodes: Operational Aspirin
Virtual nodes: Operational Aspirin
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 

Similar to defense_slides_pengdu

Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)Apache Apex
 
Big Graph Analytics Systems (Sigmod16 Tutorial)
Big Graph Analytics Systems (Sigmod16 Tutorial)Big Graph Analytics Systems (Sigmod16 Tutorial)
Big Graph Analytics Systems (Sigmod16 Tutorial)Yuanyuan Tian
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsThomas Weise
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Comsysto Reply GmbH
 
⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...
⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...
⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...Victor Asanza
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAnubhav Jain
 
The Search for Gravitational Waves
The Search for Gravitational WavesThe Search for Gravitational Waves
The Search for Gravitational Wavesinside-BigData.com
 
Microprocessor Week 4-5 MCS-51 Arithmetic operation
Microprocessor Week 4-5 MCS-51 Arithmetic operationMicroprocessor Week 4-5 MCS-51 Arithmetic operation
Microprocessor Week 4-5 MCS-51 Arithmetic operationArkhom Jodtang
 
Handling data and workflows in computational materials science: the AiiDA ini...
Handling data and workflows in computational materials science: the AiiDA ini...Handling data and workflows in computational materials science: the AiiDA ini...
Handling data and workflows in computational materials science: the AiiDA ini...Research Data Alliance
 
Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Ravi Sony
 
digitaldesign-s20-lecture3b-fpga-afterlecture.pdf
digitaldesign-s20-lecture3b-fpga-afterlecture.pdfdigitaldesign-s20-lecture3b-fpga-afterlecture.pdf
digitaldesign-s20-lecture3b-fpga-afterlecture.pdfDuy-Hieu Bui
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
 
Measuring Docker Performance: what a mess!!!
Measuring Docker Performance: what a mess!!!Measuring Docker Performance: what a mess!!!
Measuring Docker Performance: what a mess!!!Emiliano
 

Similar to defense_slides_pengdu (20)

Ph.D. Defense
Ph.D. DefensePh.D. Defense
Ph.D. Defense
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Deep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLabDeep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLab
 
Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)
 
Big Graph Analytics Systems (Sigmod16 Tutorial)
Big Graph Analytics Systems (Sigmod16 Tutorial)Big Graph Analytics Systems (Sigmod16 Tutorial)
Big Graph Analytics Systems (Sigmod16 Tutorial)
 
Isorc18 keynote
Isorc18 keynoteIsorc18 keynote
Isorc18 keynote
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
 
⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...
⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...
⭐⭐⭐⭐⭐ Device Free Indoor Localization in the 28 GHz band based on machine lea...
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
 
Lecture_2_v2_qc.pptx
Lecture_2_v2_qc.pptxLecture_2_v2_qc.pptx
Lecture_2_v2_qc.pptx
 
The Search for Gravitational Waves
The Search for Gravitational WavesThe Search for Gravitational Waves
The Search for Gravitational Waves
 
Microprocessor Week 4-5 MCS-51 Arithmetic operation
Microprocessor Week 4-5 MCS-51 Arithmetic operationMicroprocessor Week 4-5 MCS-51 Arithmetic operation
Microprocessor Week 4-5 MCS-51 Arithmetic operation
 
Handling data and workflows in computational materials science: the AiiDA ini...
Handling data and workflows in computational materials science: the AiiDA ini...Handling data and workflows in computational materials science: the AiiDA ini...
Handling data and workflows in computational materials science: the AiiDA ini...
 
Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners
 
digitaldesign-s20-lecture3b-fpga-afterlecture.pdf
digitaldesign-s20-lecture3b-fpga-afterlecture.pdfdigitaldesign-s20-lecture3b-fpga-afterlecture.pdf
digitaldesign-s20-lecture3b-fpga-afterlecture.pdf
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
 
Cadancesimulation
CadancesimulationCadancesimulation
Cadancesimulation
 
HPC Performance tools, on the road to Exascale
HPC Performance tools, on the road to ExascaleHPC Performance tools, on the road to Exascale
HPC Performance tools, on the road to Exascale
 
Measuring Docker Performance: what a mess!!!
Measuring Docker Performance: what a mess!!!Measuring Docker Performance: what a mess!!!
Measuring Docker Performance: what a mess!!!
 

defense_slides_pengdu

  • 1. Hard and Soft Error Resilience for One- sided Dense Linear Algebra Algorithms A Dissertation Defense in Support of the Doctor of Philosophy Degree Peng Du Advisor: Prof. Jack Dongarra October 26, 2016
  • 2. Agenda October 26, 2016 2 Motivation Dissertation Statement Original Work Background Hard Errors Soft Errors Contributions Publications
  • 3. Motivation • HPC systems are getting larger • Chip is getting more and more dense October 26, 2016 3 TOP500 List - June 2012
  • 4. Motivation • Proprietary Components: IBM BG/P • Full system MBTF of 1 week: (1000+ year MTBF per node) October 26, 2016 4 • Commodity Components: (X86, Intel + AMD) • Full system MTBF of 1 day • Energy budget limit the use of error/detection/redundancy • Full system MTBF of 1 hour Resilience Exascale Workshop Slides, Franck Cappello (MTBF: Mean Time Between Failure)
  • 5. Agenda October 26, 2016 5 Motivation Dissertation Statement Original Work Background Hard Errors Soft Errors Contributions Publications
  • 6. Dissertation Statement • The goal of the dissertation is to demonstrate that one- sided dense linear algebra factorizations and solvers can be made fault tolerant to both hard error (fail-stop failure) and soft error. The following problems are studied: • Full matrix protection (the left and right factor) • MPI support for runtime system recovery • Detection multiple soft errors in time and space and recovery of both factorization and solver • Performance management on large scale system • Soft errors on hybrid platform with GPGPU October 26, 2016 6
  • 7. Agenda October 26, 2016 7 Motivation Dissertation Statement Original Work Background Hard Errors Soft Errors Contributions Publications
  • 8. Original Work • Hard Error • A performance efficient method to protect the left factor (e.g. L in LU factorization) • Recovery of running stack in QR factorization after hard error using on-demand checkpointing • Soft Error • Scalable local (diskless) checkpointing to protect the left factor from soft errors • Floating point number weighted checksum encoding • Multiple soft errors detection and recovery algorithm for the right factor and trailing matrix using the weighted checksum encoding • LU based linear system solver • Factorization (demonstrated by QR) • Complexity reduction algorithm October 26, 2016 8
  • 9. Related Work • Hardware Protection • Memory, cache • Single-bit-error-correction and double-bit-error-detection (SEC/DED) • Compute logic • Logic circuits with verification functionalities • Space or execution redundancy • Disk Checkpointing/Restart • coordinated and uncoordinated checkpointing • incremental checkpointing, forked (copy-on-write) checkpointing, etc. October 26, 2016 9
  • 10. Related Work • Diskless Checkpointing • Parity based checksum (XOR of bits) • Neighbor- and parity-based diskless checkpointing • Algorithm Based Fault Tolerance (ABFT) • Checksum is generated only once, before the computation • Checksum is updated by the host algorithms • Check and fix are performed only after computation • Backward Error Assertions • Iterative refinement to correct small errors • Uncorrectable errors are notified to the applications October 26, 2016 10
  • 11. Agenda October 26, 2016 11 Motivation Dissertation Statement Original Work Background Hard Errors Soft Errors Contributions Publications
  • 12. Failure Types • Hard Error (“fail-stop failure”) October 26, 2016 12 P0 P1 P2 Time ✔ ✔ ✔
  • 13. Failure Types • Soft Error (“Transient error”, “Silent Data Corruption”, etc.) • Radiation Induced • Alpha particle • High energy neutron • Thermal neutron October 26, 2016 13 0 0 1 1 0 10 1 1 0 1 1 0 10 1 P0 P1 P2 Time ✖ ✔ ✖
  • 14. Factorization • Dense matrix factorizations • LU, Cholesky, QR • Ax=b October 26, 2016 14 A = LU x =U (L b)    
  • 15. Block LU factorization GETF2 TRSM GEMM GETF2 TRSM
  • 16. Hybrid & Blocked QR DGEQRF & DLARFT (CPU) DLARFB (GPU) DGEQRF & DLARFT (CPU) DLARFB(GPU) … Q R
  • 17. Zones of the Matrix October 26, 2016 17 Right Factor Left Factor
  • 18. Agenda October 26, 2016 18 Motivation Dissertation Statement Original Work Background Hard Errors Soft Errors Contributions Publications
  • 19. Hard Error • Key problem: The protection of the left factor • 2D block cyclic data distribution • Checkpointing with high parallelism • Efficient Recovery • Running Stack Recovery October 26, 2016 19
  • 20. ABFT for the Right Factor October 26, 2016 20
  • 21. Q-Parallel Checkpointing for the Left Factor • Q-parallel checkpoint (P x Q process grid) • Checkpointing in parallel horizontally every Q iterations • Scalable and no need for extra storage "Algorithm-based Fault Tolerance for Dense Matrix Factorizations", Peng Du, Aurelien Bouteiller, George Bosilca, Thomas Herault and Jack Dongarra, 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP' 12) ABFT Checksum Diskless Checkpoint
  • 23. Overhead • M (rows) x N (columns) input matrix • P (rows) x Q (columns) process grid • Storage • Storage for checksum • Ratio over the matrix • Computation Overhead MN Q 1 Q O(N2 )
  • 24. Experiment Platforms • “Dancer” @ UT • 16-node • Each node has two 2.27GHz quad-core Intel E5520 CPUs • a 20GB/s Infiniband interconnect. • Solid State Drive disks. October 26, 2016 24 • “Kraken” @ ORNL • Cray XT5 machine • 9,408 compute nodes. • Each node has two Istanbul 2.6 GHz six-core AMD Opteron processors, 16 GB of memory • connected through the SeaStar2+ interconnect • The scalable cluster file system “Lustre”
  • 28. Recovery Of Running Stack • No official MPI support to determine failure process’s identity • In fact…. • Recovery of the running stack on the failed process • matrix data • Control variables (e.g. loop counts) October 26, 2016 28 P0 P1 P2 Out of Synchronization
  • 29. Checkpoint-on-Failure (CoF) October 26, 2016 29 "A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI". Wesley Bland, Peng Du, Aurelien Bouteiller, Thomas Herault, George Bosilca, Jack Dongarra. 18th International European Conference on Parallel and Distributed Computing (Euro-Par 2012). August 2012, Rhodes Island, Greece.
  • 30. FT-QR with CoF October 26, 2016 30 P0 P1 P2 Surviving Processes Checkpointing to disk Restart program, dry run to failed point Surviving Processes load checkpoint from disk Parallel-Q and ABFT recovery Recovery done; Execution Resumed
  • 31. Experiment Result (Dancer) October 26, 2016 31 (8 processes/node ×16 nodes)
  • 32. Experiment Result (Kraken) October 26, 2016 32 (24 ×24 processes)
  • 33. Agenda October 26, 2016 33 Motivation Dissertation Statement Original Work Background Hard Errors Soft Errors Contributions Publications
  • 34. Soft Error • Key problems • Silent error • Error propagation • Detection • Off-line recovery October 26, 2016 34
  • 35. Multiple Soft Errors October 26, 2016 35 • Propagation in the right factor Original errors Errors due to propagation
  • 36. General work flow (LU solver) (1) Generate checksum for the input matrix as additional columns (2) Perform LU factorization WITH the additional checksum columns (3) Solve Ax=b using LU from the factorization (even if soft error occurs during LU factorization) (4) Check for soft error (5) Correct solution x
  • 37. Soft Error October 26, 2016 37 Error modeling Encoding for checksum Left Factor Protection
  • 38. Soft Error October 26, 2016 38 Error modeling Encoding for checksum Left Factor Protection
  • 39. How to detect & recover soft errors in L? • The recovery of Ax=b requires a correct L • L does not change once produced • Diskless checkpointing for L • Delay pivoting on L to prevent checksum of L from being invalidated L U
  • 40. • PDGEMM based checkpointing • Checkpointing time increases when scaled to more processes and larger matrices Checkpointing for L, idea 1 NOT SCALABLE
  • 41. Checkpointing for L, idea 2 • Local Checkpointing • Each process checkpoints their local involved data • Constant checkpointing time SCALABLE
  • 42. Soft Error October 26, 2016 42 Error modeling Encoding for checksum Left Factor Protection
  • 43. Error modeling (1 error) • When? • Answer: Doesn’t really matter October 26, 2016 43 L U A A Based on works by Luk et al. in 1980s for systolic array
  • 44. Locate Error P [ A, A×e, A× w]= L[ U, c, v], A = A+ dej T ⇒ P A = L U PAe = Lc PAw = Lv      G = eT wT         1 1  1 w1 w2  wn         T Initial error column j
  • 45. Error modeling (2 errors) October 26, 2016 45 L U B B A A j1 j2
  • 46. Soft Error October 26, 2016 46 Error modeling Encoding for checksum Left Factor Protection
  • 47. Floating Point Encoding October 26, 2016 47 l1 + l2 ++ ln = c1 w1 l1 + w2 l2 ++ wn ln = c2 u1 l1 + u2 l2 ++ un ln = c3       l1 ++ li ++ lj ++ ln = c1 w1 l1 ++ wi li ++ wj lj ++ wn ln = c2 u1 l1 ++ ui li ++ uj lj ++ un ln = c3        Let ui = wi 2 (c3 − c3 ) + (wi + wj )(c2 − c2 )+ wi wj (c1 − c1 ) = 0 O(N2 ) to find wi and wj l = l1  li  lj  ln       l = l1  li  lj  ln       Check equation for the left factor
  • 48. Multiple Soft Errors October 26, 2016 48 Error modeling Encoding for checksum Locate and recover solution from multiple errors in the right factor But, what about performance?
  • 49. October 26, 2016 49 (ˆs − ˆUw) − (wj1 + wj2 )( ˆv − ˆUw) + wj1 wj2 (c − ˆUe) = 0 Vector O(N3 ) for two errors But so is LU! Check equation for the Right factor
  • 51. Complexity Reduction October 26, 2016 51 16×16 cores, 2 errors
  • 52. Recovery • Solver • Sherman-Morrison formula to recover the solution of Ax=b • Factorization • Through reducing a spiked matrix to recover the left & right factors October 26, 2016 52 Based on works by Luk et al. in 1980s for systolic array
  • 54. Experiment Results October 26, 2016 54 CPU: 2 6-core Xeon 5660 GPU: NVIDIA M2070
  • 55. Agenda October 26, 2016 55 Motivation Dissertation Statement Original Work Background Hard Errors Soft Errors Contributions Publications
  • 56. Contributions • Hard Error • A performance efficient method to protect the left factor (e.g. L in LU factorization) • Recovery of running stack in QR factorization after hard error using on-demand checkpointing • Soft Error • Scalable local (diskless) checkpointing to protect the left factor from soft errors • Floating point number weighted checksum encoding • Multiple soft errors detection and recovery algorithm for the right factor and trailing matrix using the weighted checksum encoding • LU based linear system solver • Factorization (demonstrated by QR) • Complexity reduction algorithm October 26, 2016 56
  • 57. Publication • Chapter 3 • "Algorithm-based Fault Tolerance for Dense Matrix Factorizations". Peng Du, Aurelien Bouteiller, George Bosilca, Thomas Herault and Jack Dongarra. 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP' 12). February 2012, New Orleans, LA. • "A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI". Wesley Bland, Peng Du, Aurelien Bouteiller, Thomas Herault, George Bosilca, Jack Dongarra. 18th International European Conference on Parallel and Distributed Computing (Euro-Par 2012). August 2012, Rhodes Island, Greece. • Chapter 4 • "High Performance Dense Linear System Solver with Soft Error Resilience". Peng Du, Piotr Luszczek, Stan Tomov and Jack Dongarra. IEEE Cluster 2011. Austin, TX. • "High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors". Peng Du, Piotr Luszczek and Jack Dongarra. The International Conference on Computational Science (ICCS) 2012. Omaha, NE. • Chapter 5 • "Soft Error Resilient QR Factorization for Hybrid System with GPGPU". Peng Du, Piotr Luszczek, Stan Tomov and Jack Dongarra. The second workshop on Scalable algorithms for large-scale systems (Scala) 2011. Seattle, Washington. October 26, 2016 57
  • 60. Failures in PBLAS October 26, 2016 60 LARFT+LARFB A2Q
  • 61. Data Layout • 2D block cyclic distribution October 26, 2016 61 0 1 2 0 1 2 0 1 0 1 0 1 0 1 0 1 2 × 3 process grid Process (0,1) dies
  • 62. Recover QR factorization on Hybrid Platform with GPGPU A = A+ dej T Q R = +Q? R?
  • 63. QR update October 26, 2016 63 A = Q RGiven QR = A+ uvT Find A− A = (a• j − Q R• j )× ej T = u× vT A = Q R+ uvT = Q( R+ QT uvT ) A = Q( R+ wvT ) w = QT u = QT a• j − R• j R+ wvT = Orthogonal Transformation (fast Given’s Rotation on GPU)
  • 64. Encoding for Q October 26, 2016 64 CPU GPU DGEQRF DLARFT Sending panel to GPU Look-ahead DLARFB Trailing DLARFB Matrix size: 17480 cublasSetMatrix() cudaMemcpy2DAsync()
  • 65. Error modeling (for “where”) A0 = A 1 1 1t t t tA L P A− − −= At = Lt−1 Pt−1 At−1 − λei ej T = Lt−1 Pt−1 (Lt−2 Pt−2 L0 P0 )A0 − λei ej T Define an initial erroneous initial matrix A A ≅ (Lt−1 Pt−1 Lt−2 Pt−2 L0 P0 )−1 At = A− (Lt−1 Pt−1 Lt−2 Pt−2 L0 P0 )−1 λei ej T = A− dej T U = (Ln Pn )(L1 P1 )(L0 P)0 A0 Input matrix One step of LU If no soft error occurs If soft error occurs at step t
  • 66. Locate Error P [ A, A×e, A× w]= L[ U, c, v], A = A+ dej T ⇒ P A = L U PAe = Lc PAw = Lv      G = eT wT         1 1  1 w1 w2  wn         T Column j
  • 67. LU based linear solver Ax = b A = LU x =U (L b)
  • 68. General work flow (1) Generate checksum for the input matrix as additional columns (2) Perform LU factorization WITH the additional checksum columns (3) Solve Ax=b using LU from the factorization (even if soft error occurs during LU factorization) (4) Check for soft error (5) Correct solution x
  • 69. Why is soft error hard to handle? • Soft error occurs silently • Propagation
  • 70. Recover Ax=b • Sherman Morison Formula
  • 71. Recover Ax=b Given: P A = L U Ax = b     To Solve: Ax b=
  • 72. Recover Ax=b Ax = b ⇒ x = A−1 b ⇒ x = A−1 ( P−1 P)b = ( PA)−1 Pb ( PA)−1 = ?
  • 73. Recover Ax=b A− A = dej T Recall: PA− P A = ( Pai j − L Ui j )ej T PA = L U + L( L−1 Pai j − Ui j )ej T = L( U +tej T ) = L U(I + U −1 tej T ) = L U(I + vej T ) t = L−1 Pai j − Ui j v = U −1 t Therefore:
  • 74. Recover Ax=b ( PA)−1 = ( L U(I + vej T )) = (I + vej T )−1 ( L U)−1 = I − 1 1+ vj vej T       ÷ ÷( L U)−1 Sherman Morrison
  • 75. Recover Ax=b Ax = b = I − 1 1+ vj vej T       ÷ ÷x
  • 76. Recover Ax=b (1) L Ux = Pb (2) t = L−1 Pai j − Ui j v = U −1 t x = I − 1 1+ vj vej T       ÷ ÷x           
  • 77. Recover Ax=b (1) L Ux = Pb (2) t = L−1 Pai j − Ui j v = U −1 t x = I − 1 1+ vj vej T       ÷ ÷x            Needs protection
  • 78. Locate Error PAe = Lc ⇒ c = L−1 PAe = L−1 P( A+ dej T )e = L−1 ( P A+ Pdej T )e = L−1 ( L U + Pdej T )e = Ue + L−1 Pd ⇒ c − Ue = L−1 Pd = r r =U × e − c
  • 79. Locate Error s =U × w− v PAw = Lv ⇒ v = L−1 PAw = L−1 P( A+ dej T )w = L−1 ( P A+ Pdej T )w = L−1 ( L U + Pdej T )w = Uw+ L−1 Pdwj ⇒ v − Uw = L−1 Pdwj = s
  • 80. Locate Error c − Ue = L−1 Pd = r v − Uw = wj L−1 Pd = s     ⇒ s = wj × r ⇒ wj 1 1  1             = s./ r • Wj is the jth element of vector w in the generator matrix • Component-wise division of s and r reveals wj • Search wj in w reveals the initial soft error’s column
  • 81. Extra Storage • For input matrix of size MxN on PxQ grid • A copy of the original matrix • Not necessary when it’s easy to re-generate the required column of the original matrix • 2 additional columns: 2 x M • Each process has 2 rows: , in total2 N Q × 2P N× × 2 2 2 2 2N extra storage M P N r matrix storage M N P P N M M →∞ × + × × = = × × × = + →

Editor's Notes

  1. Good morning. The topic of my defense is xxxx.
  2. Here is the agenda. We’ll start with motivation, the contributions of this dissertation will be discussed. And we’ll go into the two main parts of the work, hard and soft error. And we’ll conclude the talk with related publication.
  3. In high performance computing, fault tolerance is not a new topic but it’s a very critical one As simple as, computation is meaningful only if it can be finished successfully. One of the most important reasons is the scaling, both in system size and chip density. Here is the newest top500 list of supercomputers in the world. And notice the number of cores on the list. The smallest one uses more than 180,000 cores and the fastest one uses more than one and half million cores.
  4. And if we take in consideration the mean time between failure, the system overall dependability will be a huge concern. According to the statistics from several national labs, and depending on the kind of system being used. Say if it’s commodity components such as AMD and Intel CPUs, the whole HPC MTBF could be as low as a day or 24 hours, and this is actually the case with Kraken which uses AMD CPUs. And this becomes an issue if the application run time is larger than 24 hours, which means the application might not even get to finish before being interrupted by failure.
  5. In order to deal with this situation,
  6. We conducted this work to provide fault tolerance for dense linear algebra computations on large scale system. Our focus includes both hard and soft errors, which we’ll discuss in detail in just a minute. And we focus on both the error correcting capability and performance impact on large scale system. Also, as more and more system are using the GPUs as accelerator, we have also extended our work to such hybrid system.
  7. And here are the original work
  8. For both hard and soft error, we provide efficient and scalable algorithms. For hard error, we also give method to recover the running stack without heavy support from MPI. Soft error is more challenging because of the error propagation. The problem is tackled from three perspective, the encoding, the error detection and location and complexity management on large scale computation.
  9. Fault tolerance techniques have been developed for many years. In the hardware level, for example the memory system and compute logic, the most common methods are parity and coding based ECC. For performance overhead consideration, most system only adopts simper ECC such as Single bit error correction, double bit error detection. Nowadays the most commonly used fault tolerance mechanism is still disk based checkpoint and restart. At a certain interval, program state is written to stable storage disk and at time of failure, the program is restored by loading state from disk. Apparently, such disk checkpoint is easy in implementation but has large overhead due to the slow I/O during checkpointing. There have been methods such as these to reduce the overhead.
  10. To further reduce overhead, diskless checkpointing was proposed to checkpoint data using memory rather than disk. For a certain type of applications, for example one-sided dense linear algebra operation, algorithm based fault tolerance or ABFT provides even cheaper options for fault tolerance. With ABFT, the checksum is only generated once and is updated along with the host algorithm. Specially to soft errors, other methods such as backward error assertion have also been proposed where iterative refinement is used to correct soft errors. In our work, the proposed methods use a combination of disk, diskless and ABFT to fight both hard and soft errors.
  11. ----- Meeting Notes (11/29/11 15:53) ----- pause here
  12. Among the types of failures, two most popular are the hard error and soft error. Hard error, or “fail-stop failure”, when it happens, the application running is interrupted immediately. Once the failure is fixed, that being either the bad device is replaced and the data has been restored, the execution can then resume and runs to the end with correct answer.
  13. Soft error, on the other hand, is mostly caused by radiation such as the aforementioned alpha particle. These radiation affect the memory device by flipping bits silently. When this happens to application, the silent error occurs but is not visible to both the application and the system. And therefore the application is not interrupted until Finish and the computing result is incorrect with no clear reason.
  14. We’ll use LU factorization as example for the rest of the talk, but same method can be applied to other factorization such as QR, and we already have result for QR as well. LU factorization produces L and U and can be used to solve a linear system Ax=b in this way. This is how it looks in matlab
  15. For better performance, block LU is often used. It starts with a panel factorization followed by the triangular solver and trailing matrix update using matrix multiplication.
  16. From both LU and QR, we can see that roughly sections exists, and in this work they are referred to as the left factor and right factor. In left factor, data does not change after being computed. In the right factor however, data keeps being updated and therefore might cause error propagation if some error occurs in this area.
  17. Now that we have seen the background of dense linear algebra computation, let’s move on to the hard error fault tolerance.
  18. For hard error, the key problem is how to protect the left factor because the right factor can be protected easily by ABFT. This has been shown by others’ work to Cholesky factorization and HPL.
  19. Here is the algorithm that we propose for such case which is one hard error for both LU and QR. The details are in the paper down there. This figure shows the checkpointing example of the same 8 by 8 matrix on a 2 by 3 process grid. Every Q panel factorizations, in this case 3, the panels that are just factored are horizontally checkpointed and put on the right on the matrix in reverse order. For example, The checksum for the first 3 panels goes here, and the checksum for the 2nd 3 panels goes here and so on. The solid color means the ABFT checksum which updates itself automatically. This algorithm does not use extra storage either for the left factor checkpoint and the performance overhead, both the checkpointing and the recovery is small.
  20. Here is a running example, which are heat maps. We did two runs. One with error and one without error. And the brightness corresponds to the different between this two runs. So the lighter is, the bigger the difference is. Matrix size 800x 800, block size 100, on a 2 by 3 grid. We have the matrix here and the checksum is here. First the checksum are recovered, then the data in the U and A’ section. And finally since the error occur during the Q panel iteration, all three of these Q panels are rolled back and re-calculated, and this concludes the recovery. After this, the computation resumed as if no failure has occurred.
  21. Here is the overhead for this algorithm. Both storage overhead and the computational overhead.
  22. ----- Meeting Notes (11/29/11 15:53) ----- pause here
  23. Soft error is more challenging because it happens in silence. And because of that, even on initial error could easily propagated into many errors in large area. And also the detection and recovery has to be done after everything is finished, which leaves more time for the propagation damage.
  24. The floating point encoding works straightforward for the left factor, but how to extend it to the right factor is a major challenge because now we have the propagation issues in hand, and the computing complexity could be higher than that of the factorization depending on how many errors we want to tolerate. We have developed solution to both issues which I’m not going into details here.
  25. Before we dive into the details, here is the general work flow of the soft error resilient algorithm And again here we use LU for example and it can be extended to QR.
  26. Here are the three main parts of the soft error fault tolerance mechanism. Namely the xxx. And we’ll discuss them one by one.
  27. Let’s start with the left factor.
  28. We need to protect the left factor not only because it’s part of the factorization result, but also it is used in the recovery of other part of the computation result. For example, the recovery of solution to Ax=b requires a correct L. The good news is that similar to the case in hard error, L can also be protected by some sort of diskless checkpointing
  29. The first idea coming to mind is a vertical checkpointing based on parallel matrix-matrix multiply. However, similar to hard error, this works but the performance overhead is to large. In fact, as we look at the specific operation closely, we notice that the vertical global sum is not really necessary. And hence comes the idea that is scalable.
  30. In this so called “local checkpointing” method, each process in the 2D block-cyclic grid checkpoints their own local data. For example…
  31. The 2nd part in soft error fault tolerance is the error modeling. Error modeling helps locating the error when combining with the encoding part we’ll see in a bit.
  32. ----- Meeting Notes (9/7/11 15:16) ----- Show the first equation earilier expend the first equation ----- Meeting Notes (9/9/11 14:40) ----- no AG
  33. Second let’s talk about the floating point number encoding.
  34. ----- Meeting Notes (11/29/11 15:53) ----- pause here
  35. Let’s start with 2D block cyclic. Here is an example of a matrix divided into 8x8 blocks, and distributed on a 2 by 3 process grid.
  36. ----- Meeting Notes (9/7/11 15:16) ----- Show the first equation earilier expend the first equation ----- Meeting Notes (9/9/11 14:40) ----- no AG
  37. ----- Meeting Notes (9/7/11 15:16) ----- move these ahead
  38. ----- Meeting Notes (9/9/11 14:40) ----- 1. generate the checksum from the matrix which becomes additional columns move 3 before 4 then check for error
  39. ----- Meeting Notes (9/7/11 15:16) ----- move these ahead
  40. ----- Meeting Notes (9/9/11 14:40) ----- does not cost a lot what was done before determine the column
  41. ----- Meeting Notes (9/7/11 15:16) ----- show the sherman morrison
  42. ----- Meeting Notes (9/7/11 15:16) ----- requries the copy of A at the beginning
  43. ----- Meeting Notes (9/9/11 14:40) ----- wj on the left of s reminds what w is look for a match of wj in w
  44. ----- Meeting Notes (9/7/11 15:16) ----- NxN copy ratio not r leave MxN (generalized for rectanglar matrices)