defense_slides_pengdu

Hard and Soft Error Resilience for One-
sided Dense Linear Algebra Algorithms
A Dissertation Defense
in Support of the
Doctor of Philosophy Degree
Peng Du
Advisor: Prof. Jack Dongarra
October 26, 2016

Agenda
October 26, 2016 2
Motivation
Dissertation
Statement
Original
Work
Background
Hard Errors
Soft Errors
Contributions
Publications

Motivation
• HPC systems are getting larger
• Chip is getting more and more dense
October 26, 2016 3
TOP500 List - June 2012

Motivation
• Proprietary Components: IBM BG/P
• Full system MBTF of 1 week: (1000+ year MTBF per node)
October 26, 2016 4
• Commodity Components: (X86, Intel + AMD)
• Full system MTBF of 1 day
• Energy budget limit the use of error/detection/redundancy
• Full system MTBF of 1 hour
Resilience Exascale Workshop Slides, Franck Cappello
(MTBF: Mean Time Between Failure)

Agenda
October 26, 2016 5
Motivation
Dissertation
Statement
Original
Work
Background
Hard Errors
Soft Errors
Contributions
Publications

Dissertation Statement
• The goal of the dissertation is to demonstrate that one-
sided dense linear algebra factorizations and solvers
can be made fault tolerant to both hard error (fail-stop
failure) and soft error. The following problems are
studied:
• Full matrix protection (the left and right factor)
• MPI support for runtime system recovery
• Detection multiple soft errors in time and space and
recovery of both factorization and solver
• Performance management on large scale system
• Soft errors on hybrid platform with GPGPU
October 26, 2016 6

Agenda
October 26, 2016 7
Motivation
Dissertation
Statement
Original
Work
Background
Hard Errors
Soft Errors
Contributions
Publications

Original Work
• Hard Error
• A performance efficient method to protect the left factor (e.g. L
in LU factorization)
• Recovery of running stack in QR factorization after hard error
using on-demand checkpointing
• Soft Error
• Scalable local (diskless) checkpointing to protect the left factor
from soft errors
• Floating point number weighted checksum encoding
• Multiple soft errors detection and recovery algorithm for the right
factor and trailing matrix using the weighted checksum
encoding
• LU based linear system solver
• Factorization (demonstrated by QR)
• Complexity reduction algorithm
October 26, 2016 8

Related Work
• Hardware Protection
• Memory, cache
• Single-bit-error-correction and double-bit-error-detection
(SEC/DED)
• Compute logic
• Logic circuits with verification functionalities
• Space or execution redundancy
• Disk Checkpointing/Restart
• coordinated and uncoordinated checkpointing
• incremental checkpointing, forked (copy-on-write)
checkpointing, etc.
October 26, 2016 9

Related Work
• Diskless Checkpointing
• Parity based checksum (XOR of bits)
• Neighbor- and parity-based diskless checkpointing
• Algorithm Based Fault Tolerance (ABFT)
• Checksum is generated only once, before the computation
• Checksum is updated by the host algorithms
• Check and fix are performed only after computation
• Backward Error Assertions
• Iterative refinement to correct small errors
• Uncorrectable errors are notified to the applications
October 26, 2016 10

Agenda
October 26, 2016 11
Motivation
Dissertation
Statement
Original
Work
Background
Hard Errors
Soft Errors
Contributions
Publications

Failure Types
• Hard Error (“fail-stop failure”)
October 26, 2016 12
P0
P1
P2
Time
✔
✔
✔

Failure Types
• Soft Error (“Transient error”, “Silent Data Corruption”, etc.)
• Radiation Induced
• Alpha particle
• High energy neutron
• Thermal neutron
October 26, 2016 13
0 0 1 1 0 10 1
1 0 1 1 0 10 1
P0
P1
P2
Time
✖
✔
✖

Factorization
• Dense matrix factorizations
• LU, Cholesky, QR
• Ax=b
October 26, 2016 14
A = LU
x =U (L b)





Block LU factorization
GETF2 TRSM
GEMM GETF2 TRSM

Hybrid & Blocked QR
DGEQRF & DLARFT (CPU) DLARFB (GPU)
DGEQRF & DLARFT (CPU) DLARFB(GPU)
… Q
R

Zones of the Matrix
October 26, 2016 17
Right
Factor
Left Factor

Agenda
October 26, 2016 18
Motivation
Dissertation
Statement
Original
Work
Background
Hard Errors
Soft Errors
Contributions
Publications

Hard Error
• Key problem: The protection of the left factor
• 2D block cyclic data distribution
• Checkpointing with high parallelism
• Efficient Recovery
• Running Stack Recovery
October 26, 2016 19

ABFT for the Right Factor
October 26, 2016 20

Q-Parallel Checkpointing for the Left Factor
• Q-parallel checkpoint (P x Q process grid)
• Checkpointing in parallel horizontally every Q iterations
• Scalable and no need for extra storage
"Algorithm-based Fault Tolerance for Dense Matrix Factorizations", Peng Du, Aurelien Bouteiller, George Bosilca, Thomas Herault and
Jack Dongarra, 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP' 12)
ABFT Checksum
Diskless Checkpoint

Overhead
• M (rows) x N (columns) input matrix
• P (rows) x Q (columns) process grid
• Storage
• Storage for checksum
• Ratio over the matrix
• Computation Overhead
MN
Q
1
Q
O(N2
)

Experiment Platforms
• “Dancer” @ UT
• 16-node
• Each node has two 2.27GHz quad-core Intel E5520 CPUs
• a 20GB/s Infiniband interconnect.
• Solid State Drive disks.
October 26, 2016 24
• “Kraken” @ ORNL
• Cray XT5 machine
• 9,408 compute nodes.
• Each node has two Istanbul 2.6 GHz six-core AMD Opteron
processors, 16 GB of memory
• connected through the SeaStar2+ interconnect
• The scalable cluster file system “Lustre”

Experiment Results (Kraken)
October 26, 2016 25

October 26, 2016 26

October 26, 2016 27

Recovery Of Running Stack
• No official MPI support to determine failure process’s identity
• In fact….
• Recovery of the running stack on the failed process
• matrix data
• Control variables (e.g. loop counts)
October 26, 2016 28
P0
P1
P2
Out of Synchronization

Checkpoint-on-Failure (CoF)
October 26, 2016 29
"A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI". Wesley Bland, Peng Du,
Aurelien Bouteiller, Thomas Herault, George Bosilca, Jack Dongarra. 18th International European Conference
on Parallel and Distributed Computing (Euro-Par 2012). August 2012, Rhodes Island, Greece.

FT-QR with CoF
October 26, 2016 30
P0
P1
P2
Surviving Processes
Checkpointing to disk
Restart program, dry
run to failed point
Surviving Processes
load checkpoint from
disk
Parallel-Q and ABFT
recovery
Recovery done;
Execution Resumed

Experiment Result (Dancer)
October 26, 2016 31
(8 processes/node ×16 nodes)

Experiment Result (Kraken)
October 26, 2016 32
(24 ×24 processes)

Agenda
October 26, 2016 33
Motivation
Dissertation
Statement
Original
Work
Background
Hard Errors
Soft Errors
Contributions
Publications

Soft Error
• Key problems
• Silent error
• Error propagation
• Detection
• Off-line recovery
October 26, 2016 34

Multiple Soft Errors
October 26, 2016 35
• Propagation in the right factor
Original errors
Errors due to
propagation

General work flow (LU solver)
(1) Generate checksum for the input matrix as additional columns
(2) Perform LU factorization WITH the additional checksum columns
(3) Solve Ax=b using LU from the factorization
(even if soft error occurs during LU factorization)
(4) Check for soft error
(5) Correct solution x

Soft Error
October 26, 2016 37
Error modeling Encoding for
checksum
Left Factor
Protection

Soft Error
October 26, 2016 38
checksum
Left Factor
Protection

How to detect & recover soft errors in L?
• The recovery of Ax=b requires a correct L
• L does not change once produced
• Diskless checkpointing for L
• Delay pivoting on L to prevent checksum of L from being invalidated
L
U

• PDGEMM based checkpointing
• Checkpointing time increases when scaled to more processes and
larger matrices
Checkpointing for L, idea 1
NOT SCALABLE

Checkpointing for L, idea 2
• Local Checkpointing
• Each process checkpoints their local involved data
• Constant checkpointing time
SCALABLE

Soft Error
October 26, 2016 42
checksum
Left Factor
Protection

Error modeling (1 error)
• When?
• Answer: Doesn’t really matter
October 26, 2016 43
L
U
A
A
Based on works by Luk et al. in 1980s for systolic array

Locate Error
P [ A, A×e, A× w]= L[ U, c, v], A = A+ dej
T
⇒
P A = L U
PAe = Lc
PAw = Lv





G = eT
wT








1 1  1
w1
w2
 wn








T
Initial error column j

Error modeling (2 errors)
October 26, 2016 45
L
U
B
B
A
A
j1 j2

Soft Error
October 26, 2016 46
checksum
Left Factor
Protection

Floating Point Encoding
October 26, 2016 47
l1
+ l2
++ ln
= c1
w1
l1
+ w2
l2
++ wn
ln
= c2
u1
l1
+ u2
l2
++ un
ln
= c3






l1
++ li
++ lj
++ ln
= c1
w1
l1
++ wi
li
++ wj
lj
++ wn
ln
= c2
u1
l1
++ ui
li
++ uj
lj
++ un
ln
= c3







Let ui
= wi
2
(c3
− c3
) + (wi
+ wj
)(c2
− c2
)+ wi
wj
(c1
− c1
) = 0
O(N2
) to find wi
and wj
l = l1
 li
 lj
 ln





 l = l1
 li
 lj
 ln






Check equation for the left
factor

Multiple Soft Errors
October 26, 2016 48
checksum
Locate and recover solution from multiple errors
in the right factor
But, what about performance?

October 26, 2016 49
(ˆs − Ûw) − (wj1
+ wj2
)( ˆv − Ûw) + wj1
wj2
(c − Ûe) = 0
Vector
O(N3
) for two errors
But so is LU!
Check equation for the Right
factor

Complexity Reduction
October 26, 2016 50

Complexity Reduction
October 26, 2016 51
16×16 cores, 2 errors

Recovery
• Solver
• Sherman-Morrison formula to recover the solution of Ax=b
• Factorization
• Through reducing a spiked matrix to recover the left & right
factors
October 26, 2016 52
Based on works by Luk et al. in 1980s for systolic array

Experiment Results
October 26, 2016 53

Experiment Results
October 26, 2016 54
CPU: 2 6-core Xeon 5660
GPU: NVIDIA M2070

Agenda
October 26, 2016 55
Motivation
Dissertation
Statement
Original
Work
Background
Hard Errors
Soft Errors
Contributions
Publications

Contributions
• Hard Error
• A performance efficient method to protect the left factor (e.g. L
in LU factorization)
• Recovery of running stack in QR factorization after hard error
using on-demand checkpointing
• Soft Error
• Scalable local (diskless) checkpointing to protect the left factor
from soft errors
• Floating point number weighted checksum encoding
• Multiple soft errors detection and recovery algorithm for the right
factor and trailing matrix using the weighted checksum
encoding
• LU based linear system solver
• Factorization (demonstrated by QR)
• Complexity reduction algorithm
October 26, 2016 56

Publication
• Chapter 3
• "Algorithm-based Fault Tolerance for Dense Matrix Factorizations". Peng Du,
Aurelien Bouteiller, George Bosilca, Thomas Herault and Jack Dongarra. 17th
ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
(PPoPP' 12). February 2012, New Orleans, LA.
• "A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard
MPI". Wesley Bland, Peng Du, Aurelien Bouteiller, Thomas Herault, George
Bosilca, Jack Dongarra. 18th International European Conference on Parallel and
Distributed Computing (Euro-Par 2012). August 2012, Rhodes Island, Greece.
• Chapter 4
• "High Performance Dense Linear System Solver with Soft Error Resilience".
Peng Du, Piotr Luszczek, Stan Tomov and Jack Dongarra. IEEE Cluster 2011.
Austin, TX.
• "High Performance Dense Linear System Solver with Resilience to Multiple Soft
Errors". Peng Du, Piotr Luszczek and Jack Dongarra. The International
Conference on Computational Science (ICCS) 2012. Omaha, NE.
• Chapter 5
• "Soft Error Resilient QR Factorization for Hybrid System with GPGPU". Peng
Du, Piotr Luszczek, Stan Tomov and Jack Dongarra. The second workshop on
Scalable algorithms for large-scale systems (Scala) 2011. Seattle, Washington.
October 26, 2016 57

Backup Slides
October 26, 2016 59

Failures in PBLAS
October 26, 2016 60
LARFT+LARFB
A2Q

Data Layout
• 2D block cyclic distribution
October 26, 2016 61
0 1 2 0 1 2 0 1
0
1
0
1
0
1
0
1
2 × 3 process grid Process (0,1) dies

Recover QR factorization on Hybrid
Platform with GPGPU
A = A+ dej
T
Q
R
= +Q?
R?

QR update
October 26, 2016 63
A = Q RGiven QR = A+ uvT
Find
A− A = (a• j
− Q R• j
)× ej
T
= u× vT
A = Q R+ uvT
= Q( R+ QT
uvT
)
A = Q( R+ wvT
) w = QT
u = QT
a• j
− R• j
R+ wvT
=
Orthogonal
Transformation
(fast Given’s Rotation on GPU)

Encoding for Q
October 26, 2016 64
CPU
GPU
DGEQRF DLARFT Sending panel to GPU
Look-ahead DLARFB Trailing DLARFB
Matrix size: 17480
cublasSetMatrix() cudaMemcpy2DAsync()

Error modeling (for “where”)
A0
= A
1 1 1t t t tA L P A− − −=
At
= Lt−1
Pt−1
At−1
− λei
ej
T
= Lt−1
Pt−1
(Lt−2
Pt−2
L0
P0
)A0
− λei
ej
T
Define an initial erroneous initial matrix A
A ≅ (Lt−1
Pt−1
Lt−2
Pt−2
L0
P0
)−1 At
= A− (Lt−1
Pt−1
Lt−2
Pt−2
L0
P0
)−1
λei
ej
T
= A− dej
T
U = (Ln
Pn
)(L1
P1
)(L0
P)0
A0
Input matrix
One step of LU
If no soft error occurs
If soft error occurs at step t

Locate Error
P [ A, A×e, A× w]= L[ U, c, v], A = A+ dej
T
⇒
P A = L U
PAe = Lc
PAw = Lv





G = eT
wT








1 1  1
w1
w2
 wn








T
Column j

LU based linear solver
Ax = b
A = LU
x =U (L b)

General work flow
(1) Generate checksum for the input matrix as additional columns
(2) Perform LU factorization WITH the additional checksum columns
(3) Solve Ax=b using LU from the factorization
(even if soft error occurs during LU factorization)
(4) Check for soft error
(5) Correct solution x

Why is soft error hard to handle?
• Soft error occurs silently
• Propagation

Recover Ax=b
• Sherman Morison Formula

Recover Ax=b
Given:
P A = L U
Ax = b




To Solve:
Ax b=

Recover Ax=b
Ax = b
⇒ x = A−1
b
⇒ x = A−1
( P−1 P)b = ( PA)−1 Pb
( PA)−1
= ?

Recover Ax=b
A− A = dej
T
Recall:
PA− P A = ( Pai j
− L Ui j
)ej
T
PA = L U + L( L−1 Pai j
− Ui j
)ej
T
= L( U +tej
T
)
= L U(I + U −1
tej
T
) = L U(I + vej
T
)
t = L−1 Pai j
− Ui j
v = U −1
t
Therefore:

Recover Ax=b
( PA)−1
= ( L U(I + vej
T
))
= (I + vej
T
)−1
( L U)−1
= I −
1
1+ vj
vej
T






÷
÷( L U)−1
Sherman
Morrison

Recover Ax=b
Ax = b
= I −
1
1+ vj
vej
T






÷
÷x

Recover Ax=b
(1) L Ux = Pb
(2)
− Ui j
v = U −1
t
x = I −
1
1+ vj
vej
T






÷
÷x












Recover Ax=b
(1) L Ux = Pb
(2)
− Ui j
v = U −1
t
x = I −
1
1+ vj
vej
T






÷
÷x











Needs protection

Locate Error
PAe = Lc
⇒ c = L−1 PAe = L−1 P( A+ dej
T
)e
= L−1
( P A+ Pdej
T
)e
= L−1
( L U + Pdej
T
)e
= Ue + L−1 Pd
⇒ c − Ue = L−1 Pd = r
r =U × e − c

Locate Error
s =U × w− v
PAw = Lv
⇒ v = L−1 PAw = L−1 P( A+ dej
T
)w
= L−1
( P A+ Pdej
T
)w
= L−1
( L U + Pdej
T
)w
= Uw+ L−1 Pdwj
⇒ v − Uw = L−1 Pdwj
= s

Locate Error
c − Ue = L−1 Pd = r
v − Uw = wj
L−1 Pd = s




⇒ s = wj
× r
⇒ wj
1
1

1












= s./ r
• Wj is the jth element of vector w in the generator matrix
• Component-wise division of s and r reveals wj
• Search wj in w reveals the initial soft error’s column

Extra Storage
• For input matrix of size MxN on PxQ grid
• A copy of the original matrix
• Not necessary when it’s easy to re-generate the required
column of the original matrix
• 2 additional columns: 2 x M
• Each process has 2 rows: , in total2
N
Q
× 2P N× ×
2 2
2 2 2N
extra storage M P N
r
matrix storage M N
P P
N M M
→∞
× + × ×
= =
×
× ×
= + →

defense_slides_pengdu

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to defense_slides_pengdu

Similar to defense_slides_pengdu (20)

defense_slides_pengdu

Editor's Notes