H-cholesky on manycore

H–Cholesky Factorization on Many-Core
Accelerators
Gang Liao
August 2, 2015

Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Background
If A is a positive definite matrix, Cholesky factorization: A = 𝐿𝐿%
Data matrices representing some numerical
observations such as proximity matrix or
correlation matrix are often huge and hard to
analyze, therefore to decompose the data
matrices into some lower-order or lower-rank
canonical forms will reveal the inherent
characteristic and structure of the matrices and
help to interpret their meaning readily.

3
Hierarchical Matrix
Hierarchical matrices (H-matrices) are a powerful tool to represent dense
matrices coming from integral equations or partial differential equations in a
hierarchical, block-oriented, data-sparse way with log-linear memory costs.

Hierarchical Matrix
4

Intel Confidential
5
Implementation: Inadmissible Leaves:
The product index set resolves into admissible and inadmissible leaves of the tree. The
assembly, storage and matrix-vector multiplication differs for the corresponding two classes
of sub matrices.
Inadmissible Leaves:

Intel Confidential
6
Implementation:Admissible Leaves:
The product index set resolves into admissible and inadmissible leaves of the tree. The
assembly, storage and matrix-vector multiplication differs for the corresponding two classes
of sub matrices.
Admissible Leaves:

Intel Confidential
7
Hierarchical Matrix Representation

Intel Confidential
8
Profiling

Compiler Optimization – Full matrix
Intel Confidential
9
For icc opt1, icc with optimizations
like -O2.
For icc opt2, icc with default
optimizations like -msse4.2 -O3.
For icc mkl, icc opt2 + mkl function.

Numerical Libraries Optimization – Full matrix
Intel Confidential
10
dpotrf_ vs plasma_dpotrf vs
magma_dpotrf
MKL: Intel Math Kernel Library
(Intel MKL) accelerates math
processing routines.
PLASMA: Parallel Linear Algebra
for Scalable Multi-core
Architectures
MAGMA: Matrix Algebra on GPU
and Multicore Architectures

Intel Confidential
11
Parallel Optimization
The concept of task-based DAG computations is used to split the H-Cholesky
factorization into single tasks and to define corresponding dependencies to form
a DAG.

12
CodeAnalysis

13
Multicore Optimization – H-Cholesky Factorization
13
Example 1:
Example 2:

14
Manycore Optimization – H-Cholesky Factorization
1. Allocate & Copy r->a[row_offset] and r->b[col_offset] into accelerators.
2. Copy result ft->e from accelerators into CPU host memory.

Intel Confidential
15
Result & Conclusion
0 500 1000 1500 2000 2500 3000 3500 4000 4500
0
2
4
6
8
10
12
H−Cholesky Decomposition where the problem size (vertices) is 10002
nmin (leaf size)
Time(sec)
MKL
Hybrid
H - Cholesky factorization
on many-core accelerators
is extremely efficient,
which also can be well
scaled on large-scaled H-
matrix.

H-cholesky on manycore

Recommended

Recommended

More Related Content

Similar to H-cholesky on manycore

Similar to H-cholesky on manycore (20)

H-cholesky on manycore