SlideShare a Scribd company logo
1 of 16
Download to read offline
H–Cholesky Factorization on Many-Core
Accelerators
Gang Liao
August 2, 2015
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Background
If A is a positive definite matrix, Cholesky factorization: A =   𝐿𝐿%
Data matrices representing some numerical
observations such as proximity matrix or
correlation matrix are often huge and hard to
analyze, therefore to decompose the data
matrices into some lower-order or lower-rank
canonical forms will reveal the inherent
characteristic and structure of the matrices and
help to interpret their meaning readily.
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
3
Hierarchical Matrix
Hierarchical matrices (H-matrices) are a powerful tool to represent dense
matrices coming from integral equations or partial differential equations in a
hierarchical, block-oriented, data-sparse way with log-linear memory costs.
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Hierarchical Matrix
4
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Intel Confidential
5
Implementation: Inadmissible Leaves:
The product index set resolves into admissible and inadmissible leaves of the tree. The
assembly, storage and matrix-vector multiplication differs for the corresponding two classes
of sub matrices.
Inadmissible Leaves:
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Intel Confidential
6
Implementation:Admissible Leaves:
The product index set resolves into admissible and inadmissible leaves of the tree. The
assembly, storage and matrix-vector multiplication differs for the corresponding two classes
of sub matrices.
Admissible Leaves:
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Intel Confidential
7
Hierarchical Matrix Representation
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Intel Confidential
8
Profiling
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Compiler Optimization – Full matrix
Intel Confidential
9
For icc opt1, icc with optimizations
like -O2.
For icc opt2, icc with default
optimizations like -msse4.2 -O3.
For icc mkl, icc opt2 + mkl function.
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Numerical Libraries Optimization – Full matrix
Intel Confidential
10
dpotrf_ vs plasma_dpotrf vs
magma_dpotrf
MKL: Intel Math Kernel Library
(Intel MKL) accelerates math
processing routines.
PLASMA: Parallel Linear Algebra
for Scalable Multi-core
Architectures
MAGMA: Matrix Algebra on GPU
and Multicore Architectures
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Intel Confidential
11
Parallel Optimization
The concept of task-based DAG computations is used to split the H-Cholesky
factorization into single tasks and to define corresponding dependencies to form
a DAG.
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
12
CodeAnalysis
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
13
Multicore Optimization – H-Cholesky Factorization
13
Example 1:
Example 2:
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
14
Manycore Optimization – H-Cholesky Factorization
1. Allocate & Copy r->a[row_offset] and r->b[col_offset] into accelerators.
2. Copy result ft->e from accelerators into CPU host memory.
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Intel Confidential
15
Result & Conclusion
0 500 1000 1500 2000 2500 3000 3500 4000 4500
0
2
4
6
8
10
12
H−Cholesky Decomposition where the problem size (vertices) is 10002
nmin (leaf size)
Time(sec)
MKL
Hybrid
H - Cholesky factorization
on many-core accelerators
is extremely efficient,
which also can be well
scaled on large-scaled H-
matrix.
H-cholesky on manycore

More Related Content

Similar to H-cholesky on manycore

How to create a high quality, fast texture compressor using ISPC
How to create a high quality, fast texture compressor using ISPC How to create a high quality, fast texture compressor using ISPC
How to create a high quality, fast texture compressor using ISPC Gael Hofemeier
 
Efficient Rendering with DirectX* 12 on Intel® Graphics
Efficient Rendering with DirectX* 12 on Intel® GraphicsEfficient Rendering with DirectX* 12 on Intel® Graphics
Efficient Rendering with DirectX* 12 on Intel® GraphicsGael Hofemeier
 
How Funcom Increased Play Time in Lego Minifigures by 40%
How Funcom Increased Play Time in Lego Minifigures by 40%How Funcom Increased Play Time in Lego Minifigures by 40%
How Funcom Increased Play Time in Lego Minifigures by 40%Gael Hofemeier
 
OIT to Volumetric Shadow Mapping, 101 Uses for Raster-Ordered Views using Dir...
OIT to Volumetric Shadow Mapping, 101 Uses for Raster-Ordered Views using Dir...OIT to Volumetric Shadow Mapping, 101 Uses for Raster-Ordered Views using Dir...
OIT to Volumetric Shadow Mapping, 101 Uses for Raster-Ordered Views using Dir...Gael Hofemeier
 
Embree Ray Tracing Kernels
Embree Ray Tracing KernelsEmbree Ray Tracing Kernels
Embree Ray Tracing KernelsIntel® Software
 
What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?Michelle Holley
 
Intel Technologies for High Performance Computing
Intel Technologies for High Performance ComputingIntel Technologies for High Performance Computing
Intel Technologies for High Performance ComputingIntel Software Brasil
 
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...Alluxio, Inc.
 
Unity Optimization Tips, Tricks and Tools
Unity Optimization Tips, Tricks and ToolsUnity Optimization Tips, Tricks and Tools
Unity Optimization Tips, Tricks and ToolsIntel® Software
 
Developing Performance-Oriented Code: Moore's Law Over 50
Developing Performance-Oriented Code: Moore's Law Over 50Developing Performance-Oriented Code: Moore's Law Over 50
Developing Performance-Oriented Code: Moore's Law Over 50SmartBear
 
Software-defined Visualization, High-Fidelity Visualization: OpenSWR and OSPRay
Software-defined Visualization, High-Fidelity Visualization: OpenSWR and OSPRaySoftware-defined Visualization, High-Fidelity Visualization: OpenSWR and OSPRay
Software-defined Visualization, High-Fidelity Visualization: OpenSWR and OSPRayIntel® Software
 
QATCodec: past, present and future
QATCodec: past, present and futureQATCodec: past, present and future
QATCodec: past, present and futureboxu42
 
N(ot)-o(nly)-(Ha)doop - the DAG showdown
N(ot)-o(nly)-(Ha)doop - the DAG showdownN(ot)-o(nly)-(Ha)doop - the DAG showdown
N(ot)-o(nly)-(Ha)doop - the DAG showdownDataWorks Summit
 
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...tdc-globalcode
 
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...Igor José F. Freitas
 
Intel® VTune™ Amplifier - Intel Software Conference 2013
Intel® VTune™ Amplifier - Intel Software Conference 2013Intel® VTune™ Amplifier - Intel Software Conference 2013
Intel® VTune™ Amplifier - Intel Software Conference 2013Intel Software Brasil
 
In The Trenches Optimizing UE4 for Intel
In The Trenches Optimizing UE4 for IntelIn The Trenches Optimizing UE4 for Intel
In The Trenches Optimizing UE4 for IntelIntel® Software
 
Intel® Trace Analyzer e Collector (ITAC) - Intel Software Conference 2013
Intel® Trace Analyzer e Collector (ITAC) - Intel Software Conference 2013Intel® Trace Analyzer e Collector (ITAC) - Intel Software Conference 2013
Intel® Trace Analyzer e Collector (ITAC) - Intel Software Conference 2013Intel Software Brasil
 
Accelerate Ceph performance via SPDK related techniques
Accelerate Ceph performance via SPDK related techniques Accelerate Ceph performance via SPDK related techniques
Accelerate Ceph performance via SPDK related techniques Ceph Community
 
Intel® MPI Library e OpenMP* - Intel Software Conference 2013
Intel® MPI Library e OpenMP* - Intel Software Conference 2013Intel® MPI Library e OpenMP* - Intel Software Conference 2013
Intel® MPI Library e OpenMP* - Intel Software Conference 2013Intel Software Brasil
 

Similar to H-cholesky on manycore (20)

How to create a high quality, fast texture compressor using ISPC
How to create a high quality, fast texture compressor using ISPC How to create a high quality, fast texture compressor using ISPC
How to create a high quality, fast texture compressor using ISPC
 
Efficient Rendering with DirectX* 12 on Intel® Graphics
Efficient Rendering with DirectX* 12 on Intel® GraphicsEfficient Rendering with DirectX* 12 on Intel® Graphics
Efficient Rendering with DirectX* 12 on Intel® Graphics
 
How Funcom Increased Play Time in Lego Minifigures by 40%
How Funcom Increased Play Time in Lego Minifigures by 40%How Funcom Increased Play Time in Lego Minifigures by 40%
How Funcom Increased Play Time in Lego Minifigures by 40%
 
OIT to Volumetric Shadow Mapping, 101 Uses for Raster-Ordered Views using Dir...
OIT to Volumetric Shadow Mapping, 101 Uses for Raster-Ordered Views using Dir...OIT to Volumetric Shadow Mapping, 101 Uses for Raster-Ordered Views using Dir...
OIT to Volumetric Shadow Mapping, 101 Uses for Raster-Ordered Views using Dir...
 
Embree Ray Tracing Kernels
Embree Ray Tracing KernelsEmbree Ray Tracing Kernels
Embree Ray Tracing Kernels
 
What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?
 
Intel Technologies for High Performance Computing
Intel Technologies for High Performance ComputingIntel Technologies for High Performance Computing
Intel Technologies for High Performance Computing
 
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
 
Unity Optimization Tips, Tricks and Tools
Unity Optimization Tips, Tricks and ToolsUnity Optimization Tips, Tricks and Tools
Unity Optimization Tips, Tricks and Tools
 
Developing Performance-Oriented Code: Moore's Law Over 50
Developing Performance-Oriented Code: Moore's Law Over 50Developing Performance-Oriented Code: Moore's Law Over 50
Developing Performance-Oriented Code: Moore's Law Over 50
 
Software-defined Visualization, High-Fidelity Visualization: OpenSWR and OSPRay
Software-defined Visualization, High-Fidelity Visualization: OpenSWR and OSPRaySoftware-defined Visualization, High-Fidelity Visualization: OpenSWR and OSPRay
Software-defined Visualization, High-Fidelity Visualization: OpenSWR and OSPRay
 
QATCodec: past, present and future
QATCodec: past, present and futureQATCodec: past, present and future
QATCodec: past, present and future
 
N(ot)-o(nly)-(Ha)doop - the DAG showdown
N(ot)-o(nly)-(Ha)doop - the DAG showdownN(ot)-o(nly)-(Ha)doop - the DAG showdown
N(ot)-o(nly)-(Ha)doop - the DAG showdown
 
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...
TDC2017 | São Paulo - Trilha Machine Learning How we figured out we had a SRE...
 
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...
Tendências da junção entre Big Data Analytics, Machine Learning e Supercomput...
 
Intel® VTune™ Amplifier - Intel Software Conference 2013
Intel® VTune™ Amplifier - Intel Software Conference 2013Intel® VTune™ Amplifier - Intel Software Conference 2013
Intel® VTune™ Amplifier - Intel Software Conference 2013
 
In The Trenches Optimizing UE4 for Intel
In The Trenches Optimizing UE4 for IntelIn The Trenches Optimizing UE4 for Intel
In The Trenches Optimizing UE4 for Intel
 
Intel® Trace Analyzer e Collector (ITAC) - Intel Software Conference 2013
Intel® Trace Analyzer e Collector (ITAC) - Intel Software Conference 2013Intel® Trace Analyzer e Collector (ITAC) - Intel Software Conference 2013
Intel® Trace Analyzer e Collector (ITAC) - Intel Software Conference 2013
 
Accelerate Ceph performance via SPDK related techniques
Accelerate Ceph performance via SPDK related techniques Accelerate Ceph performance via SPDK related techniques
Accelerate Ceph performance via SPDK related techniques
 
Intel® MPI Library e OpenMP* - Intel Software Conference 2013
Intel® MPI Library e OpenMP* - Intel Software Conference 2013Intel® MPI Library e OpenMP* - Intel Software Conference 2013
Intel® MPI Library e OpenMP* - Intel Software Conference 2013
 

H-cholesky on manycore

  • 1. H–Cholesky Factorization on Many-Core Accelerators Gang Liao August 2, 2015
  • 2. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Background If A is a positive definite matrix, Cholesky factorization: A =   𝐿𝐿% Data matrices representing some numerical observations such as proximity matrix or correlation matrix are often huge and hard to analyze, therefore to decompose the data matrices into some lower-order or lower-rank canonical forms will reveal the inherent characteristic and structure of the matrices and help to interpret their meaning readily.
  • 3. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 3 Hierarchical Matrix Hierarchical matrices (H-matrices) are a powerful tool to represent dense matrices coming from integral equations or partial differential equations in a hierarchical, block-oriented, data-sparse way with log-linear memory costs.
  • 4. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Hierarchical Matrix 4
  • 5. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Intel Confidential 5 Implementation: Inadmissible Leaves: The product index set resolves into admissible and inadmissible leaves of the tree. The assembly, storage and matrix-vector multiplication differs for the corresponding two classes of sub matrices. Inadmissible Leaves:
  • 6. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Intel Confidential 6 Implementation:Admissible Leaves: The product index set resolves into admissible and inadmissible leaves of the tree. The assembly, storage and matrix-vector multiplication differs for the corresponding two classes of sub matrices. Admissible Leaves:
  • 7. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Intel Confidential 7 Hierarchical Matrix Representation
  • 8. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Intel Confidential 8 Profiling
  • 9. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Compiler Optimization – Full matrix Intel Confidential 9 For icc opt1, icc with optimizations like -O2. For icc opt2, icc with default optimizations like -msse4.2 -O3. For icc mkl, icc opt2 + mkl function.
  • 10. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Numerical Libraries Optimization – Full matrix Intel Confidential 10 dpotrf_ vs plasma_dpotrf vs magma_dpotrf MKL: Intel Math Kernel Library (Intel MKL) accelerates math processing routines. PLASMA: Parallel Linear Algebra for Scalable Multi-core Architectures MAGMA: Matrix Algebra on GPU and Multicore Architectures
  • 11. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Intel Confidential 11 Parallel Optimization The concept of task-based DAG computations is used to split the H-Cholesky factorization into single tasks and to define corresponding dependencies to form a DAG.
  • 12. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 12 CodeAnalysis
  • 13. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 13 Multicore Optimization – H-Cholesky Factorization 13 Example 1: Example 2:
  • 14. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 14 Manycore Optimization – H-Cholesky Factorization 1. Allocate & Copy r->a[row_offset] and r->b[col_offset] into accelerators. 2. Copy result ft->e from accelerators into CPU host memory.
  • 15. Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Intel Confidential 15 Result & Conclusion 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 2 4 6 8 10 12 H−Cholesky Decomposition where the problem size (vertices) is 10002 nmin (leaf size) Time(sec) MKL Hybrid H - Cholesky factorization on many-core accelerators is extremely efficient, which also can be well scaled on large-scaled H- matrix.