SlideShare a Scribd company logo
1 of 31
Download to read offline
Matt Challacombe, Nicolas Bock & Terry Haut
Los Alamos National Laboratory
matt.challacombe@freeon.org
Thanks be to LANL, operated by Los Alamos National Security, LLC for the NNSA of the
USDoE under Contract No. DE-AC52- 06NA25396; released under LA-UR-15-26489
Electronic Structure Theory as
Generic N-Body Problem
Electronic Structure Methods for Large Systems
August ACS, Boston 2015
Barriers for Large Systems
Whats Holding O(N) Methods Back?
• Sparse approximation based on matrix decay
• SpGEMM kernels cannot access strong parallel scaling
• Optimization inhibits evolution:
 Entrenched data structures dictate research (row-col)
 Funding cycles limit data innovation
Our Vision:
• Fast, generic and data local
• New math, one programming model
Conventional O (N ) Quantum Chemistry
• Pretend interesting quantum mechanics is highly local
• Conventional SpGEMM kernels (row-col)
• Pray for stability, error control …
Radial Cutoff
Numerical
Threshold
Matrix decay and
“sparsification”
The Parallel SpGEMM in Quantum Chemistry (I)
Also:
• Solvers entangled with
DBCSR require locality
• Massive redistribution on
each SCF cycle 
Randomize
Cannon,
SUMMA
p = 3
Best parallel SpGEMM is
Aiden Bulloc’s method for
work homogenization.
• Cannon’s algorithm for redistribution throttled as O( 𝒑 )
• Cannot access strong scaling regime, 𝒑 >> 𝑵
Fock Build
F  J[P] + a KF[P] + b Kxc[P]
Spectral Projection
P  θ[ F - μI ]
Randomize
Localize
p = 2
p = 1p = 0
Bowler et. al, arXiv:1402.6828
• So far, no QC code has
demonstrated a parallel
SpGEMM beyond
~ 4 atoms/core
• Enables bigger systems, but
not higher throughput
The Parallel SpGEMM in Quantum Chemistry (II)
Problem:
• row-col data structures do not respect locality
• row-col data structures do not query
• row-col decompositions lack flexibility
Also… High Value Correlations are Extended
• Delocalized physics and local support lead to matrix
ill-conditioning and dense matrices
• Ill-conditioning is a feature, not a bug
• Allows to control interpolation range:
Quantum Transport
(LCAOs)
Radial Basis
Function Networks
High Resolution
PDE solvers (RBFs)
• A hierarchical solver with generic database operations,
data locality and local approximation
• Genericity: Treat metric queries same as range queries,
same as higher dimensional queries, same as …
• Kernel independent skeletinizations
.
Datacentric, Generic N-Body Solvers
More Locality in Higher Dimensions (I)
• For ill-conditioned systems, matrix decay can be very, very
slow, with matrices that remain dense
• Instead of looking for data compression or sparsity, look
for locality in product tensor volume
• First, a generic quadtree matrix:
• block-by-magnitude ordering
• metric locality resolved by quadtree
More Locality in Higher Dimensions (II)
occlusion
• SpAMM is a recursive occlusion-cull of product intermediates
• Double metric query of modified Cauchy-Schwartz criteria
• More task locality in higher dimensional product volume:
Relative error in product,
is bounded:
cull
physics
blocking
block-by-magnitude
slow decay
metric locality
locality principle
More Locality in Higher Dimensions (III)
more local in
product volume
~ 𝒂𝒊 𝒃𝒊occlusion-cull
space filling curve
In addition to metric locality, exploit algebraic locality
in resolution of the identity:
𝑰 𝒔 = 𝒔
1
2 ∙ 𝒔
−1
2
A N-Body Solver for Square Root Iteration
Square root iteration equivalent
to matrix sign problem under
Higham’s identity:
𝑠𝑖𝑔𝑛
0 𝒔
𝐼 0
= 0 𝒔
1
2
𝒔
−1
2 0
Challacombe, Haut & Bock in arxiv 2015
• Square root iteration (sqi) with map hα ∙ and τ algebra
• 𝒛 𝑘 → 𝒔−1/2, 𝒚 𝑘 → 𝒔1/2, 𝒙 𝑘 → 𝑰 𝒔 with 𝜏 → 0
• Two instances we care about, single and dual channel:
sqidual 𝒔, 𝜏 ≔
𝒙0 =
𝒔
𝑠0
, 𝒚0 = 𝒙0,
𝒛0 = 𝑰, 𝜏 𝑠 ~.01 × 𝜏
while 𝑡𝑟𝒙 𝑘 − 𝑛 𝑛 > τ
return {𝒛 𝜏 ← 𝒛 𝑘}
𝒛 𝑘 ← 𝒛 𝑘−1 𝜏 ℎ 𝛼 𝒙 𝑘−1
𝒚 𝑘 ← ℎ 𝛼 𝒙 𝑘−1  𝝉 𝒔
𝒚 𝑘
𝒙 𝑘 ← 𝒚 𝑘 𝜏 𝒛 𝑘
sqisingle 𝒔, 𝜏 ≔
𝒙0 =
𝒔
𝑠0
, 𝒛0 = 𝑰 , 𝜏 𝑠~.01 × 𝜏
while 𝑡𝑟𝒙 𝑘 − 𝑛 𝑛 > τ
return {𝒛 𝜏 ← 𝒛 𝑘}
𝒛 𝑘 ← 𝒛 𝑘−1 𝜏 ℎ 𝛼 𝒙 𝑘−1
𝒙 𝑘 ← 𝒛 𝑇
𝑘  𝜏 (𝒔  𝝉 𝒔
𝒛 𝑘)
Instances of Square Root Iteration
• Super-linear convergence contracts error about identity:
• stability is guaranteed for < 1
• First order variation along unit errors:
• Derivatives are strongly contractive towards identity iteration
(orientational convergence kills error accumulation):
Contractive Identity Iteration (Stability)
Contractive Identity Iteration (Lensing)
• Matrix Market bcsstk14 (Roof of Omni Coliseum)
• Condition number is 𝟏𝟎 𝟏𝟎, 𝜏 = 10−5
• Contraction in the product
Metric locality:
• Locality principle + Cartesian or non-Euclidean separations
• Ordering: space filling curve, graph theory, & etc. (random destroys)
Algebraic Locality:
• Iteration collapses volumes to one plane (lensing)
• SpAMM bound strengthens
Bifurcations for Ill-Conditioned LCAOs
• (3,3) nanotube metric, U.C. × 36 @ 𝜅 𝒔 = 𝟏𝟎 𝟏𝟎
• Sensitivity due to full inverse in 𝛿𝑧 𝑘−1:
• Use 𝜏 𝑠 ≪ 𝜏0
• Calculations
dense through
U.C. × 𝟏𝟐𝟖!
• Most approximate 𝜏0 controlled by condition 𝜅 𝒔
• Control ill-conditioning incrementally with level shifts:
• Product representation: a sandwich of thin, generic SpAMM
products that are highly lensed:
• Most approximate 𝝉 𝟎 𝝁 𝟎 ; 𝒔−𝟏/𝟐 sets cost
• How low in 𝜏0 can we go?
Precision Scoping and Iterative Regularization
then
Most Approximate but Effective by 10
• Thin generic slice: improve 𝜅 by 10, with 1 digit precision
• (3,3) U.C. × 36 → × 128, w/ 𝜅 𝒔 = 1010
metric
• Looking at % volume for the SpAMM products 𝒚 𝑘 and 𝒛 𝑘:
dual: spectral resolution tends
to quadtree copy in place:
single: spectral resolution
becomes increasingly broad:
𝜇0 = .1, 𝜏0 = .1, 𝜏 𝑠0
=.001
𝜇0 = .1, 𝜏0 = 10−2, 𝜏 𝑠0
= 10−4
× 8, 𝑘 = 0, 𝜇0 = .1, 𝜏0 = .1, 𝜏 𝑠0
=.001
× 8, 𝑘 = 14, 𝜇0 = .1, 𝜏0 = .1, 𝜏 𝑠0
=.001
× 8, 𝑘 = 37, 𝜇0 = .1, 𝜏0 = .1, 𝜏 𝑠0
=.001
Compression for Thin Slice: .1 .1 ; 𝒔−1/2
• Nanotubes, 𝜅 𝒔 = 1010& × 36 → × 128
• Volume of terminal product relative to 𝑛3
𝜏 𝑠0
~.01 × 𝜏0
sqi 𝒔, 𝜏0 = 10−3
Ill-Conditioning: 𝝹 𝒔 = 𝟏𝟎 𝟏𝟏, (3,3)x8 nanotube
sqi 𝒛 𝑇
𝝉0
 𝜏1
𝒔  𝜏1
𝒛 𝝉0
, 𝜏1 = 10−7
sqi 𝒛 𝑇
𝝉1
 𝜏 𝟐
𝒛 𝑇
𝝉0
 𝜏 𝟏
𝒔  𝜏 𝟏
𝒛 𝝉0
 𝜏 𝟐
𝒛 𝝉1
, 𝜏2 = 10−11
An Optimized SpAMM Kernel (I)
Bock & Challacombe, SIAM J. Sci. Comput., 35(1), C72
• Assembly coded SpAMM in single precision:
• 50% of peak
with 4x4
blocking
• Crossover w/
SGEMM at
n = 2000,
same error
6-31G**,
Matrix Sign Function
• In single, SpAMM can beat MKL SGEMM in error also
• Recursion w/locality more accurate than row-col:
An Optimized SpAMM Kernel (II)
Bock & Challacombe, SIAM J. Sci. Comput., 35(1), C72
• N-body potentially communication optimal:
• N-body model supported by common runtimes (sort of):
o OpenMP: attention to memory (blocks & chunks)
o Charm++: does not support recursion, roll your own
• Quantum locality → metric locality
• Temporal locality → persistence load balancing
• Decomposition in 3-D task space, not row-col data space
SpAMM in the Strong Scaling Limit (I)
Bock, Challacombe & Kale arXiv: 1403.7458
SpAMM in the Strong Scaling Limit (II)
• OpenMP: Memory access impacts recursive task parallelism
• Contiguous chunks of memory (𝑁𝑐) and blocksize (𝑁𝑏) vs wall
time, 48 core opteron, 6-31G** spectral projection, (H2O)90:
1 thread, L1 cache exceeded
complexity reduction vs prefetch
48 threads, NUMA effects
~80% parallel efficiency
 𝜏=10−10  𝜏=10−10
first step final step
SpAMM in the Strong Scaling Limit (III)
• Charm++ is a modern runtime with persistence based load
balancing, but does not currently support recursion
• Build unrolled octree mesh of chares
• Dynamic load balancing for 𝑝~30 × 𝑁 (first iteration)
• GreedyCommLB persistence load balancing for 𝑝~500 × 𝑁
A N-Body Solver for Fock Exchange (I)
Challacombe & Bock, J. Chem. Phys. 140 (2014) p. 111101
Recursive occlusion-cull with triple metric query on the Cauchy-
Schwarz (direct SCF) criteria:
Quadtree of shell pairs:
Fock exchange as hextree recursion:
4 of these
• With permutational symmetry, expect a 4x speedup. Get less
with occlusion & culling:
• Data problem w/ 4 × sub-blocks of exchange and density
tracked, resulting in 7 unique contraction blocks:
A N-Body Solver for Fock Exchange (II)
• Generalized N-body methods resolve new, low
complexity structures that row-col cannot
• 𝒔−1/2 as deferred product of generic solves:
o Apply to target without forming bad inverse
o Compression by orders of magnitude (lensing)
• Generic programming:
o Generic matrix algebra, Fock exchange,
Hartree & exchange-correlation, all N-body
o Easy access to higher dimensional problems
(tensor multiplication, derivatives, … )
o Greatly reduce code base, lower barriers to
entry, minimize bugs, hidden approximations
Fast, Generic and Data Local
• Generic N-body programming model allows focus
on new math (row-col free):
o Rigorous bounds
o Mixed metrics on same footing
o Precision and regularization scoping
o Algebraic locality + other fast methods
(sketching, probing, joining & etc)
• N-Body problem is communication optimal:
o Strong parallel scaling for fast matrix
multiplication in electronic structure
o Works for runtimes you know
o Kernel independent skeletonization
New Math, One Programming Model

More Related Content

What's hot

Oleksandr Obiednikov “Affine transforms and how CNN lives with them”
Oleksandr Obiednikov “Affine transforms and how CNN lives with them”Oleksandr Obiednikov “Affine transforms and how CNN lives with them”
Oleksandr Obiednikov “Affine transforms and how CNN lives with them”Lviv Startup Club
 
Neural Networks: Model Building Through Linear Regression
Neural Networks: Model Building Through Linear RegressionNeural Networks: Model Building Through Linear Regression
Neural Networks: Model Building Through Linear RegressionMostafa G. M. Mostafa
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data miningZHAO Sam
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford MapR Technologies
 
Notes on Spectral Clustering
Notes on Spectral ClusteringNotes on Spectral Clustering
Notes on Spectral ClusteringDavide Eynard
 
MetiTarski's menagerie of cooperating systems
MetiTarski's menagerie of cooperating systemsMetiTarski's menagerie of cooperating systems
MetiTarski's menagerie of cooperating systemsLawrence Paulson
 
Recent advances on low-rank and sparse decomposition for moving object detection
Recent advances on low-rank and sparse decomposition for moving object detectionRecent advances on low-rank and sparse decomposition for moving object detection
Recent advances on low-rank and sparse decomposition for moving object detectionActiveEon
 
Lecture 6: Convolutional Neural Networks
Lecture 6: Convolutional Neural NetworksLecture 6: Convolutional Neural Networks
Lecture 6: Convolutional Neural NetworksSang Jun Lee
 
V2 final presentation 08-12-2014
V2 final presentation 08-12-2014V2 final presentation 08-12-2014
V2 final presentation 08-12-20140309akash
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronMostafa G. M. Mostafa
 
icml2004 tutorial on spectral clustering part I
icml2004 tutorial on spectral clustering part Iicml2004 tutorial on spectral clustering part I
icml2004 tutorial on spectral clustering part Izukun
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscanYan Xu
 
icml2004 tutorial on spectral clustering part II
icml2004 tutorial on spectral clustering part IIicml2004 tutorial on spectral clustering part II
icml2004 tutorial on spectral clustering part IIzukun
 
Spectral Clustering Report
Spectral Clustering ReportSpectral Clustering Report
Spectral Clustering ReportMiaolan Xie
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satChenYiHuang5
 

What's hot (20)

Birch1
Birch1Birch1
Birch1
 
Sandia Fast Matmul
Sandia Fast MatmulSandia Fast Matmul
Sandia Fast Matmul
 
Oleksandr Obiednikov “Affine transforms and how CNN lives with them”
Oleksandr Obiednikov “Affine transforms and how CNN lives with them”Oleksandr Obiednikov “Affine transforms and how CNN lives with them”
Oleksandr Obiednikov “Affine transforms and how CNN lives with them”
 
Neural Networks: Model Building Through Linear Regression
Neural Networks: Model Building Through Linear RegressionNeural Networks: Model Building Through Linear Regression
Neural Networks: Model Building Through Linear Regression
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data mining
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
 
Notes on Spectral Clustering
Notes on Spectral ClusteringNotes on Spectral Clustering
Notes on Spectral Clustering
 
MetiTarski's menagerie of cooperating systems
MetiTarski's menagerie of cooperating systemsMetiTarski's menagerie of cooperating systems
MetiTarski's menagerie of cooperating systems
 
Recent advances on low-rank and sparse decomposition for moving object detection
Recent advances on low-rank and sparse decomposition for moving object detectionRecent advances on low-rank and sparse decomposition for moving object detection
Recent advances on low-rank and sparse decomposition for moving object detection
 
DNN and RBM
DNN and RBMDNN and RBM
DNN and RBM
 
Lecture 6: Convolutional Neural Networks
Lecture 6: Convolutional Neural NetworksLecture 6: Convolutional Neural Networks
Lecture 6: Convolutional Neural Networks
 
V2 final presentation 08-12-2014
V2 final presentation 08-12-2014V2 final presentation 08-12-2014
V2 final presentation 08-12-2014
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer Perceptron
 
CSC446: Pattern Recognition (LN5)
CSC446: Pattern Recognition (LN5)CSC446: Pattern Recognition (LN5)
CSC446: Pattern Recognition (LN5)
 
icml2004 tutorial on spectral clustering part I
icml2004 tutorial on spectral clustering part Iicml2004 tutorial on spectral clustering part I
icml2004 tutorial on spectral clustering part I
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscan
 
icml2004 tutorial on spectral clustering part II
icml2004 tutorial on spectral clustering part IIicml2004 tutorial on spectral clustering part II
icml2004 tutorial on spectral clustering part II
 
K-Means Algorithm
K-Means AlgorithmK-Means Algorithm
K-Means Algorithm
 
Spectral Clustering Report
Spectral Clustering ReportSpectral Clustering Report
Spectral Clustering Report
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
 

Viewers also liked

N'Report 3 - Tablet Kullanım Alışkanlıkları
N'Report 3 - Tablet Kullanım AlışkanlıklarıN'Report 3 - Tablet Kullanım Alışkanlıkları
N'Report 3 - Tablet Kullanım AlışkanlıklarıNmobs
 
Apps 4 God 5W+1H
Apps 4 God 5W+1HApps 4 God 5W+1H
Apps 4 God 5W+1HApps4God
 
Creative Critical Reflection Q1
Creative Critical Reflection Q1Creative Critical Reflection Q1
Creative Critical Reflection Q1shazreh_shahzad
 
Theories of m usic viideos
Theories of m usic viideosTheories of m usic viideos
Theories of m usic viideosshazreh_shahzad
 
Stress Testing Conference - FinRep 6.23.15
Stress Testing Conference - FinRep 6.23.15Stress Testing Conference - FinRep 6.23.15
Stress Testing Conference - FinRep 6.23.15Robert Fournier
 
~1473161847~reflessologi nel web
~1473161847~reflessologi nel web~1473161847~reflessologi nel web
~1473161847~reflessologi nel webGiacomo Badino
 
Apps 4 God: Aksi dan Strategi
Apps 4 God: Aksi dan StrategiApps 4 God: Aksi dan Strategi
Apps 4 God: Aksi dan StrategiApps4God
 
Revolusi Internet: Tren Teknologi di Era Messenger
Revolusi Internet: Tren Teknologi di Era MessengerRevolusi Internet: Tren Teknologi di Era Messenger
Revolusi Internet: Tren Teknologi di Era MessengerApps4God
 
SEO Services in UK - YNG Media
SEO Services in UK - YNG MediaSEO Services in UK - YNG Media
SEO Services in UK - YNG Mediayngglobal
 
REAM x GE Productive End-Use of MHP 28FEB16
REAM x GE Productive End-Use of MHP 28FEB16REAM x GE Productive End-Use of MHP 28FEB16
REAM x GE Productive End-Use of MHP 28FEB16Patrick James Pawletko
 
N'Report 6 - E-gazete & E-dergi
N'Report 6 - E-gazete & E-dergiN'Report 6 - E-gazete & E-dergi
N'Report 6 - E-gazete & E-dergiNmobs
 
Matt Rustin Resume 2015
Matt Rustin  Resume 2015Matt Rustin  Resume 2015
Matt Rustin Resume 2015Matt Rustin
 

Viewers also liked (16)

N'Report 3 - Tablet Kullanım Alışkanlıkları
N'Report 3 - Tablet Kullanım AlışkanlıklarıN'Report 3 - Tablet Kullanım Alışkanlıkları
N'Report 3 - Tablet Kullanım Alışkanlıkları
 
Apps 4 God 5W+1H
Apps 4 God 5W+1HApps 4 God 5W+1H
Apps 4 God 5W+1H
 
Creative Critical Reflection Q1
Creative Critical Reflection Q1Creative Critical Reflection Q1
Creative Critical Reflection Q1
 
Theories of m usic viideos
Theories of m usic viideosTheories of m usic viideos
Theories of m usic viideos
 
Stress Testing Conference - FinRep 6.23.15
Stress Testing Conference - FinRep 6.23.15Stress Testing Conference - FinRep 6.23.15
Stress Testing Conference - FinRep 6.23.15
 
~1473161847~reflessologi nel web
~1473161847~reflessologi nel web~1473161847~reflessologi nel web
~1473161847~reflessologi nel web
 
Apps 4 God: Aksi dan Strategi
Apps 4 God: Aksi dan StrategiApps 4 God: Aksi dan Strategi
Apps 4 God: Aksi dan Strategi
 
instruments
instruments instruments
instruments
 
Revolusi Internet: Tren Teknologi di Era Messenger
Revolusi Internet: Tren Teknologi di Era MessengerRevolusi Internet: Tren Teknologi di Era Messenger
Revolusi Internet: Tren Teknologi di Era Messenger
 
SEO Services in UK - YNG Media
SEO Services in UK - YNG MediaSEO Services in UK - YNG Media
SEO Services in UK - YNG Media
 
REAM x GE Productive End-Use of MHP 28FEB16
REAM x GE Productive End-Use of MHP 28FEB16REAM x GE Productive End-Use of MHP 28FEB16
REAM x GE Productive End-Use of MHP 28FEB16
 
N'Report 6 - E-gazete & E-dergi
N'Report 6 - E-gazete & E-dergiN'Report 6 - E-gazete & E-dergi
N'Report 6 - E-gazete & E-dergi
 
senior book final
senior book finalsenior book final
senior book final
 
2015
20152015
2015
 
Matt Rustin Resume 2015
Matt Rustin  Resume 2015Matt Rustin  Resume 2015
Matt Rustin Resume 2015
 
Curriculum vitae
Curriculum vitaeCurriculum vitae
Curriculum vitae
 

Similar to generalized_nbody_acs_2015_challacombe

Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraJason Riedy
 
HACC: Fitting the Universe Inside a Supercomputer
HACC: Fitting the Universe Inside a SupercomputerHACC: Fitting the Universe Inside a Supercomputer
HACC: Fitting the Universe Inside a Supercomputerinside-BigData.com
 
Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18Aritra Sarkar
 
Multiple Resonant Multiconductor Transmission line Resonator Design using Cir...
Multiple Resonant Multiconductor Transmission line Resonator Design using Cir...Multiple Resonant Multiconductor Transmission line Resonator Design using Cir...
Multiple Resonant Multiconductor Transmission line Resonator Design using Cir...Sasidhar Tadanki
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012Ted Dunning
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningSungchul Kim
 
Digital origami from geometrically frustrated tiles
Digital origami from geometrically frustrated tilesDigital origami from geometrically frustrated tiles
Digital origami from geometrically frustrated tilesCK Harnett
 
Super resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooSuper resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooJaeJun Yoo
 
Myers_SIAMCSE15
Myers_SIAMCSE15Myers_SIAMCSE15
Myers_SIAMCSE15Karen Pao
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...ssuser2624f71
 
Deep learning-for-pose-estimation-wyang-defense
Deep learning-for-pose-estimation-wyang-defenseDeep learning-for-pose-estimation-wyang-defense
Deep learning-for-pose-estimation-wyang-defenseWei Yang
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningSungchul Kim
 
ES_SAA_OG_PF_ECCTD_Pos
ES_SAA_OG_PF_ECCTD_PosES_SAA_OG_PF_ECCTD_Pos
ES_SAA_OG_PF_ECCTD_PosSyed Asad Alam
 
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksPR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksJinwon Lee
 
Distance-based bias in model-directed optimization of additively decomposable...
Distance-based bias in model-directed optimization of additively decomposable...Distance-based bias in model-directed optimization of additively decomposable...
Distance-based bias in model-directed optimization of additively decomposable...Martin Pelikan
 

Similar to generalized_nbody_acs_2015_challacombe (20)

Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear Algebra
 
HACC: Fitting the Universe Inside a Supercomputer
HACC: Fitting the Universe Inside a SupercomputerHACC: Fitting the Universe Inside a Supercomputer
HACC: Fitting the Universe Inside a Supercomputer
 
Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18
 
Multiple Resonant Multiconductor Transmission line Resonator Design using Cir...
Multiple Resonant Multiconductor Transmission line Resonator Design using Cir...Multiple Resonant Multiconductor Transmission line Resonator Design using Cir...
Multiple Resonant Multiconductor Transmission line Resonator Design using Cir...
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation Learning
 
lecture_16.pptx
lecture_16.pptxlecture_16.pptx
lecture_16.pptx
 
Digital origami from geometrically frustrated tiles
Digital origami from geometrically frustrated tilesDigital origami from geometrically frustrated tiles
Digital origami from geometrically frustrated tiles
 
Super resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooSuper resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun Yoo
 
Myers_SIAMCSE15
Myers_SIAMCSE15Myers_SIAMCSE15
Myers_SIAMCSE15
 
CNN for modeling sentence
CNN for modeling sentenceCNN for modeling sentence
CNN for modeling sentence
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
 
Deep learning-for-pose-estimation-wyang-defense
Deep learning-for-pose-estimation-wyang-defenseDeep learning-for-pose-estimation-wyang-defense
Deep learning-for-pose-estimation-wyang-defense
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
 
Knapsack problem using fixed tuple
Knapsack problem using fixed tupleKnapsack problem using fixed tuple
Knapsack problem using fixed tuple
 
Exploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation LearningExploring Simple Siamese Representation Learning
Exploring Simple Siamese Representation Learning
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
 
ES_SAA_OG_PF_ECCTD_Pos
ES_SAA_OG_PF_ECCTD_PosES_SAA_OG_PF_ECCTD_Pos
ES_SAA_OG_PF_ECCTD_Pos
 
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksPR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
 
Distance-based bias in model-directed optimization of additively decomposable...
Distance-based bias in model-directed optimization of additively decomposable...Distance-based bias in model-directed optimization of additively decomposable...
Distance-based bias in model-directed optimization of additively decomposable...
 

generalized_nbody_acs_2015_challacombe

  • 1. Matt Challacombe, Nicolas Bock & Terry Haut Los Alamos National Laboratory matt.challacombe@freeon.org Thanks be to LANL, operated by Los Alamos National Security, LLC for the NNSA of the USDoE under Contract No. DE-AC52- 06NA25396; released under LA-UR-15-26489 Electronic Structure Theory as Generic N-Body Problem Electronic Structure Methods for Large Systems August ACS, Boston 2015
  • 2. Barriers for Large Systems Whats Holding O(N) Methods Back? • Sparse approximation based on matrix decay • SpGEMM kernels cannot access strong parallel scaling • Optimization inhibits evolution:  Entrenched data structures dictate research (row-col)  Funding cycles limit data innovation Our Vision: • Fast, generic and data local • New math, one programming model
  • 3. Conventional O (N ) Quantum Chemistry • Pretend interesting quantum mechanics is highly local • Conventional SpGEMM kernels (row-col) • Pray for stability, error control … Radial Cutoff Numerical Threshold Matrix decay and “sparsification”
  • 4. The Parallel SpGEMM in Quantum Chemistry (I) Also: • Solvers entangled with DBCSR require locality • Massive redistribution on each SCF cycle  Randomize Cannon, SUMMA p = 3 Best parallel SpGEMM is Aiden Bulloc’s method for work homogenization. • Cannon’s algorithm for redistribution throttled as O( 𝒑 ) • Cannot access strong scaling regime, 𝒑 >> 𝑵 Fock Build F  J[P] + a KF[P] + b Kxc[P] Spectral Projection P  θ[ F - μI ] Randomize Localize p = 2 p = 1p = 0
  • 5. Bowler et. al, arXiv:1402.6828 • So far, no QC code has demonstrated a parallel SpGEMM beyond ~ 4 atoms/core • Enables bigger systems, but not higher throughput The Parallel SpGEMM in Quantum Chemistry (II) Problem: • row-col data structures do not respect locality • row-col data structures do not query • row-col decompositions lack flexibility
  • 6. Also… High Value Correlations are Extended • Delocalized physics and local support lead to matrix ill-conditioning and dense matrices • Ill-conditioning is a feature, not a bug • Allows to control interpolation range: Quantum Transport (LCAOs) Radial Basis Function Networks High Resolution PDE solvers (RBFs)
  • 7. • A hierarchical solver with generic database operations, data locality and local approximation • Genericity: Treat metric queries same as range queries, same as higher dimensional queries, same as … • Kernel independent skeletinizations . Datacentric, Generic N-Body Solvers
  • 8. More Locality in Higher Dimensions (I) • For ill-conditioned systems, matrix decay can be very, very slow, with matrices that remain dense • Instead of looking for data compression or sparsity, look for locality in product tensor volume • First, a generic quadtree matrix: • block-by-magnitude ordering • metric locality resolved by quadtree
  • 9. More Locality in Higher Dimensions (II) occlusion • SpAMM is a recursive occlusion-cull of product intermediates • Double metric query of modified Cauchy-Schwartz criteria • More task locality in higher dimensional product volume: Relative error in product, is bounded: cull
  • 10. physics blocking block-by-magnitude slow decay metric locality locality principle More Locality in Higher Dimensions (III) more local in product volume ~ 𝒂𝒊 𝒃𝒊occlusion-cull space filling curve
  • 11. In addition to metric locality, exploit algebraic locality in resolution of the identity: 𝑰 𝒔 = 𝒔 1 2 ∙ 𝒔 −1 2 A N-Body Solver for Square Root Iteration Square root iteration equivalent to matrix sign problem under Higham’s identity: 𝑠𝑖𝑔𝑛 0 𝒔 𝐼 0 = 0 𝒔 1 2 𝒔 −1 2 0 Challacombe, Haut & Bock in arxiv 2015
  • 12. • Square root iteration (sqi) with map hα ∙ and τ algebra • 𝒛 𝑘 → 𝒔−1/2, 𝒚 𝑘 → 𝒔1/2, 𝒙 𝑘 → 𝑰 𝒔 with 𝜏 → 0 • Two instances we care about, single and dual channel: sqidual 𝒔, 𝜏 ≔ 𝒙0 = 𝒔 𝑠0 , 𝒚0 = 𝒙0, 𝒛0 = 𝑰, 𝜏 𝑠 ~.01 × 𝜏 while 𝑡𝑟𝒙 𝑘 − 𝑛 𝑛 > τ return {𝒛 𝜏 ← 𝒛 𝑘} 𝒛 𝑘 ← 𝒛 𝑘−1 𝜏 ℎ 𝛼 𝒙 𝑘−1 𝒚 𝑘 ← ℎ 𝛼 𝒙 𝑘−1  𝝉 𝒔 𝒚 𝑘 𝒙 𝑘 ← 𝒚 𝑘 𝜏 𝒛 𝑘 sqisingle 𝒔, 𝜏 ≔ 𝒙0 = 𝒔 𝑠0 , 𝒛0 = 𝑰 , 𝜏 𝑠~.01 × 𝜏 while 𝑡𝑟𝒙 𝑘 − 𝑛 𝑛 > τ return {𝒛 𝜏 ← 𝒛 𝑘} 𝒛 𝑘 ← 𝒛 𝑘−1 𝜏 ℎ 𝛼 𝒙 𝑘−1 𝒙 𝑘 ← 𝒛 𝑇 𝑘  𝜏 (𝒔  𝝉 𝒔 𝒛 𝑘) Instances of Square Root Iteration
  • 13. • Super-linear convergence contracts error about identity: • stability is guaranteed for < 1 • First order variation along unit errors: • Derivatives are strongly contractive towards identity iteration (orientational convergence kills error accumulation): Contractive Identity Iteration (Stability)
  • 14. Contractive Identity Iteration (Lensing) • Matrix Market bcsstk14 (Roof of Omni Coliseum) • Condition number is 𝟏𝟎 𝟏𝟎, 𝜏 = 10−5 • Contraction in the product Metric locality: • Locality principle + Cartesian or non-Euclidean separations • Ordering: space filling curve, graph theory, & etc. (random destroys) Algebraic Locality: • Iteration collapses volumes to one plane (lensing) • SpAMM bound strengthens
  • 15. Bifurcations for Ill-Conditioned LCAOs • (3,3) nanotube metric, U.C. × 36 @ 𝜅 𝒔 = 𝟏𝟎 𝟏𝟎 • Sensitivity due to full inverse in 𝛿𝑧 𝑘−1: • Use 𝜏 𝑠 ≪ 𝜏0 • Calculations dense through U.C. × 𝟏𝟐𝟖!
  • 16. • Most approximate 𝜏0 controlled by condition 𝜅 𝒔 • Control ill-conditioning incrementally with level shifts: • Product representation: a sandwich of thin, generic SpAMM products that are highly lensed: • Most approximate 𝝉 𝟎 𝝁 𝟎 ; 𝒔−𝟏/𝟐 sets cost • How low in 𝜏0 can we go? Precision Scoping and Iterative Regularization then
  • 17. Most Approximate but Effective by 10 • Thin generic slice: improve 𝜅 by 10, with 1 digit precision • (3,3) U.C. × 36 → × 128, w/ 𝜅 𝒔 = 1010 metric • Looking at % volume for the SpAMM products 𝒚 𝑘 and 𝒛 𝑘: dual: spectral resolution tends to quadtree copy in place: single: spectral resolution becomes increasingly broad: 𝜇0 = .1, 𝜏0 = .1, 𝜏 𝑠0 =.001 𝜇0 = .1, 𝜏0 = 10−2, 𝜏 𝑠0 = 10−4
  • 18. × 8, 𝑘 = 0, 𝜇0 = .1, 𝜏0 = .1, 𝜏 𝑠0 =.001
  • 19. × 8, 𝑘 = 14, 𝜇0 = .1, 𝜏0 = .1, 𝜏 𝑠0 =.001
  • 20. × 8, 𝑘 = 37, 𝜇0 = .1, 𝜏0 = .1, 𝜏 𝑠0 =.001
  • 21. Compression for Thin Slice: .1 .1 ; 𝒔−1/2 • Nanotubes, 𝜅 𝒔 = 1010& × 36 → × 128 • Volume of terminal product relative to 𝑛3 𝜏 𝑠0 ~.01 × 𝜏0
  • 22. sqi 𝒔, 𝜏0 = 10−3 Ill-Conditioning: 𝝹 𝒔 = 𝟏𝟎 𝟏𝟏, (3,3)x8 nanotube sqi 𝒛 𝑇 𝝉0  𝜏1 𝒔  𝜏1 𝒛 𝝉0 , 𝜏1 = 10−7 sqi 𝒛 𝑇 𝝉1  𝜏 𝟐 𝒛 𝑇 𝝉0  𝜏 𝟏 𝒔  𝜏 𝟏 𝒛 𝝉0  𝜏 𝟐 𝒛 𝝉1 , 𝜏2 = 10−11
  • 23. An Optimized SpAMM Kernel (I) Bock & Challacombe, SIAM J. Sci. Comput., 35(1), C72 • Assembly coded SpAMM in single precision: • 50% of peak with 4x4 blocking • Crossover w/ SGEMM at n = 2000, same error 6-31G**, Matrix Sign Function
  • 24. • In single, SpAMM can beat MKL SGEMM in error also • Recursion w/locality more accurate than row-col: An Optimized SpAMM Kernel (II) Bock & Challacombe, SIAM J. Sci. Comput., 35(1), C72
  • 25. • N-body potentially communication optimal: • N-body model supported by common runtimes (sort of): o OpenMP: attention to memory (blocks & chunks) o Charm++: does not support recursion, roll your own • Quantum locality → metric locality • Temporal locality → persistence load balancing • Decomposition in 3-D task space, not row-col data space SpAMM in the Strong Scaling Limit (I) Bock, Challacombe & Kale arXiv: 1403.7458
  • 26. SpAMM in the Strong Scaling Limit (II) • OpenMP: Memory access impacts recursive task parallelism • Contiguous chunks of memory (𝑁𝑐) and blocksize (𝑁𝑏) vs wall time, 48 core opteron, 6-31G** spectral projection, (H2O)90: 1 thread, L1 cache exceeded complexity reduction vs prefetch 48 threads, NUMA effects ~80% parallel efficiency
  • 27.  𝜏=10−10  𝜏=10−10 first step final step SpAMM in the Strong Scaling Limit (III) • Charm++ is a modern runtime with persistence based load balancing, but does not currently support recursion • Build unrolled octree mesh of chares • Dynamic load balancing for 𝑝~30 × 𝑁 (first iteration) • GreedyCommLB persistence load balancing for 𝑝~500 × 𝑁
  • 28. A N-Body Solver for Fock Exchange (I) Challacombe & Bock, J. Chem. Phys. 140 (2014) p. 111101 Recursive occlusion-cull with triple metric query on the Cauchy- Schwarz (direct SCF) criteria: Quadtree of shell pairs: Fock exchange as hextree recursion: 4 of these
  • 29. • With permutational symmetry, expect a 4x speedup. Get less with occlusion & culling: • Data problem w/ 4 × sub-blocks of exchange and density tracked, resulting in 7 unique contraction blocks: A N-Body Solver for Fock Exchange (II)
  • 30. • Generalized N-body methods resolve new, low complexity structures that row-col cannot • 𝒔−1/2 as deferred product of generic solves: o Apply to target without forming bad inverse o Compression by orders of magnitude (lensing) • Generic programming: o Generic matrix algebra, Fock exchange, Hartree & exchange-correlation, all N-body o Easy access to higher dimensional problems (tensor multiplication, derivatives, … ) o Greatly reduce code base, lower barriers to entry, minimize bugs, hidden approximations Fast, Generic and Data Local
  • 31. • Generic N-body programming model allows focus on new math (row-col free): o Rigorous bounds o Mixed metrics on same footing o Precision and regularization scoping o Algebraic locality + other fast methods (sketching, probing, joining & etc) • N-Body problem is communication optimal: o Strong parallel scaling for fast matrix multiplication in electronic structure o Works for runtimes you know o Kernel independent skeletonization New Math, One Programming Model