This talk is a new update based on some of our recent results on doing Tall and Skinny QRs in MapReduce. In particular, the "fast" iterative refinement approximation based on a sample is new.
The asynchronous parallel algorithms are developed to solve massive optimization problems in a distributed data system, which can be run in parallel on multiple nodes with little or no synchronization. Recently they have been successfully implemented to solve a range of difficult problems in practice. However, the existing theories are mostly based on fairly restrictive assumptions on the delays, and cannot explain the convergence and speedup properties of such algorithms. In this talk we will give an overview on distributed optimization, and discuss some new theoretical results on the convergence of asynchronous parallel stochastic gradient algorithm with unbounded delays. Simulated and real data will be used to demonstrate the practical implication of these theoretical results.
The asynchronous parallel algorithms are developed to solve massive optimization problems in a distributed data system, which can be run in parallel on multiple nodes with little or no synchronization. Recently they have been successfully implemented to solve a range of difficult problems in practice. However, the existing theories are mostly based on fairly restrictive assumptions on the delays, and cannot explain the convergence and speedup properties of such algorithms. In this talk we will give an overview on distributed optimization, and discuss some new theoretical results on the convergence of asynchronous parallel stochastic gradient algorithm with unbounded delays. Simulated and real data will be used to demonstrate the practical implication of these theoretical results.
Multi-scalar multiplication: state of the art and new ideasGus Gutoski
A 90-minute online presentation for zkStudyClub, delivered 2020-06-01. I present a new idea with a demonstrated 5% speed-up for multi-scalar multiplication. When combined with precomputation, this method could yield upwards of 20% speed-up.
Summary of the article: "Band selection for dimension in hyper spectral image using integrated information gain and principal component analysis technique"
Effectively alleviates the curse of dimensionality in optimal reservoir operation.
It has better optimization capabilities compared to the DP aggregated water demand approach and can solve problems that are more complex where the DP aggregated water demand approach is not feasible.
Computationally, it is very efficient and runs fast on standard personal computers. The presented case study optimization was executed in less than five minutes.
The algorithm allows for employing dense and variable discretization on the reservoir volume and release.
It supports using a variable weight at each time step for every objective function.
Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...Jumlesha Shaik
Abstract: In this paper a digital image coding technique called ML-SRHWT (Machine Learning based image coding by Superlative Rapid HAAR Wavelet Transform) has been introduced. Compression of digital image is done using the model Superlative Rapid HAAR Wavelet Transform (SRHWT). The Least Square Support vector Machine regression predicts hyper coefficients obtained by using QPSO model. The mathematical models are discussed in brief in this paper are SRHWT, which results in good performance and reduces the complexity compared to FHAAR and EQPSO by replacing the least good particle with the new best obtained particle in QPSO. On comparing the ML-SRHWT with JPEG and JPEG2000 standards, the former is considered to be the better.
We develop a multi-scale streaming anomaly score that takes into account a family of window sizes, making the algorithm scale invariant across a different types of time series with varying pseudo-periodic structure. We explore different aggregation methods of the multi-scale anomaly score to obtain a final anomaly score. We evaluate the performance on the Yahoo! and Numenta Anomaly Benchmark(NAB) datasets.
Big data matrix factorizations and Overlapping community detection in graphsDavid Gleich
In a talk at the Chinese Academic of Sciences Institute for Automation, I discuss some of the MapReduce and community detection methods I've worked on.
Anti-differentiating Approximation Algorithms: PageRank and MinCutDavid Gleich
We study how Google's PageRank method relates to mincut and a particular type of electrical flow in a network. We also explain the details of how the "push method" for computing PageRank helps to accelerate it. This has implications for semi-supervised learning and machine learning, as well as social network analysis.
Multi-scalar multiplication: state of the art and new ideasGus Gutoski
A 90-minute online presentation for zkStudyClub, delivered 2020-06-01. I present a new idea with a demonstrated 5% speed-up for multi-scalar multiplication. When combined with precomputation, this method could yield upwards of 20% speed-up.
Summary of the article: "Band selection for dimension in hyper spectral image using integrated information gain and principal component analysis technique"
Effectively alleviates the curse of dimensionality in optimal reservoir operation.
It has better optimization capabilities compared to the DP aggregated water demand approach and can solve problems that are more complex where the DP aggregated water demand approach is not feasible.
Computationally, it is very efficient and runs fast on standard personal computers. The presented case study optimization was executed in less than five minutes.
The algorithm allows for employing dense and variable discretization on the reservoir volume and release.
It supports using a variable weight at each time step for every objective function.
Ml srhwt-machine-learning-based-superlative-rapid-haar-wavelet-transformation...Jumlesha Shaik
Abstract: In this paper a digital image coding technique called ML-SRHWT (Machine Learning based image coding by Superlative Rapid HAAR Wavelet Transform) has been introduced. Compression of digital image is done using the model Superlative Rapid HAAR Wavelet Transform (SRHWT). The Least Square Support vector Machine regression predicts hyper coefficients obtained by using QPSO model. The mathematical models are discussed in brief in this paper are SRHWT, which results in good performance and reduces the complexity compared to FHAAR and EQPSO by replacing the least good particle with the new best obtained particle in QPSO. On comparing the ML-SRHWT with JPEG and JPEG2000 standards, the former is considered to be the better.
We develop a multi-scale streaming anomaly score that takes into account a family of window sizes, making the algorithm scale invariant across a different types of time series with varying pseudo-periodic structure. We explore different aggregation methods of the multi-scale anomaly score to obtain a final anomaly score. We evaluate the performance on the Yahoo! and Numenta Anomaly Benchmark(NAB) datasets.
Big data matrix factorizations and Overlapping community detection in graphsDavid Gleich
In a talk at the Chinese Academic of Sciences Institute for Automation, I discuss some of the MapReduce and community detection methods I've worked on.
Anti-differentiating Approximation Algorithms: PageRank and MinCutDavid Gleich
We study how Google's PageRank method relates to mincut and a particular type of electrical flow in a network. We also explain the details of how the "push method" for computing PageRank helps to accelerate it. This has implications for semi-supervised learning and machine learning, as well as social network analysis.
Gaps between the theory and practice of large-scale matrix-based network comp...David Gleich
I discuss some runtimes for the personalized PageRank vector and how it relates to open questions in how we should tackle these network based measures via matrix computations.
A history of PageRank from the numerical computing perspectiveDavid Gleich
We'll survey some of the underlying ideas from Google's PageRank algorithm along the lines of Massimo Franceschet's CACM history.
There are some slight liberties I've taken to make it more accessible.
MapReduce Tall-and-skinny QR and applicationsDavid Gleich
A talk at the SIMONS workshop on Parallel and Distributed Algorithms for Inference and Optimization on how to do tall-and-skinny QR factorizations on MapReduce using a communication avoiding algorithm.
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...David Gleich
This talk covers the idea of anti-differentiating approximation algorithms, which is an idea to explain the success of widely used heuristic procedures. Formally, this involves finding an optimization problem solved exactly by an approximation algorithm or heuristic.
Spacey random walks and higher-order data analysisDavid Gleich
My talk at TMA 2016 (The workshop on Tensors, Matrices, and their Applications) on the relationship between a spacey random walk process and tensor eigenvectors
How does Google Google: A journey into the wondrous mathematics behind your f...David Gleich
A talk I gave at the annual meeting for the MetroNY section of the MAA about how Google works from a link-ranking perspective. (http://sections.maa.org/metrony/)
Based on a talk by Margot Gerritsen (which used elements from another talk I gave years ago, yay co-author improvements!)
Relaxation methods for the matrix exponential on large networksDavid Gleich
My talk from the Stanford ICME seminar series on doing network analysis and link prediction using the a fast algorithm for the matrix exponential on graph problems.
Fast relaxation methods for the matrix exponential David Gleich
The matrix exponential is a matrix computing primitive used in link prediction and community detection. We describe a fast method to compute it using relaxation on a large linear system of equations. This enables us to compute a column of the matrix exponential is sublinear time, or under a second on a standard desktop computer.
Vertex neighborhoods, low conductance cuts, and good seeds for local communit...David Gleich
My talk from KDD2012 about vertex neighborhoods and low conductance cuts. See the paper here: http://arxiv.org/abs/1112.0031 and http://dl.acm.org/citation.cfm?id=2339628
Higher-order organization of complex networksDavid Gleich
A talk I gave at the Park City Institute of Mathematics about our recent work on using motifs to analyze and cluster networks. This involves a higher-order cheeger inequality in terms of motifs.
PageRank Centrality of dynamic graph structuresDavid Gleich
A talk I gave at the SIAM Annual Meeting Mini-symposium on the mathematics of the power grid organized by Mahantesh Halappanavar. I discuss a few ideas on how our dynamic centrality could help analyze such situations.
Scalable and Adaptive Graph Querying with MapReduceKyong-Ha Lee
We address the problem of processing multiple graph queries over a massive set of data graphs in this letter. As the number of data graphs is growing rapidly, it is often hard to process graph queries with serial algorithms in a timely manner. We propose a distributed graph querying algorithm, which employs feature-based comparison and a filterand-verify scheme working on the MapReduce framework. Moreover, we devise an ecient scheme that adaptively tunes a proper feature size at runtime by sampling data graphs. With various experiments, we show that the proposed method outperforms conventional algorithms in terms of both scalability and efficiency.
Spatially resolved pair correlation functions for point cloud dataTony Fast
Presentation on computing spatial correlation functions for point cloud materials science information. This presentation uses tree algorithms and Fourier methods to compute the statistics. The analysis is performed on Al-Cu interface information provided by John Gibbs and Peter Voorhees at Northwestern University as funded by the Mosaic of Microstructure MURI program.
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
Jack Dongarra from the University of Tennessee presented these slides at Ken Kennedy Institute of Information Technology on Feb 13, 2014.
Listen to the podcast review of this talk: http://insidehpc.com/2014/02/13/week-hpc-jack-dongarra-talks-algorithms-exascale/
Christoph Koch is a professor of Computer Science at EPFL, specializing in data management. Until 2010, he was an Associate Professor in the Department of Computer Science at Cornell University. Previously to this, from 2005 to 2007, he was an Associate Professor of Computer Science at Saarland University. Earlier, he obtained his PhD in Artificial Intelligence from TU Vienna and CERN (2001), was a postdoctoral researcher at TU Vienna and the University of Edinburgh (2001-2003), and an assistant professor at TU Vienna (2003-2005). He has won Best Paper Awards at PODS 2002, ICALP 2005, and SIGMOD 2011, an Outrageous Ideas and Vision Paper Award at CIDR 2013, a Google Research Award (in 2009), and an ERC Grant (in 2011). He is a PI of the FET Flagship Human Brain Project and of NCCR MARVEL, a new Swiss national research center for materials research. He (co-)chaired the program committees of DBPL 2005, WebDB 2008, ICDE 2011, VLDB 2013, and was PC vice-chair of ICDE 2008 and ICDE 2009. He has served on the editorial board of ACM Transactions on Internet Technology and as Editor-in-Chief of PVLDB.
Fast matrix primitives for ranking, link-prediction and moreDavid Gleich
I gave this talk at Netflix about some of the recent work I've been doing on fast matrix primitives for link prediction and also some non-standard uses of the nuclear norm for ranking.
How to Prepare Weather and Climate Models for Future HPC Hardwareinside-BigData.com
In this video from GTC 2018, Peter Dueben from ECMWF presents: How to Prepare Weather and Climate Models for Future HPC Hardware.
"Learn how one of the leading institutes for global weather predictions, the European Centre for Medium-Range Weather Forecasts (ECMWF), is preparing for exascale supercomputing and the efficient use of future HPC computing hardware. I will name the main reasons why it is difficult to design efficient weather and climate models and provide an overview on the ongoing community effort to achieve the best possible model performance on existing and future HPC architectures. I will present the EU H2020 projects ESCAPE and ESiWACE and discuss recent approaches to increase computing performance in weather and climate modeling such as the use of reduced numerical precision and deep learning."
Watch the video: https://wp.me/p3RLHQ-ixu
Learn more: https://www.ecmwf.int/
and
https://www.nvidia.com/en-us/gtc/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...LDBC council
Lijun Chang, DECRA Fellow at the University of New South Wales talked about how to make subgraph matching more efficient thanks to postponing Cartesian products.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2021/02/new-methods-for-implementation-of-2-d-convolution-for-convolutional-neural-networks-a-presentation-from-santa-clara-university/
Tokunbo Ogunfunmi, Professor of Electrical Engineering and Director of the Signal Processing Research Laboratory at Santa Clara University, presents the “New Methods for Implementation of 2-D Convolution for Convolutional Neural Networks” tutorial at the September 2020 Embedded Vision Summit.
The increasing usage of convolutional neural networks (CNNs) in various applications on mobile and embedded devices and in data centers has led researchers to explore application specific hardware accelerators for CNNs. CNNs typically consist of a number of convolution, activation and pooling layers, with convolution layers being the most computationally demanding. Though popular for accelerating CNN training and inference, GPUs are not ideal for embedded applications because they are not energy efficient.
ASIC and FPGA accelerators have the potential to run CNNs in a highly efficient manner. Ogunfunmi presents two new methods for 2-D convolution that offer significant reduction in power consumption and computational complexity. The first method computes convolution results using row-wise inputs, as opposed to traditional tile-based processing, yielding considerably reduced latency. The second method, single partial product 2-D (SPP2D) convolution, avoids recalculation of partial weights and reduces input reuse. Hardware implementation results are presented.
Correlation clustering and community detection in graphs and networksDavid Gleich
We show a new relationship between various community detection objectives and a correlation clustering framework. These enable us to detect communities with good bounds on the solution.
Spectral clustering with motifs and higher-order structuresDavid Gleich
I presented these slides at the #strathna meeting in Glasgow in June 2017. They are an updated and enhanced version of the earlier talks on the subject.
A copy of my slides from the SILO Seminar at UW Madison on our recent developments for the NEO-K-Means methods including new optimization routines and results.
Using Local Spectral Methods to Robustify Graph-Based LearningDavid Gleich
This is my KDD2015 talk on robustness in semi-supervised learning. The paper is already on Michael Mahoney's website: http://www.stat.berkeley.edu/~mmahoney/pubs/robustifying-kdd15.pdf See the KDD paper for all the details, which this talk is a bit light on.
Spacey random walks and higher order Markov chainsDavid Gleich
My talk at SIAM NetSci workshop (2015) on our new spacey random walk and spacey random surfer models and how we derived them. There many potential extensions and opportunities to use this for analyzing big data as tensors.
Localized methods in graph mining exploit the local structures in a graph instead attempting to find global structures. These are widely successful at all sorts of problems including community detection, label propagation, and a few others.
Localized methods for diffusions in large graphsDavid Gleich
I describe a few ongoing research projects on diffusions in large graphs and how we can create efficient matrix computations in order to determine them efficiently.
Recommendation and graph algorithms in Hadoop and SQLDavid Gleich
A talk I gave at ancestry.com on Hadoop, SQL, recommendation and graph algorithms. It's a tutorial overview, there are better algorithms than those I describe, but these are a simple starting point.
1. Tall-and-skinny !
QRs and SVDs in
MapReduce
David F. Gleich!
Computer Science!
Purdue University!
A1
A4
A2
A3
A4
Yangyang Hou "
Purdue, CS
Paul G. Constantine "
Austin Benson "
Joe Nichols"
Stanford University
James Demmel "
UC Berkeley
Joe Ruthruff "
Jeremy Templeton"
Sandia CA
Mainly funded by Sandia National Labs
CSAR project, recently by NSF CAREER,"
and Purdue Research Foundation.
ICML2013
David Gleich · Purdue
1
2. 2
A
From tinyimages"
collection
ICML2013
Tall-and-Skinny
matrices
(m ≫ n)
Many rows (like a billion)
A few columns (under 10,000)
regression and!
general linear models!
with many samples!
block iterative methods
panel factorizations
approximate kernel k-means
big-data SVD/PCA!
Used in
David Gleich · Purdue
3. A matrix A : m × n, m ≥ n"
is tall and skinny when O(n2) "
work and storage is “cheap”
compared to m.
ICML2013
David Gleich · Purdue
3
-- Austin Benson
4. Quick review of QR
ICML2013
4
andia)
is given by
the solution of
QR is block normalization
“normalize” a vector
usually generalizes to
computing in the QR
A Q
, real
orthogonal ( )
upper triangular.
0
R
=
MapReduce 2011
David Gleich · Purdue
Let A : m × n, m ≥ n, real
A = QR
Q is m × n orthogonal (QT Q = I )
R is n × n upper triangular
5.
Tall-and-skinny SVD and RSVD
Let A : m × n, m ≥ n, real
A = U𝞢VT
U is m × n orthogonal
𝞢 is m × n nonneg, diag.
V is n × n orthogonal
ICML2013
David Gleich · Purdue
5
A Q
TSQR
R V
SVD
6. There are good MPI
implementations.
What’s left to do?
ICML2013
6
David Gleich · Purdue
7. Moving data to an MPI cluster
may be infeasible or costly
ICML2013
David Gleich · Purdue
7
8. Data computers I’ve used
ICML2013
8
Nebula Cluster @ !
Sandia CA!
2TB/core storage, "
64 nodes, 256 cores, "
GB ethernet
Cost $150k
These systems are good for working with
enormous matrix data!
Student Cluster @ !
Stanford!
3TB/core storage, "
11 nodes, 44 cores, "
GB ethernet
Cost $30k
Magellan Cluster @ !
NERSC!
128GB/core storage, "
80 nodes, 640 cores, "
infiniband
David Gleich · Purdue
9. By 2013(?) all Fortune 500
companies will have a data
computer
ICML2013
9
David Gleich · Purdue
10. How do you program them?
ICML2013
10
David Gleich · Purdue
11. A graphical view of the MapReduce
programming model
David Gleich · Purdue
11
data
Map
data
Map
data
Map
data
Map
key
value
key
value
key
value
key
value
key
value
key
value
()
Shuffle
key
value
value
dataReduce
key
value
value
value
dataReduce
key
value dataReduce
ICML2013
Map tasks read batches of data in
parallel and do some initial filtering
Reduce is often where the
computation happens
Shuffle is a
global comm.
like group-by
or MPIAlltoall
12. MapReduce looks like the first
step of any MPI program that
loads data.
Read data -> assign to process
Receive data -> do compute
ICML2013
David Gleich · Purdue
12
13. Still, isn’t this easy to do?
ICML2013
David Gleich · Purdue
13
Current MapReduce algs use the normal equations
A = QR AT
A
Cholesky
! RT
R Q = AR 1
A1
A4
A2
A3
A4
Map!
Aii to Ai
TAi
Reduce!
RTR = Sum(Ai
TAi)
Map 2!
Aii to Ai R-1
Two problems!
R inaccurate if A ill-conditioned
Q not numerically orthogonal
(Householder assures this)
14. Numerical stability was a
problem for prior approaches
ICML2013
14
Condition number
1020
105
norm(QTQ–I)
AR-1
Prior work
Previous methods
couldn’t ensure
that the matrix Q
was orthogonal
David Gleich · Purdue
15. Four things that are better
1. A simple algorithm to compute R accurately.
(but doesn’t help get Q orthogonal).
2. “Fast algorithm” to get Q numerically
orthogonal in most cases.
3. Multi-pass algorithm to get Q numerically
orthogonal in virtually all cases.
4. A direct algorithm for a numerically
orthogonal Q in all cases.
ICML2013
David Gleich · Purdue
15
Constantine & Gleich MapReduce 2011
Benson, Gleich & Demmel arXiv 2013
16. Numerical stability was a
problem for prior approaches
ICML2013
16
Condition number
1020
105
norm(QTQ–I)
AR-1
AR-1 + "
iterative refinement
Direct TSQR
Benson, Gleich, "
Demmel, Submitted
Prior work
Constantine & Gleich,
MapReduce 2011
Benson, Gleich,
Demmel, Submitted
Previous methods
couldn’t ensure
that the matrix Q
was orthogonal
David Gleich · Purdue
17. How to store tall-and-skinny
matrices in Hadoop
David Gleich · Purdue
17
A1
A4
A2
A3
A4
A : m x n, m ≫ n
Key is an arbitrary row-id
Value is the 1 x n array "
for a row (or b x n block)
Each submatrix Ai is an "
the input to a map task.
ICML2013
18. MapReduce is great for TSQR!!
You don’t need ATA
Data A tall and skinny (TS) matrix by rows
Input 500,000,000-by-50 matrix"
Each record 1-by-50 row"
HDFS Size 183.6 GB
Time to compute read A 253 sec. write A 848 sec.!
Time to compute R in qr(A) 526 sec. w/ Q=AR-1 1618 sec. "
Time to compute Q in qr(A) 3090 sec. (numerically stable)!
ICML2013
David Gleich · Purdue
18
19. Communication avoiding QR
(Demmel et al. 2008)
Communication avoidingTSQR
Demmel et al. 2008. Communicating avoiding parallel and sequential QR.
First, do QR
factorizations
of each local
matrix
Second, compute
a QR factorization
of the new “R”
19
ICML2013
David Gleich · Purdue
20. Serial QR factorizations!
(Demmel et al. 2008)
Fully serialTSQR
Demmel et al. 2008. Communicating avoiding parallel and sequential QR.
Compute QR of ,
read , update QR, …
20
ICML2013
David Gleich · Purdue
21. A1
A2
A3
A1
A2
qr
Q2 R2
A3
qr
Q3 R3
A4
qr
Q4A4
R4
emit
A5
A6
A7
A5
A6
qr
Q6 R6
A7
qr
Q7 R7
A8
qr
Q8A8
R8
emit
Mapper 1
Serial TSQR
R4
R8
Mapper 2
Serial TSQR
R4
R8
qr
Q
emit
R
Reducer 1
Serial TSQR
Algorithm
Data Rows of a matrix
Map QR factorization of rows
Reduce QR factorization of rows
Communication avoiding QR (Demmel et al. 2008) !
on MapReduce (Constantine and Gleich, 2011)
ICML2013
21
David Gleich · Purdue
22. Too many maps cause too
much data to one reducer!
Each image is 5k.
Each HDFS block has "
12,800 images.
6,250 total blocks.
Each map outputs "
1000-by-1000 matrix
One reducer gets a 6.25M-
by-1000 matrix (50GB)
From tinyimages
collection
"
80,000,000images
1000 pixels
ICML2013
David Gleich · Purdue
22
23. S(1)
A
A1
A2
A3
A3
R1
map
Mapper 1-1
Serial TSQR
A2
emit
R2
map
Mapper 1-2
Serial TSQR
A3
emit
R3
map
Mapper 1-3
Serial TSQR
A4
emit
R4
map
Mapper 1-4
Serial TSQR
shuffle
S1
A2
reduce
Reducer 1-1
Serial TSQR
S2
R2,2
reduce
Reducer 1-2
Serial TSQR
R2,1
emit
emit
emit
shuffle
A2S3
R2,3
reduce
Reducer 1-3
Serial TSQR
emit
Iteration 1 Iteration 2
identitymap
A2S(2)
Rreduce
Reducer 2-1
Serial TSQR
emit
ICML2013
David Gleich · Purdue
23
25. Full TSQR code in hadoopy
In hadoopy
import random, numpy, hadoopy
class SerialTSQR:
def __init__(self,blocksize,isreducer):
self.bsize=blocksize
self.data = []
if isreducer: self.__call__ = self.reducer
else: self.__call__ = self.mapper
def compress(self):
R = numpy.linalg.qr(
numpy.array(self.data),'r')
# reset data and re-initialize to R
self.data = []
for row in R:
self.data.append([float(v) for v in row])
def collect(self,key,value):
self.data.append(value)
if len(self.data)>self.bsize*len(self.data[0]):
self.compress()
def close(self):
self.compress()
for row in self.data:
key = random.randint(0,2000000000)
yield key, row
def mapper(self,key,value):
self.collect(key,value)
def reducer(self,key,values):
for value in values: self.mapper(key,value)
if __name__=='__main__':
mapper = SerialTSQR(blocksize=3,isreducer=False)
reducer = SerialTSQR(blocksize=3,isreducer=True)
hadoopy.run(mapper, reducer)
David Gleich (Sandia) 13/22MapReduce 2011
25
ICML2013
David Gleich · Purdue
26. Numerical stability was a
problem for prior approaches
ICML2013
26
Condition number
1020
105
norm(QTQ–I)
AR-1
AR-1 + "
iterative refinement
Direct TSQR
Benson, Gleich, "
Demmel, Submitted
Prior work
Constantine & Gleich,
MapReduce 2011
Benson, Gleich,
Demmel, Submitted
Previous methods
couldn’t ensure
that the matrix Q
was orthogonal
David Gleich · Purdue
27. Iterative refinement helps
ICML2013
David Gleich · Purdue
27
A1
A4
Q1
R-1
Mapper 1
A2
Q2
A3
Q3
A4
Q4
R
TSQR
DistributeR
R-1
R-1
R-1
Iterative refinement is like using Newton’s method to solve Ax = b. There’s a famous
quote that “two iterations of iterative refinement are enough” attributed to Parlett;
TSQR
Q1
A4
Q1
T-1
Mapper 2
Q2
Q2
Q3
Q3
Q4
Q4
T
DistributeT
T-1
T-1
T-1
28. Numerical stability was a
problem for prior approaches
ICML2013
28
Condition number
1020
105
norm(QTQ–I)
AR-1
AR-1 + "
iterative refinement
Direct TSQR
Benson, Gleich, "
Demmel, Submitted
Prior work
Constantine & Gleich,
MapReduce 2011
Benson, Gleich,
Demmel, Submitted
Previous methods
couldn’t ensure
that the matrix Q
was orthogonal
David Gleich · Purdue
29. What if iterative refinement is
too slow?
ICML2013
David Gleich · Purdue
29
A1
A4
Q1
R-1
Mapper 1
A2
Q2
A3
Q3
A4
Q4
S
Sample
ComputeQR,"
distributeR
R-1
R-1
R-1
TSQR
A1
A4
Q1
T-1
Mapper 2
A2
Q2
A3
Q3
A4
Q4
T
DistributeTR
T-1
T-1
T-1
Based on recent work by Ipsen et al. on approximating QR with a random
subset of rows. Also assumes that you can get a subset of rows “cheaply” –
possible, but nontrivial in Hadoop.
R-1
R-1
R-1
R-1
Estimate the “norm” by S
30. Orthogonality of the sampled
solution
ICML2013
30
Condition number
1020
105
norm(QTQ–I)
AR-1
Direct TSQR
Benson, Gleich, "
Demmel, Submitted
Prior work
Constantine & Gleich,
MapReduce 2011
Benson, Gleich,
Demmel, Submitted
David Gleich · Purdue
31. Four things that are better
1. A simple algorithm to compute R accurately.
(but doesn’t help get Q orthogonal).
2. Fast algorithm to get Q numerically
orthogonal in most cases.
3. Two pass algorithm to get Q numerically
orthogonal in virtually all cases.
4. A direct algorithm for a numerically
orthogonal Q in all cases.
ICML2013
David Gleich · Purdue
31
Constantine & Gleich MapReduce 2011
Benson, Gleich & Demmel arXiv 2013
32. Recreate Q by storing the
history of the factorization
ICML2013
David Gleich · Purdue
32
A1
A4
Q1
R1
Mapper 1
A2
Q2
R2
A3
Q3
R3
A4
Q4
Q1
Q2
Q3
Q4
R1
R2
R3
R4
R4
Qoutput
Routput
Q11
Q21
Q31
Q41
R
Task 2
Q11
Q21
Q31
Q41
Q1
Q2
Q3
Q4
Mapper 3
1. Output local Q and
R in separate files
2. Collect R on one
node, compute Qs
for each piece
3. Distribute the
pieces of Q*1 and
form the true Q
33. Theoretical lower bound on runtime
for a few cases on our small cluster
Rows
Cols
Old
R-only
+ no IR
R-only
+ PIR
R-only
+ IR
Direct
TSQR
4.0B
4
1803
1821
1821
2343
2525
2.5B
10
1645
1655
1655
2062
2464
0.6B
25
804
812
812
1000
1237
0.5B
50
1240
1250
1250
1517
2103
ICML2013
David Gleich · Purdue
33
All values in
seconds
Only two params
needed – read and
write bandwidth for
the cluster – in
order to derive a
performance model
of the algorithm.
This simple model
is almost within a
factor of two of the
true runtime. "
(10-node cluster,
60 disks)
Rows
Cols
Old
R-only
+ no IR
R-only
+ PIR
R-only
+ IR
Direct
TSQR
4.0B
4
2931
3460
3620
4741
6128
2.5B
10
2508
2509
3354
4034
4035
0.6B
25
1098
1104
1476
2006
1910
0.5B
50
921
1618
1960
2655
3090
Model
Actual
35. ICML2013
David Gleich · Purdue
35
Nonlinear heat transfer model in
random media
Each run takes 5 hours on 8
processors, we did 8192 runs.
4.2M nodes, 9 time-steps, 128
realizations, 64 size parameters
SVD of 5B x 64 data matrix
50x reduction in wall-clock time
including all pre and post-
processing; “∞x” speedup for a
particular study without pre-
processing (< 60 sec on laptop).
Interpolating adaptive low-rank
approximations.
Constantine, Gleich,"
Hou & Templeton arXiv 2013.
1
error
0
2
(a) Error, s = 0.39 cm
1
std
0
2
(b) Std, s = 0.39 cm
10
error
0
20
(c) Error, s = 1.95 cm
10
std
0
20
(d) Std, s = 1.95 cm
Fig. 4.5: Error in the reduce order model compared to the prediction standard de-
viation for one realization of the bubble locations at the final time for two values of
10
error
0
20
(c) Error, s = 1.95 cm
Fig. 4.5: Error in the reduce order model com
viation for one realization of the bubble locati
the bubble radius, s = 0.39 and s = 1.95 cm
version.)
the varying conductivity fields took approxima
Cubit after substantial optimizations.
Working with the simulation data involved
interpret 4TB of Exodus II files from Aria, glo
TSSVD, and compute predictions and errors.
imately 8-15 hours. We collected precise timi
it as these times are from a multi-tenant, uno
jobs with sizes ranging between 100GB and 2T
37. Is double-precision sufficient?
Double-precision floating point was designed for the
era where “big” was 1000-10000
s = 0; for i=1 to n: s = s + x[i]
A simple summation formula has "
error that is not always small if n is a billion
Watch out for this issue in important computations
ICML2013
David Gleich · Purdue
37
fl(x + y) = (x + y)(1 + ")
fl(
X
i
xi )
X
i
xi nµ
X
i
|xi | µ ⇡ 10 16