The researcher focuses on studying fundamental tradeoffs between cache-obliviousness, cache-optimality, and parallelism of algorithms and data structures. Their work combines theory and experiments on topics like stencil computation, dynamic programming, and numerical algorithms. Recent work showed that optimal time and cache complexity can be achieved simultaneously for problems like longest common subsequence via a "cache-oblivious wavefront" scheduling technique. Open questions remain about applying this approach more broadly and understanding tradeoffs between time and cache complexity.
Integrating research and e learning in advance computer architecture
Research Statement
1. Research Statement
My research focuses on studying the fundamental tradeoffs between cache-
obliviousness, cache-optimality, and parallelism of algorithms and data structures on
modern multi-core and many-core 1 architectures with hierarchical cache. My
approach combines both theory and experiments. I have been mainly working on
stencil computation, general dynamic programming computation, and numerical
algorithms.
Since 2009, I have been working with Prof. Charles E. Leiserson at MIT on stencil
computation. The project ``The Pochoir Stencil Compiler’’ [6, 7] was funded both by
NSF at total amount USD $983,017 and Intel Corp at amount RMB (Chinese Yuan)
904,627.72. In this project, we achieved following results: 1) improved the parallelism
asymptotically with the same cache-efficiency of a cache-oblivious parallel algorithm
for multi-dimensional simple stencil computation2 by inventing ``hyperspace cut’’; 2)
handles periodic and aperiodic boundary condition in one unified algorithm; 3)
designed domain-specific language (DSL) embedded in C++ for stencil computation; 4)
designed and developed a novel two-phase compilation strategy that the first phase
call any common C++ tool chain to verify the correctness and will invoke the Pochoir
compiler only afterwards to do a source-to-source transformation for a highly
optimized code. The two-phase compilation strategy saves massive cost of parsing and
type-checking of C++ language. In this project, I contributed to the algorithm, code
generation, benchmarking and core compiler software for the Pochoir system.
After the ``Pochoir’’ project, I continue working on a joint project on general
dynamic programming problem. Note that stencil computation can be viewed as a
special case of dynamic programming with constant but non-orthogonal dependencies.
In the research of general dynamic programming problem, my focus lies on the
fundamental tradeoff between time and cache complexity.
Modern multicore systems with hierarchical caches demand both parallelism and
cache-locality from software to yield good performance. In the analysis of parallel
computations, theory usually considers two metrics: time complexity and cache
complexity. The traditional objective for scheduling a parallel computation is to
minimize the time complexity, i.e., if we represent the parallel computation as a DAG
(Directed Acyclic Graph), time complexity is the length of critical path in DAG.
Alternatively, one can focus on minimizing the cache complexity, i.e., the number of
cache misses incurred during the execution of program. Theoretical analyses often
consider these metrics separately; in reality, the actual completion time of a program
depends on both, since the number of cache misses has a direct impact on the running
time and the time complexity bound often serves as a good indicator of scalability,
load-balance and scheduling overheads.
1 For example: Intel MIC (Many-Integrated-Core) coprocessor
2 Simple stencil is a stencil computation without heterogeneity in space or time.
2. Tuning of algorithms for time and/or cache complexity are usually not preferred.
It has several disadvantages: the code structure becomes more complicated; the
parameter space to explore is usually exponentially sized; and the tuned code is non-
portable, i.e., for different hardware systems, separate tunings are required. Moreover,
since the tuning environment cannot be exactly the same as running environment, e.g.
different numbers of background daemon processes, different loads of network traffic,
etc. , it means that the long-tuned code is almost always sub-optimal. Classic cache-
oblivious algorithms eliminate the need of tuning of optimality for hierarchical cache
largely. Can we further eliminate the need of tuning between time and cache
complexity while still remaining cache-obliviousness? What’s the fundamental
tradeoff between time and cache complexity? What obliviousness can buy us and cost
us? These are the questions lying in the center of my research.
For generic parallel computation, there is a tension between the objectives of
minimizing time and cache complexity. Take LCS (Longest Common Subsequence) as
an example: Given two sequences L =< l1l2 … ln > and R =< r1r2 … rn >, we find
its longest common subsequence by filling out a 2D table using recurrences
X[i, j] = {
0, 𝑖𝑓 𝑖 = 0 ∨ 𝑗 = 0
𝑋[𝑖 − 1, 𝑗 − 1] + 1, 𝑖𝑓 𝑖, 𝑗 > 0 ∧ 𝑙𝑖 = 𝑟𝑗
max{𝑋[𝑖 − 1, 𝑗], 𝑋[𝑖, 𝑗 − 1]}, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Figure 1. 2-way to 𝑛-way Divide-And-Conquer algorithms for LCS
In literature, there are two classes of algorithms to solve the problem. One is the
divide-and-conquer based cache-oblivious algorithm, the other is based on looping
possibly with tiling. If we adopt a 2-way divide-and-conquer parallel algorithm as
shown on the left-hand side in Figure 1, i.e., at each recursion level, we cut each
dimension into two halves. Since at each recursion level, we have three out of four
sub-quadrants sitting on the critical paths, the time recurrence T∞(𝑛) = 3𝑇∞ (
𝑛
2
)
solves to T∞(𝑛) = 𝑂(𝑛𝑙𝑜𝑔23), where n is the input problem size. We usually only
count serial cache complexity in ideal cache model, a.k.a. cache-oblivious model
because parallel cache complexity will be determined by fitting both time complexity
and serial cache complexity into a formula that is determined by the underlying run-
time scheduler. In ideal cache model, there are two levels of memory. The upper level
is a fully associative cache of size M and lower level is an infinitely sized main memory.
When there is a cache miss in upper level, the system employs an omniscient cache
replacement policy to exchange a cache line in size B between the two levels.
Parameters M and B are correlated by a tall cache assumption, i.e., M = Ω(B1+ϵ),
3. where ϵ > 0 is a constant. In ideal cache model, serial cache complexity is calculated
by summing up cache misses caused by four individual sub-quadrants at each
recursion level, i.e., Q(n) = 4Q (
n
2
). The recursive summation stops at some level
when the problem size of a sub-quadrant just fits into the cache and its parent doesn’t,
i.e., ∃n0 𝑠. 𝑡. n0 = 𝜖0 𝑀 ∧ 2𝑛0 > 𝑀, where ϵ0 ∈ (0,1], i.e., 0 < ϵ0 ≤ 1, is a constant.
Because after this level, further recursive divide-and-conquer won’t cause any more
cache misses than Q(n0) = O (
n0
B
) . Solving the recurrence, we have Q(n) =
O (
n2
ϵ0BM
). Keep doing more-way divide-and-conquer, eventually the algorithm will
reduce to n-way divide-and-conquer algorithm as shown on the right-hand side in
Figure 1, which is essentially parallel looping algorithm without tiling. For n-way divide-
and-conquer algorithm, the time complexity (span) reduces to T∞(𝑛) = 𝑂(𝑛) with
serial cache complexity increased to Q(n) = O (
n2
B
). From above analyses, we can see
that 2-way divide-and-conquer algorithm has the worst time complexity while the best
cache complexity, parallel looping algorithm or n-way divide-and-conquer in Figure 1,
on the contrary, has the best time complexity while the worst cache complexity.
Traditional wisdom may suggest to tune a balanced point between these two extremes
to get a good performance on a real machine. Apparently, the intuition behind balance
is that we cannot get both optimal at the same time.
wavefront
wavefront
Figure 2. Scheduling of classic 2-way divide-and-conquer algorithm for LCS on the left and
cache-oblivious wavefront algorithm on the right. Solid arrows indicate true dependencies derived
from the defining recurrence equations, while dashed arrows indicate false dependencies
introduced by the scheduling of algorithm.
4. Figure 3. Performance comparison of four algorithms for LCS, i.e., Parallel Loop without tiling,
Blocked Loop (Parallel Loop with tiling), classic 2-way divide-and-conquer based cache-oblivious
parallel (2-way COP) algorithm, and cache-oblivious wavefront (COW) algorithm. In the
performance plot, we fix the same base case size for all four algorithms except parallel loop
(without tiling) and use exactly the same non-inlined kernel function to compute the base case for
all algorithms so that the only difference is how different algorithms schedule the base cases.
In [1] we have shown that both optimal time and cache complexity is achievable
at the same time via a more compacted scheduling as shown on the right-hand side of
Figure 2. The new scheduling policy eliminates all false dependencies introduced by
prior divide-and-conquer based cache-oblivious parallel algorithm and retains only
true dependency from the defining recurrence equations. From a high level point of
view, the new algorithm proceeds like dynamically unfolding sub-quadrants on the
divide-and-conquer tree and the progress of unfolded sub-quadrants are aligned to a
wavefront. In other words, the proceeding wavefront swept throughout the divide-
and-conquer tree are generated from a 2-way divide-and-conquer algorithm so we
name the technique ``cache-oblivious wavefront (COW for short)’’3. From Figure 3, we
can see that by combining the best of both cache-oblivious and looping world, the
cache-oblivious wavefront algorithm beat both classic 2-way divide-and-conquer
based cache-oblivious parallel algorithm and parallel loop with tiling (Blocked Loop
algorithm in Figure 3.) algorithm. Some natural questions following the direction of
research are: Does this or similar technique apply to all divide-and-conquer based
cache-oblivious parallel algorithm? What ``cache-obliviousness’’ can buy us and cost
us? What’s the fundamental tradeoff between time and cache complexity? These are
all fundamental problems I am working on. Some recent progresses on the direction
include a successful application of the cache-oblivious wavefront technique to some
numerical algorithms, such as Cholesky factorization and LU factorization without
pivoting.
Besides the continuous study of fundamental tradeoff between time and cache
complexity, I also have research interests in wider area of parallel algorithms and data
structures, e.g. my recent study of Range 1 Query algorithms, which is a special case
of Range Partial Sum Query algorithm but have only values of 0 or 1 on discrete grid-
3 Thanks to Prof. Charles E. Leiserson at MIT CSAIL for coining the name.
0
2
4
6
8
updatedpoints/second
(x1e9)linearscale
side length (n)
LCS: Performance (bsize=64)
Parallel Loop Blocked Loop
2-way COP COW
5. cells was published in COCOON’14 [3]; another my recent study of weight balance on
boundaries and skeletons [4] is an inverse problem of the barycenter problem, i.e., The
barycenter problem is: given a set of n weights W = {w1, w2, … , wn} and arbitrary
n locations X = {x1, x2, … , xn} on the boundary of an arbitrary multi-dimensional
polygon, it’s easy to calculate the barycenter in O(n) time by formula x =
∑ 𝑤𝑖 ⋅ 𝑥𝑖
n
i=1 ; The inverse problem is that given an arbitrary point in/outside the
convex/concave polygon and the weight set W, how fast can we identify n locations
X = {x1, x2, … , xn} on the boundary of polygon to place the n weights such that
their barycenter is the given point? The results were published in SoCG’15 [4].
References:
1) Yuan Tang, Ronghui You, Haibin Kan, Jesmin Jahan Tithi, Pramod Ganapathi, Rezaul
A. Chowdhury, Cache-Oblivious Wavefront: Improving Parallelism of Recursive
Dynamic Programming Algorithms without Losing Cache-Efficiency, 20th ACM
SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’15),
Feb. 9-11, 2015, San Francisco, CA, USA.
2) Rezaul A. Chowdhury, Pramod Ganapathi, Yuan Tang, Jesmin Jahan Tithi, Improving
Parallelism of Recursive Stencil Algorithms without sacrificing Cache Performance, 2nd
Annual Workshop on Stencil Computations (WOSC’14), held in conjunction with
SPLASH’14,Portland, Oct. 20-24, Oregon, USA, published in ACM digital library.
3) Michael A. Bender, Rezaul A. Chowdhury, Pramod Ganapathi, Samuel McCauley and
Yuan Tang, The Range 1 Query (R1Q) Problem, The 20th International Computing and
Combinatorics Conference (COCOON'14), August 4-6, Atlanta, Georgia, USA, 2014.
4) Luis Barba, Otfried Cheong, Jean-Lou De Carufel, Michael Gene Dobbins, Rudolf
Fleischer, Akitoshi Kawamura, Matias korman, Yoshio Okamoto, janos Pach, Yuan
Tang, Takeshi Tokuyama, Sander Verdonschot, Tianhao Wang, Weight Balancing on
Boundaries and Skeletons, The 30th
Annual ACM Symposium on Computational
Geometry (SoCG’14) 2014, June 8-11, Kyoto, Japan.
5) Pramod Ganapathi, Rezaul Chowdhury, and Yuan Tang (11/9/12). The R1Q Problem.
22nd Annual Fall Workshop on Computational Geometry (FWCG’12). College Park,
Maryland.
6) Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and
Charles E. Leiserson. The Pochoir stencil compiler. 23rd ACM Symposium on Parallelism
in Algorithms and Architectures (SPAA'11), 2011.
7) Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and
Charles E. Leiserson. Coding stencil computations using the Pochoir stencil-speci_cation
language. 3rd USENIX Workshop on Hot Topics in Parallelism (HotPar'11), 2011.