SlideShare a Scribd company logo
1 of 5
Download to read offline
Research Statement
My research focuses on studying the fundamental tradeoffs between cache-
obliviousness, cache-optimality, and parallelism of algorithms and data structures on
modern multi-core and many-core 1 architectures with hierarchical cache. My
approach combines both theory and experiments. I have been mainly working on
stencil computation, general dynamic programming computation, and numerical
algorithms.
Since 2009, I have been working with Prof. Charles E. Leiserson at MIT on stencil
computation. The project ``The Pochoir Stencil Compiler’’ [6, 7] was funded both by
NSF at total amount USD $983,017 and Intel Corp at amount RMB (Chinese Yuan)
904,627.72. In this project, we achieved following results: 1) improved the parallelism
asymptotically with the same cache-efficiency of a cache-oblivious parallel algorithm
for multi-dimensional simple stencil computation2 by inventing ``hyperspace cut’’; 2)
handles periodic and aperiodic boundary condition in one unified algorithm; 3)
designed domain-specific language (DSL) embedded in C++ for stencil computation; 4)
designed and developed a novel two-phase compilation strategy that the first phase
call any common C++ tool chain to verify the correctness and will invoke the Pochoir
compiler only afterwards to do a source-to-source transformation for a highly
optimized code. The two-phase compilation strategy saves massive cost of parsing and
type-checking of C++ language. In this project, I contributed to the algorithm, code
generation, benchmarking and core compiler software for the Pochoir system.
After the ``Pochoir’’ project, I continue working on a joint project on general
dynamic programming problem. Note that stencil computation can be viewed as a
special case of dynamic programming with constant but non-orthogonal dependencies.
In the research of general dynamic programming problem, my focus lies on the
fundamental tradeoff between time and cache complexity.
Modern multicore systems with hierarchical caches demand both parallelism and
cache-locality from software to yield good performance. In the analysis of parallel
computations, theory usually considers two metrics: time complexity and cache
complexity. The traditional objective for scheduling a parallel computation is to
minimize the time complexity, i.e., if we represent the parallel computation as a DAG
(Directed Acyclic Graph), time complexity is the length of critical path in DAG.
Alternatively, one can focus on minimizing the cache complexity, i.e., the number of
cache misses incurred during the execution of program. Theoretical analyses often
consider these metrics separately; in reality, the actual completion time of a program
depends on both, since the number of cache misses has a direct impact on the running
time and the time complexity bound often serves as a good indicator of scalability,
load-balance and scheduling overheads.
1 For example: Intel MIC (Many-Integrated-Core) coprocessor
2 Simple stencil is a stencil computation without heterogeneity in space or time.
Tuning of algorithms for time and/or cache complexity are usually not preferred.
It has several disadvantages: the code structure becomes more complicated; the
parameter space to explore is usually exponentially sized; and the tuned code is non-
portable, i.e., for different hardware systems, separate tunings are required. Moreover,
since the tuning environment cannot be exactly the same as running environment, e.g.
different numbers of background daemon processes, different loads of network traffic,
etc. , it means that the long-tuned code is almost always sub-optimal. Classic cache-
oblivious algorithms eliminate the need of tuning of optimality for hierarchical cache
largely. Can we further eliminate the need of tuning between time and cache
complexity while still remaining cache-obliviousness? What’s the fundamental
tradeoff between time and cache complexity? What obliviousness can buy us and cost
us? These are the questions lying in the center of my research.
For generic parallel computation, there is a tension between the objectives of
minimizing time and cache complexity. Take LCS (Longest Common Subsequence) as
an example: Given two sequences L =< l1l2 … ln > and R =< r1r2 … rn >, we find
its longest common subsequence by filling out a 2D table using recurrences
X[i, j] = {
0, 𝑖𝑓 𝑖 = 0 ∨ 𝑗 = 0
𝑋[𝑖 − 1, 𝑗 − 1] + 1, 𝑖𝑓 𝑖, 𝑗 > 0 ∧ 𝑙𝑖 = 𝑟𝑗
max{𝑋[𝑖 − 1, 𝑗], 𝑋[𝑖, 𝑗 − 1]}, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Figure 1. 2-way to 𝑛-way Divide-And-Conquer algorithms for LCS
In literature, there are two classes of algorithms to solve the problem. One is the
divide-and-conquer based cache-oblivious algorithm, the other is based on looping
possibly with tiling. If we adopt a 2-way divide-and-conquer parallel algorithm as
shown on the left-hand side in Figure 1, i.e., at each recursion level, we cut each
dimension into two halves. Since at each recursion level, we have three out of four
sub-quadrants sitting on the critical paths, the time recurrence T∞(𝑛) = 3𝑇∞ (
𝑛
2
)
solves to T∞(𝑛) = 𝑂(𝑛𝑙𝑜𝑔23), where n is the input problem size. We usually only
count serial cache complexity in ideal cache model, a.k.a. cache-oblivious model
because parallel cache complexity will be determined by fitting both time complexity
and serial cache complexity into a formula that is determined by the underlying run-
time scheduler. In ideal cache model, there are two levels of memory. The upper level
is a fully associative cache of size M and lower level is an infinitely sized main memory.
When there is a cache miss in upper level, the system employs an omniscient cache
replacement policy to exchange a cache line in size B between the two levels.
Parameters M and B are correlated by a tall cache assumption, i.e., M = Ω(B1+ϵ),
where ϵ > 0 is a constant. In ideal cache model, serial cache complexity is calculated
by summing up cache misses caused by four individual sub-quadrants at each
recursion level, i.e., Q(n) = 4Q (
n
2
). The recursive summation stops at some level
when the problem size of a sub-quadrant just fits into the cache and its parent doesn’t,
i.e., ∃n0 𝑠. 𝑡. n0 = 𝜖0 𝑀 ∧ 2𝑛0 > 𝑀, where ϵ0 ∈ (0,1], i.e., 0 < ϵ0 ≤ 1, is a constant.
Because after this level, further recursive divide-and-conquer won’t cause any more
cache misses than Q(n0) = O (
n0
B
) . Solving the recurrence, we have Q(n) =
O (
n2
ϵ0BM
). Keep doing more-way divide-and-conquer, eventually the algorithm will
reduce to n-way divide-and-conquer algorithm as shown on the right-hand side in
Figure 1, which is essentially parallel looping algorithm without tiling. For n-way divide-
and-conquer algorithm, the time complexity (span) reduces to T∞(𝑛) = 𝑂(𝑛) with
serial cache complexity increased to Q(n) = O (
n2
B
). From above analyses, we can see
that 2-way divide-and-conquer algorithm has the worst time complexity while the best
cache complexity, parallel looping algorithm or n-way divide-and-conquer in Figure 1,
on the contrary, has the best time complexity while the worst cache complexity.
Traditional wisdom may suggest to tune a balanced point between these two extremes
to get a good performance on a real machine. Apparently, the intuition behind balance
is that we cannot get both optimal at the same time.
wavefront
wavefront
Figure 2. Scheduling of classic 2-way divide-and-conquer algorithm for LCS on the left and
cache-oblivious wavefront algorithm on the right. Solid arrows indicate true dependencies derived
from the defining recurrence equations, while dashed arrows indicate false dependencies
introduced by the scheduling of algorithm.
Figure 3. Performance comparison of four algorithms for LCS, i.e., Parallel Loop without tiling,
Blocked Loop (Parallel Loop with tiling), classic 2-way divide-and-conquer based cache-oblivious
parallel (2-way COP) algorithm, and cache-oblivious wavefront (COW) algorithm. In the
performance plot, we fix the same base case size for all four algorithms except parallel loop
(without tiling) and use exactly the same non-inlined kernel function to compute the base case for
all algorithms so that the only difference is how different algorithms schedule the base cases.
In [1] we have shown that both optimal time and cache complexity is achievable
at the same time via a more compacted scheduling as shown on the right-hand side of
Figure 2. The new scheduling policy eliminates all false dependencies introduced by
prior divide-and-conquer based cache-oblivious parallel algorithm and retains only
true dependency from the defining recurrence equations. From a high level point of
view, the new algorithm proceeds like dynamically unfolding sub-quadrants on the
divide-and-conquer tree and the progress of unfolded sub-quadrants are aligned to a
wavefront. In other words, the proceeding wavefront swept throughout the divide-
and-conquer tree are generated from a 2-way divide-and-conquer algorithm so we
name the technique ``cache-oblivious wavefront (COW for short)’’3. From Figure 3, we
can see that by combining the best of both cache-oblivious and looping world, the
cache-oblivious wavefront algorithm beat both classic 2-way divide-and-conquer
based cache-oblivious parallel algorithm and parallel loop with tiling (Blocked Loop
algorithm in Figure 3.) algorithm. Some natural questions following the direction of
research are: Does this or similar technique apply to all divide-and-conquer based
cache-oblivious parallel algorithm? What ``cache-obliviousness’’ can buy us and cost
us? What’s the fundamental tradeoff between time and cache complexity? These are
all fundamental problems I am working on. Some recent progresses on the direction
include a successful application of the cache-oblivious wavefront technique to some
numerical algorithms, such as Cholesky factorization and LU factorization without
pivoting.
Besides the continuous study of fundamental tradeoff between time and cache
complexity, I also have research interests in wider area of parallel algorithms and data
structures, e.g. my recent study of Range 1 Query algorithms, which is a special case
of Range Partial Sum Query algorithm but have only values of 0 or 1 on discrete grid-
3 Thanks to Prof. Charles E. Leiserson at MIT CSAIL for coining the name.
0
2
4
6
8
updatedpoints/second
(x1e9)linearscale
side length (n)
LCS: Performance (bsize=64)
Parallel Loop Blocked Loop
2-way COP COW
cells was published in COCOON’14 [3]; another my recent study of weight balance on
boundaries and skeletons [4] is an inverse problem of the barycenter problem, i.e., The
barycenter problem is: given a set of n weights W = {w1, w2, … , wn} and arbitrary
n locations X = {x1, x2, … , xn} on the boundary of an arbitrary multi-dimensional
polygon, it’s easy to calculate the barycenter in O(n) time by formula x =
∑ 𝑤𝑖 ⋅ 𝑥𝑖
n
i=1 ; The inverse problem is that given an arbitrary point in/outside the
convex/concave polygon and the weight set W, how fast can we identify n locations
X = {x1, x2, … , xn} on the boundary of polygon to place the n weights such that
their barycenter is the given point? The results were published in SoCG’15 [4].
References:
1) Yuan Tang, Ronghui You, Haibin Kan, Jesmin Jahan Tithi, Pramod Ganapathi, Rezaul
A. Chowdhury, Cache-Oblivious Wavefront: Improving Parallelism of Recursive
Dynamic Programming Algorithms without Losing Cache-Efficiency, 20th ACM
SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’15),
Feb. 9-11, 2015, San Francisco, CA, USA.
2) Rezaul A. Chowdhury, Pramod Ganapathi, Yuan Tang, Jesmin Jahan Tithi, Improving
Parallelism of Recursive Stencil Algorithms without sacrificing Cache Performance, 2nd
Annual Workshop on Stencil Computations (WOSC’14), held in conjunction with
SPLASH’14,Portland, Oct. 20-24, Oregon, USA, published in ACM digital library.
3) Michael A. Bender, Rezaul A. Chowdhury, Pramod Ganapathi, Samuel McCauley and
Yuan Tang, The Range 1 Query (R1Q) Problem, The 20th International Computing and
Combinatorics Conference (COCOON'14), August 4-6, Atlanta, Georgia, USA, 2014.
4) Luis Barba, Otfried Cheong, Jean-Lou De Carufel, Michael Gene Dobbins, Rudolf
Fleischer, Akitoshi Kawamura, Matias korman, Yoshio Okamoto, janos Pach, Yuan
Tang, Takeshi Tokuyama, Sander Verdonschot, Tianhao Wang, Weight Balancing on
Boundaries and Skeletons, The 30th
Annual ACM Symposium on Computational
Geometry (SoCG’14) 2014, June 8-11, Kyoto, Japan.
5) Pramod Ganapathi, Rezaul Chowdhury, and Yuan Tang (11/9/12). The R1Q Problem.
22nd Annual Fall Workshop on Computational Geometry (FWCG’12). College Park,
Maryland.
6) Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and
Charles E. Leiserson. The Pochoir stencil compiler. 23rd ACM Symposium on Parallelism
in Algorithms and Architectures (SPAA'11), 2011.
7) Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and
Charles E. Leiserson. Coding stencil computations using the Pochoir stencil-speci_cation
language. 3rd USENIX Workshop on Hot Topics in Parallelism (HotPar'11), 2011.

More Related Content

What's hot

Վարդանանց պատերազմ
Վարդանանց պատերազմՎարդանանց պատերազմ
Վարդանանց պատերազմ
susannachalikyan
 

What's hot (9)

Վարդանանց պատերազմ
Վարդանանց պատերազմՎարդանանց պատերազմ
Վարդանանց պատերազմ
 
पत्र लेखन ( अनौपचारिक पत्र ) कक्षा -२ के लिए
पत्र  लेखन ( अनौपचारिक पत्र ) कक्षा -२ के लिए पत्र  लेखन ( अनौपचारिक पत्र ) कक्षा -२ के लिए
पत्र लेखन ( अनौपचारिक पत्र ) कक्षा -२ के लिए
 
अवतारवाद
अवतारवाद  अवतारवाद
अवतारवाद
 
वार्ता.pptx
वार्ता.pptxवार्ता.pptx
वार्ता.pptx
 
sindh.pdf
sindh.pdfsindh.pdf
sindh.pdf
 
Ancient indian taxation its nature, pattern and types
Ancient indian taxation its nature, pattern and typesAncient indian taxation its nature, pattern and types
Ancient indian taxation its nature, pattern and types
 
Later (Second) Pandya Dynasty
Later (Second) Pandya DynastyLater (Second) Pandya Dynasty
Later (Second) Pandya Dynasty
 
पुष्यभूति वंश .pptx
पुष्यभूति वंश .pptxपुष्यभूति वंश .pptx
पुष्यभूति वंश .pptx
 
Religion and sacrifice of later vedic period
Religion and sacrifice of later vedic periodReligion and sacrifice of later vedic period
Religion and sacrifice of later vedic period
 

Viewers also liked

Dr. Muhsinah L Morris Research Statement
Dr. Muhsinah L Morris Research StatementDr. Muhsinah L Morris Research Statement
Dr. Muhsinah L Morris Research Statement
Muhsinah Morris, Ph.D
 
Chris Mahar Research Statement
Chris  Mahar Research StatementChris  Mahar Research Statement
Chris Mahar Research Statement
Chris Mahar
 
Corrosion engineer performance appraisal
Corrosion engineer performance appraisalCorrosion engineer performance appraisal
Corrosion engineer performance appraisal
lopedhapper
 
Lecture 3.1 how to write a cover letter student notes
Lecture 3.1 how to write a cover letter   student notesLecture 3.1 how to write a cover letter   student notes
Lecture 3.1 how to write a cover letter student notes
Nancy Bray
 
Research and Teaching Statement
Research and Teaching StatementResearch and Teaching Statement
Research and Teaching Statement
Dario Aguilar
 
Research Interests Dr. Bassam Alameddine
Research Interests Dr. Bassam AlameddineResearch Interests Dr. Bassam Alameddine
Research Interests Dr. Bassam Alameddine
balameddine
 
ERC Project - Martin Schroder
ERC Project - Martin SchroderERC Project - Martin Schroder
ERC Project - Martin Schroder
David Young
 
Zeolitic imidazolate frameworks
Zeolitic imidazolate frameworksZeolitic imidazolate frameworks
Zeolitic imidazolate frameworks
Ujjwal Surin
 

Viewers also liked (20)

Dr. Muhsinah L Morris Research Statement
Dr. Muhsinah L Morris Research StatementDr. Muhsinah L Morris Research Statement
Dr. Muhsinah L Morris Research Statement
 
Writing Research Statement
Writing Research StatementWriting Research Statement
Writing Research Statement
 
Chris Mahar Research Statement
Chris  Mahar Research StatementChris  Mahar Research Statement
Chris Mahar Research Statement
 
An Abridged Version of My Statement of Research Interests
An Abridged Version of My Statement of Research InterestsAn Abridged Version of My Statement of Research Interests
An Abridged Version of My Statement of Research Interests
 
Corrosion engineer performance appraisal
Corrosion engineer performance appraisalCorrosion engineer performance appraisal
Corrosion engineer performance appraisal
 
Lecture 3.1 how to write a cover letter student notes
Lecture 3.1 how to write a cover letter   student notesLecture 3.1 how to write a cover letter   student notes
Lecture 3.1 how to write a cover letter student notes
 
Cover Letter Guide
Cover Letter GuideCover Letter Guide
Cover Letter Guide
 
How to Write a Cover Letter
How to Write a Cover LetterHow to Write a Cover Letter
How to Write a Cover Letter
 
Jurix 2014 welcome presentation
Jurix 2014 welcome presentationJurix 2014 welcome presentation
Jurix 2014 welcome presentation
 
Statement of interest filed Lane v. Kitzhaber
Statement of interest filed Lane v. KitzhaberStatement of interest filed Lane v. Kitzhaber
Statement of interest filed Lane v. Kitzhaber
 
7 Cover Letter Hos Od
7 Cover Letter Hos Od7 Cover Letter Hos Od
7 Cover Letter Hos Od
 
Cover letter and recommendation
Cover letter and recommendationCover letter and recommendation
Cover letter and recommendation
 
Research and Teaching Statement
Research and Teaching StatementResearch and Teaching Statement
Research and Teaching Statement
 
Lee - Organic Materials Chemistry - Spring Review 2013
Lee - Organic Materials Chemistry - Spring Review 2013Lee - Organic Materials Chemistry - Spring Review 2013
Lee - Organic Materials Chemistry - Spring Review 2013
 
6.anilkumar shoibam
6.anilkumar shoibam6.anilkumar shoibam
6.anilkumar shoibam
 
Research Interests Dr. Bassam Alameddine
Research Interests Dr. Bassam AlameddineResearch Interests Dr. Bassam Alameddine
Research Interests Dr. Bassam Alameddine
 
ERC Project - Martin Schroder
ERC Project - Martin SchroderERC Project - Martin Schroder
ERC Project - Martin Schroder
 
Open access and the ERC - EARMA Conference, 3 July 2013
Open access and the ERC - EARMA Conference, 3 July 2013Open access and the ERC - EARMA Conference, 3 July 2013
Open access and the ERC - EARMA Conference, 3 July 2013
 
Zeolitic imidazolate frameworks
Zeolitic imidazolate frameworksZeolitic imidazolate frameworks
Zeolitic imidazolate frameworks
 
STATEMENT OF RESEARCH INTERESTS-Dr.TKS.
STATEMENT OF RESEARCH INTERESTS-Dr.TKS.STATEMENT OF RESEARCH INTERESTS-Dr.TKS.
STATEMENT OF RESEARCH INTERESTS-Dr.TKS.
 

Similar to Research Statement

Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower BoundsParallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Subhajit Sahu
 
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower BoundsParallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Subhajit Sahu
 
DSP IEEE paper
DSP IEEE paperDSP IEEE paper
DSP IEEE paper
prreiya
 
On the-joint-optimization-of-performance-and-power-consumption-in-data-centers
On the-joint-optimization-of-performance-and-power-consumption-in-data-centersOn the-joint-optimization-of-performance-and-power-consumption-in-data-centers
On the-joint-optimization-of-performance-and-power-consumption-in-data-centers
Cemal Ardil
 
Colfax-Winograd-Summary _final (1)
Colfax-Winograd-Summary _final (1)Colfax-Winograd-Summary _final (1)
Colfax-Winograd-Summary _final (1)
Sangamesh Ragate
 
Approaches to online quantile estimation
Approaches to online quantile estimationApproaches to online quantile estimation
Approaches to online quantile estimation
Data Con LA
 
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to PracticeWorkflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Frederic Desprez
 

Similar to Research Statement (20)

Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower BoundsParallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
 
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower BoundsParallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds
 
PPoPP15
PPoPP15PPoPP15
PPoPP15
 
DSP IEEE paper
DSP IEEE paperDSP IEEE paper
DSP IEEE paper
 
Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...
Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...
Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...
 
A Novel Approach of Caching Direct Mapping using Cubic Approach
A Novel Approach of Caching Direct Mapping using Cubic ApproachA Novel Approach of Caching Direct Mapping using Cubic Approach
A Novel Approach of Caching Direct Mapping using Cubic Approach
 
Modern processors
Modern processorsModern processors
Modern processors
 
Shor's discrete logarithm quantum algorithm for elliptic curves
 Shor's discrete logarithm quantum algorithm for elliptic curves Shor's discrete logarithm quantum algorithm for elliptic curves
Shor's discrete logarithm quantum algorithm for elliptic curves
 
On the-joint-optimization-of-performance-and-power-consumption-in-data-centers
On the-joint-optimization-of-performance-and-power-consumption-in-data-centersOn the-joint-optimization-of-performance-and-power-consumption-in-data-centers
On the-joint-optimization-of-performance-and-power-consumption-in-data-centers
 
Colfax-Winograd-Summary _final (1)
Colfax-Winograd-Summary _final (1)Colfax-Winograd-Summary _final (1)
Colfax-Winograd-Summary _final (1)
 
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
 
ACES_Journal_February_2012_Paper_07
ACES_Journal_February_2012_Paper_07ACES_Journal_February_2012_Paper_07
ACES_Journal_February_2012_Paper_07
 
An Index-first Addressing Scheme for Multi-level Caches
An Index-first Addressing Scheme for Multi-level CachesAn Index-first Addressing Scheme for Multi-level Caches
An Index-first Addressing Scheme for Multi-level Caches
 
Approaches to online quantile estimation
Approaches to online quantile estimationApproaches to online quantile estimation
Approaches to online quantile estimation
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
 
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to PracticeWorkflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
C optimization notes
C optimization notesC optimization notes
C optimization notes
 
Integrating research and e learning in advance computer architecture
Integrating research and e learning in advance computer architectureIntegrating research and e learning in advance computer architecture
Integrating research and e learning in advance computer architecture
 

Research Statement

  • 1. Research Statement My research focuses on studying the fundamental tradeoffs between cache- obliviousness, cache-optimality, and parallelism of algorithms and data structures on modern multi-core and many-core 1 architectures with hierarchical cache. My approach combines both theory and experiments. I have been mainly working on stencil computation, general dynamic programming computation, and numerical algorithms. Since 2009, I have been working with Prof. Charles E. Leiserson at MIT on stencil computation. The project ``The Pochoir Stencil Compiler’’ [6, 7] was funded both by NSF at total amount USD $983,017 and Intel Corp at amount RMB (Chinese Yuan) 904,627.72. In this project, we achieved following results: 1) improved the parallelism asymptotically with the same cache-efficiency of a cache-oblivious parallel algorithm for multi-dimensional simple stencil computation2 by inventing ``hyperspace cut’’; 2) handles periodic and aperiodic boundary condition in one unified algorithm; 3) designed domain-specific language (DSL) embedded in C++ for stencil computation; 4) designed and developed a novel two-phase compilation strategy that the first phase call any common C++ tool chain to verify the correctness and will invoke the Pochoir compiler only afterwards to do a source-to-source transformation for a highly optimized code. The two-phase compilation strategy saves massive cost of parsing and type-checking of C++ language. In this project, I contributed to the algorithm, code generation, benchmarking and core compiler software for the Pochoir system. After the ``Pochoir’’ project, I continue working on a joint project on general dynamic programming problem. Note that stencil computation can be viewed as a special case of dynamic programming with constant but non-orthogonal dependencies. In the research of general dynamic programming problem, my focus lies on the fundamental tradeoff between time and cache complexity. Modern multicore systems with hierarchical caches demand both parallelism and cache-locality from software to yield good performance. In the analysis of parallel computations, theory usually considers two metrics: time complexity and cache complexity. The traditional objective for scheduling a parallel computation is to minimize the time complexity, i.e., if we represent the parallel computation as a DAG (Directed Acyclic Graph), time complexity is the length of critical path in DAG. Alternatively, one can focus on minimizing the cache complexity, i.e., the number of cache misses incurred during the execution of program. Theoretical analyses often consider these metrics separately; in reality, the actual completion time of a program depends on both, since the number of cache misses has a direct impact on the running time and the time complexity bound often serves as a good indicator of scalability, load-balance and scheduling overheads. 1 For example: Intel MIC (Many-Integrated-Core) coprocessor 2 Simple stencil is a stencil computation without heterogeneity in space or time.
  • 2. Tuning of algorithms for time and/or cache complexity are usually not preferred. It has several disadvantages: the code structure becomes more complicated; the parameter space to explore is usually exponentially sized; and the tuned code is non- portable, i.e., for different hardware systems, separate tunings are required. Moreover, since the tuning environment cannot be exactly the same as running environment, e.g. different numbers of background daemon processes, different loads of network traffic, etc. , it means that the long-tuned code is almost always sub-optimal. Classic cache- oblivious algorithms eliminate the need of tuning of optimality for hierarchical cache largely. Can we further eliminate the need of tuning between time and cache complexity while still remaining cache-obliviousness? What’s the fundamental tradeoff between time and cache complexity? What obliviousness can buy us and cost us? These are the questions lying in the center of my research. For generic parallel computation, there is a tension between the objectives of minimizing time and cache complexity. Take LCS (Longest Common Subsequence) as an example: Given two sequences L =< l1l2 … ln > and R =< r1r2 … rn >, we find its longest common subsequence by filling out a 2D table using recurrences X[i, j] = { 0, 𝑖𝑓 𝑖 = 0 ∨ 𝑗 = 0 𝑋[𝑖 − 1, 𝑗 − 1] + 1, 𝑖𝑓 𝑖, 𝑗 > 0 ∧ 𝑙𝑖 = 𝑟𝑗 max{𝑋[𝑖 − 1, 𝑗], 𝑋[𝑖, 𝑗 − 1]}, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Figure 1. 2-way to 𝑛-way Divide-And-Conquer algorithms for LCS In literature, there are two classes of algorithms to solve the problem. One is the divide-and-conquer based cache-oblivious algorithm, the other is based on looping possibly with tiling. If we adopt a 2-way divide-and-conquer parallel algorithm as shown on the left-hand side in Figure 1, i.e., at each recursion level, we cut each dimension into two halves. Since at each recursion level, we have three out of four sub-quadrants sitting on the critical paths, the time recurrence T∞(𝑛) = 3𝑇∞ ( 𝑛 2 ) solves to T∞(𝑛) = 𝑂(𝑛𝑙𝑜𝑔23), where n is the input problem size. We usually only count serial cache complexity in ideal cache model, a.k.a. cache-oblivious model because parallel cache complexity will be determined by fitting both time complexity and serial cache complexity into a formula that is determined by the underlying run- time scheduler. In ideal cache model, there are two levels of memory. The upper level is a fully associative cache of size M and lower level is an infinitely sized main memory. When there is a cache miss in upper level, the system employs an omniscient cache replacement policy to exchange a cache line in size B between the two levels. Parameters M and B are correlated by a tall cache assumption, i.e., M = Ω(B1+ϵ),
  • 3. where ϵ > 0 is a constant. In ideal cache model, serial cache complexity is calculated by summing up cache misses caused by four individual sub-quadrants at each recursion level, i.e., Q(n) = 4Q ( n 2 ). The recursive summation stops at some level when the problem size of a sub-quadrant just fits into the cache and its parent doesn’t, i.e., ∃n0 𝑠. 𝑡. n0 = 𝜖0 𝑀 ∧ 2𝑛0 > 𝑀, where ϵ0 ∈ (0,1], i.e., 0 < ϵ0 ≤ 1, is a constant. Because after this level, further recursive divide-and-conquer won’t cause any more cache misses than Q(n0) = O ( n0 B ) . Solving the recurrence, we have Q(n) = O ( n2 ϵ0BM ). Keep doing more-way divide-and-conquer, eventually the algorithm will reduce to n-way divide-and-conquer algorithm as shown on the right-hand side in Figure 1, which is essentially parallel looping algorithm without tiling. For n-way divide- and-conquer algorithm, the time complexity (span) reduces to T∞(𝑛) = 𝑂(𝑛) with serial cache complexity increased to Q(n) = O ( n2 B ). From above analyses, we can see that 2-way divide-and-conquer algorithm has the worst time complexity while the best cache complexity, parallel looping algorithm or n-way divide-and-conquer in Figure 1, on the contrary, has the best time complexity while the worst cache complexity. Traditional wisdom may suggest to tune a balanced point between these two extremes to get a good performance on a real machine. Apparently, the intuition behind balance is that we cannot get both optimal at the same time. wavefront wavefront Figure 2. Scheduling of classic 2-way divide-and-conquer algorithm for LCS on the left and cache-oblivious wavefront algorithm on the right. Solid arrows indicate true dependencies derived from the defining recurrence equations, while dashed arrows indicate false dependencies introduced by the scheduling of algorithm.
  • 4. Figure 3. Performance comparison of four algorithms for LCS, i.e., Parallel Loop without tiling, Blocked Loop (Parallel Loop with tiling), classic 2-way divide-and-conquer based cache-oblivious parallel (2-way COP) algorithm, and cache-oblivious wavefront (COW) algorithm. In the performance plot, we fix the same base case size for all four algorithms except parallel loop (without tiling) and use exactly the same non-inlined kernel function to compute the base case for all algorithms so that the only difference is how different algorithms schedule the base cases. In [1] we have shown that both optimal time and cache complexity is achievable at the same time via a more compacted scheduling as shown on the right-hand side of Figure 2. The new scheduling policy eliminates all false dependencies introduced by prior divide-and-conquer based cache-oblivious parallel algorithm and retains only true dependency from the defining recurrence equations. From a high level point of view, the new algorithm proceeds like dynamically unfolding sub-quadrants on the divide-and-conquer tree and the progress of unfolded sub-quadrants are aligned to a wavefront. In other words, the proceeding wavefront swept throughout the divide- and-conquer tree are generated from a 2-way divide-and-conquer algorithm so we name the technique ``cache-oblivious wavefront (COW for short)’’3. From Figure 3, we can see that by combining the best of both cache-oblivious and looping world, the cache-oblivious wavefront algorithm beat both classic 2-way divide-and-conquer based cache-oblivious parallel algorithm and parallel loop with tiling (Blocked Loop algorithm in Figure 3.) algorithm. Some natural questions following the direction of research are: Does this or similar technique apply to all divide-and-conquer based cache-oblivious parallel algorithm? What ``cache-obliviousness’’ can buy us and cost us? What’s the fundamental tradeoff between time and cache complexity? These are all fundamental problems I am working on. Some recent progresses on the direction include a successful application of the cache-oblivious wavefront technique to some numerical algorithms, such as Cholesky factorization and LU factorization without pivoting. Besides the continuous study of fundamental tradeoff between time and cache complexity, I also have research interests in wider area of parallel algorithms and data structures, e.g. my recent study of Range 1 Query algorithms, which is a special case of Range Partial Sum Query algorithm but have only values of 0 or 1 on discrete grid- 3 Thanks to Prof. Charles E. Leiserson at MIT CSAIL for coining the name. 0 2 4 6 8 updatedpoints/second (x1e9)linearscale side length (n) LCS: Performance (bsize=64) Parallel Loop Blocked Loop 2-way COP COW
  • 5. cells was published in COCOON’14 [3]; another my recent study of weight balance on boundaries and skeletons [4] is an inverse problem of the barycenter problem, i.e., The barycenter problem is: given a set of n weights W = {w1, w2, … , wn} and arbitrary n locations X = {x1, x2, … , xn} on the boundary of an arbitrary multi-dimensional polygon, it’s easy to calculate the barycenter in O(n) time by formula x = ∑ 𝑤𝑖 ⋅ 𝑥𝑖 n i=1 ; The inverse problem is that given an arbitrary point in/outside the convex/concave polygon and the weight set W, how fast can we identify n locations X = {x1, x2, … , xn} on the boundary of polygon to place the n weights such that their barycenter is the given point? The results were published in SoCG’15 [4]. References: 1) Yuan Tang, Ronghui You, Haibin Kan, Jesmin Jahan Tithi, Pramod Ganapathi, Rezaul A. Chowdhury, Cache-Oblivious Wavefront: Improving Parallelism of Recursive Dynamic Programming Algorithms without Losing Cache-Efficiency, 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’15), Feb. 9-11, 2015, San Francisco, CA, USA. 2) Rezaul A. Chowdhury, Pramod Ganapathi, Yuan Tang, Jesmin Jahan Tithi, Improving Parallelism of Recursive Stencil Algorithms without sacrificing Cache Performance, 2nd Annual Workshop on Stencil Computations (WOSC’14), held in conjunction with SPLASH’14,Portland, Oct. 20-24, Oregon, USA, published in ACM digital library. 3) Michael A. Bender, Rezaul A. Chowdhury, Pramod Ganapathi, Samuel McCauley and Yuan Tang, The Range 1 Query (R1Q) Problem, The 20th International Computing and Combinatorics Conference (COCOON'14), August 4-6, Atlanta, Georgia, USA, 2014. 4) Luis Barba, Otfried Cheong, Jean-Lou De Carufel, Michael Gene Dobbins, Rudolf Fleischer, Akitoshi Kawamura, Matias korman, Yoshio Okamoto, janos Pach, Yuan Tang, Takeshi Tokuyama, Sander Verdonschot, Tianhao Wang, Weight Balancing on Boundaries and Skeletons, The 30th Annual ACM Symposium on Computational Geometry (SoCG’14) 2014, June 8-11, Kyoto, Japan. 5) Pramod Ganapathi, Rezaul Chowdhury, and Yuan Tang (11/9/12). The R1Q Problem. 22nd Annual Fall Workshop on Computational Geometry (FWCG’12). College Park, Maryland. 6) Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. The Pochoir stencil compiler. 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA'11), 2011. 7) Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. Coding stencil computations using the Pochoir stencil-speci_cation language. 3rd USENIX Workshop on Hot Topics in Parallelism (HotPar'11), 2011.