Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A Design Methodology

Resource to Performance Tradeoff
Adjustment for Fine-Grained Architectures
─A Design Methodology

Fahad Islam Cheema, Zain-Ul-Abdin,
Professor Bertil Svensson

Halmstad University, Halmstad, Sweden

Engr. Fahad Islam Cheema
 4-Year Bachelor in Computer Engineering (BCE) from COMSATS Lahore in 2006
 2-Year Industrial Experience as Embedded Software/System Engineer in Lahore
and Islamabad
 Five Rivers Technologies Lahore
 Streaming Networks Islamabad
 Delta Indus Systems Lahore
 Masters in Computer System Engineering from Halmstad University of Sweden in
2009
 Masters standalone thesis accepted for publication in FPGAWorld 2010 international
conference
 www.fpgaworld2010.com
 Copenhagen, Denmark in September,10
 1-Year Academic Experience
 Halmstad University Sweden
 LUMS Lahore
 Bahria University Islamabad

2

Engr. Fahad Islam Cheema
 3-Year Experience (Embedded Systems)
 2-year Industrial (Streaming Networks)
 1-Year Academic
 Universities (Halmstad, LUMS, Bahria)
 Courses
 Linux Programming and shell Scripting, Administration of OS, Databases
 Embedded Systems
 System Programming

 17-Year Education (Computer Engineering)
 Masters From Sweden
 Computer Engineering from COMSATS
 Specialization in Embedded Systems
 PEC # Comp/6774

 1 Publication
 Masters thesis accepted for publication in FPGAWorld2010 3

Agenda
 Overview and Problem Definition
 Main Idea
 Experimental Setup
 Mitrion Parallel Architecture
 Interpolation Kernels

 Parallelization Levels
 Conclusions
 Future Work
5

Overview
 Motivation
 Computation intensive algorithms
 Fine grained architectures

 Problem Definition
 Parallelism
 Resource to Performance Tradeoffs
 Hardware/logic gates to performance tradeoffs
 Memory to performance tradeoffs

6

Problem Defination Pipeline Step1
(First data independent Block)

d0 = x_int - x0;
d1 = x_int - x1;
d2 = x_int - x2;
d3 = x_int - x3;

p01 = (y0*d1 - y1*d0) / (x0 - x1);
Step2 p12 = (y1*d2 - y2*d1) / (x1 - x2);
p23 = (y2*d3 - y3*d2) / (x2 - x3);

p02 = (p01*d2 - p12*d0) / (x0 - x2);
Step3
p13 = (p12*d3 - p23*d1) / (x1 - x3);

Step4 p03 = (p02*d3 - p13*d0) / (x0 - x3);

Figure-1: Problem at Kernel Level

Problem Defination (Conti.)
 Points to consider for Parallelism
 Performance Improved
 Required Hardware resourses Increased
 Hardware Gates
 Memory interface
 Memory Size TradeOffs 
 Memory access speed

Fine-Grained Reconfigurable Architectures

8

Problem Defination (Conti.)
for(i in <0 .. 90000>) Problem Level
{ Parallelism (PLP)
d0 = x_int - x0;
d1 = x_int - x1; Pipeline Step1
d2 = x_int - x2;
d3 = x_int - x3;

p01 = (y0*d1 - y1*d0) / (x0 - x1);
Step2 Kernel Level
p12 = (y1*d2 - y2*d1) / (x1 - x2);
Parallelism (KLP)
p23 = (y2*d3 - y3*d2) / (x2 - x3);

p02 = (p01*d2 - p12*d0) / (x0 - x2);
Step3
p13 = (p12*d3 - p23*d1) / (x1 - x3);

Step4 p03 = (p02*d3 - p13*d0) / (x0 - x3);

} Figure-2: Parallelism at different levels

Main Idea
 Parallelism Levels
 BitLevel Parallelism (BLP)
 Kernel Level Parallelism (KLP)
 Problem Level Parallelism (PLP)
 Maximum parallelism at one level is not ultimate
solution
 Customized parallelism at different levels
 Can better adjust Resource-performance tradoffs
 Gates-performance tradeoff

10

Main idea (Conti.)
 Maximum parallelism at one level is not ultimate
solution
 Combine parallelism at different parallelism levels to
produce parallelization levels
 Parallelization Levels
 Single Kernel (SKZ)
 Cross Kernel (CKZ)
 Multi-SKZ
 Multi-CKZ

Figure-3: Parallelism and Parallelization
Levels
11

Experimental Setup
 Computation intensive algorithms
 Interpolation Kernels
 Fine Grained Architecture
 FPGA
 Fine Grained Parallelism
 Mitrion virtual processor
 Extract fine grained parallelism

 Mitrion-C high level language (HLL)
 Hardware Platform
 Cray XD1 Supercomputer with Vertex-4 FPGA

12

Interpolation Kernels
 What is interpolation
 Process of calculating new values within the range of
available values [1]
 Cubic interpolation
 Bi-cubic interpolation
 Applying cubic in 2D
 5 cubic kernels

Figure-4: 2D Interpolation

13

Mitrion Parallel Architecture
 Mitrion Virtual Processor (MVP)
 Fine-Grained, Soft-Core Processor
 Almost 60 IP blocks defined in HDL [2]
 Non von-neumann architecture
 Mitrion-C
 HLL for FPGA
 Data dependence instead of order-of-execution
 Parallelism Language Constructs [3]
 Pipelining

14

Parallelization Levels
 Single Kernel
Parallelization (SKZ)
 Only kernel level
parallelism (KLP)
 All data independent
blocks are internally
parallel but externally
pipelined
Figure-5: SKZ

15

Parallelization Levels (Conti.)
 Cross Kernel
Parallelization (CKZ)
 Extend kernel by Mixing
more than one kernels
 Replicate computation
intensive data
independent blocks
 Resource computation
balance
Figure-6: CKZ
16

 Multi-SKZ
 Replicate kernels which
already have SKZ

Figure-7: Multi-SKZ
17

 Multi-CKZ
 Replicate kernels which
already have CKZ
d0 d0 P01 d0 P01
P01
d1 d1 d1

D values P12 D values P12 D values P12
d2 d2 d2

P23 d3 P23 d3 P23
d3

d0 d0 P01 d0 P01
a P01 a a
d1 d1 d1
Read from Read from Read from
Memory Memory Memory
d2 d2 d2

b b P23 b P23
d3 P23 d3 d3
Go for Go for Go for
next next next
iteration iteration iteration

p02 p02 p02

Write to Write to Write to
P03 P03 P03
p13 p13 p13

p02 p02 p02

P03 P03 P03

p13 p13 p13

d0 d0 P01 d0
P01 P01
d1 d1 d1
d2 d2 d2

P23 d3 P23 P23
d3 d3

d0 d0 P01 d0
a P01 a a P01
d1 d1 d1
Read from Read from Read from
d2 d2 d2

b b P23 b
d3 P23 d3 d3 P23
Go for Go for Go for
next next next
iteration iteration iteration

p02 p02 p02

Write to Write to Write to
P03 P03 P03
p13 p13 p13

p02 p02 p02

P03 P03 P03
p13 p13 p13

Figure-8:Multi-CKZ
18

Results

Table-1 : Results

19

Conclusions
 Specific conclusions
 For very limited resources, SKZ is better
 CKZ is better for applications with high unbalanced computation
distribution
 SKZ and CKZ are better for large size applications
 Multi-CKZ can provide high level of parallelism at cost of design
complexity
 Multi-SKZ and Multi-CKZ are attractive for small size Real-Time
applications
 Using parallelization levels
 Can adjust trade-offs
 Can achieve highly custom parallelism
 Mix of parallelization levels can produce
 Application-specific parallelism
 Resource-specific parallelism

20

Future Work
 Automation of parallelization levels
 Parallelization levels to deal with other tradeoffs
 Generalized parallelization levels for all
application
 Generalized parallelization levels for graphical
processors to adjust tradeoffs
 Floating point and accuracy

21

References
[1] William H. Press Brian P. Flannery, Saul A. Teukolsky William
T.Vetterling “Numerical Recipes, Art of Scientific Computing”,
Cambridge University Press

[2] Stefan Möhl, “The Mitrion Virtual Processor, Using FPGAs in HPC”
Sixteenth ACM/SIGDA International Symposium on FPGAs
<http://www.ece.wisc.edu/~kati/fpga2008/fpga2008%20workshop%2
0-%2005%20Mitrionics%20-%20Mohl.pdf> Date 14-05-2009

[3] “Mitrion User Guide”, Copyright © 2005 - 2008 by Mitrionics AB.
<http://www.mitrionics.com/?page=developers_resources> Date 03-
03-2009

22

Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A Design Methodology

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A Design Methodology

Similar to Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A Design Methodology (20)

Recently uploaded

Recently uploaded (20)

Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A Design Methodology