Your SlideShare is downloading. ×
Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A Design Methodology
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A Design Methodology

541
views

Published on

Resource to Performance Tradeoff …

Resource to Performance Tradeoff
Adjustment for Fine-Grained Architectures
─A Design Methodology

When implementing computation-intensive algorithms on finegrained
parallel architectures, adjustment of resource to
performance tradeoff is a big challenge. This paper proposes a
methodology for dealing with some of these performance tradeoffs
by adjusting parallelism at different levels. In a case study,
interpolation kernels are implemented on a fine-grained
architecture (FPGA) using a high level language (Mitrion-C).
For both cubic and bi-cubic interpolation, one single-kernel, one
cross-kernel and two multi-kernel parallel implementations are
designed and evaluated. Our results demonstrate that no single
level of parallelism can be used for trade-off adjustment. Instead,
the appropriate degree of parallelism on each level, according to
available resources and the performance requirements of the
application, needs to be found. Basing the design on high-level
programming simplifies the trade-off process. This research is a
step towards automation of the choice of parallelization based on
a combination of parallelism levels.

Published in: Technology, Business

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
541
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A Design Methodology Fahad Islam Cheema, Zain-Ul-Abdin, Professor Bertil Svensson Halmstad University, Halmstad, Sweden
  • 2. Engr. Fahad Islam Cheema  4-Year Bachelor in Computer Engineering (BCE) from COMSATS Lahore in 2006  2-Year Industrial Experience as Embedded Software/System Engineer in Lahore and Islamabad  Five Rivers Technologies Lahore  Streaming Networks Islamabad  Delta Indus Systems Lahore  Masters in Computer System Engineering from Halmstad University of Sweden in 2009  Masters standalone thesis accepted for publication in FPGAWorld 2010 international conference  www.fpgaworld2010.com  Copenhagen, Denmark in September,10  1-Year Academic Experience  Halmstad University Sweden  LUMS Lahore  Bahria University Islamabad 2
  • 3. Engr. Fahad Islam Cheema  3-Year Experience (Embedded Systems)  2-year Industrial (Streaming Networks)  1-Year Academic  Universities (Halmstad, LUMS, Bahria)  Courses  Linux Programming and shell Scripting, Administration of OS, Databases  Embedded Systems  System Programming  17-Year Education (Computer Engineering)  Masters From Sweden  Computer Engineering from COMSATS  Specialization in Embedded Systems  PEC # Comp/6774  1 Publication  Masters thesis accepted for publication in FPGAWorld2010 3
  • 4. Resource to Performance Tradeoff Adjustment for Fine-Grained Architectures ─A Design Methodology Fahad Islam Cheema, Zain-Ul-Abdin, Professor Bertil Svensson Halmstad University, Halmstad, Sweden
  • 5. Agenda  Overview and Problem Definition  Main Idea  Experimental Setup  Mitrion Parallel Architecture  Interpolation Kernels  Parallelization Levels  Conclusions  Future Work 5
  • 6. Overview  Motivation  Computation intensive algorithms  Fine grained architectures  Problem Definition  Parallelism  Resource to Performance Tradeoffs  Hardware/logic gates to performance tradeoffs  Memory to performance tradeoffs 6
  • 7. Problem Defination Pipeline Step1 (First data independent Block) d0 = x_int - x0; d1 = x_int - x1; d2 = x_int - x2; d3 = x_int - x3; p01 = (y0*d1 - y1*d0) / (x0 - x1); Step2 p12 = (y1*d2 - y2*d1) / (x1 - x2); p23 = (y2*d3 - y3*d2) / (x2 - x3); p02 = (p01*d2 - p12*d0) / (x0 - x2); Step3 p13 = (p12*d3 - p23*d1) / (x1 - x3); Step4 p03 = (p02*d3 - p13*d0) / (x0 - x3); Figure-1: Problem at Kernel Level
  • 8. Problem Defination (Conti.)  Points to consider for Parallelism  Performance Improved  Required Hardware resourses Increased  Hardware Gates  Memory interface  Memory Size TradeOffs   Memory access speed Fine-Grained Reconfigurable Architectures 8
  • 9. Problem Defination (Conti.) for(i in <0 .. 90000>) Problem Level { Parallelism (PLP) d0 = x_int - x0; d1 = x_int - x1; Pipeline Step1 d2 = x_int - x2; d3 = x_int - x3; p01 = (y0*d1 - y1*d0) / (x0 - x1); Step2 Kernel Level p12 = (y1*d2 - y2*d1) / (x1 - x2); Parallelism (KLP) p23 = (y2*d3 - y3*d2) / (x2 - x3); p02 = (p01*d2 - p12*d0) / (x0 - x2); Step3 p13 = (p12*d3 - p23*d1) / (x1 - x3); Step4 p03 = (p02*d3 - p13*d0) / (x0 - x3); } Figure-2: Parallelism at different levels
  • 10. Main Idea  Parallelism Levels  BitLevel Parallelism (BLP)  Kernel Level Parallelism (KLP)  Problem Level Parallelism (PLP)  Maximum parallelism at one level is not ultimate solution  Customized parallelism at different levels  Can better adjust Resource-performance tradoffs  Gates-performance tradeoff 10
  • 11. Main idea (Conti.)  Maximum parallelism at one level is not ultimate solution  Combine parallelism at different parallelism levels to produce parallelization levels  Parallelization Levels  Single Kernel (SKZ)  Cross Kernel (CKZ)  Multi-SKZ  Multi-CKZ Figure-3: Parallelism and Parallelization Levels 11
  • 12. Experimental Setup  Computation intensive algorithms  Interpolation Kernels  Fine Grained Architecture  FPGA  Fine Grained Parallelism  Mitrion virtual processor  Extract fine grained parallelism  Mitrion-C high level language (HLL)  Hardware Platform  Cray XD1 Supercomputer with Vertex-4 FPGA 12
  • 13. Interpolation Kernels  What is interpolation  Process of calculating new values within the range of available values [1]  Cubic interpolation  Bi-cubic interpolation  Applying cubic in 2D  5 cubic kernels Figure-4: 2D Interpolation 13
  • 14. Mitrion Parallel Architecture  Mitrion Virtual Processor (MVP)  Fine-Grained, Soft-Core Processor  Almost 60 IP blocks defined in HDL [2]  Non von-neumann architecture  Mitrion-C  HLL for FPGA  Data dependence instead of order-of-execution  Parallelism Language Constructs [3]  Pipelining 14
  • 15. Parallelization Levels  Single Kernel Parallelization (SKZ)  Only kernel level parallelism (KLP)  All data independent blocks are internally parallel but externally pipelined Figure-5: SKZ 15
  • 16. Parallelization Levels (Conti.)  Cross Kernel Parallelization (CKZ)  Extend kernel by Mixing more than one kernels  Replicate computation intensive data independent blocks  Resource computation balance Figure-6: CKZ 16
  • 17. Parallelization Levels (Conti.)  Multi-SKZ  Replicate kernels which already have SKZ Figure-7: Multi-SKZ 17
  • 18. Parallelization Levels (Conti.)  Multi-CKZ  Replicate kernels which already have CKZ d0 d0 P01 d0 P01 P01 d1 d1 d1 D values P12 D values P12 D values P12 d2 d2 d2 P23 d3 P23 d3 P23 d3 d0 d0 P01 d0 P01 a P01 a a d1 d1 d1 Read from Read from Read from D values P12 D values P12 D values P12 Memory Memory Memory d2 d2 d2 b b P23 b P23 d3 P23 d3 d3 Go for Go for Go for next next next iteration iteration iteration p02 p02 p02 Write to Write to Write to P03 P03 P03 Memory Memory Memory p13 p13 p13 p02 p02 p02 P03 P03 P03 p13 p13 p13 d0 d0 P01 d0 P01 P01 d1 d1 d1 D values P12 D values P12 D values P12 d2 d2 d2 P23 d3 P23 P23 d3 d3 d0 d0 P01 d0 a P01 a a P01 d1 d1 d1 Read from Read from Read from D values P12 D values P12 D values P12 Memory Memory Memory d2 d2 d2 b b P23 b d3 P23 d3 d3 P23 Go for Go for Go for next next next iteration iteration iteration p02 p02 p02 Write to Write to Write to P03 P03 P03 Memory Memory Memory p13 p13 p13 p02 p02 p02 P03 P03 P03 p13 p13 p13 Figure-8:Multi-CKZ 18
  • 19. Results Table-1 : Results 19
  • 20. Conclusions  Specific conclusions  For very limited resources, SKZ is better  CKZ is better for applications with high unbalanced computation distribution  SKZ and CKZ are better for large size applications  Multi-CKZ can provide high level of parallelism at cost of design complexity  Multi-SKZ and Multi-CKZ are attractive for small size Real-Time applications  Using parallelization levels  Can adjust trade-offs  Can achieve highly custom parallelism  Mix of parallelization levels can produce  Application-specific parallelism  Resource-specific parallelism 20
  • 21. Future Work  Automation of parallelization levels  Parallelization levels to deal with other tradeoffs  Generalized parallelization levels for all application  Generalized parallelization levels for graphical processors to adjust tradeoffs  Floating point and accuracy 21
  • 22. References [1] William H. Press Brian P. Flannery, Saul A. Teukolsky William T.Vetterling “Numerical Recipes, Art of Scientific Computing”, Cambridge University Press [2] Stefan Möhl, “The Mitrion Virtual Processor, Using FPGAs in HPC” Sixteenth ACM/SIGDA International Symposium on FPGAs <http://www.ece.wisc.edu/~kati/fpga2008/fpga2008%20workshop%2 0-%2005%20Mitrionics%20-%20Mohl.pdf> Date 14-05-2009 [3] “Mitrion User Guide”, Copyright © 2005 - 2008 by Mitrionics AB. <http://www.mitrionics.com/?page=developers_resources> Date 03- 03-2009 22
  • 23. Tack (Hope you enjoyed)