CISL talk


Published on

Research talk given on April 28th 2009 at CISL

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

CISL talk

  1. 1. Synchronization Synthesis for Large Scale Parallel Streaming Applications Vivek Venugopal Committee Dr. Cameron Patterson Dr. Peter Athanas Dr. Paul Plassmann Dr. Jeffrey Reed Dr. Kevin Shinpaugh 1
  2. 2. Outline • Research Overview • Introduction • Related Work • Research Statement • Methodology • Target Applications & Evaluation • Results • Contributions 2
  3. 3. Research Overview Streaming application Set of transformations Research scope Specialized hardware platform • How to partition algorithm? • How to map and where? • What are the communication resources? • How is synchronization guaranteed? 3
  4. 4. Streaming Architecture without Flow Control (SAFC) Aurora ML310 ML310 PE PE PE board 1 board 2 1 2 6 Aurora PE PE PE ML310 ML310 7 8 12 board 3 board 4 ML310 boards connected in PE PE PE mesh driven by same clock 31 32 36 Clock value but different sources source Aurora switches Streaming architecture with large number of PE's FSL requiring more than 1 board PE1 PE2 PE3 Aurora switches Aurora switches FSL PE4 PE5 PE6 PE7 PE8 PE9 Clock source Aurora switches Inside a ML310 4
  5. 5. Clock domains: GALS scenario PE1 PE2 Data Data Data • GALS (Globally Asynchronous Locally Synchronous) 5
  6. 6. Synchronization Data Source IC Destination IC clock • System Synchronous • Synchronization synthesis 6
  7. 7. Data type Packet based data Streaming data Start and stop easy for packet Cannot stop streaming data based data Easier synchronization due to flow Synchronization is difficult leading control to data loss if not done properly Better dynamic scheduling of Better static scheduling of resources resources Best-effort service Guaranteed service 7
  8. 8. Communication framework System-level communication framework Point-to-point Bus-based Network-On-Chip interconnect architecture Custom Uniform Shared Split 8
  9. 9. Point-to-point Interconnect 1 2 3 4 Ring 1 2 3 1 2 3 4 5 6 4 5 6 7 8 9 7 8 9 2D Torus 2D Mesh 9
  10. 10. Bus-based architecture Memory High-speed Low-speed interface peripheral peripheral Block I/O RAM interface OPB Bridge Power PC PLB arbiter OPB arbiter • IBM CoreConnect architecture 10
  11. 11. Network-on-Chip (NoC) Router Link core core core core core core core core core 11
  12. 12. Multi-core Streaming Architecture Network Of FPGAs with Integrated Aurora Switches (NOFIS) ML310 ML310 board 1 board 2 Aurora switches Aurora switches FSL FSL PE1 PE2 PE3 PE1 PE2 PE3 FSL FSL Aurora Aurora switches Aurora switches Aurora switches Aurora switches PE4 PE5 PE6 PE4 PE5 PE6 PE7 PE8 PE9 PE7 PE8 PE9 Aurora switches Aurora switches 12
  13. 13. NOFIS: On-board communication Master Slave FSL_S_Clk FSL_M_Clk FSL_M_Data FSL_S_Data FIFO FSL_M_Control FSL_S_Control FSL_S_Read FSL_M_Write FSL_M_Full FSL_S_Exists • Fast Simplex Link : uni-directional FIFO interface • Configure FIFO depth, clocking modes 13
  14. 14. NOFIS: Off-board communication Aurora Aurora Channel Lane 1 User User Aurora Aurora Application Application interface interface (ML310) (ML310) Aurora Lane n • High-speed (3.125 Gb/sec) and self synchronous 14
  15. 15. Model of Computation (MoC): SDF Synchronous Data Flow (SDF) 2 1 2 Buffer A B 2 Buffer Buffer 1 1 delay C elements • SDF exhibits ideal systolic dataflow behavior • Varying data rate not supported 15
  16. 16. Model of Computation (MoC): PSDF Parametrized Synchronous Data Flow (PSDF) A1 B1 Buffer size a1 A B A2 B2 Buf fer f Buf a3 size er size a2 C1 C C2 firing of (A) ⨉ production rate of (A1) = firing of (B) ⨉ consumption rate of (B1) firing of (A) ⨉ production rate of (A2) = firing of (C) ⨉ consumption rate of (C1) firing of (C) ⨉ production rate of (C2) = firing of (B) ⨉ consumption rate of (B2) • supports reconfiguration and different data rates 16
  17. 17. Multi-core Multi-processor trend 1000 Number of cores 100 Moore's Law Multi-core growth 10 2003 2004 2005 2006 2007 2008 2009 Year of production • Need single unified parallel programming tool for exploiting parallel processing at the core level • Kill Rule by Agrawal: correlate to communication cores 17
  18. 18. Related work Related What MoC Shortcomings work suboptimal bandwidth Compaan- compilers KPN based utilization due to infinite Laura Matlab length FIFOs custom synchronization scheme always CERBERO MPI model architecture fixed with the master PE PEs are connected using a MPI custom TMD MPI model communication library, no architecture automation manual partitioning and custom CORES/ SDF/FSM scheduling of communication architecture + HASIS model resources, run-time transformations reconfiguration outside scope 18
  19. 19. Research Question What transformations are required to map a streaming application on a systolic-like architecture, with the low-level communication interface details hidden from the end-user and at the same time support automation for implementing streaming applications on the platform? 19
  20. 20. Conventional design flow design capture partitioning and scheduling of processes manual redesign loop select parameters for the customizable cores map the values on the hardware 20
  21. 21. Proposed PRO-PART Design Flow SFG representation streaming structure and of application algorithm dataflow capture components PRO-PART + using SFG specification Design flow component specification of implementation platform NOFIS platform partitioning and Input specifications communication resource specification automated Objectives: configure and generate • Partition algorithm values for communication cores • Identify communication resources • Schedule and embed flow control mapping to hardware • Guarantee overall synchronization without re-design loops 21
  22. 22. SAFC Flow Graph (SFG) Data from Data from Data from North inputs Data from South inputs North inputs East inputs F1 F2 F3 F4 F5 F6 FPGA 1 FPGA 2 FPGA 3 FPGA 4 FPGA12 FPGA13 FPGA14 FPGA 5 F7 F8 FPGA11 FPGA16 FPGA15 FPGA 6 F9 FPGA10 FPGA 9 FPGA 8 FPGA 7 Synchronized data Data from Data from blocks recorded West inputs South inputs onto disk • Provides abstraction to view a streaming system with a single universal clock (I/O rate) 22
  23. 23. Platform specification ML310 board 1 ML310 board 2 Aurora switches Aurora switches FSL FSL Aurora PE1 PE2 PE1 PE2 Aurora switches Aurora switches Aurora switches Aurora switches FSL FSL PE3 PE3 PE4 PE4 Aurora switches Aurora switches 23
  24. 24. Process Partitioning Read image1 from Read image2 from memory and memory and create zone1 create zone2 ML310 board 1 f1 = FFT(zone1) f2 = FFT(zone2) f3 = mult(f1, f2) ML310 board 2 f4 = IFFT(f3) sub-pixel(f4) • Assign process id depending on order of execution and partition between boards 24
  25. 25. Configuring comm. resources ML310 board 1 Read image1 from Read image2 from memory and memory and create zone1 create zone2 synchronous non- • Configure buffer depths blocking mode for FSL • map channels to physical f1 = FFT(zone1) f2 = FFT(zone2) links Channel multiplexing • schedule data over over Aurora and flow ML310 board 2 control mode channels f3 = mult(f1, f2) • multiplex virtual channels f4 = IFFT(f3) sub-pixel(f4) 25
  26. 26. Mapping to Hardware Inside ML310 PE PE data_out1 I/O I/O data_in1 FSL FSL FSL data_out2 data_in2 I/O I/O FSL PE PE • I/O unit generates parameter values for: FIFO generator, FSL block, sync counter, Aurora FSL switch 26
  27. 27. Particle Image Velocitometry (PIV) • Cardiovascular Disease (CVD) is the leading cause of death in the United States and accounts for more than 37.1 % of all fatalities for 2005. • AEThER Lab at Virginia Tech models cardiovascular fluid dynamics 27
  28. 28. PIV algorithm t motion vector Image 1 FFT t + dt zone 1 Multiplication IFFT Reduction FFT Image 2 zone 2 • Data-intensive, each case results in 1250 image pairs x 5MB = 6.25 GB • Custom FlowIQ program: 16 minutes for one image pair on a 2GHz Xeon processor resulting in 2.6 years for analysis 28
  29. 29. PIV performance 5.0 CPU 4.50 • GPU fastest platform, 4.5 GPU FPGA expensive data transfer 4.0 between device(GPU) and 3.5 Time in seconds host(CPU). 3.0 • PRO-PART+ NOFIS: slower 2.5 2.25 but higher throughput due to 2.0 efficient pipelining and 1.279 1.5 customized communication 1.0 cores. (work in progress) 0.5 0 Execution device 29
  30. 30. ETA Beamforming application LVDS Antenna connections inputs ML310 S25 S25 ML310 2.5 Gbit/sec Serial Interconnect Network (Aurora) S25 ML310 ML310 S25 S25 ML310 ML310 PC disk S25 ML310 ML310 PC disk ML310 S25 ML310 PC S25 ML310 disk S25 ML310 ML310 PC disk ML310 Inner nodes S25 Recording nodes S25 ML310 S25 ML310 Receiver nodes Outer nodes 30
  31. 31. ETA using PRO-PART + NOFIS Proposed implementation Current ETA implementation (work in progress) • Time-consuming and • Shorter design cycle extensive simulations • Potential increase in resource • Hardware efficient due to but meets performance goals hand-coded RTL systolic dataflow structure and capture using components design capture SFG specification partitioning and partitioning and scheduling of processes communication resource specification manual redesign loop automated select parameters for configure and generate the customizable cores values for communication cores map the values on the hardware mapping to hardware 31
  32. 32. Contributions • Map streaming applications to GALS architecture • SAFC Flow Graph (SFG) representation • PRO-PART design methodology • Configurable communication cores • Guarantee synchronization by meeting the I/O clock rate • Increase designer productivity 32
  33. 33. Discussion Questions 33
  34. 34. Supporting slides 34
  35. 35. Synchronization methods Source synchronous Data Source IC Destination IC Clock Self synchronous Data and clock Source IC Destination IC 35
  36. 36. Message Passing Interface (MPI) Formatted output file with combined results job submission User Head node application application dataset dataset Compute Compute Compute Compute node 1 node 2 node 3 node n 36
  37. 37. Models of Computation: CSP Process 3 channel channel sending sending channel channel Process Process Process 2 1 5 channel channel receiving receiving channel channel Process 4 37
  38. 38. Models of Computation: KPN Infinite P1 sized P2 FIFO FI zed e si finit In ze fin d In si IFO FO ite F P3 38
  39. 39. NVIDIA Tesla C1060 GPU Multiprocessor N ! Multiprocessor 2 !! quot;#$!%&'()**!+,*!-(./01).1,()!1/-1!2)304)(*!,'!15!6!78'*!09!+51/!,'*1()-:!-92! Multiprocessor 1 25;9*1()-:!2-1-!1(-9*<)(*=! !! >1-92-(2!092,*1(?!<5(:!<-.15(*@!<5(!+51/!2)*A15'!-92!(-.AB:5,91)2! Shared Memory .59<0C,(-1059*=! !! DE$F$G!,90<0)2!2(04)(!-(./01).1,()!HIFGJ=! Registers Registers Registers Instruction B@C! ()*+,quot;1DEAquot;./0quot; Unit Processor 1 Processor 2 Processor M K/)!DE$F$G!K)*3-!#LMN! Constant 7quot;I!.5:',109C!+5-(2!0*!-9! Cache -22B09!.-(2!+-*)2!59!1/)!K)*3-! #LMN!7quot;I=!$1!/-*!-!quot;#$! %&'()**!<,33B/)0C/1!<5(:!<-.15(! Texture -92!0*!1-(C)1)2!-*!-!/0C/! Cache ')(<5(:-9.)!.5:',109C! HOquot;#J!*53,1059!<5(!quot;#$! %&'()**!/5*1!;5(A*1-1059*=! GPU memory 7quot;I!#5:',109C!+5-(2B3)4)3! '(52,.1*!25!951!/-4)!20*'3-?! .599).15(*!-92!-()!*').0<0.-33?! 2)*0C9)2!<5(!.5:',109C=! F;G5:)quot;BHC-quot;()*+,quot;1DEAquot;./0quot; quot;(5.)**5(!.35.A*@!:):5(?! 39
  40. 40. PIV performance-cuda_profile 40
  41. 41. NOFIS Hardware ML310 Infiniband adapter board ML310 41