R EFERENCE:   PUBLISHED BY THE IEEE COMPUTER SOCIETY, JULY 2008 Presented by: Md. Merazul Islam 0507036 Dept. of CSE, KUET
W ARP  P ROCESSING  ? Dynamically optimize the software to improve execution time and energy consumption. A new architecture implementing with both H/W & S/W. Transform binary kernel into FPGA circuit. Fully dynamic and generate entire coprocessing circuits beyond functional units. It can also works with multiple processors. Md. Merazul Islam, Dept. of CSE, KUET
F PGA   C IRCUIT   ? Field Programmable Gate Array: Programmable. FPGA do Bit Manipulation Fast. FPGAs aren't Part of Mainstream Computing. Supports any compiler, any language, multiple sources etc. Figure: In  the CAD-oriented FPGA,  the configurable  logic block inputs and outputs are directly connected to the switch matrices. Md. Merazul Islam, Dept. of CSE, KUET
W ARP  A RCHITECTURE µ P I$ D$ FPGA Profiler Dynamic Part. Module (DPM) Md. Merazul Islam, Dept. of CSE, KUET Partitioned application executes faster with lower energy consumption 5 Profile application to determine critical regions 2 Profiler Initially execute application in software only  1 µ P I$ D$ Partition critical regions to hardware 3 Dynamic Part. Module (DPM) Program configurable logic & update software binary 4 FPGA
W ARP  P ROCESSING  S TEPS Md. Merazul Islam, Dept. of CSE, KUET µ P I$ D$ (FPGA) Profiler DPM (CAD) Binary Binary Decompilation Binary HW Bit stream RT Synthesis Partitioning Binary Updater Binary Updated Binary Binary Std. HW Binary JIT FPGA Compilation JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing
W ARP  P ROCESSING  S TEPS Dynamic Binary Translation Decompilation: Recover high-level information lost during compilation. Utilize sophisticated decompilation methods. RT Synthesis: Converts decompiled CDFG to Boolean expressions. Detects read/write, memory access pattern, memory read/write ordering. discover loops, if-else, etc. reduce operation sizes, etc. reroll loops, etc. Md. Merazul Islam, Dept. of CSE, KUET
W ARP  P ROCESSING  S TEPS Logic Synthesis:  Optimize hardware circuit created during RT synthesis. Technology Mapping/Packing: Decompose hardware circuit into basic logic gates. Traverse logic network combining nodes to form single-output. Placement:  Identify critical path, placing critical nodes in center of configurable logic fabric. Routing: Find a path within FPGA to connect source and sinks of each net. Represent routing nets between CLBs as routing between SMs. Md. Merazul Islam, Dept. of CSE, KUET
R ESULTS Execution Time and Memory Requirements (a) a commercial FPGA CAD tool running on a desktop workstation (b) the Riverside Dynamic CAD  tools on the same workstation, and (c) the RDCAD tools on a lean 40- MHz ARM7 processor.  size  time  a   120 MB  3 min b   3.6 MB  .108 s  c   3.6 MB  1.11 s  Md. Merazul Islam, Dept. of CSE, KUET
S PEEDUP  C OMPARISON [a] Comparison of software execution on a digital signal processor (DSP) and warped execution on a warp processor to a 200-MHz ARM9 on single threaded applications.  [b] Comparison of multithreaded application speedups on various 400-MHz ARM11-based multiprocessors and warp processors. Md. Merazul Islam, Dept. of CSE, KUET
C ONCLUSION Warp processing shows  the  technique’s & opening  the door  to new challenges. Speed up 2X-100X or even more. 20X less memory usage. 10% more routing resource usage. 38%-94% power reduction. In the near future, we expect warp processors to achieve speedups much greater than an order of magnitude.  Md. Merazul Islam, Dept. of CSE, KUET

0507036

  • 1.
    R EFERENCE: PUBLISHED BY THE IEEE COMPUTER SOCIETY, JULY 2008 Presented by: Md. Merazul Islam 0507036 Dept. of CSE, KUET
  • 2.
    W ARP P ROCESSING ? Dynamically optimize the software to improve execution time and energy consumption. A new architecture implementing with both H/W & S/W. Transform binary kernel into FPGA circuit. Fully dynamic and generate entire coprocessing circuits beyond functional units. It can also works with multiple processors. Md. Merazul Islam, Dept. of CSE, KUET
  • 3.
    F PGA C IRCUIT ? Field Programmable Gate Array: Programmable. FPGA do Bit Manipulation Fast. FPGAs aren't Part of Mainstream Computing. Supports any compiler, any language, multiple sources etc. Figure: In the CAD-oriented FPGA, the configurable logic block inputs and outputs are directly connected to the switch matrices. Md. Merazul Islam, Dept. of CSE, KUET
  • 4.
    W ARP A RCHITECTURE µ P I$ D$ FPGA Profiler Dynamic Part. Module (DPM) Md. Merazul Islam, Dept. of CSE, KUET Partitioned application executes faster with lower energy consumption 5 Profile application to determine critical regions 2 Profiler Initially execute application in software only 1 µ P I$ D$ Partition critical regions to hardware 3 Dynamic Part. Module (DPM) Program configurable logic & update software binary 4 FPGA
  • 5.
    W ARP P ROCESSING S TEPS Md. Merazul Islam, Dept. of CSE, KUET µ P I$ D$ (FPGA) Profiler DPM (CAD) Binary Binary Decompilation Binary HW Bit stream RT Synthesis Partitioning Binary Updater Binary Updated Binary Binary Std. HW Binary JIT FPGA Compilation JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing
  • 6.
    W ARP P ROCESSING S TEPS Dynamic Binary Translation Decompilation: Recover high-level information lost during compilation. Utilize sophisticated decompilation methods. RT Synthesis: Converts decompiled CDFG to Boolean expressions. Detects read/write, memory access pattern, memory read/write ordering. discover loops, if-else, etc. reduce operation sizes, etc. reroll loops, etc. Md. Merazul Islam, Dept. of CSE, KUET
  • 7.
    W ARP P ROCESSING S TEPS Logic Synthesis: Optimize hardware circuit created during RT synthesis. Technology Mapping/Packing: Decompose hardware circuit into basic logic gates. Traverse logic network combining nodes to form single-output. Placement: Identify critical path, placing critical nodes in center of configurable logic fabric. Routing: Find a path within FPGA to connect source and sinks of each net. Represent routing nets between CLBs as routing between SMs. Md. Merazul Islam, Dept. of CSE, KUET
  • 8.
    R ESULTS ExecutionTime and Memory Requirements (a) a commercial FPGA CAD tool running on a desktop workstation (b) the Riverside Dynamic CAD tools on the same workstation, and (c) the RDCAD tools on a lean 40- MHz ARM7 processor. size time a 120 MB 3 min b 3.6 MB .108 s c 3.6 MB 1.11 s Md. Merazul Islam, Dept. of CSE, KUET
  • 9.
    S PEEDUP C OMPARISON [a] Comparison of software execution on a digital signal processor (DSP) and warped execution on a warp processor to a 200-MHz ARM9 on single threaded applications. [b] Comparison of multithreaded application speedups on various 400-MHz ARM11-based multiprocessors and warp processors. Md. Merazul Islam, Dept. of CSE, KUET
  • 10.
    C ONCLUSION Warpprocessing shows the technique’s & opening the door to new challenges. Speed up 2X-100X or even more. 20X less memory usage. 10% more routing resource usage. 38%-94% power reduction. In the near future, we expect warp processors to achieve speedups much greater than an order of magnitude. Md. Merazul Islam, Dept. of CSE, KUET