Track A-Compilation guiding and adjusting - IBM
Upcoming SlideShare
Loading in...5
×
 

Track A-Compilation guiding and adjusting - IBM

on

  • 606 views

 

Statistics

Views

Total Views
606
Views on SlideShare
549
Embed Views
57

Actions

Likes
0
Downloads
1
Comments
0

3 Embeds 57

http://www.chiportal.co.il 52
http://www.directrss.co.il 4
http://chiportal.co.il 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Proebting was talking about performance; what about power/energy?? How can compilers help improve power? Tell me if you know..
  • This slide is quite self-explanatory
  • This slide shows the general overview of the ERA platform. Basically, there are different components for “processing”, “networking”, and “memories” that we can choose from in order to build the platform. On top, we want to be able to adapt to different applications by choosing from libraries of these components – an additional advantage is that we want to do this dynamically. For this, we need a hardware scheduler or an OS/software scheduler that works in tandem with the hardware scheduler. The monitoring block monitors for example the power and performance of the system and this information can be fed into the schedulers. Finally, we need a smarter compiler is better aware of the dynamic behavior of the platform.
  • This slides shows all the partners within the project.
  • This slide summarizes the slide with the figure of the ERA platform.
  • -mcpu: architecture (ISA); -mtune: micro-architecture Several PowerPC versions; code size, flexibility, switch versions at specific places in code
  • Memory params – static analysis of memory access patterns, temporal and spatial reuse Partition code into sections representing phases of distinct ILP/MEM
  • In the table, you can highlight the fact that we can parameterize the issue width of the roVEX processor and that different instantiations have different resource utilizations.
  • On this slide, we can see that with the same resources, we can instantiate different cores. 2 smaller ones to handle TLP or combine it into a big to exploit ILP. The idea in the ERA project is to be able to do this on-the-fly in a dynamic way manner.
  • This slide shows results on EDP (energy-delay product) measurements by varying the instruction window size (this has a clear relation with the parallelism of an application - ILP) and cache sizes. We see in this slide that when we increase the cache size, the EDP decreases. However, more interesting is the fact that the EDP product is similar (almost the same) with varying configurations – see the arrows pointing to different ILP-cache configurations. This means that we can optimize our design by changing the parameters and still achieve the same EDP. Please note that the information on this slide has not been published yet, so it is copyrighted!!

Track A-Compilation guiding and adjusting - IBM Track A-Compilation guiding and adjusting - IBM Presentation Transcript

  • Compilation guiding and adjusting to hardware changes in Embedded Reconfigurable Architecture ( ) May 4, 2011 Ayal Zaks IBM Haifa Research Lab E A R
  • Motivation
    • Moore's Law asserts that advances in hardware double computing power every 18 months.
    • Proebsting's Law asserts that advances in compiler optimizations double computing power every 18 years.
    • Motivation: improve power/performance with compilation? – by leveraging HW advances!
  • Challenges of (EU FP7 STREP)
    • Develop a platform to meet the following requirements:
    • Limited power budgets ( power wall )
    • Break through the memory wall
    • Support for many applications
    • Stringent (real-time) performance needs
    • Improve/introduce functionalities without re-designs
    • Reduce design time and costs ( re-use & programmability )
    • Maintain performance without increasing design costs
    • Dynamically ride the performance-power trade-off curve
    The adaptive ERA platform will be able to meet these challenges!! E A R
  • Memory component Network component Processing component Monitoring Hardware scheduler LIBRARIES Applications OS (or software scheduler) C/C++/ Java compiler Power vs. Performance ARM, VEX, DSP, accelerators, etc. Crossbar, bus, NoC, etc. Multi-level caches, controllers, etc. Abstract overview of the platform E A R
  • Partners of Participant no. Participant organisation name Short name Country 1 (Coordinator) Technische Universiteit Delft TUD NL 2 Industrial Systems Institute ISI GR 3 Universita' degli Studi di Siena UNISI IT 4 Chalmers University CHALMERS SE 5 University of Edinburgh UEDIN UK 6 Evidence EVI IT 7 ST Microelectronics ST IT 8 IBM IBM IL 9 Universidade do Rio Grande do Sul UFRGS BR 10 Uppsala University UPP SE E A R
  • Key elements of the ERA platform
    • Summary of main components (in no particular order) & expertise :
    • Reconfigurable & parameterized computing elements TUD, UEDIN, UFRGS
    • Reconfigurable & parameterized network components UFRGS
    • Reconfigurable memory hierarchy and organization ISI,CHALMERS, UPP
    • Hardware scheduler TUD, CHALMERS
    • Hardware monitors CHALMERS
    • OS support EVI
    • Hardware-aware compilers UEDIN, IBM
    • Application profiler UNISI, ST
  • Work packages and leaders
    • WP0: Management TUD
    • WP1: Embedded Application Analysis UNISI
    • WP2: Dynamic Embedded Processor Arch. UEDIN
    • WP3: Memory Hierarchy ISI
    • WP4: Software Interface and Tools IBM
    • WP5: Dissemination TUD
    • WP6: Exploitation STMICRO
  • Goals of the ERA project
    • Develop the ERA platform
    • Investigate the performance/power tradef-offs in all main components within the ERA platform
    • Investigate and incorporate techniques in compilers to deal with dynamically parameterizable hardware
    • Investigate and incorporate OS scheduling techniques
    • Investigate and develop a hardware scheduler (no code changes needed !!)
    • Determine and incorporate metrics in the hardware monitor allowing application, OS, hardware scheduler to decide for reconfiguration.
  • ... compilers to deal with dynamically parameterizable hardware
    • Compilers generate assembly-code for specific targets (ISA dictionary, ABI conventions), optimizing it accordingly (latencies, conflicts)
    • What about targets that change?
      • Compile several versions statically, ahead-of-time
      • Dynamic binary translation
      • Re-compile just-in-time (JIT); split compilation process:
        • first compile source code to abstract intermediate language
        • then compile from intermediate language to the specific target
      • Migrating programs (& data) in virtual Cloud environments
  • The architecture:  -VEX
    • Reconfigurable VLIW Example
    • Load/store, registers
    • can reconfigure: #of syllables, #of FU’s, per syllable #of registers, MEM/Cache params
    • See r-vex.googlecode.com
  • WP4 1 st Year Achievements
    • 1. Toolchain for  -VEX – compiler work (IBM):
    • Initial implementation of  -VEX port in GCC 4.5
    • The compiler assumes the  -VEX processor organization:
      • generates up-to 4 syllables in VLIW instruction
      • uses GCC’s existing list scheduling and Swing Modulo Scheduling for loops
      • ongoing testing with simulator
    pipe:: c0 shl $r0.3 = $r0.3,16 c0 shl $r0.4 = $r0.4,16 c0 shl $r0.5 = $r0.5,16 c0 shl $r0.6 = $r0.6,16 ;; ;; ;; c0 shr $r0.4 = $r0.4,16 c0 shr $r0.6 = $r0.6,16 c0 shr $r0.3 = $r0.3,16 c0 shr $r0.5 = $r0.5,16 ;; ;; ;; c0 mpyll $r0.3 = $r0.3,$r0.4 c0 mpyll $r0.5 = $r0.5,$r0.6 ;; ;;;; c0 mpyll $r0.3 = $r0.3,$r0.5 ;; ;; ;; c0 add $r0.3 = $r0.3,3 ;;;; ;; c0 sxth $r0.3 = $r0.3 ;;;; ;; c0 return $r0.1 = $r0.1,(0x0),$l0.0 ;; ;; ;; ;; __attribute__ ((noinline)) short pipe (short a, short b, short c, short d) { short f, g , t; t = a * b; f = c * d; g = t * f; return g+3; } Source code  -VEX Assembly code generated by GCC
    • 2. Characterizing ILP of ERA benchmarks
    • Extract the hot code using standard profiling tools (oprofile, gcov) on standard platforms (PowerPC, x86)
    • Estimate ILP of the hot code by examining the code generated by GCC for different architectures
      • examine Modulo-Scheduled code generated for  -VEX
      • modify GCC to assume larger issue rates, additional machine resources , and shorter instruction latencies
    WP4 1 st Year Achievements (cont.)
  • Characterizing ILP of ERA benchmarks
    • examine Modulo-Scheduled code generated for  -VEX
    Original program 128 = b + 119 119 = 119 + 4 127= a + 119 129 = MEM[128] 130 = 129 + 1 MEM[127] = 130 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] 128 = b + 119 119 = 119 + 4 127= a + 119 129 = MEM[128] 130 = 129 + 1 MEM[127] = 130 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] void foo (unsigned char *dst , unsigned char *src ) { int x ; for( x = 0; x < 100; x+=1 ) dst[x] = ( src [x] + 1 ); } I nitiation I nterval prologue epilogue Source code  -VEX Assembly (transcribed) modulo scheduled by GCC 128 = b + 119 119 = 119 + 4 127= a + 119 129 = MEM[128] 130 = 129 + 1 MEM[127] = 130 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] 128 = b + 119 119 = 119 + 4 127= a + 119 129 = MEM[128] 130 = 129 + 1 MEM[127] = 130 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]
  • Example - X264 List of hot functions:
  • Poster at
  • Recent Developments
    • Identified benchmarks representative for the embedded market followed by characterization
    • Modeling of memory system and network configurations
    • Fully functional ρ -VEX processor softcore (tested on Virtex-4, -5, -6, and Altera Stratix FPGAs)
    • Results on resource utilization on Virtex-6:
    Issue-width Slice Registers Slice LUTs BRAMs 2-issue 586 (0%) 6375 (4%) 4 (1%) 4-issue 1046 (0%) 12899 (8%) 16 (4%) 8-issue 1868 (0%) 26252 (17%) 64 (15%)
  • TLP vs. ILP
    • Leverage reconfigurable multi-core to adapt resources from TLP to ILP or fault-tolerance
    • Key ideas:
      • Add direct pair-wise fine-grain communication support to interconnect and ISA
      • Compiler manages ILP through advanced clustering techniques
  • Core vs. Cache GCC EDP Different configurations, same EDP! Copyright © Keramidis & Kaxiras, ERA project
  • Conclusions
    • “ Hardware has become more flexible than software” ( )
    • Couple HW reconfiguration and monitoring with SW analysis, transformation and management (compiler, OS) to provide flexible power/performance efficient platform
    • Requires interdisciplinary, open collaboration; in line with “ Application Driven Design ” and “Pursuing Growth through Collaboration” objectives of
  • Thanks! To you and:
    • partners
    • network of excellence
    • IBM
    E A R
  • Contact information Visit http://www.era-project.eu for more information Coordinator: Stephan Wong (Delft University of Techology) [email_address] http://ce.et.tudelft.nl/~stephan/ IBM representative , Work Package 4 leader: Ayal Zaks (IBM Haifa Research Lab) [email_address] https://www.research.ibm.com/haifa/dept/svt/code_compiler.html
  • Strengths of ERA partners
    • TUD: reconfigurable architectures, parameterized VLIW cores
    • ISI: power-efficient memories, processor architectures
    • UNISI: application modeling, power modeling
    • Chalmers: memory hierarchies, hardware monitoring
    • UEDIN: compilers, parallel architectures
    • Evidence: embedded OS
    • ST Micro: large industrial player, ES design experience
    • IBM: tools, compilers
    • UFRGS: computer architectures, reconfigurable technologies