Presenter : Shuai-wei Huang Date : 2007/11/21 A Lightweight Instruction Scheduling Algorithm for Just-In-Time Compiler on ...
Contents Introduction 1 XSCALE CORE PIPELINES   2 LIS Algorithm   3 Performance Evaluation 4 Conclusions 5
Introduction <ul><li>For a J2ME JIT, the algorithms face the challenges: </li></ul><ul><ul><li>small memory budget  </li><...
Introduction Related work <ul><li>List scheduling is widely adopted in compilers. In practice, the time complexity of this...
Introduction DAG Example <ul><li>LIS not based on Directed Acyclic Graphs (DAGs) or expression trees, but a novel data str...
Introduction  XORP JIT  <ul><li>XORP (Xscale Open Runtime Platform) is Intel’s J2ME JVM for both CDC and CLDC configuratio...
XSCALE CORE PIPELINES Superpipeline   <ul><li>The XScaleTM core consists of a main execution pipeline, a multiply/accumula...
XSCALE CORE PIPELINES Out-of-order   <ul><li>Instructions in different pipelines may be completed out of order, if no data...
XSCALE CORE PIPELINES Resource conflicts <ul><li>Multiply instructions could cause pipeline stalls due to either result la...
XSCALE CORE PIPELINES load-use instruction <ul><li>In many Java applications, the typical pipeline stalls come from load-u...
LIS Algorithm EDM <ul><li>DM </li></ul><ul><ul><li>-1 : no dependency </li></ul></ul><ul><ul><li>0  : dependency, but no s...
LIS Algorithm Re-ordered native instructions  <ul><li>According to the values in the column “Stl”, it is easy to know ther...
LIS Algorithm stall enclosure <ul><li>The motivation of LIS algorithm is, to look for instructions that could be moved bef...
LIS Algorithm <ul><li>for (every  Stl n   >0 ){  </li></ul><ul><li>t =  Stl n   ;  </li></ul><ul><li>for(m =  Ceil n   –1 ...
LIS Algorithm   Complexity <ul><li>In practice, only a few instructions around the stalled ones, i.e. instructions with po...
LIS Algorithm Static counts of total instructions and stalled instructions   <ul><li>The total instructions are about 6.8 ...
LIS Algorithm   Complexity (build EDM) <ul><li>In XORP JIT, we use a scheduling window with a constant size to build EDM. ...
Performance Evaluation <ul><li>XORP JIT will compile every Java method to XScale native instructions when the Java method ...
Performance Evaluation <ul><li>We chose six Java workloads from EEMBC [3], namely Chess, Cryptography, kXML, Parallel, PNG...
Performance Evaluation <ul><li>Figure 7 shows that the average compilation time of EEMBC occupies 25.3% of the 1st round e...
Performance Evaluation <ul><li>Figure 8 illustrates the compilation time including scheduling and building EDM or building...
Performance Evaluation <ul><li>Figure 9 presented  t he efficiency of the result code by LI S. </li></ul>
Performance Evaluation <ul><li>The runtime performance improvement by LIS is significant for the workloads we studied, lik...
Conclusions <ul><li>Because of the resource constraint, especially power constraint on embedded systems, the capability of...
Thank You !
Upcoming SlideShare
Loading in...5
×

A Lightweight Instruction Scheduling Algorithm For Just In Time Compiler

1,214

Published on

In this paper, we present a lightweight algorithm of instruction scheduling to reduce the pipeline stalls on XScale.

Published in: Economy & Finance, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,214
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
28
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • A Lightweight Instruction Scheduling Algorithm For Just In Time Compiler

    1. 1. Presenter : Shuai-wei Huang Date : 2007/11/21 A Lightweight Instruction Scheduling Algorithm for Just-In-Time Compiler on XScale Xiaohua Shi Peng Guo Programming System Lab Microprocessor Technology Labs {xiaohua.shi,peng.guo }@intel.com
    2. 2. Contents Introduction 1 XSCALE CORE PIPELINES 2 LIS Algorithm 3 Performance Evaluation 4 Conclusions 5
    3. 3. Introduction <ul><li>For a J2ME JIT, the algorithms face the challenges: </li></ul><ul><ul><li>small memory budget </li></ul></ul><ul><ul><li>compilation time constraint. </li></ul></ul><ul><li>In this paper, we present a lightweight algorithm of instruction scheduling to reduce the pipeline stalls on XScale. </li></ul><ul><li>It only uses a small piece of memory space with constant size, about 1k bytes for one thread. </li></ul>
    4. 4. Introduction Related work <ul><li>List scheduling is widely adopted in compilers. In practice, the time complexity of this algorithm could be close to linear, and O(n^2) to the code length in the worst cases. </li></ul>n^2 DAG 1986 Gibbons & Muchnick (List Scheduling) n^2 DAG 1995 Kurlander,Proebsting,and Fischer (DLS) n^2 DAG 1988 Goodman & Hsu (IPS) DAG construct complexity Date structure Publish year Author
    5. 5. Introduction DAG Example <ul><li>LIS not based on Directed Acyclic Graphs (DAGs) or expression trees, but a novel data structure namely extended dependency matrix ( EDM ). </li></ul>
    6. 6. Introduction XORP JIT <ul><li>XORP (Xscale Open Runtime Platform) is Intel’s J2ME JVM for both CDC and CLDC configurations on XScale. </li></ul><ul><li>Most optimizations in the JIT compiler have linear time complexity or almost linear complexity in practice, with constrained memory budget. </li></ul><ul><li>The instruction scheduling module is the last optimization before JIT emitting the result code. </li></ul><ul><li>XORP JIT does not pay the price for a global scheduling mechanism with much higher complexity. </li></ul>
    7. 7. XSCALE CORE PIPELINES Superpipeline <ul><li>The XScaleTM core consists of a main execution pipeline, a multiply/accumulate (MAC) pipeline, and a memory access pipeline. </li></ul>
    8. 8. XSCALE CORE PIPELINES Out-of-order <ul><li>Instructions in different pipelines may be completed out of order, if no data dependencies exist. </li></ul><ul><ul><ul><li>I 0 : ldr R1, [R0] </li></ul></ul></ul><ul><ul><ul><li>I 1 : add R2, R2, R3 </li></ul></ul></ul><ul><ul><ul><li>I 2 : add R4, R1, R2 </li></ul></ul></ul><ul><ul><li>Instruction I 1 could be completed before I 0 , because they will be processed in different pipelines. </li></ul></ul><ul><ul><li>For instruction I 2 , it depends on the results of both I 0 and I 1 , and will wait for the completion of all the previous ones. </li></ul></ul>
    9. 9. XSCALE CORE PIPELINES Resource conflicts <ul><li>Multiply instructions could cause pipeline stalls due to either result latencies, or resource conflicts, which mean that no more than two instructions can occupy the MAC pipeline concurrently. </li></ul><ul><li>For instance, the following two instructions without data dependencies will incur a stall of 0~3 cycles, depending on the actual execution cycles of I 0 , due to resource conflicts: </li></ul><ul><li>I 0 : mul R0,R4,R5 </li></ul><ul><li>I 1 : mul R1,R6,R7 </li></ul>
    10. 10. XSCALE CORE PIPELINES load-use instruction <ul><li>In many Java applications, the typical pipeline stalls come from load-use instructions. </li></ul><ul><li>Pipeline stalls will happen before instruction I 1 and I 2 respectively. </li></ul>...; prepare outgoing arguments I0: ldr R12, [R0] ; //get vtable from object handle I1: ldr R12, [R12 + offset] ; //vtable + offset equals to the address of the method entry I2: blx R12 ; //indirect branch to the method entry
    11. 11. LIS Algorithm EDM <ul><li>DM </li></ul><ul><ul><li>-1 : no dependency </li></ul></ul><ul><ul><li>0 : dependency, but no stall </li></ul></ul><ul><ul><li>Positive integer : pipeline stall cycles </li></ul></ul><ul><li>Cyl </li></ul><ul><ul><li>estimated execution cycles from I 0 to others. </li></ul></ul><ul><li>Stl </li></ul><ul><ul><li>pipeline stall cycles before issuing one instruction. </li></ul></ul><ul><li>Ceil </li></ul><ul><ul><li>the instruction causing the stall with smallest index </li></ul></ul><ul><li>UP,DWN </li></ul><ul><ul><li>record the boundaries where the instructions could be safely moved to, without breaking the data dependencies. </li></ul></ul><ul><li>I 0 : add R0,Rr5,Rr6 </li></ul><ul><li>I 1 : sub R1, R7,Rr8 </li></ul><ul><li>I 2 : ldr R2, [R4, 0x4] </li></ul><ul><li>I 3 : add R3, R2, R1 </li></ul>
    12. 12. LIS Algorithm Re-ordered native instructions <ul><li>According to the values in the column “Stl”, it is easy to know there is a 2-cycle stall before instruction I 3 . </li></ul><ul><li>The “DWN’ values of the first two instructions I 0 and I 1 are equal to or larger than “3”. </li></ul><ul><li>Both I 0 and I 1 could be safely moved before I 3 , to overlap the pipeline stalls between I 2 and I 3 . </li></ul><ul><li>I 2 : ldr R2, [R4, 0x4] </li></ul><ul><li>I 0 : add R0,Rr5,Rr6 </li></ul><ul><li>I 1 : sub R1, R7,Rr8 </li></ul><ul><li>I 3 : add R3, R2, R1 </li></ul>
    13. 13. LIS Algorithm stall enclosure <ul><li>The motivation of LIS algorithm is, to look for instructions that could be moved before the stalled ones. </li></ul><ul><li>Stl_En n is a “stall enclosure”, which includes all instructions from index Ceil n to n . </li></ul><ul><li>LIS avoids to move instructions outside a Stl_En , but moves instructions inside it. </li></ul>
    14. 14. LIS Algorithm <ul><li>for (every Stl n >0 ){ </li></ul><ul><li>t = Stl n ; </li></ul><ul><li>for(m = Ceil n –1 ; m>=0 && t >0; m--){ </li></ul><ul><li>if( I m belongs to other Stl_En ) break ; </li></ul><ul><li>if( I m has been moved before) continue ; </li></ul><ul><li>if( DWN m > Ceil n ){ </li></ul><ul><li>move I m after Ceil n ; </li></ul><ul><li>t = t – issue-latency(I m ); </li></ul><ul><li>}/*if*/ </li></ul><ul><li>}/*for*/ </li></ul><ul><li>for(m = n +1 ; m<=last instruction && t>0; m++){ </li></ul><ul><li>if( I m belongs to other Stl_En ) break ; </li></ul><ul><li>if( I m has been moved before) continue ; </li></ul><ul><li>if( UP m < n ){ </li></ul><ul><li>move I m before I n ; </li></ul><ul><li>t = t – issue-latency(I m ); </li></ul><ul><li>}/*if*/ </li></ul><ul><li>}/*for*/ </li></ul><ul><li>}/*for*/ </li></ul>(A) Ceil n (B) Stl_En n nth inst. (C) Ceil y (D) Stl_En y yth inst. (E)
    15. 15. LIS Algorithm Complexity <ul><li>In practice, only a few instructions around the stalled ones, i.e. instructions with positive Stl values, will be visited during the scheduling. It’s one of the key reasons that this algorithm runs faster than others’ approaches. </li></ul><ul><li>In the worst cases, every instruction will be visited at most twice, just as what we have introduced in the previous section. The time complexity is still linear to the code length. </li></ul>
    16. 16. LIS Algorithm Static counts of total instructions and stalled instructions <ul><li>The total instructions are about 6.8 times more than the stalled instructions on average, and up to 8.31 times for kXML. The big difference makes the LIS run faster. </li></ul>
    17. 17. LIS Algorithm Complexity (build EDM) <ul><li>In XORP JIT, we use a scheduling window with a constant size to build EDM. </li></ul><ul><li>All the columns except “DWN” could be calculated out when the EDM growing. </li></ul><ul><li>When reaching the boundary of a basic block, or the scheduling window overflowing, the algorithm will update all values in the column “DWN”. </li></ul><ul><li>For every thread, most memory space required by this algorithm is the constant-size EDM. With 16 rows, its total size is less than 1k bytes. </li></ul>
    18. 18. Performance Evaluation <ul><li>XORP JIT will compile every Java method to XScale native instructions when the Java method is called at the first time at runtime. </li></ul><ul><li>The implementation of list scheduling is based on some simple heuristic rules, and does not deal with register allocation. </li></ul><ul><li>List scheduling traverses the DAG from the roots toward the leaves, selects the: </li></ul><ul><ul><li>the earliest execution time </li></ul></ul><ul><ul><li>maximum possible delay </li></ul></ul><ul><ul><li>and updates current time and the earliest </li></ul></ul><ul><ul><li>execution time for its children. </li></ul></ul>
    19. 19. Performance Evaluation <ul><li>We chose six Java workloads from EEMBC [3], namely Chess, Cryptography, kXML, Parallel, PNG Decoding and Regular Expression to demonstrate the comparisons between list scheduling and LIS, in terms of the compilation time and runtime performance. </li></ul>
    20. 20. Performance Evaluation <ul><li>Figure 7 shows that the average compilation time of EEMBC occupies 25.3% of the 1st round execution. The list scheduling consumes 15.9% of the total compilation time, and 4% of the total execution time on average. </li></ul>
    21. 21. Performance Evaluation <ul><li>Figure 8 illustrates the compilation time including scheduling and building EDM or building DAGs of LIS and the list scheduling. </li></ul>
    22. 22. Performance Evaluation <ul><li>Figure 9 presented t he efficiency of the result code by LI S. </li></ul>
    23. 23. Performance Evaluation <ul><li>The runtime performance improvement by LIS is significant for the workloads we studied, like Figure 10. </li></ul>
    24. 24. Conclusions <ul><li>Because of the resource constraint, especially power constraint on embedded systems, the capability of processors and the size of memory footprint are still bottlenecks for a high performance JIT compiler. </li></ul><ul><li>For XScaleTM, the 3-cycle L1 cache latency could produce significant pipeline stalls at runtime, just like we have introduced above. </li></ul><ul><li>Lightweight instruction scheduling mechanisms like LIS could be used to reduce the pipeline stalls in an easier and faster way. </li></ul>
    25. 25. Thank You !
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×