A Speculative Technique for Auto-Memoization Processor with Multithreading

A Speculative Technique for Auto-Memoization Processor with Multithreading Yushi KAMIYA † Tomoaki TSUMURA † Hiroshi MATSUO † Yasuhiko NAKASHIMA ‡ ○ † 　 Nagoya Institute of Technology ‡ 　 Nara Institute of Science and Technology The 10th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT) Hiroshima, Japan on 9th, December, 2009

Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Research background ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],・・・ Auto-Memoization Processor How to skip execution

Memoization for functions and loops ,[object Object],[object Object],[object Object],func: : : return %x main: : call func : : : .LL3: : : ba .LL3 : : (A) : Functions (B) : Loops Memoizable Instruction Regions between backward branch and branch target label between a callee label and return instruction

Auto-Memoization Processor Regs D$1 ALU Temporary buffer Computing... End of computation store writeback Match MemoBuf MemoTbl Save the input/output sequence Detect a function or a loop D$2 Input Matching

Registration of an input sequence RB (CAM) RA (RAM) v=6 W1 pointer v=140 W1 (RAM) RF (RAM) Memory(Cache) 00000004 00:00001000 00000002 02:00001008 --:-------- 00000001 01 opr 1 2 0 0x1000 0x1004 0x1008 int x, y[5]; ... opr(4); ... opr(int a) { int v; v = x + a; v = v * y[1]; return (v); } MemoTbl x y[0] y[1] 00 02 FF 02:00002000 00000406 01 00:00004004 60000000 FF --:-------- 80000008 03 00 sum Memobuf val %i0 00000004 adr x 00001000 val x 00000002 adr y[1] 00001008 val y[1] 00000001 RB RA RB RA RB RA (A) (B) (C) (A) (B) (C) Store 00 01 02 03 04 05 .. 00 01 02 03 04 05 .. ... ... ... ...

Input Matching W1 pointer Memory(Cache) v=140 opr v=6 RB (CAM) RA (RAM) W1 (RAM) RF (RAM) 1 2 0 0x1000 0x1004 0x1008 MemoTbl x y[0] y[1] sum 02:00002000 00000406 01 00:00004004 60000000 FF --:-------- 80000008 03 00 00000002 02:00001008 --:-------- 00000001 01 00 02 00000004 00:00001000 FF int x, y[5]; ... opr(4); ... opr(int a) { int v; v = x + a; v = v * y[1]; return (v); } FF:00000004 00:00000002 02:00000001 00 01 02 03 04 05 .. 00 01 02 03 04 05 .. ... ... ... ...

Reuse Overhead W1 pointer Memory(Cache) v=140 Comparing the input sequence with the value of RB entries opr v=6 RB (CAM) RA (RAM) W1 (RAM) RF (RAM) 1 2 0 0x1000 0x1004 0x1008 MemoTbl x y[0] y[1] int x, y[5]; ... opr(4); ... opr(int a) { int v; v = x + a; v = v * y[1]; return (v); } 02:00002000 00000406 01 00:00004004 60000000 FF --:00000000 80000008 03 00 00000002 02:00001008 --:00000000 00000001 01 00 02 00000004 00:00001000 FF sum Regs D$1 Writing back the output sequence Reuse Overheads 00 01 02 03 04 05 .. 00 01 02 03 04 05 .. ... ... ... ...

Speculative Multithreading ,[object Object],[object Object],[object Object],[object Object],SpMT core SpMT core Main core SpMT core Stride value Prediction MemoTbl Reuse the function fact(4) fact(3) fact(4) fact(5) fact(1) fact(2) fact(4) Calculation in advance fact(5) = 120 fact(4) = 24 fact(3) = 6 fact(1) = 1 fact(2) = 2 * fact : factorial(n!)

Memoization and Multithreading ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Our proposal

Reduction of Reuse Overhead ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],No-memoization thread : assumes that the input matching will fail Preceding thread : assumes that the input matching will succeed ・・・ The area (A) is executed normally ・・・ The area (B) is executed speculatively (B) ... v = u / w sum(); ・・・ (A) y = x + 4; ...

Execution model ③ (A) (B) Main thread Preceding thread Main thread Preceding thread ① ① Proposal Model ： Execution ： Search ： Write back Reuse overhead Former Model ② (C) ② No-memoization thread ① ④ ③ ② No-memoization thread Main thread ③ ② ... v = u / w; x = sum(5, 3); y = x + 4; z = x + y; ... x = sum(3, 6); z = x + y; ... int sum(a, b) { int i, sum = 0; for(i=0; i<a; i++) sum += i + b; return(sum); } (α) (β) Reduction (α + β) First several input values match the value of RB entries Completely matched Do not match time time

Prediction Pointer W1 pointer Prediction pointer v=6 Memory(Cache) 01 01 01 RB (CAM) RA (RAM) 1 2 0 0x1000 0x1004 0x1008 MemoTbl RF (RAM) W1 (RAM) opr x y[0] y[1] int x, y[5]; ... opr(4); ... opr(int a) { int v; v = x + a; v = v * y[1]; return (v); } 02:00002000 00000406 01 00:00004004 60000000 FF --:00000000 80000008 03 00 00000002 02:00001008 --:00000000 00000001 01 00 02 00000004 00:00001000 FF v=140 sum Match 00 01 02 03 04 05 .. 00 01 02 03 04 05 .. ... ... ... ...

Architecture – the proposal model D$2 MemoTbl Shared Memobuf Regs D$1 ALU SpRF Regs D$1 ALU SpRF Regs D$1 ALU SpRF Regs D$1 ALU Memo Buf Input Pred. Main thread Preceding thread No-memoization thread SpMT cores Additional register file set SpMT cores don't use the shared MemoBuf Shared with all cores

Register Synchronization 0 0 0 g0 g1 g2 ・・・・・ 0 0 0 g3 g4 g5 0 0 0 g6 g7 0 g0 g1 g2 g3 g4 g5 ... [0] 0 0 0 g0 g1 g2 ・・・・・ 0 0 0 g3 g4 g5 0 0 0 g6 g7 0 g0 g1 g2 g3 g4 g5 ... [1] 0 0 0 g0 g1 g2 ・・・・・ 0 0 0 g3 g4 g5 0 0 0 g6 g7 0 g0 g1 g2 g3 g4 g5 ... 1 1 [2] 0FFF1000 00000040 0FFF1000 0FFF1000 00000040 00000040 00000050 1 0FFF1000 00000040 0FFF1000 00000040 00000040 ... sum(); a = b * c; ... min(a, b, c); ... search (A) (B) (C) ： Main ： Preceding ： No-memoization 0FFF1000 RF SpRF RF SpRF RF SpRF SpRF RF WB Register mask Main thread Preceding thread No-memoization thread Main thread No-memoization thread RF ⇔ SpRF Don't synchronized

Performance Evaluation ,[object Object],[object Object],[object Object],[object Object],[object Object],Memo Buffer (Shared + Local) (RAM) 160 KBytes Memo Table (CAM) 128 KBytes (RAM) 448 KBytes Comparison (Register and CAM) 9 Cycles/32Bytes Comparison (Cache and CAM) 10 Cycles/32Bytes Write back (MemoTbl ⇒ Register or Cache) 1 Cycle/32Bytes Register copy 1 Cycle/32Bits

Performance – SPEC CPU95 099.go 124.m88ksim 126.gcc 129.compress 130.li 132.ijpeg 147.vortex 101.tomcatv 102.swim 103.su2cor 104.hydro2d 107.mgrid 110.applu 125.turb3d 141.apsi 145.fpppp 146.wave5 134.perl (N) w/o Memoization (M) Memoization (P) Memoization + Proposal (A) Memoization + SpMT + Proposal (S) Memoization + SpMT CFP CINT Reduced cycles ： reuse_ovh ： D$2 ： window ： exec ： regcopy ： D$1 max ave. (M) Memoization 13.9% -0.1% (S) Memoization + SpMT 35.2% 5.6% (P) Memoization + Proposal 21.7% 2.1% (A) Memoization + SpMT + Proposal 36.0% 9.0%

Conclusion ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Register copy overhead 099.go 147.vortex 126.gcc 130.li 146.wave5 124.m88ksim 129.compress 132.ijpeg 145.fpppp 141.aspi 099.tomcatv 110.applu 104.hydro2d 102.swim Copy all values The proposal model latency : 32 bits/cycle

A Speculative Technique for Auto-Memoization Processor with Multithreading

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A Speculative Technique for Auto-Memoization Processor with Multithreading

Similar to A Speculative Technique for Auto-Memoization Processor with Multithreading (20)

A Speculative Technique for Auto-Memoization Processor with Multithreading

Editor's Notes