Your SlideShare is downloading. ×
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Hardware assited x86 emulation on godson 3
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hardware assited x86 emulation on godson 3

3,442

Published on

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,442
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
31
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide





























  • Transcript

    • 1. Hardware-assisted x86 emulation on Loongson-3 syuu@openbsd.org
    • 2. What is Loongson? • “A Chinese Challenge to Intel” • Microprocessor development project in ICT • ST Microelectronics is manufacturing & selling • MIPS compatible, but independently developed
    • 3. History of Loongson 2002 Loongson 1: 200MHz, 180nm, MIPS32 2003 Loongson 2B: 250MHz, 180nm, MIPS64 2004 Loongson 2C: 450MHz, 180nm, MIPS64 2006 Loongson 2E: 1GHz, 90nm, MIPS64, 512KB L2, DDR, 5~7W Loongson 2C based small computer released(Lemote Longmeng) 2007 Loongson 2F: 1GHz, 90nm, MIPS64, 512KB L2, DDR2, PCI/PCI-X 3~5W Loongson 2F based HPC revealed(KD-50-I, 330 core, 1TFLOPS) 2008 Loongson 2E/2F based Netbook released(Jisus, Gdium, Lemote Yeeloong) 2009 ICT licensed the MIPS32/64 architecture from MIPS Technologies 2010 Loongson 3A: 4core 1GHz, 65nm, 4MB L2, DDR2/3, HyperTransport 1.0, PCI/PCI- X, 10W Loongson 3A based HPC announced(KD-60-I, 4x80 core, 1TFLOPS)
    • 4. SPEC CPU2000 Rate Godson Development !"""" Intel/AMD/HP/IBM/SGI/Sparc SPEC cpu2000 rate !""" !"" Godson rate !" !### $""" $""! $""$ $""% $""& $""' $""( 5
    • 5. Yes, it runs OpenBSD!
    • 6. Also other OSes • Linux: Debian, RedFlag, Mandriva... • NetBSD • Windows CE
    • 7. GS464(Loongson 3A) • Scalable Architecture • Reconfigurable CPU core and L2 • Hardware-assisted x86 emulation • Low power consumption
    • 8. Scalable Architecture Scalable Architecture Design ! Scalable interconnection networ k " C rossbar + M esh • 8x8 crossbar " Single crossbar connects cores, L2s, and four directions ! Directory-based cache coherence protocol • Directory caches cache coherency " Distributed L2 based are globally addressed "• Bothcore cache65nm(3B), 4 core on 32nm(3C)directory " E ach cache block has a directory entry 2 data on and instruction cache are recorded in P0 P1 P2 P3 E E S S W 8x8 X bar W N N L2 L2 L2 L2 11
    • 9. Reconfigurable CPU core and L2 Reconfigurable architecture Special purpose General purpose Core GStera Core GS464 DMA engine can be 8 configurable address configured to achieve windows of each master port high performance allow pages migration across L2 and memory
    • 10. Hardware-assisted x86 emulation • On software based binary translation, some of x86 instruction requires tens of MIPS instructions due to the difference of ISA • added 200+ of new instructions to reduce instructions on binary translation
    • 11. BHT: Branch history table ITLB: Instruction translation Virtual machine BRQ: Bandwidth request look-aside buffer DTLB: Data translation RAS: Return address stack look-aside buffer TAP: Test access port architecture Figure 1. GS464 microarchitecture. GS464 adopts a nine-stage dynamical pipeline. Microsoft Windows Linux applications on x86 Linux applications on MIPS System-level x86 Process-level x86 virtual machine virtual machine Linux on MIPS Enhanced MIPS core • It’s just QEMU on Linux Figure 2. The GS464 virtual machine’s software architecture. The x86 operating systems and applications are built on MIPS Linux system through virtual machine monitor. •support for EFlag modified to improve performance, Hardware QEMU of x86 arithmetic calculation, and the branch direc- using new instructions A major difference between the x86 and tions of branch instructions are determined MIPS ISAs is that the x86 ISA uses EFlags. according to the EFlag values. MIPS fixed-
    • 12. x86 EFlag support • Most of x86 fixed-point arithmetic instructions generate EFlag • Branch directions of branch instructions are determined according to the EFlag • MIPS doesn’t have flag register! Therefore it needs to check result and set/ clear bit on virtual EFlag register on runtime • That’s very costly
    • 13. x86 EFlag support: Solution • Add new instructions to handle EFlag • Generate EFlag • Branch on EFlag
    • 14. Number of instructions Instruction Comment 0 SUB ECX EDX 1 JE X86_target (a) 0.00 SUBU Result Recx Redx 0.01 SRL Rsf Result 31 /*SF=Result[31]*/ 0.02 BEQ Result R0 L1 0.03 ADD Rzf R0 R0 /*ZF=0*/ 0.04 B L2 0.05 NOP 0.06 L1: ADDI Rzf R0 1 /*ZF=1*/ . . . . . . . . . . . . . . . . . . . . . 0.35 B L8 0.36 NOP 0.37 L7: ADDI Rcf R0 1 /*CF=1*/ 0.38 L8: ADD Recx Result R0 1.00 BNE Rzf R0 MIPS_target 1.01 NOP (b) 0.0 SUBU Result Recx Redx /*Generating Sub result*/ 0.1 SETFLAG 0.2 SUBU Reflag Recx Redx /*Generating EFLAGS*/ 1.0 X86JE Reflag MIPS_target /*Branch on EFLAGS*/ (c) 0.0 SUB Result Recx Redx /*Generating Sub result*/ 0.1 X86SUB Reflag Recx Redx /*Generating EFLAGS*/ 1.0 X86JE Reflag MIPS_target /*Branch on EFLAGS*/ (d)
    • 15. x87 support • Register stack: • Maintaining TOP pointer is costly • Calculating absolute register number from relative register number is costly • Emulating x87 tag to detect stack overflow/ underflow is costly • 80bit floating point: MIPS only has 64bit floating point!
    • 16. x87 support: Solution • Calculates TOP value in the decode stage, using register renaming New flag on fp control register to point TOP => Reduces 10+ instructions in each x87 instruction • New instruction to simulate x87 tag, and new exception to detect stack overflow/underflow • New instructions for 80bit floating point: • 80 bit fp number using two 64bit reg => 64 bit fp number using one 64bit reg • 64 bit fp number using one 64bit reg => 80 bit fp number using two 64bit reg
    • 17. Number of instructions Instruction Comment 0 FLD *%R10 1 FMUL *16(%R10) 2 FSTP *%R10 (a) 0.00 LD Rtmp1 12(R8) /*convert 1st operand*/ 0.01 LD Rtmp2 4(R8) 0.02 ANDI Rsign Rtmp1 /*get sign bit and sign bit of exp*/ 0.03 DSLL32 Rsign Rsign 16 /*get biased exponent . . . . . . . . . . . . . . . . . . 0.23 DMTC1 F8 Rfp2 1.00 MUL.d F9 F7 F8 /*64-bit multiply*/ 2.00 DMFC1 Rres F9 2.01 DSRL32 Rsign Rres 31 /*get sign bit*/ . . . . . . . . . . . . . . . . . . 2.12 SD Rres1 12(R8) /*write back result*/ 2.13 SD Rres2 4(R8) (b) 0.0 GSLQC1 F4 4(R8) /*128-bit load to F4 and F5*/ 0.1 CVT.d.ld F7 F4 F5 /*80-bit to 64-bit convert*/ 0.2 GSLQC1 F2 20(R8) /*128-bit load to F2 and F3*/ 0.3 CVT.d.ld F8 F2 F3 /*80-bit to 64-bit convert*/ 1.0 MUL.d F9 F7 F8 /*64-bit multiplication*/ 2.0 CVT.ud.d F7 F9 /*64-bit to high part of 80- bit*/ 2.1 CVT.ld.d F8 F9 /*64-bit to low part of 80-bit*/ 2.2 GSSQC1 F7 4(R8) /*128-bit store*/
    • 18. Multimedia instructions • x86 has MMX, SSE, SSE2... • MIPS as extention instruction set called MDMX, but very different from x86 multimedia instructions • Added original SIMD instruction set which similar to SSE2
    • 19. New addressing mode • MIPS only supports “(base) + disp” for fixed/float, “(base) + (index)” for float • x86 has more flexible addressing modes ex: “(base) + (index) x scale + disp” • ‘‘(base) + (index) + disp8’’ addressing mode added to translate it
    • 20. Bounded load and store • x86 has segment address mode • Bounded load/store instruction added to handle this This reads bound register as the memory- access boundary • It raises address exception if the memory- access exceeds the boundary
    • 21. Fixed-point multiplication and division • MIPS fixed-point multiplication/division instruction use the special Hi/Lo register as destination Additional operation needed to move data from Hi/Lo register to general-purpose registers • Added fixed-point multiplication/divison instruction which use general-purpose register as destination
    • 22. Byte insertion and extraction • x86 supports 8, 16, 32, 64bit operations • MIPS only supports 32, 64bit operations • Added flexible byte insertion instructions that can insert 8, 16, 32bit from any location of a register to any location of another register Also added flexible byte extraction instructions
    • 23. CAM • Translation of indirect branch is costly, because the translator must lookup branch target dynamically • It requires <x86 branch target:MIPS branch target> hash table to keep mapping information • 64-entry CAM added to speed up it • CAM Entry format: PID, Address, Data
    • 24. ................................................................................................................................................. . Number of instructions Instruction Comment 0 MOV %RAX %R11 1 JMPQ %*R11 (a) 0 MOVE Rr11 Rrax 1.0 CAMPV Rtmp Rr11 /* Look up the first level indirect jump address */ 1.1 CAMPV Rtgt Rtmp /* Look up the final jump address */ 1.2 JR Rtgt (b) Figure 5. Example of indirect branch target translation: The original x86 program (a), and the program translated with Godson-3 content-associated memory (CAM) instructions (b). The boldface text indicates new instructions for x86 emulation.
    • 25. Context Switch Optimization • The binary translator stores translated codes in data cache, then the execution requires flushing them from data cache and loading them into the instruction cache • Keep coherence by hardware, between data and instruction cache, as well as L2 • Binary translator performs context switch between translator and translated codes, it requires to save/restore target machines register, which simulated as general purpose registers • To reduce the costs, 128bit load and store instructions are added • This save/restore up to four x86 registers in one time
    • 26. EMBC x86 assembly FPGA x86 SIMD crobench C and x86 assembly Xtreme-3/FPGA PEC 2000 C FPGA PEC 2000 PEC 2000 C C Benchmark results FPGA FPGA ich bench- x86 binary 100 No hardware support e using the 90 Hardware support tor; and 80 in which Performance (percent) 70 nto x86 bi- dware using 60 y translator 50 acceleration 40 ) hardware 30 20 with the 10 0 rformance e T FT C 1 2 T ip er t ar ag C BC BC O G gz rs -F 9. -ID BO tor modes er pa 4. M M FP 17 Av FP EE EE 16 S- 7. . Godson- O 19
    • 27. Godson SPEC Ratio Pentium SPEC Ratio 2E-750 2F-800 3A-800 PIII-800 PIV-1.4 or software on a Mhz Mhz Mhz Mhz Ghz and time-consuming. 164.gzip 209 251 324 344 397 standard to facilitate 175.vpr 237 239 391 261 246 rdware/software sub- 176.gcc 282 329 369 241 350 hensive debugging ca- 181.mcf 271 232 421 229 255 ion and debug mode, 186.crafty 356 362 415 352 386 197.parser 202 152 225 231 331 breakpoint, instruc- 252.eon 289 441 526 90.7 125 nts, single-step execu- 253.perlbmk 235 321 330 397 547 on. The IEEE 1149.1 254.gap 238 243 229 260 441 ndard is employed to 255.vortex 236 274 297 383 478 EJTAG. Every pro- 256.bzip2 247 241 268 249 314 TAG TAP controller, 300.twolf 313 331 486 269 287 ected as a chain. A SPECint2000 256 275 345 260 326 h each processor core 168.wupwise 307 308 325 248 474 171.swim 247 273 336 218 244 172.mgrid 156 155 184 99.2 320 Evaluation 173.applu 188 268 200 154 333 177.mesa 373 438 400 265 265 he first-silicon sample 178.galgel - 345 583 - - ned from fabrication. 179.art 349 693 1254 115 109 183.equake 250 303 278 190 493 187.facerec - 111 177 - - 188.ammp 277 283 364 174 200 189.lucas - 284 251 - - 191.fma3d - 108 128 - - 200.sixtrack 131 217 184 137 224 301.apsi 172 197 225 190 199 SPECfp2000 232 254 289 171 263
    • 28. Conclusion • GS464 added 200+ instructions and number of optimization for x86 emulation • In the result, binary translation speeds up 2x ~ 3x faster than original QEMU • That’s neary 70% performance of MIPS native binary • CPU performance itself is poor though • The paper doesn’t tell us enough informations to know actual performance of the emulation on real chip... • Anyway Loongson-3 looks good try and interesting!
    • 29. Papers & Slides • “GODSON-3: A SCALABLE MULTICORE RISC PROCESSOR WITH X86 EMULATION” • “Micro-architecture of Godson-3 Multi-Core Processor” • “Efficient Binary Translation System with Low Hardware Cost” • “Godson-3 Multicore RISC Processor”

    ×