November 2004 J. E. Smith Virtual Machines: An Architecture Perspective
Introduction <ul><li>Why are virtual machines interesting? </li></ul><ul><li>They involve computer architecture in a pure ...
Performance Isn’t Everything <ul><li>The BIG ideas are all at least 20 years old </li></ul><ul><ul><li>and they have been ...
Outline <ul><li>Virtualization </li></ul><ul><li>The Family of Virtual Machines </li></ul><ul><li>Process VMs and Code Cac...
Abstraction <ul><li>Computer systems are built on levels of abstraction </li></ul><ul><li>Instruction Set Architecture </l...
Virtualization <ul><li>An isomorphism from guest to host </li></ul><ul><ul><li>Map guest state to host state </li></ul></u...
Virtualization <ul><li>Similar to abstraction </li></ul><ul><ul><li>Except </li></ul></ul><ul><ul><li>Details not necessar...
The Family of Virtual Machines <ul><li>Lots of things are called “virtual machines” </li></ul><ul><ul><li>IBM VM/370 </li>...
System Virtual Machines <ul><li>Provide a system environment </li></ul><ul><li>Constructed at ISA level </li></ul><ul><li>...
System Virtual Machines <ul><li>Native VM System </li></ul><ul><ul><li>VMM privileged mode </li></ul></ul><ul><ul><li>Gues...
Process Virtual Machines <ul><li>Constructed at ABI level </li></ul><ul><li>Runtime  manages guest process </li></ul><ul><...
The Virtual Machine Space Multi programmed Systems HLL VMs Co-Designed VMs same ISA different ISA Process VMs System VMs W...
Architecture Issues: System VMs <ul><li>Why System VMs are of interest today </li></ul><ul><ul><li>Security & Fault Tolera...
System Virtualization <ul><li>Traps and interrupts (& sys calls) </li></ul><ul><ul><li>Transfer to VMM </li></ul></ul><ul>...
Popek and Goldberg (in brief) <ul><li>Control Sensitive instructions </li></ul><ul><ul><li>All instructions that change ha...
System VM Research <ul><li>Architecture Challenge:  </li></ul><ul><ul><li>Make IA-32 efficiently virtualizable </li></ul><...
The Virtual Machine Space Multi programmed Systems HLL VMs Co-Designed VMs same ISA different ISA Process VMs System VMs W...
Architecture Issues: Process VMs <ul><li>Generally to allow application migration </li></ul><ul><ul><li>Or to run popular ...
Staged Emulation with Code Caching <ul><li>An important part of many VM implementations </li></ul><ul><li>Translate, optim...
Superblocks <ul><li>Based on “hot” paths </li></ul><ul><li>One entry multiple exits </li></ul><ul><li>May contain redundan...
Binary Translation Example 4FD0: addl %edx,(%eax) ;load and accumulate sum movl (%eax),%edx ;store to memory sub %ebx,1 ;d...
Code Caches <ul><li>Contain </li></ul><ul><ul><li>Basic blocks </li></ul></ul><ul><ul><li>Superblocks (one entrance, multi...
Indirect Jumps <ul><li>Translated code cache PC (TPC)  </li></ul><ul><li>differs from Source binary PC (SPC) </li></ul><ul...
The Indirect Jump Problem <ul><li>Target addresses (SPCs) can change </li></ul><ul><ul><li>SPC needs to be translated at  ...
Protecting the Runtime <ul><li>The runtime shares process memory space with application </li></ul><ul><ul><li>Must protect...
Process VM Research <ul><li>Same-ISA dynamic binary optimizers are probably not a winning proposition </li></ul><ul><ul><l...
Computer Architecture Innovation HLL VMs –  software people invent ISA to solve SW problems Co-Designed VMs –  hardware pe...
The Virtual Machine Space Multi programmed Systems HLL VMs Co-Designed VMs same ISA different ISA Process VMs System VMs W...
High Level Language Virtual Machines <ul><li>Raise the “ABI” level of abstraction </li></ul><ul><ul><li>User higher level ...
Architecture Issues: High Level VMs <ul><li>Examples: </li></ul><ul><ul><li>Sun Java </li></ul></ul><ul><ul><li>Microsoft ...
HLL VMs: Architecture Perspective <ul><li>Here, architects were deprived (or let themselves be deprived) of some interesti...
HLL VM Research <ul><li>Metadata – an interesting concept </li></ul><ul><ul><li>Data Set Architecture </li></ul></ul><ul><...
HLL VM Research <ul><li>Precise trap model </li></ul><ul><ul><li>Problems in conventional processors: </li></ul></ul><ul><...
HLL VM Research <ul><li>Stack tracking </li></ul><ul><ul><li>At any given point, operand stack must have same number of el...
HLL VMs Summary <ul><li>Claim: Slow-downs due to OO programming, probably not dynamic compilation </li></ul><ul><li>–  and...
The Virtual Machine Space Multi programmed Systems HLL VMs Co-Designed VMs same ISA different ISA Process VMs System VMs W...
Co-Designed Virtual Machines <ul><li>Separate the hardware/software interface from the ISA level of abstraction </li></ul>...
Co-Designed VMs <ul><li>Should be of interest to both architects and micro-architects </li></ul><ul><ul><li>Offers opportu...
Architecture Issues: Concealed Memory <ul><li>VM software resides in memory concealed from  all conventional software </li...
Another Way of Doing Things conventional dynamic translation Code Cache Processor Pipeline Software Translator Main Memory...
Jump Target-address Lookup Table <ul><li>A hardware cache of dispatch table entries </li></ul><ul><li>Similar to software-...
Dual-address RAS <ul><li>Problem: function call instruction saves return SPC not TPC  </li></ul><ul><ul><li>Conventional s...
IPC performance <ul><li>“ Translate” Alpha to Alpha; start with highly optimized code </li></ul><ul><li>Conventional metho...
<ul><li>Wide pipelines are at odds with fast pipelines </li></ul><ul><ul><li>Fast pipeline => low complexity per stage </l...
Fused Instruction Set <ul><li>Co-designed VM  x86 implementation </li></ul><ul><ul><li>Shorten and simplify pipeline front...
Conventional Issue Logic <ul><li>Select and issue instructions free of data dependences </li></ul><ul><li>Based on the sel...
<ul><li>Fuse dependent instructions into single slot </li></ul><ul><li>Fused instructions traverse entire pipeline </li></...
Instruction Set call 0x080af30e (21bit disp) jcc  0x080115a0 jmp  0x080C0988 LIMM.lo Redx, LO(0x0810a7de) LIMM.hi Redx, HI...
Translation Algorithm <ul><li>Two Pass Algorithm: </li></ul><ul><li>1. Form superblocks using Dynamo MRET method </li></ul...
Fusing Profile <ul><li>About 50% of operations are fused </li></ul><ul><li>Only 5-10% of non-fused are single-cycle ALU op...
Distance Between Fused Operations <ul><li>Most fused operations close together </li></ul><ul><ul><li>70% of fused ops from...
Performance (Normalized IPC) <ul><li>Baseline: generic superscalar </li></ul><ul><li>Macro-op: Fused macro-ops with pipeli...
VM Research <ul><li>Architecture Support for VMs </li></ul><ul><ul><li>Enable spectrum of VMs (process, system,  HLL, co-d...
Upcoming SlideShare
Loading in...5
×

cornell.pps

894

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
894
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Figs7/tramplne
  • cornell.pps

    1. 1. November 2004 J. E. Smith Virtual Machines: An Architecture Perspective
    2. 2. Introduction <ul><li>Why are virtual machines interesting? </li></ul><ul><li>They involve computer architecture in a pure sense </li></ul><ul><li>They allow transcending of interfaces </li></ul><ul><li>(which often seem to be an obstacle to innovation) </li></ul><ul><li>They enable innovation in flexible, adaptive hardware, security, fault-tolerance, support for network computing (and others) </li></ul>
    3. 3. Performance Isn’t Everything <ul><li>The BIG ideas are all at least 20 years old </li></ul><ul><ul><li>and they have been very thoroughly explored </li></ul></ul><ul><li>Focus research on other important areas </li></ul><ul><ul><li>Power efficiency </li></ul></ul><ul><ul><li>Performance efficiency </li></ul></ul><ul><ul><li>Security </li></ul></ul><ul><ul><li>Ease of design </li></ul></ul><ul><ul><li>Software compatibility / interoperability </li></ul></ul><ul><li>Virtual Machines can be important enablers for all the above </li></ul>
    4. 4. Outline <ul><li>Virtualization </li></ul><ul><li>The Family of Virtual Machines </li></ul><ul><li>Process VMs and Code Caching </li></ul><ul><li>High Level Language VMs </li></ul><ul><li>Co-Designed VMs </li></ul><ul><li>Research in Co-Designed VMs </li></ul>
    5. 5. Abstraction <ul><li>Computer systems are built on levels of abstraction </li></ul><ul><li>Instruction Set Architecture </li></ul><ul><ul><li>Major division between hardware and software </li></ul></ul>I/O devices and Networking Controllers System Interconnect (bus) Controllers Memory Translation Execution Hardware Drivers Memory Manager Scheduler Operating System Libraries Application Programs Main Memory 1 2 3 3 4 5 6 7 7 8 8 8 8 9 10 10 11 11 12 13 14 Software Hardware <ul><li>Application Binary Interface </li></ul><ul><ul><li>Observed by user processes </li></ul></ul><ul><ul><li>User ISA + OS calls </li></ul></ul><ul><li>Higher level of abstraction hide details at lower levels </li></ul><ul><li>Example: files are an abstraction of a disk </li></ul>file file abstraction
    6. 6. Virtualization <ul><li>An isomorphism from guest to host </li></ul><ul><ul><li>Map guest state to host state </li></ul></ul><ul><ul><li>Implement “equivalent” functions </li></ul></ul>S i S S i ' S j ' Guest Host V( S i ) V( S j ) e (S i ) e '(S i ') j
    7. 7. Virtualization <ul><li>Similar to abstraction </li></ul><ul><ul><li>Except </li></ul></ul><ul><ul><li>Details not necessarily hidden </li></ul></ul><ul><li>Construct Virtual Disks </li></ul><ul><ul><li>As files on a larger disk </li></ul></ul><ul><ul><li>Map state </li></ul></ul><ul><ul><li>Implement functions </li></ul></ul><ul><li>Now do the same thing with the whole “machine” </li></ul>file file virtualization
    8. 8. The Family of Virtual Machines <ul><li>Lots of things are called “virtual machines” </li></ul><ul><ul><li>IBM VM/370 </li></ul></ul><ul><ul><li>Java </li></ul></ul><ul><ul><li>VMware </li></ul></ul><ul><ul><li>Some things not called “virtual machines”, are virtual machines </li></ul></ul><ul><li> IA-32 EL </li></ul><ul><li> Dynamo </li></ul><ul><li>Transmeta Crusoe </li></ul>
    9. 9. System Virtual Machines <ul><li>Provide a system environment </li></ul><ul><li>Constructed at ISA level </li></ul><ul><li>Persistent </li></ul><ul><li>Examples: IBM VM/360, VMware, Transmeta Crusoe </li></ul>guest process HOST PLATFORM virtual network communication Guest OS VMM guest process guest process guest process Guest OS2 VMM guest process guest process
    10. 10. System Virtual Machines <ul><li>Native VM System </li></ul><ul><ul><li>VMM privileged mode </li></ul></ul><ul><ul><li>Guest OS user mode </li></ul></ul><ul><ul><li>Example: classic IBM VMs </li></ul></ul><ul><li>User-mode Hosted VM </li></ul><ul><ul><li>VMM runs as user application </li></ul></ul><ul><li>Dual-mode Hosted VM </li></ul><ul><ul><li>Parts of VMM privileged, parts non-privileged </li></ul></ul><ul><ul><li>Example VMware </li></ul></ul>Non-privileged modes Privileged Mode Virtual Machine VMM Hardware Virtual Machine Host OS Hardware VMM Virtual Machine Host OS Hardware VMM
    11. 11. Process Virtual Machines <ul><li>Constructed at ABI level </li></ul><ul><li>Runtime manages guest process </li></ul><ul><li>Guest processes may intermingle with host processes </li></ul><ul><li>Not persistent </li></ul><ul><li>As a practical matter, guest and host OSes are often the same </li></ul><ul><li>Dynamic optimizers are a special case </li></ul><ul><li>Examples: IA-32 EL, FX!32, Dynamo </li></ul>HOST OS Disk file sharing network communication guest process create host process guest process runtime runtime guest process runtime host process
    12. 12. The Virtual Machine Space Multi programmed Systems HLL VMs Co-Designed VMs same ISA different ISA Process VMs System VMs Whole System VMs different ISA same ISA Classic OS VMs Dynamic Binary Optimizers Dynamic Translators Hosted VMs
    13. 13. Architecture Issues: System VMs <ul><li>Why System VMs are of interest today </li></ul><ul><ul><li>Security & Fault Tolerance (isolation) </li></ul></ul><ul><ul><li>Platform Consolidation </li></ul></ul><ul><ul><li>Application/Environment portability </li></ul></ul><ul><li>“ Efficiently Virtualizable” Instruction Sets </li></ul><ul><ul><li>Goldberg and Popek (1974) should still be required reading </li></ul></ul><ul><ul><li>(An architecture paper with theorems and proofs!) </li></ul></ul><ul><li>Virtual Machine Assists </li></ul><ul><ul><li>Compensate for inefficiencies due to privilege level “compression” </li></ul></ul><ul><ul><li>Fast emulation of system functions </li></ul></ul><ul><ul><li>Many developed for IBM mainframe VMs </li></ul></ul>
    14. 14. System Virtualization <ul><li>Traps and interrupts (& sys calls) </li></ul><ul><ul><li>Transfer to VMM </li></ul></ul><ul><ul><li>VMM determines appropriate Guest OS </li></ul></ul><ul><ul><li>VMM transfers to Guest OS </li></ul></ul><ul><li>Guest performs privileged operation </li></ul><ul><ul><li>Trap to VMM </li></ul></ul><ul><ul><li>VMM reads/modifies guest state </li></ul></ul><ul><ul><li>May modify shadow state </li></ul></ul><ul><ul><li>Returns to Guest </li></ul></ul><ul><li>Guest OS “return” to user app. </li></ul><ul><ul><li>Transfer to VMM </li></ul></ul><ul><ul><li>VMM bounces return back to Guest app. </li></ul></ul>privileged operation next instruction check privileges perform operation return system call/trap vector location: virtual vector location: Application Guest OS VMM
    15. 15. Popek and Goldberg (in brief) <ul><li>Control Sensitive instructions </li></ul><ul><ul><li>All instructions that change hardware resource allocation (or mapping) </li></ul></ul><ul><ul><li>Example: write TLB </li></ul></ul><ul><li>Behavior Sensitive instructions </li></ul><ul><ul><li>All instructions whose outcome depends on hardware resource allocation </li></ul></ul><ul><ul><li>Example: read processor mode </li></ul></ul><ul><li>Theorem (paraphrase) </li></ul><ul><ul><li>Efficiently virtualizable if all sensitive instructions trap in user mode </li></ul></ul>
    16. 16. System VM Research <ul><li>Architecture Challenge: </li></ul><ul><ul><li>Make IA-32 efficiently virtualizable </li></ul></ul><ul><li>Virtual Machine Assists </li></ul><ul><ul><li>Compensate for inefficiencies due to privilege level “compression” </li></ul></ul><ul><ul><li>Fast emulation of system functions </li></ul></ul><ul><ul><li>Many developed for IBM mainframe VMs </li></ul></ul><ul><li>Applications to Chip Multiprocessors </li></ul><ul><ul><li>Technology changes often require innovation and “re-invention” </li></ul></ul>
    17. 17. The Virtual Machine Space Multi programmed Systems HLL VMs Co-Designed VMs same ISA different ISA Process VMs System VMs Whole System VMs different ISA same ISA Classic OS VMs Dynamic Binary Optimizers Dynamic Translators Hosted VMs
    18. 18. Architecture Issues: Process VMs <ul><li>Generally to allow application migration </li></ul><ul><ul><li>Or to run popular software on a less popular platform </li></ul></ul><ul><ul><li>Goal is generally to minimize performance loss </li></ul></ul><ul><li>Same-ISA dynamic optimizers are special case </li></ul><ul><ul><li>HP Dynamo </li></ul></ul><ul><li>Architecture problems </li></ul><ul><ul><li>Efficient code-caching </li></ul></ul><ul><ul><li>Indirect jump problem </li></ul></ul><ul><ul><li>Protecting runtime from guest process </li></ul></ul>
    19. 19. Staged Emulation with Code Caching <ul><li>An important part of many VM implementations </li></ul><ul><li>Translate, optimize & cache frequent code sequences </li></ul>Binary Memory Image Code Cache Profile Data Interpreter Translator/ Optimizer runtime <ul><li>Start interpreting </li></ul><ul><li>Profile to find “hot” code regions </li></ul>
    20. 20. Superblocks <ul><li>Based on “hot” paths </li></ul><ul><li>One entry multiple exits </li></ul><ul><li>May contain redundant blocks (tail duplication) </li></ul>15 B D C G A E F 15 B D C G A E F G G
    21. 21. Binary Translation Example 4FD0: addl %edx,(%eax) ;load and accumulate sum movl (%eax),%edx ;store to memory sub %ebx,1 ;decrement loop count jz 51C8 ;branch if at loop end 4FDC: add %eax,4 ;increment %eax jmp 4FD0 ;jump to loop top 51C8: movl (%ecx),%edx ;store last value of %edx xorl %edx,%edx ;clear %edx jmp 6200 ;jump elsewhere x86 Binary 9AC0: lwz r16,0(r4) ;load value from memory add r7,r7,r16 ;accumulate sum stw 0(r5),r7 ;store to memory subi. r5,r5,1 ;decrement loop count, set cr0 bez cr0,pc+12 ;branch if loop exit bl F000 ;branch & link to EM 4FDC ;save source PC in link register 9AE4: bl F000 ;branch & link to EM 51C8 ;save source PC in link register 9C08: stw 0(r6),r7 ;store last value of %edx subi r7,r7,r7 ;clear %edx bl F000 ;branch & link to EM 6200 ;save source PC in link register PowerPC Translation
    22. 22. Code Caches <ul><li>Contain </li></ul><ul><ul><li>Basic blocks </li></ul></ul><ul><ul><li>Superblocks (one entrance, multiple exits) </li></ul></ul><ul><ul><li>Optimized Superblocks </li></ul></ul><ul><li>A base technology for many VMs </li></ul><ul><ul><li>Dynamic binary translators: Intel IA-32 EL, Compaq FX!32 </li></ul></ul><ul><ul><li>Dynamic binary optimizers: Dynamo family </li></ul></ul><ul><ul><li>Co-designed virtual machines: Transmeta, IBM DAISY </li></ul></ul><ul><ul><li>High performance Java virtual machines </li></ul></ul><ul><ul><li>System VMs with “inefficiently virtualizable” ISAs </li></ul></ul><ul><ul><li>“ Sandboxing” secure VMs (x86 DynamoRIO) </li></ul></ul>
    23. 23. Indirect Jumps <ul><li>Translated code cache PC (TPC) </li></ul><ul><li>differs from Source binary PC (SPC) </li></ul><ul><ul><li>Need branch/jump target address translation </li></ul></ul><ul><ul><li>(Direct) branches are easier; target address is fixed </li></ul></ul><ul><ul><ul><li> Chaining can be used </li></ul></ul></ul>Superblock Dispatch table lookup code Superblock Superblock Without chaining Superblock Dispatch table lookup code Superblock Superblock With chaining Superblock
    24. 24. The Indirect Jump Problem <ul><li>Target addresses (SPCs) can change </li></ul><ul><ul><li>SPC needs to be translated at run-time , not translation time </li></ul></ul><ul><li>Conventional solution: superblock construction-time software prediction (aka inline caching) </li></ul><ul><ul><ul><li>If Rx == #addr_1 goto #target_1 </li></ul></ul></ul><ul><ul><ul><li>Else if Rx == #addr_2 goto #target_2 </li></ul></ul></ul><ul><ul><ul><li>Else dispatch_table_lookup(Rx); do it the slow way </li></ul></ul></ul><ul><li>The biggest overhead in code caches </li></ul><ul><ul><li>Compare-and-branch: 6 instructions </li></ul></ul><ul><ul><li>Hash table lookup: 15 instruction s in Dynamo x86 </li></ul></ul>
    25. 25. Protecting the Runtime <ul><li>The runtime shares process memory space with application </li></ul><ul><ul><li>Must protect runtime from application </li></ul></ul><ul><ul><li>Expensive memory protection changes on switches between runtime and code cache </li></ul></ul><ul><ul><li>If guest registers are mapped to host memory </li></ul></ul><ul><ul><ul><li>How are memory mapped registers protected? </li></ul></ul></ul>Guest Code Guest Data Runtime Data Runtime Code N R/W Code Cache Ex R/W N R/W R/W Guest Code Guest Data Runtime Data Runtime Code N N Code Cache N Ex N R/W R Runtime mode Emulation mode
    26. 26. Process VM Research <ul><li>Same-ISA dynamic binary optimizers are probably not a winning proposition </li></ul><ul><ul><li>Indirect jumps lead to performance losses on modern processors </li></ul></ul><ul><ul><ul><li>(optimizers with patching are better) </li></ul></ul></ul><ul><ul><li>Complete ( intrinsic ) compatibility is extremely difficult </li></ul></ul><ul><ul><ul><li>May have to rely on extrinsic assurances </li></ul></ul></ul><ul><ul><ul><li>Topic of architecture research similar to Goldberg and Popek </li></ul></ul></ul><ul><li>For general process VMs some primitive support in ISA will be useful / necessary </li></ul><ul><ul><li>Indirect jumps (more later) </li></ul></ul><ul><ul><li>Code caching </li></ul></ul><ul><ul><li>Protection </li></ul></ul>
    27. 27. Computer Architecture Innovation HLL VMs – software people invent ISA to solve SW problems Co-Designed VMs – hardware people invent ISA to solve HW problems These two are the most interesting VMs from an architecture perspective and provide the biggest opportunities.
    28. 28. The Virtual Machine Space Multi programmed Systems HLL VMs Co-Designed VMs same ISA different ISA Process VMs System VMs Whole System VMs different ISA same ISA Classic OS VMs Dynamic Binary Optimizers Dynamic Translators Hosted VMs
    29. 29. High Level Language Virtual Machines <ul><li>Raise the “ABI” level of abstraction </li></ul><ul><ul><li>User higher level virtual ISA </li></ul></ul><ul><ul><li>OS abstracted as standard libraries </li></ul></ul><ul><li>A form of process VM </li></ul>HLL Program Intermediate Code Memory Image Object Code ( ISA ) Compiler front-end Compiler back-end Loader HLL Program Portable Code ( Virtual ISA ) Host Instructions Virt. Mem. Image Compiler VM loader VM Interpreter/Translator Traditional HLL VM
    30. 30. Architecture Issues: High Level VMs <ul><li>Examples: </li></ul><ul><ul><li>Sun Java </li></ul></ul><ul><ul><li>Microsoft .NET Framework and MSIL </li></ul></ul><ul><li>Why are HLL VMs important? </li></ul><ul><ul><li>Microsoft says so. </li></ul></ul><ul><ul><li>It’s a good idea. </li></ul></ul><ul><ul><ul><li>Combines object oriented programming and network computing </li></ul></ul></ul>
    31. 31. HLL VMs: Architecture Perspective <ul><li>Here, architects were deprived (or let themselves be deprived) of some interesting architecture work </li></ul><ul><li>Don’t look at it bottom-up, i.e. </li></ul><ul><ul><li>Take existing software for supporting HLL VMs, </li></ul></ul><ul><ul><li>Generate traces for standard ISAs, </li></ul></ul><ul><ul><li>Analyze traces </li></ul></ul><ul><ul><li>Conclude its “just like C”… problem solved! </li></ul></ul><ul><li>Look top-down – start with features of MSIL and look for computer architecture opportunities </li></ul><ul><ul><li>Will require a mix of hardware and software innovation </li></ul></ul><ul><ul><li>(else just continue to ignore real architecture in favor of implementation) </li></ul></ul>
    32. 32. HLL VM Research <ul><li>Metadata – an interesting concept </li></ul><ul><ul><li>Data Set Architecture </li></ul></ul><ul><ul><li>Don’t have to discover data structures </li></ul></ul><ul><ul><ul><li>– compare with C programs. </li></ul></ul></ul>Metadata Code Machine Independent Program File Loader Virtual Machine Implementation Interpreter Internal Data Structures Translator Native Code
    33. 33. HLL VM Research <ul><li>Precise trap model </li></ul><ul><ul><li>Problems in conventional processors: </li></ul></ul><ul><ul><ul><li>All state precise </li></ul></ul></ul><ul><ul><ul><li>Many instructions can trap </li></ul></ul></ul><ul><ul><ul><li>Enable/disable “remote” and at any time </li></ul></ul></ul><ul><ul><li>HLL VMs </li></ul></ul><ul><ul><ul><li>Not all state must be precise </li></ul></ul></ul><ul><ul><ul><li>PC not needed </li></ul></ul></ul><ul><ul><ul><li>operand stack never </li></ul></ul></ul><ul><ul><ul><li>local variables only if trap is handled locally </li></ul></ul></ul><ul><ul><ul><li>Trap enable explicit and locally specified </li></ul></ul></ul>
    34. 34. HLL VM Research <ul><li>Stack tracking </li></ul><ul><ul><li>At any given point, operand stack must have same number of elements and types regardless of control flow path </li></ul></ul><ul><ul><li>This property could simplify exploitation of control independence </li></ul></ul>
    35. 35. HLL VMs Summary <ul><li>Claim: Slow-downs due to OO programming, probably not dynamic compilation </li></ul><ul><li>– and not stack-based ISA </li></ul><ul><li>Research opportunities abound </li></ul><ul><ul><li>For VM implementation </li></ul></ul><ul><ul><li>For speeding up OO programs (look beyond C/C++) </li></ul></ul><ul><ul><li>Use co-designed HW/SW </li></ul></ul><ul><ul><ul><li>Base design on MSIL/Java and implement conventional ISA as the uncommon case </li></ul></ul></ul>
    36. 36. The Virtual Machine Space Multi programmed Systems HLL VMs Co-Designed VMs same ISA different ISA Process VMs System VMs Whole System VMs different ISA same ISA Classic OS VMs Dynamic Binary Optimizers Dynamic Translators Hosted VMs
    37. 37. Co-Designed Virtual Machines <ul><li>Separate the hardware/software interface from the ISA level of abstraction </li></ul><ul><li>Restore the ISA to its “natural” place </li></ul><ul><ul><li> as an I mplementation ISA that reflects actual hardware </li></ul></ul><ul><li>Support existing ISAs </li></ul><ul><ul><li> as a Virtual ISA </li></ul></ul><ul><li>Let processor designers use both </li></ul><ul><li>hardware and software </li></ul><ul><li>A form of system VM </li></ul>OS libs. User Applications V-ISA I-ISA Hardware Software Hardware OS libs. User Applications ISA
    38. 38. Co-Designed VMs <ul><li>Should be of interest to both architects and micro-architects </li></ul><ul><ul><li>Offers opportunities for performance, power saving, fault tolerance and other implementation-dependent features </li></ul></ul><ul><ul><li>Allows transcending conventional ISAs </li></ul></ul><ul><ul><li>Don’t confuse them with VLIW! </li></ul></ul>
    39. 39. Architecture Issues: Concealed Memory <ul><li>VM software resides in memory concealed from all conventional software </li></ul>Source ISA Data Code Cache VM Code ICache Hierarchy DCache Hierarchy Processor Core Source ISA Code VM Data concealed memory conventional memory
    40. 40. Another Way of Doing Things conventional dynamic translation Code Cache Processor Pipeline Software Translator Main Memory Func. Unit Func. Unit . .. Main Memory Cache Hierarchy Processor Pipeline Translation Unit (form uops) Func. Unit Func. Unit Func. Unit . .. Translation Unit (form uops) Cache Hierarchy
    41. 41. Jump Target-address Lookup Table <ul><li>A hardware cache of dispatch table entries </li></ul><ul><li>Similar to software-managed TLB in virtual memory </li></ul>Jump insn TPC BTB Predicted next fetch TPC Tag TPC Jump insn Register identifier SPC Register file Jump Target SPC SPC TPC JTLT Jump Target TPC Hit? Match? Yes BTB prediction correct Yes No BTB misprediction: Redirect fetch to jump target TPC from JTLT No JTLT miss: Redirect fetch to the dispatch code
    42. 42. Dual-address RAS <ul><li>Problem: function call instruction saves return SPC not TPC </li></ul><ul><ul><li>Conventional software-based chaining cannot utilize a RAS </li></ul></ul><ul><li>Solution: save both SPC and TPC </li></ul>SPC TPC JTLT SPC TPC Push-dual-address-RAS insn Dual-address RAS SPC TPC
    43. 43. IPC performance <ul><li>“ Translate” Alpha to Alpha; start with highly optimized code </li></ul><ul><li>Conventional method (ala Dynamo) results in 14% IPC loss </li></ul><ul><li>Dual-address RAS provides the most benefit </li></ul><ul><li>Using both JTLT & RAS, 7.7% IPC improvement </li></ul><ul><ul><li>Due to superblock re-layout </li></ul></ul>0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 253.perlbmk 254.gap 255.vortex 256.bzip2 300.twolf H.mean IPC original sw_pred.sw_pred sw_pred.sw_pred (private dispatch) sw_pred.ras jtlt.ras
    44. 44. <ul><li>Wide pipelines are at odds with fast pipelines </li></ul><ul><ul><li>Fast pipeline => low complexity per stage </li></ul></ul><ul><ul><li>More instructions per stage => high complexity per stage </li></ul></ul><ul><li>Process larger atomic units in pipeline stages </li></ul><ul><ul><li>Narrower “effective” width </li></ul></ul><ul><li>Reduce decoding stages </li></ul><ul><ul><li>Do more in software </li></ul></ul><ul><li>Pipeline the issue stage </li></ul>Research: Efficient Microarchitectures
    45. 45. Fused Instruction Set <ul><li>Co-designed VM x86 implementation </li></ul><ul><ul><li>Shorten and simplify pipeline front-end </li></ul></ul><ul><li>Combine pairs of dependent instructions </li></ul><ul><ul><li>For single “unit” for pipeline processing </li></ul></ul><ul><li>Use VM software to </li></ul><ul><ul><li>“Crack” x86 instructions into RISC-ops </li></ul></ul><ul><ul><li>Re-order RISC-ops </li></ul></ul><ul><ul><li>Reassemble into (new) fused pairs </li></ul></ul><ul><li>Related: Pentium-M fuses in front-end </li></ul><ul><ul><li>Using original x86 instructions </li></ul></ul>
    46. 46. Conventional Issue Logic <ul><li>Select and issue instructions free of data dependences </li></ul><ul><li>Based on the selection, clear dependences </li></ul><ul><ul><li>And “wake-up” newly independent instructions </li></ul></ul><ul><li>Single cycle select-wakeup important for good performance </li></ul>OP R1 Imm. R2 OP R6 R7 R1 Issue Buffer select fanout/ wakeup
    47. 47. <ul><li>Fuse dependent instructions into single slot </li></ul><ul><li>Fused instructions traverse entire pipeline </li></ul><ul><li>Make single issue decision for the pair </li></ul>Pipelined Issue Logic
    48. 48. Instruction Set call 0x080af30e (21bit disp) jcc 0x080115a0 jmp 0x080C0988 LIMM.lo Redx, LO(0x0810a7de) LIMM.hi Redx, HI(0x0810a7de) CMP.cc Reax, 0x4000 LD Reax, mem[Resp + F8] ST Reax, mem[Rebp + 4C] ADD Reax, Rebx, 4c ADD Reax, Redx, Rebx Fmac Facc, Fmp1, Fmp2 LD Reax, mem[Rebx + Rebp] mov esp, ebp  MOV Resp, Rebp mov eax,[esp]  LD Reax, mem[Resp] add eax, edx  ADD Reax, Redx sub ecx, 4  SUB Recx, 4 shr esi, 2  SHR Resi, 2 inc ecx  INC Recx, 1 jcc 3e e.g. jnz 3e 21-bit Immediate/Displacement 10b opcode 11b Immd/Disp 10b opcode 5b Rds 5b Rsr 16-bit opcode 5b Rds 5b Rsr 5b Rsr 4b Rd 4b Rs 7b op 4b Rd 4b I 7b op 8b Immd/Disp 7b op F 16-bit immediate / Disp 10b opcode 5b Rds F F F F F F
    49. 49. Translation Algorithm <ul><li>Two Pass Algorithm: </li></ul><ul><li>1. Form superblocks using Dynamo MRET method </li></ul><ul><li>2. Crack x86 instructions into RISC-like micro-ops </li></ul><ul><li>3. Attempt to fuse ALU ops only </li></ul><ul><li>4. Fuse LD/ST instructions as tails and ALU ops as heads </li></ul>
    50. 50. Fusing Profile <ul><li>About 50% of operations are fused </li></ul><ul><li>Only 5-10% of non-fused are single-cycle ALU ops </li></ul>0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 253.perlbmk 254.gap 255.vortex 256.bzip2 300.twolf Average Percentage of Dynamic Instructions ALU FP or NOPs BR ST LD Fused
    51. 51. Distance Between Fused Operations <ul><li>Most fused operations close together </li></ul><ul><ul><li>70% of fused ops from different x86 instructions </li></ul></ul><ul><ul><li>60% contain two ALU operations </li></ul></ul>0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 253.perlbmk 254.gap 255.vortex 256.bzip2 300.twolf Percentage of fused macro-ops 1 2 3 4 5 6 7
    52. 52. Performance (Normalized IPC) <ul><li>Baseline: generic superscalar </li></ul><ul><li>Macro-op: Fused macro-ops with pipelined issue logic </li></ul><ul><li>Baseline Pipelined: superscalar with pipelined issue logic </li></ul>0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 16 24 32 40 48 56 64 Issue Window Size Relative IPC performance 4-wide Macro-op 4-wide Baseline 4-wide Baseline Pipelined 2-wide Macro-op
    53. 53. VM Research <ul><li>Architecture Support for VMs </li></ul><ul><ul><li>Enable spectrum of VMs (process, system, HLL, co-designed) </li></ul></ul><ul><ul><li>Support for dynamic translation and optimization </li></ul></ul><ul><ul><li>Primitives: code caches & indirect jumps; concealed memory </li></ul></ul><ul><ul><li>Pays for itself – helps get rid of obsolete ISA baggage </li></ul></ul><ul><li>VM applications </li></ul><ul><ul><li>Security </li></ul></ul><ul><ul><li>Fault Tolerance </li></ul></ul><ul><li>Co-Designed VMs </li></ul><ul><ul><li>Efficient microarchitecture </li></ul></ul><ul><ul><li>Adaptive microarchitecture </li></ul></ul><ul><ul><ul><li>For power efficiency </li></ul></ul></ul><ul><ul><ul><li>For performance </li></ul></ul></ul><ul><li>New ISAs </li></ul><ul><ul><li>Application-area specific ISAs </li></ul></ul><ul><ul><li>Support for Java/MSIL </li></ul></ul><ul><ul><li>“ Convergence” architectures </li></ul></ul><ul><li>Computer Architects can do Computer Architecture! </li></ul>
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×