eMIPS Project Overview
i.e...
What if your next computer was an FPGA?
Alessandro Forin
Microsoft Research, Redmond
April 3, 2008
OutlineOutline
 How FPGAs took over the world (architecture)
 Easy pickings and novelty items (eMIPS)
 Simulation, debugging, profiling, … (tools)
Field Programmable Gate ArraysField Programmable Gate Arrays
are..are..
..essentially two, super-imposed memory planes:
 The routing plane  which signal goes where
 The logic plane  what logic function a node computes
Just like RAM, they are
 slower (1/10 roughly)
 power hungry
Nonetheless, they took over the world.
One way is..One way is..
 UltraSparc CPU

2.0 GHz, 4-issue,
15 stages
 Memory Hierarchy

16KB/1MB L1/L2

32 Byte cacheline

1/15 cycle L1/L2 latency

160 cycle memory latency
 Reconfigurable HW

Mapped to Xilinx Virtex-4 to
obtain clock frequency
Courtesy: K. Compton, U. Wisc.
Another way is..Another way is..
The FPGA advantageThe FPGA advantage
 No code fetches
 Fine-grained, spatially parallel
 Data access: static prediction, reordering, variable width
eMIPSeMIPS
Dynamically ExtensibleDynamically Extensible
ProcessorsProcessors
 Using an FPGA, we have realized a (MIPS) processor that
extends itself at runtime, using Extensions that are safe for
multi-user operating systems
 Applications:

Speedup execution using an Application-Specific CPU (M2V)

Unobtrusively monitor (real-time) software (P2V)

Loadable software debugging support (eBug)

Load/Unload peripherals at runtime, minimizing chip area

Load/Unload processor cores on demand
 First release now available for non-commercial use
The eMIPS “Workstation”The eMIPS “Workstation”
• Motherboard: Xilinx ML401 evaluation board for the Virtex4 FPGA
• eMIPS is on the FPGA, just add keyboard, mouse and disk!
Binaries with Hardware Acceleration
Extended instructions are inserted into the Binaries. If
the HW Extension is loaded the instruction executes and
skips the basic block. Otherwise eMIPS interprets the
instruction as a NOP and executes the block.
Video Games Real-Time Spec2000
Application speedups (worstApplication speedups (worst
case)case)
Other code…
Op78 sp,ra,10 New Instruction
Lw ra,10(sp) Original Basic
Jr ra Block
Addiu sp,sp,18
Other code…
Assertion Based Verification withAssertion Based Verification with
P2VP2V
 Use the IEEE-standard hardware Property Specification Language
(PSL) to verify C (real-time) programs
 Implement it using a simulator, or in reconfigurable hardware
 PSL-to-Verilog compiler: creates Extensions from PSL code
 zero instrumentation code and zero overhead!
debug
info
Elf-
image
C
PSL
GCC
P2V Bitfile Monitor Unit (MU)
Core Datapath
always(REQ→eventually(ACK==1))
int foo(void){
REQ:
device->CONTROL = 1;
while(1) {
ACK = device->STATUS;
.... }
}
Roles of eMIPS, P2V andRoles of eMIPS, P2V and
GCCGCC
Extensible PeripheralsExtensible Peripherals
Use the eMIPS extension slot for I/O peripherals
Safely load/unload peripherals on demand
Saves area, forward-compatible, bug fixes, …
Flexible interface solves perf. and atomicity issues
Peripheral Configuration State Machine
Suspended:
Power Mgt. Run
Absent Not
Configured
ToolsTools
eBug: the extensible debuggereBug: the extensible debugger
Safe, in-process, JTAG-style software
debugger
Extensible in hw (watchpoints)
Extensible in sw (communication protocols)
Use P2V as a trigger
Processor Debugging &Processor Debugging &
VerificationVerification
ModelSim:
eMIPS CPU
Giano: Simulated
Board
Giano: Oracle
Software
Developers
Hardware
Designers
Compiled
Code
Profiling
Top one or
two Basic
Blocks
Basic Blocks Implemented
as Hardware Extensions
Original Binaries Modified to utilize
Hardware Acceleration
Optimizing the ISA with M2VOptimizing the ISA with M2V
Same speed, half the area of hand-generated Verilog code
M2V RoleM2V Role
The BBToolsThe BBTools
 BBFIND finds the basic blocks in MIPS, PPC and ARM images
(ELF and PE++)
 A simulator (Giano) uses the BB info to generate profiling
information. BBSORT+BBDUMP print it
 BBMATCH applies the new instructions to the original
executables
 The simulator generates the new profile data
Execution Counts of Individual Basic Blocks in XQuake, on the Xbox360
Real-Time Simulation: GianoReal-Time Simulation: Giano
 Definition of Real-Time Simulation”:
Realize a software system that matches the temporal behavior
of the hardware+software system being simulated, using the
same time-ordered sequence of inputs
 Applicable to hybrid hw+sw simulators too
 Requires:

Clock adaptation

I/O adaptation
ModelSim V-modelsC-models
CPU
MEM
I/O
ARM: At91m63200 Xilinx: Spartan3
Optional: Icarus
Verilog Interpreter
FPGA: vvp.dll
NamedPipe “GIANO”
PLI plug-in (VPI)
NamedPipe client
Optional: external
devices (LabView)
PLI plug-in (VPI)
Start = TheCounter->Value;
...compute...
End = TheCounter->Value;
module test;
always @(posedge clock)
counter = counter + 1;
User Interface: Visio GraphsUser Interface: Visio Graphs
Atmel EB63 Evaluation Board
Clock AdaptationClock Adaptation
Problem: Output a character every second
•
Timer too slow/fast?  Incorrect
•
Host load changes?  Erratic
Solution: Rate-limit the clock using introspection and adaptation
Rate-limiting the ClockRate-limiting the Clock
1. Every M (10**3) clock ticks spin idle for D microseconds
2. Every N (10**6) clock ticks check the actual frequency
against the target frequency, adjust the delay D
Adjusting the Delay factorAdjusting the Delay factor
Delay Calibration
-100
0
100
200
300
400
500
600
57
60
64
67
71
75
79
83
88
92
96
20
104
108
time(sec)
ns
DelayFactor
TgtIpsTime
CurIpsTime
DelayAdjust
Effect on IPSEffect on IPS
Effective IPS
0
2000000
4000000
6000000
8000000
10000000
12000000
14000000
16000000
18000000
20000000
57
60
64
67
71
75
79
83
88
92
96
20
104
108
time(sec)
IPS
Target
Current
I/O adaptationI/O adaptation
Problem: 9600 baud serial line
•
From disk trace?  too fast
•
From user?  too slow
•
From real serial line?  depends
Solution:

Link to rate-limited clock

Adapt using events and notifications
Giano: Key pointsGiano: Key points
 Giano is the first Real-Time Simulation Framework for
hardware-software co-development
 Uses Microsoft Visio as the graphing and execution UI
 Configurations are Platform XML files
 Nodes in the graph are separate, user-defined DLLs
 Lots of functionality pre-built into the base framework
 60+ working modules, 20+ systems, 4 years internal use
 Release V2 available, free for academic use
CreditsCredits
Full-timers
Neil Pittman, Alessandro Forin
Interns
Nathaniel Lynch, Behnam Neekzad, Ping Hang Cheung,
Bharat Sukhwani, Lu Hong, Karl Meier, Giovanni Busonera
http://research.microsoft.com/research/EmbeddedSystems

emips_overview_apr08

  • 1.
    eMIPS Project Overview i.e... Whatif your next computer was an FPGA? Alessandro Forin Microsoft Research, Redmond April 3, 2008
  • 2.
    OutlineOutline  How FPGAstook over the world (architecture)  Easy pickings and novelty items (eMIPS)  Simulation, debugging, profiling, … (tools)
  • 3.
    Field Programmable GateArraysField Programmable Gate Arrays are..are.. ..essentially two, super-imposed memory planes:  The routing plane  which signal goes where  The logic plane  what logic function a node computes Just like RAM, they are  slower (1/10 roughly)  power hungry Nonetheless, they took over the world.
  • 4.
    One way is..Oneway is..  UltraSparc CPU  2.0 GHz, 4-issue, 15 stages  Memory Hierarchy  16KB/1MB L1/L2  32 Byte cacheline  1/15 cycle L1/L2 latency  160 cycle memory latency  Reconfigurable HW  Mapped to Xilinx Virtex-4 to obtain clock frequency Courtesy: K. Compton, U. Wisc.
  • 5.
  • 6.
    The FPGA advantageTheFPGA advantage  No code fetches  Fine-grained, spatially parallel  Data access: static prediction, reordering, variable width
  • 7.
  • 8.
    Dynamically ExtensibleDynamically Extensible ProcessorsProcessors Using an FPGA, we have realized a (MIPS) processor that extends itself at runtime, using Extensions that are safe for multi-user operating systems  Applications:  Speedup execution using an Application-Specific CPU (M2V)  Unobtrusively monitor (real-time) software (P2V)  Loadable software debugging support (eBug)  Load/Unload peripherals at runtime, minimizing chip area  Load/Unload processor cores on demand  First release now available for non-commercial use
  • 9.
    The eMIPS “Workstation”TheeMIPS “Workstation” • Motherboard: Xilinx ML401 evaluation board for the Virtex4 FPGA • eMIPS is on the FPGA, just add keyboard, mouse and disk!
  • 10.
    Binaries with HardwareAcceleration Extended instructions are inserted into the Binaries. If the HW Extension is loaded the instruction executes and skips the basic block. Otherwise eMIPS interprets the instruction as a NOP and executes the block. Video Games Real-Time Spec2000 Application speedups (worstApplication speedups (worst case)case) Other code… Op78 sp,ra,10 New Instruction Lw ra,10(sp) Original Basic Jr ra Block Addiu sp,sp,18 Other code…
  • 11.
    Assertion Based VerificationwithAssertion Based Verification with P2VP2V  Use the IEEE-standard hardware Property Specification Language (PSL) to verify C (real-time) programs  Implement it using a simulator, or in reconfigurable hardware  PSL-to-Verilog compiler: creates Extensions from PSL code  zero instrumentation code and zero overhead!
  • 12.
    debug info Elf- image C PSL GCC P2V Bitfile MonitorUnit (MU) Core Datapath always(REQ→eventually(ACK==1)) int foo(void){ REQ: device->CONTROL = 1; while(1) { ACK = device->STATUS; .... } } Roles of eMIPS, P2V andRoles of eMIPS, P2V and GCCGCC
  • 13.
    Extensible PeripheralsExtensible Peripherals Usethe eMIPS extension slot for I/O peripherals Safely load/unload peripherals on demand Saves area, forward-compatible, bug fixes, … Flexible interface solves perf. and atomicity issues Peripheral Configuration State Machine Suspended: Power Mgt. Run Absent Not Configured
  • 14.
  • 15.
    eBug: the extensibledebuggereBug: the extensible debugger Safe, in-process, JTAG-style software debugger Extensible in hw (watchpoints) Extensible in sw (communication protocols) Use P2V as a trigger
  • 16.
    Processor Debugging &ProcessorDebugging & VerificationVerification ModelSim: eMIPS CPU Giano: Simulated Board Giano: Oracle
  • 17.
    Software Developers Hardware Designers Compiled Code Profiling Top one or twoBasic Blocks Basic Blocks Implemented as Hardware Extensions Original Binaries Modified to utilize Hardware Acceleration Optimizing the ISA with M2VOptimizing the ISA with M2V Same speed, half the area of hand-generated Verilog code
  • 18.
  • 19.
    The BBToolsThe BBTools BBFIND finds the basic blocks in MIPS, PPC and ARM images (ELF and PE++)  A simulator (Giano) uses the BB info to generate profiling information. BBSORT+BBDUMP print it  BBMATCH applies the new instructions to the original executables  The simulator generates the new profile data Execution Counts of Individual Basic Blocks in XQuake, on the Xbox360
  • 20.
    Real-Time Simulation: GianoReal-TimeSimulation: Giano  Definition of Real-Time Simulation”: Realize a software system that matches the temporal behavior of the hardware+software system being simulated, using the same time-ordered sequence of inputs  Applicable to hybrid hw+sw simulators too  Requires:  Clock adaptation  I/O adaptation
  • 21.
    ModelSim V-modelsC-models CPU MEM I/O ARM: At91m63200Xilinx: Spartan3 Optional: Icarus Verilog Interpreter FPGA: vvp.dll NamedPipe “GIANO” PLI plug-in (VPI) NamedPipe client Optional: external devices (LabView) PLI plug-in (VPI) Start = TheCounter->Value; ...compute... End = TheCounter->Value; module test; always @(posedge clock) counter = counter + 1;
  • 22.
    User Interface: VisioGraphsUser Interface: Visio Graphs Atmel EB63 Evaluation Board
  • 23.
    Clock AdaptationClock Adaptation Problem:Output a character every second • Timer too slow/fast?  Incorrect • Host load changes?  Erratic Solution: Rate-limit the clock using introspection and adaptation
  • 24.
    Rate-limiting the ClockRate-limitingthe Clock 1. Every M (10**3) clock ticks spin idle for D microseconds 2. Every N (10**6) clock ticks check the actual frequency against the target frequency, adjust the delay D
  • 25.
    Adjusting the DelayfactorAdjusting the Delay factor Delay Calibration -100 0 100 200 300 400 500 600 57 60 64 67 71 75 79 83 88 92 96 20 104 108 time(sec) ns DelayFactor TgtIpsTime CurIpsTime DelayAdjust
  • 26.
    Effect on IPSEffecton IPS Effective IPS 0 2000000 4000000 6000000 8000000 10000000 12000000 14000000 16000000 18000000 20000000 57 60 64 67 71 75 79 83 88 92 96 20 104 108 time(sec) IPS Target Current
  • 27.
    I/O adaptationI/O adaptation Problem:9600 baud serial line • From disk trace?  too fast • From user?  too slow • From real serial line?  depends Solution:  Link to rate-limited clock  Adapt using events and notifications
  • 28.
    Giano: Key pointsGiano:Key points  Giano is the first Real-Time Simulation Framework for hardware-software co-development  Uses Microsoft Visio as the graphing and execution UI  Configurations are Platform XML files  Nodes in the graph are separate, user-defined DLLs  Lots of functionality pre-built into the base framework  60+ working modules, 20+ systems, 4 years internal use  Release V2 available, free for academic use
  • 29.
    CreditsCredits Full-timers Neil Pittman, AlessandroForin Interns Nathaniel Lynch, Behnam Neekzad, Ping Hang Cheung, Bharat Sukhwani, Lu Hong, Karl Meier, Giovanni Busonera http://research.microsoft.com/research/EmbeddedSystems