• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
The Power Architecture and Power.org word marks and the Power ...
 

The Power Architecture and Power.org word marks and the Power ...

on

  • 441 views

 

Statistics

Views

Total Views
441
Views on SlideShare
441
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • the virtual hardware in simics is independent of the host hardware. Also shows our breadth of processor support, as an indirect point.
  • WE have seen three main themes in the application of virtualized software development at our customers. At the most basic level, virtual hardware is used to replace physical machines. This saves capital costs and also increases the availability of systems for developers, reducing capacity bottlenecks in the organization. Next, Virtual Hardware can be used to accelerate product development. Here, virtual platforms are developed alongside the hardware and made available months, quarters, or even years ahead of the real hardware. This provides a head start on software development and partner ecosystem development, ultimately shortening the time to market for the product. The product can be as small as a new SoC processor chip, or as complex as a complete new server product. In the optimize phase, we use the full potential of virtual hardware to transform the system development process. Software and hardware development are decoupled, and the strong debug and analysis features of virtual platforms are used to reduce software development time and risk. Project schedules can be tighter, software development can ramp faster, and the system development process for combined hardware and software is shorter and results in better quality products. This reduces the system maintenance costs and the risk of expensive problems found in the field.
  • What did we say in the earlier slide: In the optimize phase, we use the full potential of virtual hardware to transform the system development process. Software and hardware development are decoupled, and the strong debug and analysis features of virtual platforms are used to reduce software development time and risk. Project schedules can be tighter, software development can ramp faster, and the system development process for combined hardware and software is shorter and results in better quality products. This reduces the system maintenance costs and the risk of expensive problems found in the field.
  • The point of this slide is to emphasize that we simulate more than processors & memory. We want to emphasize that we can simulate complete boards, racks of boards, networks of systems. Very powerful for users developing complex systems.
  • The idea here is to show how Simics accelerator does TWO things: It brings simulation speed back to where it was for a single-core host, single-board target It actually increases the total simulation power by a factor of four for a four-core host The way this is achieved is by using multiple threads inside a single Simics process
  • Light green: discarded Dark green: kept Red: unique data kept for each machine, where we know it. Zero pages might be a large contributor to the 8572E case. Within the single machine, significant saving from U-Boot copying itself to ram, is our assumption Within the eight ebonies, some saving from internal redundancy, some from between machines. We have not analyzed the sharing of data in the heterogeneous case, nor how large the local data is for each machine. It is running 6 different OS images on four different types of machines (multi-mix-freescale.simics)
  • Middle sharp increase: probably cache locality, diminishing returns is from reducing switch overhead.
  • The 300¤ numer for central is a wild estimate, the point is that distribution is less efficient due to

The Power Architecture and Power.org word marks and the Power ... The Power Architecture and Power.org word marks and the Power ... Presentation Transcript

  • Simics Accelerator Virtualizing Large Systems Dr. Mikael Bergqvist, Senior Application Engineer 2008-05-30
  • Topic
    • Speeding up the simulation of large target systems
      • Bring virtualized software development to the big stuff
    • Outline
      • Virtualized software development
        • Apologies if you attended the morning presentation on multicore debug, some parts will be repeated. But with two tracks we cannot be sure that you all saw that.
      • Target and host system trends
      • Multithreading virtual hardware models
      • Leveraging redundant information with Page Sharing
      • Results
    • Virtualization for Software Developers
  • What is Virtual Hardware?
    • A piece of software
    • Running on a regular PC, server, or workstation
    • Functionally identical to a particular hardware
    • Runs the same software as the physical hardware system
    Virtual HW
  • Virtutech Core Technology
    • Model any electronic system on a PC or workstation
      • Simics is a software program, no hardware required
    • Run the exact same software as the physical target (complete binary)
    • Run it fast (100s of MIPS)
    • Model any target system
      • Networks, SoCs, boards, ASICs, ... no limits
      • Here is where accelerator comes in
    • For the benefit of software developers and hardware providers
    • Enables process change in software development
    User application code Virtual target hardware Target operating system (s) Middleware and libraries Simics Simics Host hardware Host hardware Host operating system Host operating system
  • Why do we use Virtual Hardware?
    • Business Reasons
      • “ It hits the bottom line”
    • Develop software before hardware becomes available
      • Shorten time-to-market
    • Decouple hardware and software development
    • Reduce software risk
    • Increase quality
    • Availability & Flexibility
    • Engineering Reasons
      • ” It is cool”
    • Checkpoint & restore
    • Virtual time
      • Precisely synchronized
      • Stopped at any point
    • Repeatability
    • Reverse execution
    • Configurable
    • Control
      • Change anything
    • Inspection power
      • See anything
    • No debug bandwidth limit
  • Value Proposition Replace Accelerate Optimize Test and configuration Early Software Development Capital Expenditure Reduction Time to Market Enhance System Debug Cost of Recall and System Maintenance
  • Replace
    • Availability
    • Virtual system is software
      • Trivial to copy
      • Trivial to distribute
    • Cheaper than custom HW
    • Each engineer can have a custom hardware system at their desk
    • Scalability
    • No physical supply limit
      • Any number of any board
      • Any type of system in ”infinite” supply at no cost
      • Old systems or new
    • A virtual system can be big or small by simple software (re)configuration
  • Accelerate
    • Virtual hardware created from the system specification
      • Model available much earlier than prototype hardware
      • Software development starts much earlier
    • Software available when hardware starts shipping
      • Shorter sales cycles, less product risk, shorter time-to-market
    Board design Board prototype production Application software development Hardware/Software Integration and Test Hardware-dependent software development Virtual model ”production”
  • Optimize
    • Take advantage of the full power of virtualized software development and virtual hardware
    • Factor it into the project plan for a system
    • Observed effects:
      • Software not blocked by hardware availability
      • Development schedules that start earlier and end earlier
      • Shorter development time for equivalent functionality
      • Shorter time to find and fix the really hard bugs
      • Fewer show-stoppers
      • More tested software
      • Improved hardware and hardware documentation quality
      • Very short time before software runs on first hardware
  • Optimized Debugging Power
    • Virtual hardware has very nice debugging and testing abilities
    break –x 0x0000 0x1F00 break-io uart0 break-exception int13 ... con0.wait-for-string “=>“ con0.input “bootm ” con0.wait-for-string “login:“ con0.input “root ” ... Synchronous stop for entire system Determinism and repeatability Reverse execution Unlimited and powerful breakpoints Trace anything Powerful scripting
  • The Disk Corruption Example Bug
    • Distributed fault-tolerant file system got corrupted
      • Rack-based system with many boards
      • Intermittent error
      • Error seen as a composite state across multiple disks: they suddenly and intermittently became inconsistent
      • Months spent chasing it on physical hardware
    • Simics solution:
      • Reproduce corruption in Simics model of target
      • Pin-point time when it happens, by interval halving
      • Around the critical time, take periodic snapshots of disks
      • Check consistency of disk states in offline scripts
    • Result:
      • Found the precise instruction causing the problem
      • Captured the network traffic pattern causing the issue
      • Communicated the complete setup and reproduction instructions to development, greatly facilitating fixing the bug
  • What Types of Systems Can Be Virtualized? Complete Systems & Networks
    • Satellite constellation, telecom network
    Racks of Boards & Backplanes
    • Telecom rack, avionics bay, blade server
    Complete Boards
    • MPC8572DS board, ebony board, custom
    Devices & Buses
    • PCIe, RapidIO, I 2 C, Custom FPGA
    SoC Devices
    • MPC8572E, PPC440GX, or CSSP ASIC
    Processor & Memory
    • e300, e500, 440, 970, 7450, Power6, ...
    Examples This is where performance becomes an issue
    • Technology Trends and Simics Accelerator
  • Trends
    • Target systems are getting more complex
      • Multiple boards
      • Multiple processors
      • Multicore SoCs
      • More and larger memories
    • Reduces perceived simulation performance as more work is needed per target time unit
    • Host hardware is parallel
      • Multicore processors
      • Multiple processors
      • Clusters of PCs
    • Multicore standard for desktop
      • 600 EUR for a 2-core PC
      • 3000 EUR gets 8-core server
    • Increases processing power for software which is parallel
    • NB: Memory size is not increasing as quickly as #cores
  • Simics Accelerator
    • Launched with Simics 4.0 in April 2008
    • Contains a set of technologies for speeding up execution of large target systems in Simics
      • Tackle more complex target systems
      • Using multiple host processor cores
      • Taking advantage of redundancy in target system
    • Without impacting Simics determinism, control, synchronization, insight, and reverse execution
  • The Target Systems
    • Large, complex targets
      • Multiple boards
      • Multiple networks
      • 20-100 processors
      • Heterogeneous processors
      • Many gigabytes of memory
    • Almost overwhelming – but not with Accelerator!
      • Brings a whole new level of systems into the bracket of ”conveniently fast”
    • Typical target markets:
      • Telecom network equipment (racks and clusters)
      • Military/aerospace racks
      • Datacenter blade enclosures
      • Distributed systems
      • Networked systems
    • Multithreading Simics
  • Not Trivial to do Right
    • Simics accelerator
    • Any target machine
      • SMP, AMP, distributed, clustered
    • SMP host
    • Maintain control, insight, determinism, repeatability like single-threaded execution
    • Independent of host machine behavior
      • # of cores, speed of cores, type of cores
      • Same results regardless of host
    • VmWare, CECSim
    • Run parallel threads without coordination
    • No tightly coordinated stop, repeatability
    • SMP target on SMP host only
    • Depend on host hardware for behavior
    • Current simulation and virtualization tools
      • Simics pre-4.0
      • SystemC
      • etc.
    • Repeatability and controllability “easy”
      • Round-robin time sharing
    • Independent of host hardware behavior
      • Same results regardless of host
    • Shared-memory and local-memory targets on any host
    Controlled parallel execution Uncoordinated parallel execution Single-threaded execution
  • Multithreading Simics: Overview Host Workstation Host Workstation Simics Simics Single thread Simics Host Workstation Target simulation speed Total simulator work 25% 1.0 100% 4.0 100% 1.0 Simple system Complex system Complex system with Simics Accelerator
  • Multithreading Simics: Details
    • Simics 4.0 can utilize multiple host processors for simulation
    • The simulation is divided up into cell s
      • The cells can run concurrently in different threads
    • Objects in different cells can only communicate with each other through message passing (Simics links)
      • Processors that share memory or devices have to be in the same cell (currently)
      • Boards or machines that communicate over Ethernet and other networks can be in separate cells
      • Typically, one or a few boards/machines in a cell
      • Links connecting machines require some smarts
    • Orthogonal to other Simics features
    • Reuses target structure for earlier Simics versions
  • Hierarchical Synchronization
    • Deterministic semantics
      • Regardless of host # cores
    • Periodic synchronization between different cells and target machines
      • Puts a minimum latency on communication propagation
      • Synch interval determines simulation results, not number of execution threads in Simics
    • Latency within a cell:
      • 1000-10000 cycles
      • Works well for SMP OS
    • Latency between cells:
      • 10 to 1000 ms
      • Works well for latency-tolerant networks
    • Builds on current Simics experience in temporally decoupled simulation
      • This works well in practice
    link link Synchronize shared memory machine tightly Longer latency on network between cells Short latency between machines with tight network coupling, inside a single cell
  • Scaling Out
    • Multithreading and distribution of the simulation can be combined to simulate extremely large systems
      • Make more cores and more host memory available
      • Takes Simics into the hundreds of nodes domain
      • Distribute at network links, just like cell boundaries
    Switch Simics Host Workstation link Simics Host Workstation link Simics Host Workstation link
    • Leveraging Target Redundancy
  • Redundancy in Target Systems
    • Large systems are not built from all-unique components
    • Software repeats
      • Machines use the same OS, middleware, applications
    • Data repeats
      • Redundant databases
      • Data packets passed around in a cluster
    • Copies within machine
      • Code and data copied from disk to memory to be used
    • Simulator sees the whole system, leverage repetition to reduce memory footprint
    Linux DB App A App A DB App A DB Dataset Dataset Dataset Dataset RTOS RTOS RTOS RTOS Packet Packet Packet
  • Data Page Sharing Implementation
    • Simics memory images used for all data stores (flash, ram, rom, disks, etc.)
      • Standard Simics feature
    • Identical pages in different memory images stored in a single copy
      • Within machines
      • Between machines
      • Regardless of type of memory in the target
      • Copy-on-write semantics for safety (obviously)
    • Reduces memory footprint, increase data locality, helps maintain performance
    Simics cpu RAM flash dev dev dev cpu cpu RAM flash dev dev dev cpu RAM flash dev dev dev cpu
    • Simics Accelerator: Results
  • Accelerator Scaling
    • Many times better scalability for virtual hardware
      • Brings virtualization to larger system setups
      • More boards and larger memories handled with same host
      • (No real effect on single-machine setups)
    • Better use of host hardware
      • Use all the cores in a workstation
      • Do not waste workstation memory
      • Same semantics everywhere: start on a small machine, move to a larger one for large simulations if needed
    • Overall, removes target system size as an obstacle for using virtualized software development
  • Single Point of Control Eight machines simulated by two threads, inspect any part of any machine from single interface
  • Multithreading Performance Results
    • Performance effect of multithreading depends on
      • Target system characteristics
      • Software latency requirements
      • Target system load balance
      • Target system communications pattern
    • Synthetic experiments and lab experience
      • Single-thread performance not affected
        • Simics works just as well as before on a single core
        • No impact on idle loop simulation
      • Up to 10x Simics 3.2 performance
        • 8-core host, 64 target machines, no communication
      • Up to 6x scaling on 8-core host
        • Pretty respectable
  • Page Sharing Results Local unique data: 4% Shared data across machines: 96% Total data savings: 65% Total data savings: 20% Data repeated within the machine: 20% All results are for networks of machines booted to prompt, but no applications loaded Local unique data: 1% Shared data across and within machines: 98% Total data savings: 89% Zero pages: 90% Total data savings: 91% Other shared: 1% Eight PPC440GP/Linux machines Mixed network (4 mach, 6 OS) Single PPC440GP/Linux machines Three 8572e/Linux machines
    • Questions?
  • Munich
    • Spares
  • Simulation Speed
    • Detail level determines speed
    • The more detail, the slower the simulation
      • You can run lots of software with low detail level
      • or not very much software with high detail level
      • But not lots of software with high detail level
    5 minutes 400 5 Virtual prototypes 8 hours 4 500 Cycle-approximate simulation 7 days 0.2 10000 Computer architecture 2 years 0.002 1000000 Gate-level simulation Time to simulate one real-world minute Approximate speed in MIPS Typical slowdown Simulation detail level
  • Workload Sizes 10000 M One second in a 10-processor 1GHz rack system 4000 M Running 10 million Dhrystone iterations on an UltraSPARC core 3600 M Booting Linux 2.6 on a dual-core MPC8641D SoC 1000 M Booting Linux 2.6 on a single-core MPC8548 SoC 100 M Booting a real-time operating system on a PPC440GP 50 M Booting Linux 2.4 on a simple StrongARM machine Size in number of instructions Workload
  • Temporal Decoupling Speed Impact
    • Experimental data
      • 4 virtual PPC440 boards
      • Booting Linux
        • Which is a particularly hard workload, lots of device accesses
      • Execution quanta of 1, 10, 100, ... 1000_000 cycles
    • Notable points:
      • 10x performance increase from 10 to 1000 quantum
      • +30% from 1000 to 1000_000 quantum
  • Simics 4.0 & Accelerator Performance
    • Running a single machine in a single thread is equal in performance with Simics 3.2
      • Setups with many machines are often faster than with 3.2
    • Multithreading makes it much easier to utilize multicore and multiprocessor host machines
      • Linear scaling seen for simple cases such as compute-intense workloads or boot with little communication
      • Variability of the workload limits performance (see next slide)
      • Performance reduced if low-latency communication is required
    • Page sharing is not yet optimized for performance
      • Current implementation saves memory without affecting performance (neither better nor worse than without page sharing)
      • Have potential to improve performance
  • Multithreading Performance in Practice
    • Multiple boards in a single target system
    • Virtual time progress, with time quanta
    A1 A3 A2 B1 B3 B2 Virtual time C1 C3 C2 Simics A B C
  • Execution on Single-Threaded Simics Serialized execution on single-threaded Simics A1 A3 A2 B1 B3 B2 Virtual time C1 C3 C2 A1 A2 B1 B2 Real time C1 C2 Virtual time progress, with quanta The simulation of the three target machines are interleaved on a single processor The real time it takes to execute each time quantum tend to vary with target hardware and software characteristics
  • Execution on Multi-threaded Simics Parallel execution on multi-threaded Simics Serialized execution on single-threaded Simics A1 A3 A2 B1 B3 B2 Virtual time C1 C3 C2 A1 A2 B1 B2 Real time C1 C2 A1 A2 B1 B2 C1 C2 A3 B3 C3 Virtual time progress, with quanta Real time Each time quantum has to be finished on all machines before progressing to next quantum Stall Stall Stall Stall Best case, all time target time quanta take the same time to simulate.
  • Execution on Multi-threaded Simics Parallel execution on multi-threaded Simics Serialized execution on single-threaded Simics A1 A3 A2 B1 B3 B2 Virtual time C1 C3 C2 A1 A2 B1 B2 Real time C1 C2 A1 A2 B1 B2 C1 C2 A3 B3 C3 Virtual time progress, with quanta Real time Stall Stall Stall Stall Speed-up over single-threaded Simics will vary over time, and is limited by load balance
  • Simics Accelerator vs Simics Central
    • Accelerator advantages:
      • Easier to setup, control and coordinate the simulation
      • Potentially more efficient use of host machine resources
    Simics Host Workstation Accelerator uses multiple threads inside a single Simics instance. Host Workstation Simics Simics Simics Simics Simics Central coordinates a set of separate Simics processes. Complex system with Simics Central Complex system with Simics Accelerator
  • Are Cores for Free?