1. Parallel computation is needed to achieve high performance as modern processors have limitations despite features like caches, buses, and pipelines. Parallel computers use multiple CPUs working together to solve problems faster.
2. Flynn's classification categorizes computer architectures based on their instruction and data flows as single instruction stream single data stream (SISD), single instruction stream multiple data stream (SIMD), or multiple instruction stream multiple data stream (MIMD).
3. Important metrics for measuring parallel performance include speedup, which measures improvement over sequential execution, and efficiency, which relates speedup to number of processors used. According to Amdahl's law, even small amounts of sequential code limit maximum speedup attainable.
The document summarizes a presentation on building automation systems (BAS) and their role in managing energy usage and demand in green, intelligent buildings. It discusses how BAS can integrate with the smart grid to support distributed energy and demand response. It also outlines the agenda, benefits of intelligent buildings, services that building systems can provide, and the vision of the Continental Automated Buildings Association (CABA) to advance integrated technology in buildings.
1) Intelligent building systems were introduced in the United States in the early 1980s as several major technology trends were underway including the emergence of personal computers and innovations in telecommunications.
2) A smart building involves the installation and use of advanced integrated building technology systems to provide actionable information about the building to allow owners and occupants to efficiently manage it.
3) Building automation systems are an example of distributed control systems that use sensors, controllers and networks to automatically control a building's mechanical, lighting and other systems to improve efficiency and occupant experience.
The document discusses parallel algorithms and their analysis. It introduces a simple parallel algorithm for adding n numbers using log n steps. Parallel algorithms are analyzed based on their time complexity, processor complexity, and work complexity. For adding n numbers in parallel, the time complexity is O(log n), processor complexity is O(n), and work complexity is O(n log n). The document also discusses models of parallel computation like PRAM and designs of parallel architectures like meshes and hypercubes.
This document provides an overview of a data structures revision tutorial. It discusses why data structures are needed, as computers take on more complex tasks and software implementation is difficult without an organized conceptual framework. The tutorial will cover common data structures, how to implement and analyze their efficiency, and how to use them to solve practical problems. It requires programming experience in C/C++ and some Java experience. Topics will include arrays, stacks, queues, trees, hashing, sorting, and graphs. The problem solving process involves defining the problem, designing algorithms, analyzing algorithms, implementing solutions, testing, and maintaining code.
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
https://sites.google.com/view/itri-icl-dla/
(Public Information Share) This is our lightweight DNN inference processor presentation, including a system solution (from Caffe prototxt to HW controls files), hardware features, and an example of object detection (Tiny YOLO) RTL simulation results. We modified open-source NVDLA, small configuration, and developed a RISC-V MCU in this accelerating system.
This document discusses real-time operating systems (RTOS), focusing on principles and a case study of EMERALDS, an RTOS designed for small memory embedded systems. It describes the requirements and classifications of RTOSs, approaches to minimizing code size and execution overhead in EMERALDS, and how EMERALDS implements efficient real-time scheduling, semaphores, message passing and memory protection for constrained embedded devices. The document also discusses how EMERALDS was adapted to the OSEK standard for automotive systems.
This document outlines Jerry Meyerhoff's presentation on why EMC products fail and how to fix them. It discusses EMC fundamentals like sources, paths and receptors. It also presents several case studies to illustrate EMC troubleshooting techniques, including dividing problems, finding dominant mechanisms, optimizing solutions, and considering both technical and human factors. The case studies cover issues like emissions, immunity, conducted and radiated emissions and immunity. They demonstrate using measurements, modeling, best practices and an iterative process to diagnose and resolve EMC problems.
1. Parallel computation is needed to achieve high performance as modern processors have limitations despite features like caches, buses, and pipelines. Parallel computers use multiple CPUs working together to solve problems faster.
2. Flynn's classification categorizes computer architectures based on their instruction and data flows as single instruction stream single data stream (SISD), single instruction stream multiple data stream (SIMD), or multiple instruction stream multiple data stream (MIMD).
3. Important metrics for measuring parallel performance include speedup, which measures improvement over sequential execution, and efficiency, which relates speedup to number of processors used. According to Amdahl's law, even small amounts of sequential code limit maximum speedup attainable.
The document summarizes a presentation on building automation systems (BAS) and their role in managing energy usage and demand in green, intelligent buildings. It discusses how BAS can integrate with the smart grid to support distributed energy and demand response. It also outlines the agenda, benefits of intelligent buildings, services that building systems can provide, and the vision of the Continental Automated Buildings Association (CABA) to advance integrated technology in buildings.
1) Intelligent building systems were introduced in the United States in the early 1980s as several major technology trends were underway including the emergence of personal computers and innovations in telecommunications.
2) A smart building involves the installation and use of advanced integrated building technology systems to provide actionable information about the building to allow owners and occupants to efficiently manage it.
3) Building automation systems are an example of distributed control systems that use sensors, controllers and networks to automatically control a building's mechanical, lighting and other systems to improve efficiency and occupant experience.
The document discusses parallel algorithms and their analysis. It introduces a simple parallel algorithm for adding n numbers using log n steps. Parallel algorithms are analyzed based on their time complexity, processor complexity, and work complexity. For adding n numbers in parallel, the time complexity is O(log n), processor complexity is O(n), and work complexity is O(n log n). The document also discusses models of parallel computation like PRAM and designs of parallel architectures like meshes and hypercubes.
This document provides an overview of a data structures revision tutorial. It discusses why data structures are needed, as computers take on more complex tasks and software implementation is difficult without an organized conceptual framework. The tutorial will cover common data structures, how to implement and analyze their efficiency, and how to use them to solve practical problems. It requires programming experience in C/C++ and some Java experience. Topics will include arrays, stacks, queues, trees, hashing, sorting, and graphs. The problem solving process involves defining the problem, designing algorithms, analyzing algorithms, implementing solutions, testing, and maintaining code.
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
https://sites.google.com/view/itri-icl-dla/
(Public Information Share) This is our lightweight DNN inference processor presentation, including a system solution (from Caffe prototxt to HW controls files), hardware features, and an example of object detection (Tiny YOLO) RTL simulation results. We modified open-source NVDLA, small configuration, and developed a RISC-V MCU in this accelerating system.
This document discusses real-time operating systems (RTOS), focusing on principles and a case study of EMERALDS, an RTOS designed for small memory embedded systems. It describes the requirements and classifications of RTOSs, approaches to minimizing code size and execution overhead in EMERALDS, and how EMERALDS implements efficient real-time scheduling, semaphores, message passing and memory protection for constrained embedded devices. The document also discusses how EMERALDS was adapted to the OSEK standard for automotive systems.
This document outlines Jerry Meyerhoff's presentation on why EMC products fail and how to fix them. It discusses EMC fundamentals like sources, paths and receptors. It also presents several case studies to illustrate EMC troubleshooting techniques, including dividing problems, finding dominant mechanisms, optimizing solutions, and considering both technical and human factors. The case studies cover issues like emissions, immunity, conducted and radiated emissions and immunity. They demonstrate using measurements, modeling, best practices and an iterative process to diagnose and resolve EMC problems.
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)Heiko Joerg Schick
The document describes the QPACE supercomputer project which aims to build a supercomputer optimized for lattice QCD simulations using IBM PowerXCell 8i processors. Key aspects summarized are:
1) QPACE uses 256 node cards per rack, each with a PowerXCell 8i processor, to achieve 26 TFLOPS and 1 TB memory per rack.
2) Custom networks include a 3D torus for nearest neighbor communication and an interrupt tree for global operations.
3) The node card design features the PowerXCell processor, FPGA network processor, memory, and networking interfaces.
4) Early results found the hardware design worked well but network processor implementation and software deployment took longer than planned.
This document discusses instruction level power analysis (ILPA) for estimating processor power consumption. It describes how ILPA works by associating an energy cost with each instruction based on its operations and accounting for inter-instruction effects. Initially used for RISC processors, ILPA methods were modified for VLIW/EPIC processors by considering independent energy dissipation across execution slots and clustering similar instructions. ILPA does not provide insight into core power consumption causes but was expanded to microarchitecture-aware models accounting for individual pipeline stages. Register files and caches can also be modeled based on access patterns and state transitions between cycles.
This document discusses internal memory organization and technologies. It begins with an overview of memory cell operation for DRAM and SRAM. It then covers various memory types including DRAM, SRAM, ROM, PROM, EPROM, and EEPROM. Advanced DRAM technologies like synchronous DRAM, Rambus DRAM, and double data rate SDRAM are also summarized. The document concludes with sections on error correction techniques and cache DRAM.
1. The document discusses research activities related to reducing energy consumption by at least 30% through the development of core source technologies for universal operating systems.
2. It describes four papers being presented, including ones on system and device latency modeling, power management frameworks for embedded systems, and automatic selection of power policies for operating systems.
3. It also summarizes four research topics from the National University, including performance evaluation of parallel applications using a power-aware paging method on next-generation memory architectures.
The document discusses LinkedIn's infrastructure and technical challenges in building a distributed system to support their professional social network. Key points include:
- LinkedIn needed a way to efficiently perform graph computations across billions of connections stored in a database
- Storing the data in memory provided better performance but they needed a way to keep the in-memory data synchronized with the database
- They explored various options for a "databus" system to replicate database changes to multiple graph engines in real-time without missing any events
- The solution they developed uses Oracle's row-level change tracking to capture all transaction changes and replay them asynchronously to keep the graph engines synchronized with the database.
This document discusses memory interfacing in microprocessors. It describes the different types of memory like RAM and ROM. RAM is further divided into SRAM and DRAM. It explains the basic design of a memory chip including memory registers, address decoding, and read/write operations. The document provides block diagrams to illustrate memory interfacing concepts.
This document discusses challenges in low-power design and verification. It addresses why low-power is now a priority given trends in mobile applications. Key challenges include increased leakage due to process scaling, accounting for active leakage, and handling process variations. The document also discusses low-power design methodologies, including multiple power domains, voltage scaling, and clock gating. Verification challenges are presented, such as needing good test patterns and coordination across design domains. Overall power analysis is more complex than timing analysis due to its pattern dependence and need to optimize for performance per watt.
Here are the key challenges faced in low power design without a common power format:
1. Domain definitions, level shifters, isolation cells, and other low power techniques are specified differently in each tool using tool-specific commands files and languages. This makes cross-tool consistency and validation difficult.
2. Power functionality cannot be easily verified at the RTL level without changing the RTL code, since power domains and low power techniques are not represented. This limits verification coverage.
3. Iteration between design creation and verification is difficult, since changes to the low power implementation require updates to multiple tool-specific specification files rather than a single cross-tool definition. This impacts design schedule and risks inconsistencies.
4.
Ben Prusinski is presenting on Oracle R12 E-Business Suite performance tuning. He will cover methodology, best practices, and techniques from basic to advanced. The presentation includes tuning at the infrastructure, application, and database levels with a focus on a holistic approach. Specific areas that will be discussed are concurrent manager tuning including queue size, sleep cycle, cache size, and number of processes.
This document describes SmartBalance, a sensing-driven Linux load balancer for heterogeneous multi-processor systems-on-chips (MPSoCs) that aims to improve energy efficiency. It does this through a predictive approach that balances tasks among cores based on ongoing performance and power measurements, with the goal of jointly addressing workload variability and hardware heterogeneity. The key aspects are on-chip sensing to monitor performance and power, online prediction of these metrics, and a simulated annealing-based allocation algorithm to optimize task distribution across cores in each scheduling epoch.
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Toronto-Oracle-Users-Group
This document discusses various Oracle 12c features that can be used to achieve high availability and keep systems available even during planned and unplanned outages. It compares options for handling planned changes like hardware, OS, database upgrades including RAC, RAC One Node, and Data Guard. It also discusses disaster recovery options like storage mirroring, RAC extended clusters, Data Guard, and GoldenGate replication. New features in Oracle 12c like Far Sync instances and cascading standbys are also covered. The document provides a guide to deciphering the necessary components for high availability.
1. Different tools use different descriptions for power management, making it difficult to verify configurations and keep definitions consistent across the design flow.
2. There is no automation for verifying power management definitions, requiring designers to manually verify thousands of statements.
3. The design hierarchy and syntax varies between tools and between RTL and gate representations, complicating cross-checking.
4. It is challenging to verify power functionality without changing RTL code since power and ground nets are not explicitly captured or simulated.
1. Power functionality cannot be easily verified without changing RTL code since power domains and constraints are not represented consistently across the design flow.
2. Tools from different vendors use different specifications for power management, making end-to-end verification and signoff difficult.
3. Constraints and intent for low power techniques like multi-VT, power gating, and DVFS cannot be validated without a common representation.
Crash course on data streaming (with examples using Apache Flink)Vincenzo Gulisano
These are the slides I used for a crash course (4 hours) on data streaming. It contains both theory / research aspects as well as examples based on Apache Flink (DataStream API)
The venerable Servlet Container still has some performance tricks up its sleeve - this talk will demonstrate Apache Tomcat's stability under high load, describe some do's (and some don'ts!), explain how to performance test a Servlet-based application, troubleshoot and tune the container and your application and compare the performance characteristics of the different Tomcat connectors. The presenters will share their combined experience supporting real Tomcat applications for over 20 years and show how a few small changes can make a big, big difference.
20121017 OpenStack CERN Accelerating ScienceTim Bell
CERN operates the Large Hadron Collider (LHC) and uses OpenStack to manage its large computing infrastructure. CERN aims to deploy OpenStack on 15,000 hypervisors hosting 100,000-300,000 VMs by 2015 to support fundamental physics research from the LHC experiments. OpenStack provides an open source alternative to CERN's traditional computing model and will help scale to meet growing data and computing needs.
This document provides information about CERN and its use of OpenStack. CERN is the European Organization for Nuclear Research located on the French-Swiss border. It operates the Large Hadron Collider and related experiments that generate large amounts of data. CERN uses OpenStack to manage its large computing infrastructure and virtualize servers to efficiently process data from the LHC experiments across a global grid. OpenStack allows CERN to scale its infrastructure to meet growing computing needs while reducing costs and improving operational efficiency.
CERN operates the Large Hadron Collider (LHC) and uses OpenStack to manage its large computing infrastructure. CERN aims to deploy OpenStack on 15,000 hypervisors hosting 100,000-300,000 VMs by 2015 to manage its growing computing needs for physics experiments. CERN contributes to and benefits from the OpenStack community through code, documentation, and helping solve problems.
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Fisnik Kraja
This document summarizes a presentation on introducing many-core and commercial off-the-shelf (COTS) technologies into satellite computing architectures. It discusses constraints of onboard computers like power, size, heat, and high reliability requirements. It proposes a new architecture using many-core processors and COTS products to improve processing abilities while maintaining reliability through hardware and software fault tolerance techniques. It evaluates the architecture on synthetic aperture radar benchmarking and finds performance scales linearly with cores. Combining COTS with radiation-hardened components could provide over 100x speedup and improved reliability, flexibility, and portability.
The document discusses enabling energy efficient data centers through the E3S Center. The E3S Center aims to develop electronic systems that are self-sensing and regulating, and optimized for energy efficiency at various performance levels. It works with industry and academia. Over a 5-year period, its goals include determining data center inefficiencies at all levels, developing predictive models and control algorithms for synergistic management of computing and cooling, and improving airflow and thermal management through testing and validation. The Center is led by researchers from Binghamton University, University of Texas at Arlington, and Villanova University, and partners with many industry collaborators.
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)Heiko Joerg Schick
The document describes the QPACE supercomputer project which aims to build a supercomputer optimized for lattice QCD simulations using IBM PowerXCell 8i processors. Key aspects summarized are:
1) QPACE uses 256 node cards per rack, each with a PowerXCell 8i processor, to achieve 26 TFLOPS and 1 TB memory per rack.
2) Custom networks include a 3D torus for nearest neighbor communication and an interrupt tree for global operations.
3) The node card design features the PowerXCell processor, FPGA network processor, memory, and networking interfaces.
4) Early results found the hardware design worked well but network processor implementation and software deployment took longer than planned.
This document discusses instruction level power analysis (ILPA) for estimating processor power consumption. It describes how ILPA works by associating an energy cost with each instruction based on its operations and accounting for inter-instruction effects. Initially used for RISC processors, ILPA methods were modified for VLIW/EPIC processors by considering independent energy dissipation across execution slots and clustering similar instructions. ILPA does not provide insight into core power consumption causes but was expanded to microarchitecture-aware models accounting for individual pipeline stages. Register files and caches can also be modeled based on access patterns and state transitions between cycles.
This document discusses internal memory organization and technologies. It begins with an overview of memory cell operation for DRAM and SRAM. It then covers various memory types including DRAM, SRAM, ROM, PROM, EPROM, and EEPROM. Advanced DRAM technologies like synchronous DRAM, Rambus DRAM, and double data rate SDRAM are also summarized. The document concludes with sections on error correction techniques and cache DRAM.
1. The document discusses research activities related to reducing energy consumption by at least 30% through the development of core source technologies for universal operating systems.
2. It describes four papers being presented, including ones on system and device latency modeling, power management frameworks for embedded systems, and automatic selection of power policies for operating systems.
3. It also summarizes four research topics from the National University, including performance evaluation of parallel applications using a power-aware paging method on next-generation memory architectures.
The document discusses LinkedIn's infrastructure and technical challenges in building a distributed system to support their professional social network. Key points include:
- LinkedIn needed a way to efficiently perform graph computations across billions of connections stored in a database
- Storing the data in memory provided better performance but they needed a way to keep the in-memory data synchronized with the database
- They explored various options for a "databus" system to replicate database changes to multiple graph engines in real-time without missing any events
- The solution they developed uses Oracle's row-level change tracking to capture all transaction changes and replay them asynchronously to keep the graph engines synchronized with the database.
This document discusses memory interfacing in microprocessors. It describes the different types of memory like RAM and ROM. RAM is further divided into SRAM and DRAM. It explains the basic design of a memory chip including memory registers, address decoding, and read/write operations. The document provides block diagrams to illustrate memory interfacing concepts.
This document discusses challenges in low-power design and verification. It addresses why low-power is now a priority given trends in mobile applications. Key challenges include increased leakage due to process scaling, accounting for active leakage, and handling process variations. The document also discusses low-power design methodologies, including multiple power domains, voltage scaling, and clock gating. Verification challenges are presented, such as needing good test patterns and coordination across design domains. Overall power analysis is more complex than timing analysis due to its pattern dependence and need to optimize for performance per watt.
Here are the key challenges faced in low power design without a common power format:
1. Domain definitions, level shifters, isolation cells, and other low power techniques are specified differently in each tool using tool-specific commands files and languages. This makes cross-tool consistency and validation difficult.
2. Power functionality cannot be easily verified at the RTL level without changing the RTL code, since power domains and low power techniques are not represented. This limits verification coverage.
3. Iteration between design creation and verification is difficult, since changes to the low power implementation require updates to multiple tool-specific specification files rather than a single cross-tool definition. This impacts design schedule and risks inconsistencies.
4.
Ben Prusinski is presenting on Oracle R12 E-Business Suite performance tuning. He will cover methodology, best practices, and techniques from basic to advanced. The presentation includes tuning at the infrastructure, application, and database levels with a focus on a holistic approach. Specific areas that will be discussed are concurrent manager tuning including queue size, sleep cycle, cache size, and number of processes.
This document describes SmartBalance, a sensing-driven Linux load balancer for heterogeneous multi-processor systems-on-chips (MPSoCs) that aims to improve energy efficiency. It does this through a predictive approach that balances tasks among cores based on ongoing performance and power measurements, with the goal of jointly addressing workload variability and hardware heterogeneity. The key aspects are on-chip sensing to monitor performance and power, online prediction of these metrics, and a simulated annealing-based allocation algorithm to optimize task distribution across cores in each scheduling epoch.
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Toronto-Oracle-Users-Group
This document discusses various Oracle 12c features that can be used to achieve high availability and keep systems available even during planned and unplanned outages. It compares options for handling planned changes like hardware, OS, database upgrades including RAC, RAC One Node, and Data Guard. It also discusses disaster recovery options like storage mirroring, RAC extended clusters, Data Guard, and GoldenGate replication. New features in Oracle 12c like Far Sync instances and cascading standbys are also covered. The document provides a guide to deciphering the necessary components for high availability.
1. Different tools use different descriptions for power management, making it difficult to verify configurations and keep definitions consistent across the design flow.
2. There is no automation for verifying power management definitions, requiring designers to manually verify thousands of statements.
3. The design hierarchy and syntax varies between tools and between RTL and gate representations, complicating cross-checking.
4. It is challenging to verify power functionality without changing RTL code since power and ground nets are not explicitly captured or simulated.
1. Power functionality cannot be easily verified without changing RTL code since power domains and constraints are not represented consistently across the design flow.
2. Tools from different vendors use different specifications for power management, making end-to-end verification and signoff difficult.
3. Constraints and intent for low power techniques like multi-VT, power gating, and DVFS cannot be validated without a common representation.
Crash course on data streaming (with examples using Apache Flink)Vincenzo Gulisano
These are the slides I used for a crash course (4 hours) on data streaming. It contains both theory / research aspects as well as examples based on Apache Flink (DataStream API)
The venerable Servlet Container still has some performance tricks up its sleeve - this talk will demonstrate Apache Tomcat's stability under high load, describe some do's (and some don'ts!), explain how to performance test a Servlet-based application, troubleshoot and tune the container and your application and compare the performance characteristics of the different Tomcat connectors. The presenters will share their combined experience supporting real Tomcat applications for over 20 years and show how a few small changes can make a big, big difference.
20121017 OpenStack CERN Accelerating ScienceTim Bell
CERN operates the Large Hadron Collider (LHC) and uses OpenStack to manage its large computing infrastructure. CERN aims to deploy OpenStack on 15,000 hypervisors hosting 100,000-300,000 VMs by 2015 to support fundamental physics research from the LHC experiments. OpenStack provides an open source alternative to CERN's traditional computing model and will help scale to meet growing data and computing needs.
This document provides information about CERN and its use of OpenStack. CERN is the European Organization for Nuclear Research located on the French-Swiss border. It operates the Large Hadron Collider and related experiments that generate large amounts of data. CERN uses OpenStack to manage its large computing infrastructure and virtualize servers to efficiently process data from the LHC experiments across a global grid. OpenStack allows CERN to scale its infrastructure to meet growing computing needs while reducing costs and improving operational efficiency.
CERN operates the Large Hadron Collider (LHC) and uses OpenStack to manage its large computing infrastructure. CERN aims to deploy OpenStack on 15,000 hypervisors hosting 100,000-300,000 VMs by 2015 to manage its growing computing needs for physics experiments. CERN contributes to and benefits from the OpenStack community through code, documentation, and helping solve problems.
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Fisnik Kraja
This document summarizes a presentation on introducing many-core and commercial off-the-shelf (COTS) technologies into satellite computing architectures. It discusses constraints of onboard computers like power, size, heat, and high reliability requirements. It proposes a new architecture using many-core processors and COTS products to improve processing abilities while maintaining reliability through hardware and software fault tolerance techniques. It evaluates the architecture on synthetic aperture radar benchmarking and finds performance scales linearly with cores. Combining COTS with radiation-hardened components could provide over 100x speedup and improved reliability, flexibility, and portability.
The document discusses enabling energy efficient data centers through the E3S Center. The E3S Center aims to develop electronic systems that are self-sensing and regulating, and optimized for energy efficiency at various performance levels. It works with industry and academia. Over a 5-year period, its goals include determining data center inefficiencies at all levels, developing predictive models and control algorithms for synergistic management of computing and cooling, and improving airflow and thermal management through testing and validation. The Center is led by researchers from Binghamton University, University of Texas at Arlington, and Villanova University, and partners with many industry collaborators.
1. Energy Conservation and Thermal
Management in High-Performance
Server Architectures
Adam Lewis
The Center for Advanced Computer Studies
The University of Louisiana at Lafayette
Tuesday, March 15, 2011
2. Agenda
• Background and Related Work
• System Modeling
• Effective Prediction
• Initial Evaluation and Results
• Thermally-Aware Scheduling
• Status, Plans, and Summary
Tuesday, March 15, 2011
3. What does this picture tell us?
(c) The New York Times, June 14, 2006
Source: McKinsey & Company 2008 Source: EPA 2008
A 20% projected increase Only ~50%
in data center of power consumed
emissions over next 5 years from IT equipment
Tuesday, March 15, 2011
4. Current Practice
Completely Fair Scheduler
Domain-based Load Balancing
Power-state aware
Run-queue scheduling
Domain-based Load Balancing
Power-state aware (Solaris 11)
Run-queue scheduling
Interface w/ power manager?
Tuesday, March 15, 2011
5. Thread Scheduling & Power Management
DVFS:
P = CV 2 f
SpeedStep
Multi-core/Many-core • Performance issues [LLBL 2007]
•Cache affinity • Lack of slack
•Load balancing • High load = No gain
•Opportunity to turn • Reliability issues [Bircher 2008]
off the lights?
• Under-clocking & MTBF
• Reactive rather than proactive
Tuesday, March 15, 2011
6. Proactively Avoid Thermal Emergencies
Thermally-
A Full-System Effective
+ Aware
Energy Model Prediction
Scheduling
• Possible approaches
• Heat-and-Run and related approaches
[Gomaa2004] [Coskun2009] [Zhou2010]
• Memory-resource focused approaches
[Merkel2010]
• Control-theoretic techniques
Tuesday, March 15, 2011
8. Model: Inputs & Components
Esystem = Eproc + Emem + Ehdd + Eboard + Eem .
• Processor
• Memory
• Hard disk & storage
devices
Edc = Esystem
• Motherboard &
peripherals
• Three DC voltage
domains • Electrical &
Electromechanical
• 12Vdc, 5.5Vdc, 3.3Vdc
Components
• 5.5V and 3.3V domains
limited to 20% of rated
voltage
Tuesday, March 15, 2011
9. Model: Processor
t2
Eproc = (Pproc (t))dt
t1
Memory Memory
DDR2-DRAM DDR2-DRAM
QPLC
•
Core 1 Core 2 Core 2 Core 1
Bus transactions
I-cache D-cache I-cache D-cache D-cache I-cache D-cache I-cache
L2 Cache L2 cache L2 cache L2 Cache Core Core
system request interface system request interface
Crossbar Switch Crossbar Switch
•
Coherent
Integrated HyperTransport Integrated
Host bridge (cHT) Host bridge
Memory Memory
Reflects amount of
HyperTransport HyperTransport
Controller Controller
HyperTransport Bus
QPLIO QPLIO
data processed
Input
USB VGA
Output
SouthBridge Handler
HDD
•
Ethernet
Die temperature
DVD
Graphics
Board - Level Power consumers PCI Express
AMD Opteron Intel Nehalem
• Computation per
core
• Processor as black box
• Processor system
• Power = f(workload) metrics
• Manifests as heat
Tuesday, March 15, 2011
10. Model: Memory
• DRAM Read/Write
t2
N
power + background
Emem =
t1
(
i=1
CMi (t) + DB(t)) × PDR + Pab dt power = known
quantities
• Performance counters
exist for measuring the
count of highest level
cache miss and bus
transactions
• Combine these to
compute the energy
consumed
Tuesday, March 15, 2011
11. Model: Storage
Ehdd =Pspin−up × Tsu + Pread N r × Tr
+ Pwrite N w × Tw + Pidle × Tid
Parameter Value
Interface Serial ATA
Capacity 250 GB
Rotational speed 7200 rpm
Power (spin up) 5.25 W (max)
Power (Random read, write) 9.4 W (typical)
Power (Silent read, write) 7 W (typical)
Power (idle) 5 W (typical)
Power (low RPM idle) 2.3 W (typical for 4500 RPM)
Power (standby) 0.8 W (typical)
Power (sleep) 0.6 W (typical)
Tuesday, March 15, 2011
12. Model: Board
Eboard = Vpower−line × Ipower−line × tinterval
• System components that
support the operation of
the machine
• Typically in the 5.5Vdc
and 3.3Vdc power
domains
• Measured by current
probe
Tuesday, March 15, 2011
13. Model: Electromechanical
N
Tp
i
Eem = V (t) · I(t) + Pf an (t) dt
0 i=1
• Need to account for
energy required to cool
• No performance
counters
• Can measure power
drawn by the fans
• Derived from log data
collected by OS
Tuesday, March 15, 2011
15. gobmk 1.7% 9.0% 2.30
zeusmp TABLE III 8.1%
2.8% 2.14
Linear ODEL ERRORS FOR CAP, AR(1),A good ON AN AMD OPTERON S
M AR Time Series - AND MARS idea?
AR
Avg Max RMSE
Benchmark Err % Err %
astar 3.1% 8.9% 2.26
games 2.2% 9.3% 2.06
gobmk 1.7% 9.0% 2.30
zeusmp 2.8% 8.1% 2.14
TABLE IV
M ODEL ERRORS AR Model: AMD Opteron SERVER
Linear FOR AR ON I NTEL N EHALEM
• Linear Regression
• Easy, simple Benchmark
Avg
Err %
Max
Err %
RMSE
• Odd mis-predictions astar 5.9% 28.5% 4.94
• Corrective methods games
gobmk
5.6%
5.3%
44.3%
27.8%
5.54
4.83
required zeusmp 7.7% 31.8% 7.24
TABLE IV
M ODEL ERRORS FOR AR ON I NTEL Nehalem SERVER
Linear AR Model: Intel N EHALEM
Avg Max RMSE
Benchmark Err % Err %
astar 5.9% 28.5% 4.94
games 5.6% 44.3% 5.54
Tuesday, March 15, 2011
16. Prediction w/ Chaotic Time Series
TABLE 4
dications of chaotic behavior in power time series
Chaotic behavior
(AMD, Intel) Chaotic Time Series
Benchmark Hurst Average • Time-delay reconstructed state space
Parameter
(H)
Lyapunov
Exponent
• Uses Takens Embedding Theorem:
bzip2 (0.96, 0.93) (0.28, 0.35)
• Time-delayed partition of
observations to build function that
cactusadm (0.95, 0.97) (0.01, 0.04) preserves the topological and
gromac (0.94, 0.95) (0.02, 0.03)
leslie3d (0.93, 0.94) (0.05, 0.11) dynamical properties of our original
omnetpp (0.96, 0.97) (0.05, 0.06) chaotic system
perlbench (0.98, 0.95) (0.06, 0.04)
• Find nearest neighbors on attractor to our
observations
• Perform least-square curve fit to find a
ponent can be calculated using: polynomial that approximates the attractor
1
N −1
λ = lim ln|f (Xn )|.
N →∞ N
n=0
e found a positive Lyapunov exponent when per-
ming this calculation on our data set ranging from
1 to 0.28 (or 0.03 to 0.35) on the AMD (or Intel) test
ver, as listed in Table 4, where each pair indicates
Tuesday, March 15, 2011
18. Forward prediction
• Start with a Taylor series expansion
fˆ(X) = f (x) + f (x)T (X − x)
ˆ ˆ
• Find the coefficients of the polynomial by solving the
linear least squares problem for a and b:
n+p
T
2
Xt − a − b (Xt−1 − x) ∗ Kβ (Xt−1 − x)
t=p+1
• Explicit solution for our linear least squares
problem: n+p
ˆ 1
f (x) = (s2 − s1 ∗ (x − Xt−1 ))2 ∗ Kβ ((x − Xt−1 )/β)
n t=p+1
n+p
1
si = (x − Xt−1 )i ∗ Kβ ((x − Xt−1 )/β)
n t=p+1
Tuesday, March 15, 2011
19. Time Complexity
n future observations p past observations
Creating a CAP: O(n ) 2
Predicting with a CAP: O(p)
Tuesday, March 15, 2011
21. gobmk C Artificial Intelligence: Go
ipmito
FP Benchmarks
are ava
Initial Evaluation and Results calculix C++/F90 Structural Mechanics
zeusmp F90 Computational Fluid Dynamics
commo
Solaris
Linux.
TABLE 7
dtrace
Test hardware configuration
tunable
impact
Sun Fire 2200 Dell PowerEdge R610
consum
CPU 2 AMD Opteron 2 Intel Xeon (Nehalem) 5500 The p
CPU L2 cache 2x2MB 4MB
Memory 8GB 9GM power
Internal disk 2060GB 500GM and th
Network 2x1000Mbps 1x1000Mbps measur
Video On-board NVIDA Quadro FX4600
Height 1 rack unit 1 rack unit ampera
memor
the run
TABLE 5 CAT increases linearly, as can be obtained in Eq. (15). downlo
values, TABLE 6
SPEC CPU2006 benchmarks used for model interna
Open
Training Benchmarks
calibration Evaluation Benchmarks
The actual computation time results for our CAP
SPEC CPU2006 benchmarks used for evaluation
code implemented using MATLAB run on machines the dif
coun
(detailed in Table 7) with respect to different n and p sured u
the O
Integer Benchmarks Integer Benchmark
values are provided in the next section. one Ag
In
bzip2 C Compression astar C++ Path Finding domain
is de- mcf C Combinatorial Optimization gobmk C Artificial Intelligence: Go
have
from th
average omnetpp C++ Discrete Event Simulation 5 FP E VALUATION ipmi
Benchmarks a bench
sed on
host. a
are
FP Benchmarks
. , Xt−p A calculix C++/F90 Structural Mechanics out to evaluate
set of experiments was carried
gromacs C/F90 Biochemistry/Molecular Dynamics
the performance of power models built using CAP
zeusmp F90 Computational Fluid Dynamics
comm
(u) = cacstusADM C/F90 Physics/General Relativity
Solar
as de- leslie3d F90 Fluid Dynamics techniques to approximate a solution for dynamic 5.2 R
lbm C Fluid Dynamics systems following Eq. TABLE 7 purpose of the first
(12). The Linu
experiment was to confirm the time complexity CAP. Fig. 8
dtra
Test hardware configuration
The behavior of CAP was simulated using MATLAB from C
tuna
using two criteria: sufficient coverage of the functional on the hardware described in Table 7 withR610 varying tual po
impa
Sun Fire 2200 Dell PowerEdge system
units in the processor and reasonable applicability values of n future observation and p past observa- cons
to the problem space. Components of the processor tions. Fig. 6 illustrates the behavior of (Nehalem) 5500
CPU 2 AMD Opteron 2 Intel Xeon CAP as the three A
Th
affect the thermal envelope in different ways [40]. This CPU L2 cache 2x2MB 4MB over th
value of n is varied and confirms the O(n2 ) behavior
Memory 8GB 9GM pow
issue is addressed by balancing the benchmark selec- of the predictor in this case. The behavior of CAP as
Internal disk 2060GB 500GM betwee
and
tion between integer and floating point benchmarks pNetwork is shown in Fig. 71x1000Mbps
is varied 2x1000Mbps and supports the claim benchm
meas
a local in the SPEC CPU2006 benchmark suite. Second, the Video On-board NVIDA Quadro FX4600 polyno
Tuesday, March 15, 2011 of linear behavior. amp
22. Results: AMD Opteron f10h
(a) Astar/CAP. (b) Astar/AR(1)).
(c) Zeusmp/CAP. (d) Zeusmp/AR(1).
Fig. 8. Actual power results versus predicted results for AMD Opteron.
Tuesday, March 15, 2011
23. Results: Intel Nehalem
(c) Zeusmp/CAP. (d) Zeusmp/AR(1).
Fig. 8. Actual power results versus predicted results for AMD Opteron.
(a) Astar/CAP. (b) Astar/AR(1)).
(c) Zeusmp/CAP. (d) Zeusmp/AR(1).
Fig. 9. Actual power results versus predicted results for an Intel Nehalem server.
Tuesday, March 15, 2011
25. Observations and Analysis
• Where does maximum error occur?
• Choice of performance counters
• Difference in behavior between
processors?
• The right set of performance counters
• Benchmark selection
Tuesday, March 15, 2011
27. Problem nature
• Scheduling...
• in time: who runs next
• in space: who runs where
• Optimization problem
• Who runs next: least use of energy with
best performance quality of service
• Who runs where: best utilization of
resources with least increase in
processor and/or ambient temperature
Tuesday, March 15, 2011
28. Thermal Extensions to System Model
Applications have
a length: 2.
1. For which we define
L(A, DA , t) Thermal Equivalent of Application
and generate workload ΘA (A, DA , T, t) = U (A,DA ,t)
lim Je × (T − Tnominal )
T →Tth
U (A, DA , t) = lim n × W (pi , di , t) × Ln (An , DAn , t), 1 ≤ i ≤ p
n→ke
3. Which is used to generate 4. That is used to compute
Thermal Efficiency to Completion Cost of Performance per Unit Power
ΘA (A,DA ,T,t) ΘA (A,DA ,T,t)
η(A, DA , T, t) = ΘA (Ae ,DAe ,Tme ,Le ) Cθ (A, DA , T, t) = Esys (A,DA ,t)
Tuesday, March 15, 2011
29. Extending CAP for Thermal Prediction
• Thermal Chaotic Attractor Predictor
(TCAP)
• Extends CAP to thermal domain
• Created and used in similar manner to
CAP
• Matching TCAP for each thermal metric
Tuesday, March 15, 2011
30. Reducing Processor Temperatures
• Premise: Processor die temperature can be
managed by controlling what threads
execute over time
• Predict the next thread to run a logical
CPU using TCAP for processor die
temperature
Tuesday, March 15, 2011
31. Reducing Ambient Temperature
• Premise: Control system ambient
temperature by managing load on logical
CPUs so that overheated resources have
time to recover
• Partition resources into categories based
on predicted change in temperature
• Move workload from “HOT” resources
towards “COLD” resources
Tuesday, March 15, 2011
33. Current Status
Thermally-
A Full-System Effective
+ Aware
Energy Model Prediction
Scheduling
• Development Complete
• Evaluation Complete • Design complete
• Intel + AMD processors • Prototype under
development
• OpenSolaris (Solaris 11)
• Peer-reviewed
• Conference/Workshop: 3
• Journal: 1
Tuesday, March 15, 2011
34. Plan for Completion
ID Task
1 Respond to review comments for [Lewis 2011]
2 Implement scheduler prototype in FreeBSD
3 Evaluate scheduler performance using parallel benchmarks
4 Document results and submit to archival journal
5 Create dissertation from Prospectus + output from previous task
6 Defend dissertation
7 Respond to comments from committee and Graduate School editor
8 Submit final version of document
Tuesday, March 15, 2011
35. Future Directions
• Extend beyond a single blade
• Cluster, Grid, and Cloud Scheduling
• MPI, OpenMP, and other environments
• Impact of operating system virtualization
• Extension of the thermal model in terms of
the thermodynamics of computation
Tuesday, March 15, 2011
38. This work was supported in part by the U.S.
Department of Energy and by the Louisiana
Board of Regents
Tuesday, March 15, 2011
39. Publications List
Lewis, A., Ghosh, S., and Tzeng, N.-F. 2008. Run-
time energy consumption estimation based on
workload in server systems. Proceedings of the
2008 conference on Power aware computing and
systems.
Lewis, A., Simon, J., and Tzeng, N.-F. 2010.
Chaotic attractor prediction for server run-time
energy consumption. Proc. of the 2010 Workshop on
Power Aware Computing and Systems (Hotpower’10).
Lewis, A., Tzeng, N.-F., and Ghosh, S. 2011. Time
series approximation of run-time energy
consumption based on server workload. Under
review for publication in ACM Transactions on
Architecture and Code Optimization.
Tuesday, March 15, 2011