• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Thermal Control Overview
 

Thermal Control Overview

on

  • 192 views

This presentation gives an overview of the early results on thermal control obtained by the Multitherman ERC project

This presentation gives an overview of the early results on thermal control obtained by the Multitherman ERC project

Statistics

Views

Total Views
192
Views on SlideShare
126
Embed Views
66

Actions

Likes
0
Downloads
0
Comments
0

1 Embed 66

http://www-micrel.deis.unibo.it 66

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Thermal Control Overview Thermal Control Overview Presentation Transcript

    • Thermal Control Activities Andrea Bartolini a.bartolini@unibo.it
    • Outline• Thermal Management• Thermal Model Learning• Model Predictive Controller• Computational Sprinting• Reliability Borrowing
    • Thermal ManagementTecnology scaling software System integrationHigh performace Costsrequirements Spatial and temporal Limitated workload variation dissipation High power capabilities densities NON UNIFORM: power, temperature, performance Leakage current Reliability lost, Hot spots, thermal Aging gradients and cycles
    • Thermal ManagementTecnology scaling software System integrationHigh performace Costsrequirements Spatial and temporal Limitated workload variation dissipation High power capabilities densities Dynamic Approach: on-line tuning of system performance and NON UNIFORM: temperature through closed-loop control power, temperature, performance Leakage current Reliability lost, Hot spots, thermal Aging gradients and cycles
    • DRM - General Architecture App.1 App.N• System• Sensors Thread Thread Thread Thread ....... N N 1 1 ... ... – Performance counter - PMU CONTROLLER O.S – Core temperature SW• Actuator - Knobs HW L1 L1 L1 – ACPI states f,v CPU1 CPU2 CPUN – P-State -> DVFS PGATING – C-State -> PGATING – Task allocation L2 L2• Controller Network – Reactive – Threshold/Heuristic Simulation snap-shot DRAM – Controller theory – Predictive TCPU,#L1MISS,#BUSACCESS,CYCLEACTIVE,....
    • Thermal Controller Model Predictive Controller •Internal prediction: avoid overshoot •Optimization: [Intel®, ISSCC 2007] maximizes performance Target frequency Classical feed-back Past input & output Thermal Future + controller Model output - Future Future • PID controllers input Optimizer error Threshold based • Better than threshold Cost function Constraint controller based approach MPC • Centralized • Cannot prevent overshoot•T > Tmax  low freq • aware of neighbor•T < Tmin  high freq cores thermal• cannot prevent overshoot influence• thermal cycle • All at once – MIMO controller • Complexity !!!
    • Background – Thermal Modeling Taskj Pj Tj Power Modello di Thermal Modello model potenza model Termico P=g(task,f) Pn,j Tn,jtask task task
    • MPC Robustness System IdentificationMPC needs a Thermal Model Target frequency Identified State-Space Past input Thermal Model & output Thermal Future + Model output - Temperature Future Future Optimizer error CoreN input Cost function Constraint Power MPC Corei Core1• Accurate, with low complexity• Must be known “at priori” multicore• Depends on user configuration Workload execution• Changes with system ageing Training tasks Workload Workload Workload“In field” Self-Calibration t Workload t CoreN t Workload Workload • Force test workloads Workload Core t Workload t i • Measure cores temperatures Core1 t t • System identification
    • Centralized Thermal Modelling Thermal Modeling i7 Server Platform – 4 cores Ts = 1ms - Quantizzation noiseStep response:Black Box
    • Centralized Thermal Modelling Thermal Modeling i7 Server Platform – 4 cores Ts = 1ms - Quantizzation noiseStep response:Black Box Our approach LS + physical constraints
    • Distributed Thermal Modelling Thermal Modeling Single chip cloud computer SCC – 48 cores Ts = 100ms - Measurment noiseStandard ARX:• Designed only for process noise !
    • Distributed Thermal Modelling Thermal Modeling Single chip cloud computer SCC – 48 cores Ts = 100ms - Measurment noiseStandard ARX:• Designed only for process noise !Bias Compensated ARX• Iterativelly estimate the noise variance and compensate it in the LS Residual Correlation @ lag 0..8
    • Distributed Thermal Controller MPC Controller Core 1 CPI CPI P1,EC MPC Controller f1,TC f1,EC g(·) P1,TC g-1(·) Linear Model TENV x1 QP Optimiz 2 states per core Observer T1 Implicit formulation Nonlinear(Frequency to Power) s.t Linear (Power to Temperature) Classic Luenberger state observer
    • Outline• Thermal Management• Thermal Model Learning• Model Predictive Controller• Computational Sprinting• Reliability Borrowing
    • Computational Sprinting• TDP statically defined on worse case power TAMB – Considers only the thermal resistance  package optimized to minimize it – Does not allow to power on all the cores TCORE Core Core Core Core Core Core Core Core PCHIP Core Core Core Core TDP Core Core Core Core
    • Computational Sprinting• TDP statically defined on worse case power TAMB – Considers only the thermal resistance  package optimized to minimize it – Does not allow to power on all the cores TCORE• Application requires maximum performance in short bursts – Thermal capacitance = “Heat Buffer” – Use heat buffer to run all cores at maximum performance for a short time window TCHIP – Triggered on-demand by peak parallel Safe workload phases & user interaction Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core PCHIP Core Core Core Core Core Core Core Core TDP Core Core Core Core Core Core Core Core Sequential Parallel
    • Computational Sprinting• TDP statically defined on worse case power TAMB – Considers only the thermal resistance  package optimized to minimize it PCM – Does not allow to power on all the cores TCORE• Application requires maximum performance in short bursts – Thermal capacitance = “Heat Buffer” – Use heat buffer to run all cores at maximum performance for a short time window TCHIP – Triggered on-demand by peak parallel Safe workload phases & user interaction – Augment the sprint duration with Phase Change Material (PCM) PCHIP • Heat tanks  needs restore phases!! • Need to allocate the sprint phases to the TDP most QoS-critical task !!
    • Re-Sprinting ControllerTwo level - Hierarchical Controller PTARGET,0 PTARGET,i PTARGET,i PTARGET,16 TPCM Ub (●) TAMB PCM Model Predictive Controller TCORE,0 TCORE,i TCORE,i TCORE,N TPCM P*CORE,i P*CORE,i P*CORE,16 TCORE,0 TCORE,i TCORE,i TCORE,N TNEIGH,0 Thermal TNEIGH,i Thermal TNEIGH,i Thermal TNEIGH,N Thermal MPC MPC MPC MPC PCORE,0 PCORE,i PCORE,i PCORE,16
    • Guaranteed re-sprinting• one RC cell per core, 𝑃𝑃 𝑇𝑇𝑇𝑇𝑇𝑇 = ∑ 𝑃𝑃𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 ,Time-varing internal energy bound UB• M ~ 1-10s => silicon steady-state Guaranteed tasks PTOT N - sprint duration M – re-sprint rate ti ti+M ti+M+N
    • Guaranteed re-sprinting• one RC cell per core, 𝑃𝑃 𝑇𝑇𝑇𝑇𝑇𝑇 = ∑ 𝑃𝑃𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 ,Time-varing internal energy bound UB• M ~ 1-10s => silicon steady-state U UMAX Ub(t) U(t) UN PTOT PMAX TAMB PREST ti ti+M ti+M+N
    • Outline• Thermal Management• Thermal Model Learning• Model Predictive Controller• Computational Sprinting• Reliability Borrowing
    • Dynamic Relibility Management Idea: Tune at run-time the V DD & Temperature to reach a target lifetime while minimizing the workload degradation State-of-the-art: Expected Expect workload Performance Reliability (… to lifetime) goal @ target lifetime Cut Performance No > Reliability  VDD, T targetIssues: Yes1) Workload change in seconds, Reliability in years2) Different workload Give performance ⇒ same performance cut  VDD, T ⇒ different final user QoS
    • Our Solution – Key innovations1. Two-level controller @ two different time scales – Long Intervals: predefined interval of time for reliability control – Short intervals: task scheduling periods2. Reliability Speculation / Borrowing – Flag the task as: • Hard – High QoS (Real Time, latency constrained task) • Soft – Low QoS (Background process) – Reliability target updated each long interval => average bound within the long interval – Speculation: • Hard Task – run always at the maximum performance • Soft Task – performance constrained by the reliability target Soft Task pays the reliability loss induced by the Hard Task
    • Single Core Controller Architecture• The complete architecure is composed by two controllers: – Long Term Controller (LTC): • Monitors the core Reliability and assesses a value of voltage which is assumed as a soft constraint for the Short Term Controller. It could be entirely model based or sensor aided. – Short Term controller (STC): • Based on the soft constraint coming from the LTC, it assigns the operating voltage and frequency based on the workload requirements. Reliability Sensor
    • Task Mapping?Problem: • How Task Mapping takes advantage of it? • Model Predictive Controller: • Schedule the task to minimize the thermal controller activations • Sprinting: • How to flag the tasks? • Sprint rooms are finite buffer in time - how to use them? • Reliability Borrowing: • Soft tasks slow-down depend on Hard Task Rate – Can we optimize it?