Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
From FPGA-based Reconfigurable Systems to Autonomic Heterogeneous Computing Systems
1. Politecnico di Milano
From FPGA-based Reconfigurable Systems to
Autonomic Heterogeneous Computing Systems
an enabling technologies perspective and more…
PEoPLe@DEIB
CE4 @ 13 march 2018
Marco D. Santambrogio
<marco.santambrogio@polimi.it>
Politecnico di Milano
6. Heterogeneous Complex Systems
• Ryft ONE
– Big Data infrastructure due to an FPGA-accellerated architecture
– http://www.ryft.com/
• IBM Power8
– Introducing the Coherent Accelerator Processor Interface (CAPI) port that is
layered on top of PCI Express 3.0
– http://www-304.ibm.com/webapp/set2/sas/f/capi/home.html
• Microsoft Catapult
– Stratix V (Arria 10 FPGA)
– http://research.microsoft.com/en-us/projects/catapult/
• Amazon EC2 F1 Instances
– Xilinx UltraScale Plus FPGA
– https://aws.amazon.com/about-aws/whats-new/2017/04/amazon-ec2-f1-
instances-customizable-fpgas-for-hardware-acceleration-are-now-generally-
available/
• OpenPower Foundation
– http://openpowerfoundation.org/
6
20. People Demanding
for Functionalities
Set of Available
Functionalities
FiArea/Time
Legenda:
A2/1
B 1/2
C2/2
D 1/1 E 1/1
F 2/2
RR3RR2RR1
FPGA
20
RR3RR2RR1
A
RR3RR2RR1
F
RR3RR2RR1
D
RR3RR2RR1
B
RR3RR2RR1
C
E
RR3RR2RR1
RFU
Implementations
RC
Implementations
Starting Scenario @ 2009
22. The Problem
Time
Area
A
B
Rec. F
F
Rec. E
E
Rec. C
C
Rec. D
D
22
A
E
D
C
B
F
2/1
2/2
1/2
1/1
1/1
2/2
A possible scenario
FiArea/Time
Legenda:
Time
RR3RR2RR1
A
RR3RR2RR1
F
RR3RR2RR1
D
RR3RR2RR1
B
RR3RR2RR1
C
E
RR3RR2RR1
RFU
Implementations
RC
Implementations
29. Starting Scenario @ 2009
Set of Available
Functionalities
FiArea/Time
Legenda:
A2/1
B 1/2
C2/2
D 1/1 E 1/1
F 2/2
RR3RR2RR1
FPGA
29
RR3RR2RR1
A
RR3RR2RR1
F
RR3RR2RR1
D
RR3RR2RR1
B
RR3RR2RR1
C
E
RR3RR2RR1
RFU
Implementations
RC
Implementations
People Demanding
for Functionalities
51. Hints on the problem…
51
*
[*] Vipin, K. and Fahmy, S. A.: Architecture-aware reconfiguration-centric floorplanning for partial reconfiguration. In ARC,
pages 13-25, 2012.
52. Heuristic-Optimal Floorplanner
52
Reconfigurable regions +
Resource requirements FPGA Heuristic
solution
MILP model
MILP Solver
(Gurobi, Cplex, GLPK, …)
Improved heuristic solution
User-defined linear
Objective Function
Geometrical constraints
• Non overlapping
guaranteed by the
geometrical constraints
53. Objective function
• Cost function can be defined starting from the
variables and parameters of the MILP model
• Implemented metrics:
– Global wirelength measured using HPWL ( )
– Regions perimeter ( )
– Wasted resources ( )
53
54. Hints on the problem…
54
*
[*] Vipin, K. and Fahmy, S. A.: Architecture-aware reconfiguration-centric floorplanning for partial reconfiguration. In ARC,
pages 13-25, 2012.
55. Hints on the problem…
• Optimal solution in 29s
• 34% wasted frames
reduction
– No DSP and CLB wasted
by the Video Decoder RR
– No BRAM wasted by the
Signal Decoder RR
• Approximately same
wirelength
55
*
[*] Vipin, K. and Fahmy, S. A.: Architecture-aware reconfiguration-centric floorplanning for partial reconfiguration. In ARC,
pages 13-25, 2012.
58. Runtime Reconfiguration Management
• Reconfigurable architecture
– Static Area: used to control the reconfiguration process
– Reconfigurable Area: used to swap at runtime different cores
– Reconfiguration- oriented communication infrastructure
• Runtime reconfiguration managed via SW
– Standalone, Operating System
• Increased portability of user applications
• Inherited multitasking capabilities
• Simplified software development process
• Bitstreams relocation technique to
– speedup the overall system execution
– achieve a core preemptive execution
– assign at runtime the bitstreams placement
– reduce the amount of memory used to store partial bitstreams
58
59. Runtime Reconfiguration Management
• Reconfigurable architecture
– Static Area: used to control the reconfiguration process
– Reconfigurable Area: used to swap at runtime different cores
– Reconfiguration- oriented communication infrastructure
• Runtime reconfiguration managed via SW
– Standalone, Operating System
• Increased portability of user applications
• Inherited multitasking capabilities
• Simplified software development process
• Bitstreams relocation technique to
– speedup the overall system execution
– achieve a core preemptive execution
– assign at runtime the bitstreams placement
– reduce the amount of memory used to store partial bitstreams
59
60. OS-based management of dynamic
reconfiguration
• Provide software support for dynamic partial reconfiguration
on Systems-on-Chip running an operating system (i.e.,
LINUX).
– OS customization for specific architectures
– Rec. Functional Unit caching policies to improve the performance
– Partial reconfiguration process management from the OS
– Addition and removal of reconfigurable components
– Automatic loading and unloading of specific drivers for the IP-Cores
upon components configuration and/or deconfiguration
• Hardware-independent interface for software developers
based on the GNU/Linux
• Easier programming interface for specific drivers
60
61. IP-Core Devices Access
• Interaction with configured IP-Cores implemented by
means of the standard Linux device access
– Open, Close, Read, write, ioctl operations
/dev/device_1
/dev/device_2a
/dev/device_2b
Software
Application 1
Software
Application 2
device_1.o
device_2.o
/dev
Device
Drivers
Userspace
Devices
IP-Core_1
IP-Core_2a
IP-Core_2b
FPGA
Multiple instances
of the same
hardware module
61
62. Reconfigurable Process Control Block
• Reconfigurable process: an RFU object code in execution
• Each reconfigurable process is represented in the system by a
Reconfigurable Process Control Block (RPCB).
• A RPCB contains all the information associated with a specific
reconfigurable process
– State: the state in which the reconfigurable process control is at
the current time
– Position: the placement position on the device
– Object Code Accounting Information:
• Object Code
• Configuration Priority
• Resources
• Position
Configured
Addressing Space
Assignment
Ready
Positioning
Configuring
Configuring
Cached
Computing
Removed
Waiting
Executing
Preempted
62
63. The Centralized Manager
• Userspace applications are not allowed to explicitly
request a bitstream
– They request high-level functionalities
• Userspace requests are collected and served by a
centralized manager (Linux Reconfiguration Manager)
– The OS chooses the configuration code
– A new reconfigurable process is created
• Only the LRM can ask for a bitstream to be configured
on the FPGA
– Centralized knowledge of the device status
– Area management and module caching
63
63
71. Trying to raise the bar
• Towards the design and implementaion of Self-
adaptive and autonomic systems
71
72. Trying to raise the bar
• Towards the design and implementaion of Self-
adaptive and autonomic systems
• We need to include 2 more features
– Goal-oriented
• Tell the system what you want
• System’s job to figure out how to get there
– Approximate
• Does not expend any more effort than necessary to
meet goals
72
84. Autonomic Operating System
• The AcOS project aims at
– designing and prototyping a patch for commodity
operating systems (e.g. Linux, FreeBSD)
– being capable to observe its own execution and
optimize, in a self-aware manner, its behavior with
respect to the external environment, to user needs
and to applications demands
[integrated/used in different research projects]
84
85. The research effort
• K42
– http://researcher.watson.ibm.com/researcher/view_
project.php?id=2078
• The SElf-awarE Computing (SEEC) Model
– http://groups.csail.mit.edu/carbon/seec/
• Angstrom
– http://projects.csail.mit.edu/angstrom/
• The Swarm project and Tessellation OS
– http://tessellation.cs.berkeley.edu
85
91. AcOS: via an intelligent ODA loop
91
Monitoring
Infrastructures
Adaptation
Policies
Applications
System
Goals
Measurements
Decide
Act Observe
92. The Heart Rate Monitor
• Heartbeats signal[1] either progresses or availability
– video encoder: 1 heartbeat = 1 frame
– web server: 1 heartbeat = 1 request
– database server: 1 heartbeat = transaction
• Heart rate as a performance measure and goal
– High-level, application-specific performance
measurements and goals (e.g., video encoder: 30
heartbeats/s = 30 frames/s)
• Compact API, user/kernel-space partitioned
implementation[2]
– User-mode fast-path heartbeats issue
– User and kernel-mode low-latency heart rate access
92
[1] Hoffmann et al., Application Heartbeats for Software Performance and Health,
Proc. of the 15th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2010
[2] Sironi et al., Metronome: Operating System Level Performance Management via Self-Adaptive Computing,
Proc. of the 49th Annual Design Automation Conference 2012
95. The Heart Rate Monitor
• Set performance goal
95
e.g.:
min: 25hb/sec
max: 35hb/sec
96. The Heart Rate Monitor
• Set performance goal
• Run the app and update progress
96
Issue an heartbeat
e.g.:
min: 25hb/sec
max: 35hb/sec
97. The Heart Rate Monitor
• Set performance goal
• Run the app and update progress
97
Issue an heartbeat
Statistics automatically
updated
e.g.:
min: 25hb/sec
max: 35hb/sec
98. The Heart Rate Monitor
• Set performance goal
• Run the app and update progress
• Check heart rate and, if necessary, react
98
Issue an heartbeat
Statistics automatically
updated
e.g.:
min: 25hb/sec
max: 35hb/sec
99. A first evaluation
• Hardware
• Quad-core: Intel Core i7-870 @ 2.97 GHz with 8 MB of
LLC
• Hyper-Threading, Turbo Boost, etc. disabled
• Memory: 4 GB of DDR3-1333
• Software
• Debian GNU/Linux 2.6.35
• PARSEC Benchmark Suite 2.1
99
102. Two controlled x86, different goals
102
Consensus
object
core allocator
x264 x264
103. Autonomic Scheduling
• In a scenario were applications
– are competing for the same set of resources
– require predictable performance, expressed
through high-level, application-specific metrics
• The scheduler has to become Performance-
Aware to automatically allocate resources to
match performance goals
– With Metronome, we introduce performance-
awareness by means of a non-invasive
modifications to the Completely Fair Scheduler
103
106. Metronome
Sironi et al., Metronome: Operating System Level Performance Management via Self-Adaptive Computing,
Proc. of the 49th Annual Design Automation Conference 2012
106
107. From Metronome to Metronome++
• Metronome demonstrates
how a simple heuristic can be
enough to enable goal-
oriented resource allocation
by exploiting runtime
performance feedback
• Metronome++ has been
designed with more advanced
adaptation policy to
dynamically allocate CPUs to
SLO-bound
– E.g. Through tasks migration
among run queues
107
(1) Linux kernel vanilla
(2) Linux kernel enhanced with Metronome++
108. Trying to raise the bar (again)
• AcOS took into consideration performance…
• Any other HOT topic?
108
109. Trying to raise the bar (again)
• AcOS took into consideration performance…
• Any other HOT topic?
– What about temperature Control/Management!
109
110. 110
swaptions @ 2.80 GHz
ab @ 2.80 GHz
temperatureincrease(°C)
0
10
20
time (s)
0 100 200 300 400 500 600
Temperature Control/Management
starting scenario
111. 111
swaptions @ 2.80 GHz
ab @ 2.80 GHz
temperatureincrease(°C)
0
10
20
time (s)
0 100 200 300 400 500 600
Temperature Control/Management
set a temperature cap
113. 113
swaptions @ 2.80 GHz
ab @ 2.80 GHz
temperatureincrease(°C)
0
10
20
time (s)
0 100 200 300 400 500 600
Temperature Control/Management
back to the starting scenario
114. 114
swaptions @ 2.80 GHz
ab @ 2.80 GHz
temperatureincrease(°C)
0
10
20
time (s)
0 100 200 300 400 500 600
Temperature Control/Management
set the same temperature cap
119. A global property
• Being able to adapt is not a specific domain
property
– Operating system, computer architecture, etc…
– HPC
– Exascale computing systems
– Embedded systems
– IoT
– …
119
120. A global property
• Being able to adapt is not a specific domain
property
– Operating system, computer architecture, etc…
– HPC
– Exascale computing systems
– Embedded systems
– IoT
embedded/mobile devices and
HPC/distributed/exascale computing
120
121. A self-adaptive, context-aware operating system to enhance users
experience via smart devices and environments
The CHANGE project
Enabling Technologies For Self-Aware
Adaptive Computing Systems
121
122. The mobile context
• What about
– Taking into consideration users’ habits and needs
– Considering very constrained and fluctuating scenarios
122
123. The mobile context
• What about
– Taking into consideration users’ habits and needs
– Considering very constrained and fluctuating scenarios
• Going Mobile!
– Constrained in terms of available resources
• Multi-core, reduced CPU performance, few memory
• Battery constrained
– Internal and external conditions are fluctuating
• Resources availability varies over time
• External conditions are unpredictable
123
124. Mobile context: some examples
• Tegra 2
– Dual-Core ARM Cortex-A9
– ULP GeForce, 8 cores
• A6
– Dual-Core based on ARMv7
– Triple-core PowerVR SGX 543MP3 GPU
• Tegra 3
– Quad-Core
– ULP GeForce, 12 cores
124
Motorola Photon 4G
HTC One X
125. The morphone approach
• To react to different situations all the information
from different components should not be
disregarded
• Our approach uses all the information that it can
retrieve to create a model of the context
– Environment
– Self
– User behavior
125
126. The Mpower example
126
Public Available for Free
SEPTEMBER 2013
mpower.necst.it
https://www.facebook.com/morphone.os
Gain back your Android battery life!
Configuration A
Configuration B
127. The concept of configuration
• Domain-specific feature: hardware modules currently used
• We defined the concept of configuration:
• “Given the controllable hardware modules on a device, a
configuration is a combination of their internal state”
• We compute a single ARX model for each configuration
Configuration
A
Configuration
B
Configuration
C
127
129. Modeling approach: “divide et impera”
129
We experienced a piecewise linear behavior and tried to attribute this to
domain-specific features
Modelling Approach
129
130. Modeling approach: “divide et impera”
130
We experienced a piecewise linear behavior and tried to attribute this to
domain-specific features
Configuration
A
Configuration
B
Configuration
C
Modelling Approach
130
131. Modeling approach: “divide et impera”
131
We experienced a piecewise linear behavior and tried to attribute this to
domain-specific features
Configuration
A
Configuration
B
Configuration
C
= actions on controllable
variables
Modelling Approach
131
132. Modeling approach: “divide et impera”
132
We experienced a piecewise linear behavior and tried to attribute this to
domain-specific features
Configuration
A
Configuration
B
Configuration
C
= actions on controllable
variables
Exogenous input
(uncontrollable)
Modelling Approach
132
133. Actions on controllable variables
• They are determined by the user’s behavior
• We model the evolution of the smartphone’s configuration as
a Markov Decision Process
• A state for every configuration
• Transitions’ weights represent the
probability to go from
a configuration to another
Configuration
A
Configuration
B
Configuration
C
133
134. Move Decision Phase on Device
134
• To perform the computation needed for the
decision phase on the device
– We need low energy computing systems
• The battery is always the bottleneck
– We need fast computing system
• The system has to be responsive
– We need reconfigurable computing systems
• The analysis to be performed on the input data
changes with respect to
145. Politecnico di Milano
From FPGA-based Reconfigurable Systems to
Autonomic Heterogeneous Computing Systems
an enabling technologies perspective and more…
International Center for Theoretical Physics
Trieste @ 24 November 2017
Marco D. Santambrogio
<marco.santambrogio@polimi.it>
Politecnico di Milano
Editor's Notes
The framework has been designed in order to ensure usability, so that users with low-expertise on FPGA are able to use it, interactivity during the design phases to guide the user through the optimization process, modularity, so that users can upload their custom modules.
The framework has been designed in order to ensure usability, so that users with low-expertise on FPGA are able to use it, interactivity during the design phases to guide the user through the optimization process, modularity, so that users can upload their custom modules.
The framework has been designed in order to ensure usability, so that users with low-expertise on FPGA are able to use it, interactivity during the design phases to guide the user through the optimization process, modularity, so that users can upload their custom modules.
The framework has been designed in order to ensure usability, so that users with low-expertise on FPGA are able to use it, interactivity during the design phases to guide the user through the optimization process, modularity, so that users can upload their custom modules.
The framework has been designed in order to ensure usability, so that users with low-expertise on FPGA are able to use it, interactivity during the design phases to guide the user through the optimization process, modularity, so that users can upload their custom modules.
The framework has been designed in order to ensure usability, so that users with low-expertise on FPGA are able to use it, interactivity during the design phases to guide the user through the optimization process, modularity, so that users can upload their custom modules.
The framework has been designed in order to ensure usability, so that users with low-expertise on FPGA are able to use it, interactivity during the design phases to guide the user through the optimization process, modularity, so that users can upload their custom modules.
[AGGIUNGI QUI NOTE SULLA PROPRIETA’ DI MARKOVIANITA’]