feedback_optimizations_v2

Optimizations for the Orbit Feedback System
Ani Sridhar
AES Division / Controls Group Intern
Email: asridhar@anl.gov
asridha1@andrew.cmu.edu
Controls Group Meeting
7/27/16

Overview of the Orbit Feedback System
Inverse
X Response =
Matrix
BPM Error Data
Corrector Error Values
Regulator
Corrector deltas
660 BPMs
160 correctors
Goal: 44 usec

Board Structure for correctors
DSP 1
8 cores
DSP 2
8 cores
X plane computations Y plane computations

Board Structure for correctors

Architecture Tradeoffs
• Feasible?
• Chip utilization?
• Time? Memory? Money?
20 board architecture
(8 correctors per board)
Single board architecture
(160 correctors per board)

MATRIX MULTIPLICATION
BENCHMARKS

Matrix Multiplication setup
a1,1 a1,2 … . a1,660
a2,1 a2,2 … . a2,660
.... ….. ….. …. …
aC,1 aC,2 aC,660
b1
b2
b3
…
b660
r1
r2
…
rC
X =
BPM errors
Corrector ErrorsInverse Response Matrix

Option 1: Library Optimized Matrix
Multiplication
Inverse Response
Matrix X =
BPM Error
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Extra zero padding
(required by library function)
Corrector Error

Option 2: Dot Product multiplication
Inverse Response
Matrix
X =
BPM error
• Less time
• Less memory
• Better for parallel cores
Corrector error

Measured and Calculated times
IRM Matrix
Dimensions
Theoretical
Limits*
Method 2:
Dot Product
8 x 660 2.4 usec 2.64 usec
160 x 660 48 usec 98.4 usec
Does not take into account memory constraints!
*computed by clock cycle formula

Memory constraints for DSP C6678

Measured and Calculated times:
32 KB L2 Cache Enabled (MSM)
IRM Matrix
Dimensions
Theoretical
Limits*
Method 1:
Library fn
Method 2:
Dot Product
8 x 660 2.4 usec 4.424 usec 2.64 usec
160 x 660 48 usec 121.4064 usec 98.4 usec
MSM X =
*computed by clock cycle formula

Matrix Multiplication on Multiple Cores
IRM
X =
DSP CORE 1
DSP CORE 0

Parallel Matrix Multiplication Benchmarks
MSM, L2 Cache Enabled (32K)
Matrix
Dimensions
Time with
1 core
Time with
4 cores
Time with
6 cores
Time with
8 cores*
8 x 660 2.6 usec 3.7 usec 4.9 usec 7.0 usec
160 x 660 98.4 usec 27.5 usec 21.0 usec 19.0 usec
*Currently, we want to exclusively use one core to receive data and
another to send data. 8 core data is just there for benchmarks.

Summary of Benchmarks
• 8 x 660 matrix multiplication can be finished in 2.6 usecs (with 1 core)
• 160 x 660 matrix multiplication can be finished in 21.0 usecs (with 6 cores, current
architecture), 19.0 usecs with 8 cores
• Hard theoretical limit for 160x660 multiplication is 48 usecs with 1 core, roughly 10
usecs with 6 cores and 6 usecs with 8 cores, but these limits do not take into
account memory constraints

Regulator System and Current Specs
Corrector Error Corrector Deltas
Same for all correctors in the same plane
Number of correctors Time (usecs)
8 3.5712 usec
160 72.3432 usec
Optimization idea: filter and PID coefficients ONLY depend on the plane!

Optimized PID Performance:
Library weighted vector sum
PID constants are the same across correctors in the same plane

Optimized Filter Implementation: Direct Form II
r[n] t[n]
w[n]
z-1
h[0]
h[0]h[1]
Advantage: minimal number of array accesses

Optimized Filter Implementation: Direct Form II
Minimal array accesses creates SMALL matrices for FAST computation!

Measured Regulator Results
# of correctors Original Time (usecs) Optimized time (usecs)
8 3.5712 usec 0.628 usec (628 ns)
160 72.3432 usec 5.9015 usec
Speedup by a factor of 12!

Feedback System Timings:
8 correctors / 8 x 660 IRM
0 2 4 6 8 10
1 Core
4 Cores
6 Cores
8 Cores
Matrix Multiplication
Latency
Regulator
5.6 usec
4.3 usec
3.2 usec
7.7 usec
Hard Limit: 44 usec
Note: additional time needed to receive/send data

Feedback System Timings:
160 correctors / 160 x 660 IRM
0 50 100 150
1 core
4 cores
6 cores
8 cores
Latency
Regulator
26.9 usec
33.4 usec
104.2 usec
Hard Limit: 44 usec
24.9 usec
Note: additional time needed to receive/send data

The 8 x 660 Picture
• 20 boards
• Each board takes 3.2 usec for
main operations
• Much less than 44 usec limit

The 160 x 660 Picture • 1 board
• 26.9 usecs for main
operations
• Under the 44 usecs
limit

Tradeoffs to consider
• Time: how long do we need the system to settle after sending corrector deltas?
• Money: how much of a budget do we have to spend on boards?
• Chip Utilization: when does it make sense to use multiple cores? At what point are
there too many boards?

Summary of Results
• For 160 x 660 IRM,
• Regulator runs in 5.9 usecs (faster by factor of 12 from original)
• Matrix Multiplication runs in 21.0 usecs (with 30% speedup from optimized library
implementation)
• Total time 26.9 usecs < 44 usecs
• Orbit Feedback Controller Design options
• Option 1: Using 20 boards for 8 x 660 computations – will finish main computations in 3.2
usecs
• Option 2: Using 1 board for 160 x 660 computations – will finish main computations in
26.9 usecs – now a feasible idea!
• Additional Data
• Have generated benchmarks for other matrix sizes for multiplication and regulator – gives
a more detailed picture of different tradeoffs and timings.

Further Optimizations
• Matrix Multiplication
• Sequential: investigate further memory optimizations – measured times are nearly double
the theoretical hard limit (which does not take into account memory constraints)
• Meeting hard theoretical limit would allow chip to finish 160 x 660 computations in ~ 8
usecs, with total time ~ 14 usecs.
• Parallel: minimize latency of synchronizing cores after computations are done: ~ 3 usecs
saved
• Regulator
• Currently sequential. Parallel estimated time for 6 cores is 1.2 usecs, not taking into
account latency

Questions/Comments?
Contact Information:
Ani Sridhar
Email:
asridhar@anl.gov
asridha1@andrew.cmu.edu
I have a spreadsheet with more detailed results and data as well as a full report on
these
results. Email me if you want access!

feedback_optimizations_v2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to feedback_optimizations_v2

Similar to feedback_optimizations_v2 (20)

feedback_optimizations_v2

Editor's Notes