AN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1 (1).pdf
feedback_optimizations_v2
1. Optimizations for the Orbit Feedback System
Ani Sridhar
AES Division / Controls Group Intern
Email: asridhar@anl.gov
asridha1@andrew.cmu.edu
Controls Group Meeting
7/27/16
2. Overview of the Orbit Feedback System
Inverse
X Response =
Matrix
BPM Error Data
Corrector Error Values
Regulator
Corrector deltas
660 BPMs
160 correctors
Goal: 44 usec
3. Board Structure for correctors
DSP 1
8 cores
DSP 2
8 cores
X plane computations Y plane computations
9. Option 2: Dot Product multiplication
Inverse Response
Matrix
X =
BPM error
• Less time
• Less memory
• Better for parallel cores
Corrector error
10. Measured and Calculated times
IRM Matrix
Dimensions
Theoretical
Limits*
Method 2:
Dot Product
8 x 660 2.4 usec 2.64 usec
160 x 660 48 usec 98.4 usec
Does not take into account memory constraints!
*computed by clock cycle formula
14. Parallel Matrix Multiplication Benchmarks
MSM, L2 Cache Enabled (32K)
Matrix
Dimensions
Time with
1 core
Time with
4 cores
Time with
6 cores
Time with
8 cores*
8 x 660 2.6 usec 3.7 usec 4.9 usec 7.0 usec
160 x 660 98.4 usec 27.5 usec 21.0 usec 19.0 usec
*Currently, we want to exclusively use one core to receive data and
another to send data. 8 core data is just there for benchmarks.
15. Summary of Benchmarks
• 8 x 660 matrix multiplication can be finished in 2.6 usecs (with 1 core)
• 160 x 660 matrix multiplication can be finished in 21.0 usecs (with 6 cores, current
architecture), 19.0 usecs with 8 cores
• Hard theoretical limit for 160x660 multiplication is 48 usecs with 1 core, roughly 10
usecs with 6 cores and 6 usecs with 8 cores, but these limits do not take into
account memory constraints
17. Regulator System and Current Specs
Corrector Error Corrector Deltas
Same for all correctors in the same plane
Number of correctors Time (usecs)
8 3.5712 usec
160 72.3432 usec
Optimization idea: filter and PID coefficients ONLY depend on the plane!
21. Measured Regulator Results
# of correctors Original Time (usecs) Optimized time (usecs)
8 3.5712 usec 0.628 usec (628 ns)
160 72.3432 usec 5.9015 usec
Speedup by a factor of 12!
22. Feedback System Timings:
8 correctors / 8 x 660 IRM
0 2 4 6 8 10
1 Core
4 Cores
6 Cores
8 Cores
Matrix Multiplication
Matrix Multiplication
Latency
Regulator
5.6 usec
4.3 usec
3.2 usec
7.7 usec
Hard Limit: 44 usec
Note: additional time needed to receive/send data
23. Feedback System Timings:
160 correctors / 160 x 660 IRM
0 50 100 150
1 core
4 cores
6 cores
8 cores
Matrix Multiplication
Matrix Multiplication
Latency
Regulator
26.9 usec
33.4 usec
104.2 usec
Hard Limit: 44 usec
24.9 usec
Note: additional time needed to receive/send data
24. The 8 x 660 Picture
• 20 boards
• Each board takes 3.2 usec for
main operations
• Much less than 44 usec limit
25. The 160 x 660 Picture • 1 board
• 26.9 usecs for main
operations
• Under the 44 usecs
limit
26. Tradeoffs to consider
• Time: how long do we need the system to settle after sending corrector deltas?
• Money: how much of a budget do we have to spend on boards?
• Chip Utilization: when does it make sense to use multiple cores? At what point are
there too many boards?
27. Summary of Results
• For 160 x 660 IRM,
• Regulator runs in 5.9 usecs (faster by factor of 12 from original)
• Matrix Multiplication runs in 21.0 usecs (with 30% speedup from optimized library
implementation)
• Total time 26.9 usecs < 44 usecs
• Orbit Feedback Controller Design options
• Option 1: Using 20 boards for 8 x 660 computations – will finish main computations in 3.2
usecs
• Option 2: Using 1 board for 160 x 660 computations – will finish main computations in
26.9 usecs – now a feasible idea!
• Additional Data
• Have generated benchmarks for other matrix sizes for multiplication and regulator – gives
a more detailed picture of different tradeoffs and timings.
28. Further Optimizations
• Matrix Multiplication
• Sequential: investigate further memory optimizations – measured times are nearly double
the theoretical hard limit (which does not take into account memory constraints)
• Meeting hard theoretical limit would allow chip to finish 160 x 660 computations in ~ 8
usecs, with total time ~ 14 usecs.
• Parallel: minimize latency of synchronizing cores after computations are done: ~ 3 usecs
saved
• Regulator
• Currently sequential. Parallel estimated time for 6 cores is 1.2 usecs, not taking into
account latency