SlideShare a Scribd company logo
1 of 29
Optimizations for the Orbit Feedback System
Ani Sridhar
AES Division / Controls Group Intern
Email: asridhar@anl.gov
asridha1@andrew.cmu.edu
Controls Group Meeting
7/27/16
Overview of the Orbit Feedback System
Inverse
X Response =
Matrix
BPM Error Data
Corrector Error Values
Regulator
Corrector deltas
660 BPMs
160 correctors
Goal: 44 usec
Board Structure for correctors
DSP 1
8 cores
DSP 2
8 cores
X plane computations Y plane computations
Board Structure for correctors
Architecture Tradeoffs
• Feasible?
• Chip utilization?
• Time? Memory? Money?
20 board architecture
(8 correctors per board)
Single board architecture
(160 correctors per board)
MATRIX MULTIPLICATION
BENCHMARKS
Matrix Multiplication setup
a1,1 a1,2 … . a1,660
a2,1 a2,2 … . a2,660
.... ….. ….. …. …
aC,1 aC,2 aC,660
b1
b2
b3
…
b660
r1
r2
…
rC
X =
BPM errors
Corrector ErrorsInverse Response Matrix
Option 1: Library Optimized Matrix
Multiplication
Inverse Response
Matrix X =
BPM Error
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Extra zero padding
(required by library function)
Corrector Error
Option 2: Dot Product multiplication
Inverse Response
Matrix
X =
BPM error
• Less time
• Less memory
• Better for parallel cores
Corrector error
Measured and Calculated times
IRM Matrix
Dimensions
Theoretical
Limits*
Method 2:
Dot Product
8 x 660 2.4 usec 2.64 usec
160 x 660 48 usec 98.4 usec
Does not take into account memory constraints!
*computed by clock cycle formula
Memory constraints for DSP C6678
Measured and Calculated times:
32 KB L2 Cache Enabled (MSM)
IRM Matrix
Dimensions
Theoretical
Limits*
Method 1:
Library fn
Method 2:
Dot Product
8 x 660 2.4 usec 4.424 usec 2.64 usec
160 x 660 48 usec 121.4064 usec 98.4 usec
MSM X =
*computed by clock cycle formula
Matrix Multiplication on Multiple Cores
IRM
X =
DSP CORE 1
DSP CORE 0
Parallel Matrix Multiplication Benchmarks
MSM, L2 Cache Enabled (32K)
Matrix
Dimensions
Time with
1 core
Time with
4 cores
Time with
6 cores
Time with
8 cores*
8 x 660 2.6 usec 3.7 usec 4.9 usec 7.0 usec
160 x 660 98.4 usec 27.5 usec 21.0 usec 19.0 usec
*Currently, we want to exclusively use one core to receive data and
another to send data. 8 core data is just there for benchmarks.
Summary of Benchmarks
• 8 x 660 matrix multiplication can be finished in 2.6 usecs (with 1 core)
• 160 x 660 matrix multiplication can be finished in 21.0 usecs (with 6 cores, current
architecture), 19.0 usecs with 8 cores
• Hard theoretical limit for 160x660 multiplication is 48 usecs with 1 core, roughly 10
usecs with 6 cores and 6 usecs with 8 cores, but these limits do not take into
account memory constraints
REGULATOR OPTIMIZATIONS
Regulator System and Current Specs
Corrector Error Corrector Deltas
Same for all correctors in the same plane
Number of correctors Time (usecs)
8 3.5712 usec
160 72.3432 usec
Optimization idea: filter and PID coefficients ONLY depend on the plane!
Optimized PID Performance:
Library weighted vector sum
PID constants are the same across correctors in the same plane
Optimized Filter Implementation: Direct Form II
r[n] t[n]
w[n]
z-1
h[0]
h[0]h[1]
Advantage: minimal number of array accesses
Optimized Filter Implementation: Direct Form II
Minimal array accesses creates SMALL matrices for FAST computation!
Measured Regulator Results
# of correctors Original Time (usecs) Optimized time (usecs)
8 3.5712 usec 0.628 usec (628 ns)
160 72.3432 usec 5.9015 usec
Speedup by a factor of 12!
Feedback System Timings:
8 correctors / 8 x 660 IRM
0 2 4 6 8 10
1 Core
4 Cores
6 Cores
8 Cores
Matrix Multiplication
Matrix Multiplication
Latency
Regulator
5.6 usec
4.3 usec
3.2 usec
7.7 usec
Hard Limit: 44 usec
Note: additional time needed to receive/send data
Feedback System Timings:
160 correctors / 160 x 660 IRM
0 50 100 150
1 core
4 cores
6 cores
8 cores
Matrix Multiplication
Matrix Multiplication
Latency
Regulator
26.9 usec
33.4 usec
104.2 usec
Hard Limit: 44 usec
24.9 usec
Note: additional time needed to receive/send data
The 8 x 660 Picture
• 20 boards
• Each board takes 3.2 usec for
main operations
• Much less than 44 usec limit
The 160 x 660 Picture • 1 board
• 26.9 usecs for main
operations
• Under the 44 usecs
limit
Tradeoffs to consider
• Time: how long do we need the system to settle after sending corrector deltas?
• Money: how much of a budget do we have to spend on boards?
• Chip Utilization: when does it make sense to use multiple cores? At what point are
there too many boards?
Summary of Results
• For 160 x 660 IRM,
• Regulator runs in 5.9 usecs (faster by factor of 12 from original)
• Matrix Multiplication runs in 21.0 usecs (with 30% speedup from optimized library
implementation)
• Total time 26.9 usecs < 44 usecs
• Orbit Feedback Controller Design options
• Option 1: Using 20 boards for 8 x 660 computations – will finish main computations in 3.2
usecs
• Option 2: Using 1 board for 160 x 660 computations – will finish main computations in
26.9 usecs – now a feasible idea!
• Additional Data
• Have generated benchmarks for other matrix sizes for multiplication and regulator – gives
a more detailed picture of different tradeoffs and timings.
Further Optimizations
• Matrix Multiplication
• Sequential: investigate further memory optimizations – measured times are nearly double
the theoretical hard limit (which does not take into account memory constraints)
• Meeting hard theoretical limit would allow chip to finish 160 x 660 computations in ~ 8
usecs, with total time ~ 14 usecs.
• Parallel: minimize latency of synchronizing cores after computations are done: ~ 3 usecs
saved
• Regulator
• Currently sequential. Parallel estimated time for 6 cores is 1.2 usecs, not taking into
account latency
Questions/Comments?
Contact Information:
Ani Sridhar
Email:
asridhar@anl.gov
asridha1@andrew.cmu.edu
I have a spreadsheet with more detailed results and data as well as a full report on
these
results. Email me if you want access!

More Related Content

What's hot

MAC UNIT USING DIFFERENT MULTIPLIERS
MAC UNIT USING DIFFERENT MULTIPLIERSMAC UNIT USING DIFFERENT MULTIPLIERS
MAC UNIT USING DIFFERENT MULTIPLIERSBhamidipati Gayatri
 
Memory Requirements for Convolutional Neural Network Hardware Accelerators
Memory Requirements for Convolutional Neural Network Hardware AcceleratorsMemory Requirements for Convolutional Neural Network Hardware Accelerators
Memory Requirements for Convolutional Neural Network Hardware AcceleratorsSepidehShirkhanzadeh
 
Meteo I/O Introduction
Meteo I/O IntroductionMeteo I/O Introduction
Meteo I/O IntroductionRiccardo Rigon
 
Low power high_speed
Low power high_speedLow power high_speed
Low power high_speednanipandu
 
SIMD Processing Using Compiler Intrinsics
SIMD Processing Using Compiler IntrinsicsSIMD Processing Using Compiler Intrinsics
SIMD Processing Using Compiler IntrinsicsRichard Thomson
 
Rock Flow Dynamics (Rfd)
Rock Flow Dynamics (Rfd)Rock Flow Dynamics (Rfd)
Rock Flow Dynamics (Rfd)Kirill Parinov
 
Computer Architecture Vector Computer
Computer Architecture Vector ComputerComputer Architecture Vector Computer
Computer Architecture Vector ComputerHaris456
 
Space of simd computers
Space of simd computersSpace of simd computers
Space of simd computersaniston0108
 
Hardware Architecture for Calculating LBP-Based Image Region Descriptors
Hardware Architecture for Calculating LBP-Based Image Region DescriptorsHardware Architecture for Calculating LBP-Based Image Region Descriptors
Hardware Architecture for Calculating LBP-Based Image Region DescriptorsMarek Kraft
 
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- PerformanceLec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- PerformanceHsien-Hsin Sean Lee, Ph.D.
 
Arithmetic Logic Unit (ALU)
Arithmetic Logic Unit (ALU)Arithmetic Logic Unit (ALU)
Arithmetic Logic Unit (ALU)Student
 
Simd programming introduction
Simd programming introductionSimd programming introduction
Simd programming introductionChamp Yen
 
Urban Plant watering
Urban Plant wateringUrban Plant watering
Urban Plant wateringShilpaJoy5
 
Cmos Arithmetic Circuits
Cmos Arithmetic CircuitsCmos Arithmetic Circuits
Cmos Arithmetic Circuitsankitgoel
 
System-wide Energy Optimization for Multiple DVS Components and Real-time Tasks
System-wide Energy Optimization for Multiple DVS Components and Real-time TasksSystem-wide Energy Optimization for Multiple DVS Components and Real-time Tasks
System-wide Energy Optimization for Multiple DVS Components and Real-time TasksHeechul Yun
 

What's hot (20)

MAC UNIT USING DIFFERENT MULTIPLIERS
MAC UNIT USING DIFFERENT MULTIPLIERSMAC UNIT USING DIFFERENT MULTIPLIERS
MAC UNIT USING DIFFERENT MULTIPLIERS
 
Memory Requirements for Convolutional Neural Network Hardware Accelerators
Memory Requirements for Convolutional Neural Network Hardware AcceleratorsMemory Requirements for Convolutional Neural Network Hardware Accelerators
Memory Requirements for Convolutional Neural Network Hardware Accelerators
 
Meteo I/O Introduction
Meteo I/O IntroductionMeteo I/O Introduction
Meteo I/O Introduction
 
Low power high_speed
Low power high_speedLow power high_speed
Low power high_speed
 
SIMD Processing Using Compiler Intrinsics
SIMD Processing Using Compiler IntrinsicsSIMD Processing Using Compiler Intrinsics
SIMD Processing Using Compiler Intrinsics
 
Rock Flow Dynamics (Rfd)
Rock Flow Dynamics (Rfd)Rock Flow Dynamics (Rfd)
Rock Flow Dynamics (Rfd)
 
Computer Architecture Vector Computer
Computer Architecture Vector ComputerComputer Architecture Vector Computer
Computer Architecture Vector Computer
 
Nbvtalkatjntuvizianagaram
NbvtalkatjntuvizianagaramNbvtalkatjntuvizianagaram
Nbvtalkatjntuvizianagaram
 
Space of simd computers
Space of simd computersSpace of simd computers
Space of simd computers
 
Hardware Architecture for Calculating LBP-Based Image Region Descriptors
Hardware Architecture for Calculating LBP-Based Image Region DescriptorsHardware Architecture for Calculating LBP-Based Image Region Descriptors
Hardware Architecture for Calculating LBP-Based Image Region Descriptors
 
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- PerformanceLec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
 
Arithmetic Logic Unit (ALU)
Arithmetic Logic Unit (ALU)Arithmetic Logic Unit (ALU)
Arithmetic Logic Unit (ALU)
 
AES on modern GPUs
AES on modern GPUsAES on modern GPUs
AES on modern GPUs
 
Simd programming introduction
Simd programming introductionSimd programming introduction
Simd programming introduction
 
cnsm2011_slide
cnsm2011_slidecnsm2011_slide
cnsm2011_slide
 
MaPU-HPCA2016
MaPU-HPCA2016MaPU-HPCA2016
MaPU-HPCA2016
 
Anegdotic Maxeler (Romania)
  Anegdotic Maxeler (Romania)  Anegdotic Maxeler (Romania)
Anegdotic Maxeler (Romania)
 
Urban Plant watering
Urban Plant wateringUrban Plant watering
Urban Plant watering
 
Cmos Arithmetic Circuits
Cmos Arithmetic CircuitsCmos Arithmetic Circuits
Cmos Arithmetic Circuits
 
System-wide Energy Optimization for Multiple DVS Components and Real-time Tasks
System-wide Energy Optimization for Multiple DVS Components and Real-time TasksSystem-wide Energy Optimization for Multiple DVS Components and Real-time Tasks
System-wide Energy Optimization for Multiple DVS Components and Real-time Tasks
 

Viewers also liked

Валентина Алексеева, Быстро меняющиеся процессы производства
Валентина Алексеева, Быстро меняющиеся процессы производстваВалентина Алексеева, Быстро меняющиеся процессы производства
Валентина Алексеева, Быстро меняющиеся процессы производстваScrumTrek
 
Андрей Войнов, О перестройке, перезагрузке и выворачивании себя наизнанку
Андрей Войнов, О перестройке, перезагрузке и выворачивании себя наизнанкуАндрей Войнов, О перестройке, перезагрузке и выворачивании себя наизнанку
Андрей Войнов, О перестройке, перезагрузке и выворачивании себя наизнанкуScrumTrek
 
The results of resurrection
The results of resurrectionThe results of resurrection
The results of resurrectionleony espin
 
основная образовательная программа
основная образовательная программаосновная образовательная программа
основная образовательная программаsad72
 
Дмитрий Лобасев, Галина Ильчук. Agile для бизнеса: трансформация корпоративно...
Дмитрий Лобасев, Галина Ильчук. Agile для бизнеса: трансформация корпоративно...Дмитрий Лобасев, Галина Ильчук. Agile для бизнеса: трансформация корпоративно...
Дмитрий Лобасев, Галина Ильчук. Agile для бизнеса: трансформация корпоративно...ScrumTrek
 
Антон Немчинов, Внедрение Kanban и гибких практик разработки в агрессивной ср...
Антон Немчинов, Внедрение Kanban и гибких практик разработки в агрессивной ср...Антон Немчинов, Внедрение Kanban и гибких практик разработки в агрессивной ср...
Антон Немчинов, Внедрение Kanban и гибких практик разработки в агрессивной ср...ScrumTrek
 
Сергей Чирва. Внедрение Agile в производственной компании: опыт реальной орга...
Сергей Чирва. Внедрение Agile в производственной компании: опыт реальной орга...Сергей Чирва. Внедрение Agile в производственной компании: опыт реальной орга...
Сергей Чирва. Внедрение Agile в производственной компании: опыт реальной орга...ScrumTrek
 
Олег Блохин, Контракт с дьяволом
Олег Блохин, Контракт с дьяволомОлег Блохин, Контракт с дьяволом
Олег Блохин, Контракт с дьяволомScrumTrek
 
Евгений Кобзев, Прикладная и абстрактная холакратия
Евгений Кобзев, Прикладная и абстрактная холакратияЕвгений Кобзев, Прикладная и абстрактная холакратия
Евгений Кобзев, Прикладная и абстрактная холакратияScrumTrek
 
台中市環保局-「臺中市空氣品質惡化緊急應變措施」(草案)研商公聽會 會議紀錄
台中市環保局-「臺中市空氣品質惡化緊急應變措施」(草案)研商公聽會 會議紀錄台中市環保局-「臺中市空氣品質惡化緊急應變措施」(草案)研商公聽會 會議紀錄
台中市環保局-「臺中市空氣品質惡化緊急應變措施」(草案)研商公聽會 會議紀錄主婦聯盟台中分會
 
ConnectorsForIntegration
ConnectorsForIntegrationConnectorsForIntegration
ConnectorsForIntegrationbthomps1979
 
действуй опираясь на ценности а не просто применяй инструменты максим цепков
действуй опираясь на ценности а не просто применяй инструменты максим цепковдействуй опираясь на ценности а не просто применяй инструменты максим цепков
действуй опираясь на ценности а не просто применяй инструменты максим цепковMaxim Tsepkov
 

Viewers also liked (12)

Валентина Алексеева, Быстро меняющиеся процессы производства
Валентина Алексеева, Быстро меняющиеся процессы производстваВалентина Алексеева, Быстро меняющиеся процессы производства
Валентина Алексеева, Быстро меняющиеся процессы производства
 
Андрей Войнов, О перестройке, перезагрузке и выворачивании себя наизнанку
Андрей Войнов, О перестройке, перезагрузке и выворачивании себя наизнанкуАндрей Войнов, О перестройке, перезагрузке и выворачивании себя наизнанку
Андрей Войнов, О перестройке, перезагрузке и выворачивании себя наизнанку
 
The results of resurrection
The results of resurrectionThe results of resurrection
The results of resurrection
 
основная образовательная программа
основная образовательная программаосновная образовательная программа
основная образовательная программа
 
Дмитрий Лобасев, Галина Ильчук. Agile для бизнеса: трансформация корпоративно...
Дмитрий Лобасев, Галина Ильчук. Agile для бизнеса: трансформация корпоративно...Дмитрий Лобасев, Галина Ильчук. Agile для бизнеса: трансформация корпоративно...
Дмитрий Лобасев, Галина Ильчук. Agile для бизнеса: трансформация корпоративно...
 
Антон Немчинов, Внедрение Kanban и гибких практик разработки в агрессивной ср...
Антон Немчинов, Внедрение Kanban и гибких практик разработки в агрессивной ср...Антон Немчинов, Внедрение Kanban и гибких практик разработки в агрессивной ср...
Антон Немчинов, Внедрение Kanban и гибких практик разработки в агрессивной ср...
 
Сергей Чирва. Внедрение Agile в производственной компании: опыт реальной орга...
Сергей Чирва. Внедрение Agile в производственной компании: опыт реальной орга...Сергей Чирва. Внедрение Agile в производственной компании: опыт реальной орга...
Сергей Чирва. Внедрение Agile в производственной компании: опыт реальной орга...
 
Олег Блохин, Контракт с дьяволом
Олег Блохин, Контракт с дьяволомОлег Блохин, Контракт с дьяволом
Олег Блохин, Контракт с дьяволом
 
Евгений Кобзев, Прикладная и абстрактная холакратия
Евгений Кобзев, Прикладная и абстрактная холакратияЕвгений Кобзев, Прикладная и абстрактная холакратия
Евгений Кобзев, Прикладная и абстрактная холакратия
 
台中市環保局-「臺中市空氣品質惡化緊急應變措施」(草案)研商公聽會 會議紀錄
台中市環保局-「臺中市空氣品質惡化緊急應變措施」(草案)研商公聽會 會議紀錄台中市環保局-「臺中市空氣品質惡化緊急應變措施」(草案)研商公聽會 會議紀錄
台中市環保局-「臺中市空氣品質惡化緊急應變措施」(草案)研商公聽會 會議紀錄
 
ConnectorsForIntegration
ConnectorsForIntegrationConnectorsForIntegration
ConnectorsForIntegration
 
действуй опираясь на ценности а не просто применяй инструменты максим цепков
действуй опираясь на ценности а не просто применяй инструменты максим цепковдействуй опираясь на ценности а не просто применяй инструменты максим цепков
действуй опираясь на ценности а не просто применяй инструменты максим цепков
 

Similar to feedback_optimizations_v2

Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxAkshitAgiwal1
 
SOC Application Studies: Image Compression
SOC Application Studies: Image CompressionSOC Application Studies: Image Compression
SOC Application Studies: Image CompressionA B Shinde
 
Summary Of Course Projects
Summary Of Course ProjectsSummary Of Course Projects
Summary Of Course Projectsawan2008
 
Advanced High-Performance Computing Features of the OpenPOWER ISA
 Advanced High-Performance Computing Features of the OpenPOWER ISA Advanced High-Performance Computing Features of the OpenPOWER ISA
Advanced High-Performance Computing Features of the OpenPOWER ISAGanesan Narayanasamy
 
Computer System Architecture Lecture Note 8.1 primary Memory
Computer System Architecture Lecture Note 8.1 primary MemoryComputer System Architecture Lecture Note 8.1 primary Memory
Computer System Architecture Lecture Note 8.1 primary MemoryBudditha Hettige
 
Electronics product design companies in bangalore
Electronics product design companies in bangaloreElectronics product design companies in bangalore
Electronics product design companies in bangaloreAshok Kumar.k
 
Architecture innovations in POWER ISA v3.01 and POWER10
Architecture innovations in POWER ISA v3.01 and POWER10Architecture innovations in POWER ISA v3.01 and POWER10
Architecture innovations in POWER ISA v3.01 and POWER10Ganesan Narayanasamy
 
Benchmark Processors- VAX 8600,MC68040,SPARC and Superscalar RISC
Benchmark Processors- VAX 8600,MC68040,SPARC and Superscalar RISCBenchmark Processors- VAX 8600,MC68040,SPARC and Superscalar RISC
Benchmark Processors- VAX 8600,MC68040,SPARC and Superscalar RISCPriyodarshini Dhar
 
Summary Of Academic Projects
Summary Of Academic ProjectsSummary Of Academic Projects
Summary Of Academic Projectsawan2008
 
An application classification guided cache tuning heuristic for
An application classification guided cache tuning heuristic forAn application classification guided cache tuning heuristic for
An application classification guided cache tuning heuristic forKhyati Rajput
 
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...James McCombs
 
SOC Processors Used in SOC
SOC Processors Used in SOCSOC Processors Used in SOC
SOC Processors Used in SOCA B Shinde
 
**Understanding_CTS_Log_Messages.pdf
**Understanding_CTS_Log_Messages.pdf**Understanding_CTS_Log_Messages.pdf
**Understanding_CTS_Log_Messages.pdfagnathavasi
 
AN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1.pptx
AN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1.pptxAN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1.pptx
AN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1.pptxKeshvan Dhanapal
 
AN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1 (1).pdf
AN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1 (1).pdfAN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1 (1).pdf
AN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1 (1).pdfKeshvan Dhanapal
 

Similar to feedback_optimizations_v2 (20)

Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
module01.ppt
module01.pptmodule01.ppt
module01.ppt
 
PraveenBOUT++
PraveenBOUT++PraveenBOUT++
PraveenBOUT++
 
SOC Application Studies: Image Compression
SOC Application Studies: Image CompressionSOC Application Studies: Image Compression
SOC Application Studies: Image Compression
 
Summary Of Course Projects
Summary Of Course ProjectsSummary Of Course Projects
Summary Of Course Projects
 
Advanced High-Performance Computing Features of the OpenPOWER ISA
 Advanced High-Performance Computing Features of the OpenPOWER ISA Advanced High-Performance Computing Features of the OpenPOWER ISA
Advanced High-Performance Computing Features of the OpenPOWER ISA
 
Computer System Architecture Lecture Note 8.1 primary Memory
Computer System Architecture Lecture Note 8.1 primary MemoryComputer System Architecture Lecture Note 8.1 primary Memory
Computer System Architecture Lecture Note 8.1 primary Memory
 
Aa sort-v4
Aa sort-v4Aa sort-v4
Aa sort-v4
 
Processors selection
Processors selectionProcessors selection
Processors selection
 
Electronics product design companies in bangalore
Electronics product design companies in bangaloreElectronics product design companies in bangalore
Electronics product design companies in bangalore
 
Architecture innovations in POWER ISA v3.01 and POWER10
Architecture innovations in POWER ISA v3.01 and POWER10Architecture innovations in POWER ISA v3.01 and POWER10
Architecture innovations in POWER ISA v3.01 and POWER10
 
Benchmark Processors- VAX 8600,MC68040,SPARC and Superscalar RISC
Benchmark Processors- VAX 8600,MC68040,SPARC and Superscalar RISCBenchmark Processors- VAX 8600,MC68040,SPARC and Superscalar RISC
Benchmark Processors- VAX 8600,MC68040,SPARC and Superscalar RISC
 
Summary Of Academic Projects
Summary Of Academic ProjectsSummary Of Academic Projects
Summary Of Academic Projects
 
An application classification guided cache tuning heuristic for
An application classification guided cache tuning heuristic forAn application classification guided cache tuning heuristic for
An application classification guided cache tuning heuristic for
 
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
 
SOC Processors Used in SOC
SOC Processors Used in SOCSOC Processors Used in SOC
SOC Processors Used in SOC
 
**Understanding_CTS_Log_Messages.pdf
**Understanding_CTS_Log_Messages.pdf**Understanding_CTS_Log_Messages.pdf
**Understanding_CTS_Log_Messages.pdf
 
AN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1.pptx
AN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1.pptxAN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1.pptx
AN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1.pptx
 
BIRA recent.pptx
BIRA recent.pptxBIRA recent.pptx
BIRA recent.pptx
 
AN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1 (1).pdf
AN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1 (1).pdfAN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1 (1).pdf
AN EFFICIENT MEMORY DESIGN FOR ERROR TOLERANT APPLICATION1 (1).pdf
 

feedback_optimizations_v2

  • 1. Optimizations for the Orbit Feedback System Ani Sridhar AES Division / Controls Group Intern Email: asridhar@anl.gov asridha1@andrew.cmu.edu Controls Group Meeting 7/27/16
  • 2. Overview of the Orbit Feedback System Inverse X Response = Matrix BPM Error Data Corrector Error Values Regulator Corrector deltas 660 BPMs 160 correctors Goal: 44 usec
  • 3. Board Structure for correctors DSP 1 8 cores DSP 2 8 cores X plane computations Y plane computations
  • 4. Board Structure for correctors
  • 5. Architecture Tradeoffs • Feasible? • Chip utilization? • Time? Memory? Money? 20 board architecture (8 correctors per board) Single board architecture (160 correctors per board)
  • 7. Matrix Multiplication setup a1,1 a1,2 … . a1,660 a2,1 a2,2 … . a2,660 .... ….. ….. …. … aC,1 aC,2 aC,660 b1 b2 b3 … b660 r1 r2 … rC X = BPM errors Corrector ErrorsInverse Response Matrix
  • 8. Option 1: Library Optimized Matrix Multiplication Inverse Response Matrix X = BPM Error 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Extra zero padding (required by library function) Corrector Error
  • 9. Option 2: Dot Product multiplication Inverse Response Matrix X = BPM error • Less time • Less memory • Better for parallel cores Corrector error
  • 10. Measured and Calculated times IRM Matrix Dimensions Theoretical Limits* Method 2: Dot Product 8 x 660 2.4 usec 2.64 usec 160 x 660 48 usec 98.4 usec Does not take into account memory constraints! *computed by clock cycle formula
  • 12. Measured and Calculated times: 32 KB L2 Cache Enabled (MSM) IRM Matrix Dimensions Theoretical Limits* Method 1: Library fn Method 2: Dot Product 8 x 660 2.4 usec 4.424 usec 2.64 usec 160 x 660 48 usec 121.4064 usec 98.4 usec MSM X = *computed by clock cycle formula
  • 13. Matrix Multiplication on Multiple Cores IRM X = DSP CORE 1 DSP CORE 0
  • 14. Parallel Matrix Multiplication Benchmarks MSM, L2 Cache Enabled (32K) Matrix Dimensions Time with 1 core Time with 4 cores Time with 6 cores Time with 8 cores* 8 x 660 2.6 usec 3.7 usec 4.9 usec 7.0 usec 160 x 660 98.4 usec 27.5 usec 21.0 usec 19.0 usec *Currently, we want to exclusively use one core to receive data and another to send data. 8 core data is just there for benchmarks.
  • 15. Summary of Benchmarks • 8 x 660 matrix multiplication can be finished in 2.6 usecs (with 1 core) • 160 x 660 matrix multiplication can be finished in 21.0 usecs (with 6 cores, current architecture), 19.0 usecs with 8 cores • Hard theoretical limit for 160x660 multiplication is 48 usecs with 1 core, roughly 10 usecs with 6 cores and 6 usecs with 8 cores, but these limits do not take into account memory constraints
  • 17. Regulator System and Current Specs Corrector Error Corrector Deltas Same for all correctors in the same plane Number of correctors Time (usecs) 8 3.5712 usec 160 72.3432 usec Optimization idea: filter and PID coefficients ONLY depend on the plane!
  • 18. Optimized PID Performance: Library weighted vector sum PID constants are the same across correctors in the same plane
  • 19. Optimized Filter Implementation: Direct Form II r[n] t[n] w[n] z-1 h[0] h[0]h[1] Advantage: minimal number of array accesses
  • 20. Optimized Filter Implementation: Direct Form II Minimal array accesses creates SMALL matrices for FAST computation!
  • 21. Measured Regulator Results # of correctors Original Time (usecs) Optimized time (usecs) 8 3.5712 usec 0.628 usec (628 ns) 160 72.3432 usec 5.9015 usec Speedup by a factor of 12!
  • 22. Feedback System Timings: 8 correctors / 8 x 660 IRM 0 2 4 6 8 10 1 Core 4 Cores 6 Cores 8 Cores Matrix Multiplication Matrix Multiplication Latency Regulator 5.6 usec 4.3 usec 3.2 usec 7.7 usec Hard Limit: 44 usec Note: additional time needed to receive/send data
  • 23. Feedback System Timings: 160 correctors / 160 x 660 IRM 0 50 100 150 1 core 4 cores 6 cores 8 cores Matrix Multiplication Matrix Multiplication Latency Regulator 26.9 usec 33.4 usec 104.2 usec Hard Limit: 44 usec 24.9 usec Note: additional time needed to receive/send data
  • 24. The 8 x 660 Picture • 20 boards • Each board takes 3.2 usec for main operations • Much less than 44 usec limit
  • 25. The 160 x 660 Picture • 1 board • 26.9 usecs for main operations • Under the 44 usecs limit
  • 26. Tradeoffs to consider • Time: how long do we need the system to settle after sending corrector deltas? • Money: how much of a budget do we have to spend on boards? • Chip Utilization: when does it make sense to use multiple cores? At what point are there too many boards?
  • 27. Summary of Results • For 160 x 660 IRM, • Regulator runs in 5.9 usecs (faster by factor of 12 from original) • Matrix Multiplication runs in 21.0 usecs (with 30% speedup from optimized library implementation) • Total time 26.9 usecs < 44 usecs • Orbit Feedback Controller Design options • Option 1: Using 20 boards for 8 x 660 computations – will finish main computations in 3.2 usecs • Option 2: Using 1 board for 160 x 660 computations – will finish main computations in 26.9 usecs – now a feasible idea! • Additional Data • Have generated benchmarks for other matrix sizes for multiplication and regulator – gives a more detailed picture of different tradeoffs and timings.
  • 28. Further Optimizations • Matrix Multiplication • Sequential: investigate further memory optimizations – measured times are nearly double the theoretical hard limit (which does not take into account memory constraints) • Meeting hard theoretical limit would allow chip to finish 160 x 660 computations in ~ 8 usecs, with total time ~ 14 usecs. • Parallel: minimize latency of synchronizing cores after computations are done: ~ 3 usecs saved • Regulator • Currently sequential. Parallel estimated time for 6 cores is 1.2 usecs, not taking into account latency
  • 29. Questions/Comments? Contact Information: Ani Sridhar Email: asridhar@anl.gov asridha1@andrew.cmu.edu I have a spreadsheet with more detailed results and data as well as a full report on these results. Email me if you want access!

Editor's Notes

  1. Go to "View | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All"
  2. The details of each cycle of the feedback controller
  3. Next to library mult slide
  4. Memory constraints for large matrices
  5. Do similar changes to data table Msm with cache
  6. No need for communication during computation Can expect linear improvements Easy to implement
  7. Note the latency factor, which has a large contribution for small matrices but insignificant for larger matrices
  8. First, make statements about the optimization Then, list impact of them