• Save
Accelerating Real-time processing of the ATST Adaptive Optics System using Coarse-grained Parallel Hardware Architectures
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Accelerating Real-time processing of the ATST Adaptive Optics System using Coarse-grained Parallel Hardware Architectures

on

  • 534 views

Presented on July 20, 2011 in Gold Room (Regular session) at 11:40 - 12:00pm ...

Presented on July 20, 2011 in Gold Room (Regular session) at 11:40 - 12:00pm

http://ersaconf.org/ersa11/program/wed.php

The International Conference on
Engineering of Reconfigurable Systems and Algorithms (ERSA'11) at Las Vegas Nevada USA

Statistics

Views

Total Views
534
Views on SlideShare
532
Embed Views
2

Actions

Likes
0
Downloads
0
Comments
0

2 Embeds 2

http://www.linkedin.com 1
http://www.slashdocs.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Accelerating Real-time processing of the ATST Adaptive Optics System using Coarse-grained Parallel Hardware Architectures Presentation Transcript

  • 1. ERSA, Las Vegas, Nevada, July 2011Accelerating Real-time processing of the ATST Adaptive Optics System using Coarse-grained Parallel Hardware Architectures Vivek Venugopal (vivek@vivekvenugopal.net) National Solar Observatory, Sunspot, New Mexico
  • 2. Advanced Technology Solar Telescope 2
  • 3. Adaptive Optics system Uncorrected Tip/Tilt light Mirror Deformable Mirror (DM) Tilt drive signal DM drive signal CorrectedProcessors Beamsplitter light Shack-Hartmann Lenslet Array CCD Camera 3
  • 4. Sub-apertures 4
  • 5. HOAO Real-time system Actuator gains Offscale Recon- Dark Reference slope Slope struction Actuator Flat field image field tolerance offsets matrix offsets Deformable mirror Cross- Offscale WFS correlation Matrix ActuatorCamera X slope slope detection X multiply servos Servo computation parameters Average Tip/Tilt slope servos Tip/Tilt mirror Data Zernike collection offload process • 1750 sub-apertures and 1900 actuators 5
  • 6. Camera data formatcamera data half camera data half960 x 480 pixels 960 x 480 pixels • Camera data consists of two halves of 960x480 pixels • Each half of camera data sent to FPGA using 12 channels 6
  • 7. Scenario 1:FPGA-DSP 96 DSPs Camera FPGA 1 data half 12 optical 12 fiber channels channels Camera FPGA 2 data half 12 optical 12 fiber channels channels• Pixel unpacking task - FPGA• Processing - DSPs 7
  • 8. Scenario 2:FPGA-DSP 48 DSPs Camera FPGA 1 data half 12 optical 12 fiber channels channels Camera FPGA 2 data half 12 12 optical fiber channels channels• Pixel unpacking, dark and flat correction- FPGA• Cross-correlation and reconstruction matrix processing - DSPs 8
  • 9. Dark and flat correction pixel0 10 • Dark pixel and flat pixel stored in - 10 RAMdark_pixel 8 8 x 18 flat_product0 • Flat corrected product is flat_pixel 8 accumulator 8 concatenated and written to flat_acc1 pixel 1 10 FIFO - 10 • Flat accumulated value can be used to update the referencedark_pixel 8 flat_pixel 8 x 8 18 flat_product1 image 8 accumulator flat_acc1pixel16 10 - 10dark_pixel 8 flat_pixel 8 x 8 18 flat_product16 8 accumulator flat_acc16 9
  • 10. Pixel unpacking & Dark and flat correction Synchronizer/ counters dark and flat reference image value RAM RAM 206.8 ns 20 ns 256 channel 1 128 Data 160 Dark-flat correction/ Receiver FIFO unpack accumulator 16 160 288 channel 2 PCIe system bus 128 Data 160 Dark-flat correction/12 channels Receiver FIFO1/2 camera unpack accumulator 16 160 288 channel 12 128 Data 160 Dark-flat correction/ Receiver FIFO unpack accumulator 16 160 288 clock period = 9.42 ns clock period = 5 ns clock rate = 106.15 MHz clock rate = 200 MHz 10
  • 11. Scenario 3:FPGA-GPU or FPGA-CPU Camera FPGA 1 data half 12 optical fiber PCI-e bus channels GPU/CPU Camera FPGA 2 data half 12 optical fiber channels• Pixel unpacking, dark and flat correction - FPGA• Cross-correlation and reconstruction matrix processing - GPU or CPU 11
  • 12. Nvidia Tesla C2050 GPU Multiprocessor 14 • Nvidia Tesla C2050: 14 streaming multi-processors Multiprocessor 2 with 32 cores each (SIMD) Multiprocessor 1 Instruction Cache clocked at 1.15 GHz Warp Scheduler Warp Scheduler • 3 GB on-board RAM Dispatch Unit Dispatch Unit • Kernel-based execution Register File • 1.288 TFLOPS single Core 1 Core 2 Core 1 Core 2 Load/ Store 1 SFU 1 precision Load/ SFU 2 Core 3 Core 4 Core 3 Core 4 Store 2 • 515.2 GFLOPS double SFU 3 Load/ precision Core 15 Core 16 Core 15 Core 16 SFU 4 Store 16 Interconnection Network 64 KB Shared Memory/ L1 cache Uniform CacheReference: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf 12
  • 13. Process mapping and partitioning Raw Flat Reference pixels pixels pixels 20x20 20x20 20x20 FPGA GPU Dark find x and y dark flat 2D cross-correlationpixels maximum interpolation correction correction20x20 13
  • 14. Correlation routines 1. FFT correlation 2. 7x7 correlation flat referencecorrected image image precomputed original reference Region 1 reference FFT FFT image 26x26 pixels (20x20 pixels) precomputed Region 2 reference Complex conjugate (20x20 pixels) Multiplication IFFT precomputed Region 49 reference (20x20 pixels) Precomputed Reference pixels 20x20 (49 regions) 14
  • 15. find_max and interpolation routines• Find the maximum value and itʼs index• Find x and y shifts using the interpolation equations num x = max value − out(shif ted y index, (shif ted x index − 1) den x = 2 ∗ max value − out(shif ted y index, (shif ted x index − 1)) −out(shif ted y index, (shif ted x index + 1)) num x x = (shif ted x index − 0.5) + den x num y = max value − out((shif ted y index − 1), shif ted x index) den y = 2 ∗ max value − out((shif ted y index − 1), shif ted x index) −out((shif ted y index + 1), shif ted x index)) num y y = (shif ted y index − 0.5) + den y 15
  • 16. GPU results Tesla C1060 FFT correlation Tesla C2050 7x7 correlation 2200 400 1889 313 307 301 1619 278 279 281 1650 1510 300Time in us Time in us 1188 1100 200 550 100 0 0 1 50 1 50 584 No. of images No. of imagesNote: Least time indicates better performance 16
  • 17. Reconstruction routine 1900 Tesla C1060 x y Tesla C20501750 1750 x DSP CPU x and y shifts for 1750 sub-aperture images 3500 100000 46769 reconstruction matrix 1900x3500 10000 964 956 Time in us 1900 1000 229 accumulated values for 1900 actuators 100 10 • 1750 sub-aperture x and y shifts • 3500 x 1900 reconstruction matrix 1 Devices 17
  • 18. Scenario 4:FPGA-GPU or FPGA-CPU Camera FPGA 1 data half 12 optical fiber PCI-e bus channels GPU/CPU Camera FPGA 2 data half 12 optical fiber channels• Pixel unpacking, dark and flat correction, cross-correlation - FPGA• Reconstruction matrix processing - GPU or CPU 18
  • 19. Cross-correlation 18 • Configure 400x392 (49x8 bits/ flat_product0 pixel) RAM bank (RAM0-RAM19) 18 8 x 26 xcorr_product0 with pre-computed referenceflatcorr_value pixels ref_pixel0 392 • Multiply each pixel with 18 ref_pixel corresponding reference pixel flat_product0 8 x 26 xcorr_product1 1274 xcorr_value_per pixel ref_pixel1 18 flat_product0 8 x 26 xcorr_product48 ref_pixel48 19
  • 20. Sub-aperture formatChannel # Channel # 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 1 0 1 1 0 1 3 8 3 8 3 8 3 8 2 7 2 7 2 7 2 7 1 6 1 6 1 6 1 6 0 5 0 5 0 4 0 4 • Sub-aperture regions in 480 columns x 2 2 3 2 2 3 13 18 13 18 13 18 13 18 12 17 12 17 12 16 12 16 11 15 11 15 10 15 10 15 9 14 9 14 9 14 9 14 1 row per channel 4 4 23 23 22 22 21 21 21 21 20 20 20 20 19 19 19 19 0 0 4 4 4 4 3 3 2 2 1 1 1 1 0 0 0 0 • Accumulate pixels per sub-aperture in 3 4 1 2 3 4 1 2 9 13 9 13 8 13 8 13 7 12 7 12 7 12 7 12 6 11 6 11 6 11 6 11 5 10 5 10 5 10 5 10 each channel 3 3 18 18 18 18 17 17 17 17 16 16 16 16 15 15 14 14 1274 1715 4 4 23 23 23 23 22 22 22 22 21 21 20 20 19 19 19 19 xcorr_pixel0 subap0_acc 1274 1715 0 0 4 4 4 4 3 3 3 3 2 2 2 2 1 1 0 0 xcorr_pixel1 subap1_acc subap_accumulator 5 1 5 1 9 9 9 9 8 8 8 8 7 7 6 6 5 5 5 5 channel #1,#2,#7,#8 6 2 6 2 14 14 14 14 13 13 12 12 11 11 11 11 10 10 10 10 3 3 1274 1715 19 19 18 18 17 17 17 17 16 16 16 16 15 15 15 15 xcorr_pixel15 subap23_acc 4 4 23 23 23 23 22 22 22 22 21 21 21 21 20 20 20 20 0 0 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0 0 1274 1715 7 1 7 1 8 8 8 8 7 7 7 7 6 6 6 6 5 5 4 4 xcorr_pixel0 subap0_acc 8 2 8 2 13 13 13 13 12 12 12 12 11 11 10 10 9 9 9 9 1274 1715 3 3 18 18 18 18 17 17 16 16 15 15 15 15 14 14 14 14 xcorr_pixel1 subap1_acc subap_accumulator 4 4 23 23 22 22 21 21 21 21 20 20 20 20 19 19 19 19 channel #3,#4,#9,#10 0 0 4 4 4 4 3 3 2 2 1 1 1 1 0 0 0 0 1274 1715 xcorr_pixel15 subap23_acc 9 1 9 1 9 9 8 8 7 7 7 7 6 6 6 6 5 5 5 5 10 2 10 2 13 13 13 13 12 12 12 12 11 11 11 11 10 10 10 10 3 3 18 18 18 18 17 17 17 17 16 16 16 16 15 15 14 14 4 4 23 23 23 23 22 22 22 22 21 21 20 20 19 19 19 19 1274 1715 xcorr_pixel0 subap0_acc 0 0 4 4 4 4 3 3 3 3 2 2 2 2 1 1 0 0 1274 1715 11 1 11 1 9 9 9 9 8 8 8 8 7 7 6 6 5 5 5 5 xcorr_pixel1 subap1_acc subap_accumulator 12 2 12 2 14 14 14 14 13 13 12 12 11 11 11 11 10 10 10 10 channel #5,#6,#11,#12 3 3 19 19 18 18 17 17 17 17 16 16 16 16 15 15 15 15 4 4 23 23 23 23 22 22 22 22 21 21 21 21 20 20 20 20 1274 1715 xcorr_pixel15 subap23_acc 20
  • 21. Timing Rxdata from transceiver unpacked data 123.73 ns written to FIFO 40 ns unpacked data read 95 ns from FIFO 15 ns dark-flat output 40 ns input to xcorr_pixel module 20 ns output from xcorr_pixel 16 ns output from sub-aperture accumulator per channel 91 ns• Each data packet is available from the FIFO after 95 ns• 95 ns * 5 packets * 10 rows = 4.75 us to read the data from the FIFO 21
  • 22. Timing with find_max & interpolationsub-aperture 2*max_valueaccumulated max_value 36 value 35 subap value 1715 find_max max_index max_index - 1 x_shift1 x 1715 6 max_index + 1 x_shift2 35 32 index index x and y calculation max_index - 7 decoder y_shift1 35 shift calc y max_index + 7 y_shift2 35 32 35 shifted_x - 0.5 shifted index 32 calculation shifted_y - 0.5 32 5.84 us output from sub- aperture accumulationacross 6 channels 0.060 us 0.060 us xy shifts 22
  • 23. GPU vs FPGA vs DSP100 us 225 us 300.93 usCamerareadout Data transfer through PCIe x16 C2050 GPU 1 C2050 GPU 2 C2050 GPU 3 C2050 GPU throughput = 525.93 us FPGA FPGA throughput = 280 us DSP 96 DSPs throughput = 495 us Camera readout Data transfer through PCIe x16 C2050 GPU 1 23
  • 24. Conclusions GPU FPGA• DSP: excellent performance but not cost-effective• GPU: fast SIMD architectures - suitable for certain tasks• FPGA: MIMD architectures, custom I/O, meets latency and throughput constraintsSlide idea: David Pellerin, Impulse Accelerated Technology 24
  • 25. Future work Virtex-6 Virtex-7 Resources XC6VLX550T XC7V2000T Slice logic resources 549,888 1,954,560 I/O pins 840 850 GTX transceivers 36 36• Investigate performance improvement after partitioning 3 channels between FPGAs. Total of 5 FPGAs for processing each half of camera data• Throughput sustained even if the processes are partitioned over multiple FPGAs• Promising because of increased logic density in Virtex-7 FPGAs 25
  • 26. Discussion QuestionsEmail: vivek@vivekvenugopal.net 26
  • 27. Top level design channel_cycle_count 288 288 160 subap_row_count refim_fetch_addr_d RAM bank (RAM0- FCFPGA dark_flat_acc_top Flatcorr xcorr_pixel_channel ch1278_subap_accumulator ecoder RAM19) _FIFO addr_decoder_ce subap_acc_out (1715 bits) x24 address decoder data unpack xcorr_pixel refim_in (1274 bits) x16xcorr_sm xcorr_pixel_ce (392 bits) x16 subap_acc_ce channel1_top subap_acc_12ch_cexcorr state flat_fifo_rd machine subap_acc_out 24subap_12ch_ (1715 bits) x24 accumulator 288 288 160 FCFPGA dark_flat_acc_top Flatcorr xcorr_pixel_channel ch561112_subap_accumulator _FIFO subap_acc_out xcorr_pixel (1715 bits) x24 data unpack refim_in (1274 bits) x16 (392 bits) x16 channel12_top 27
  • 28. Synthesis estimates for Virtex-6 FPGA• Implement dark, flat correction only : resources used 288 out of 687,360 (1%)• Implement the correlation for single channel up to the sub-aperture accumulator within the channel (without the final 12 channel accumulation) : resources used 2,578 out of 687,360 (1%) Device utilization summary: Slice Logic Utilization: Number of Slice Registers: 992448 out of 687360 144% (*) Number of Slice LUTs: 1126081 out of 343680 327% (*) Number used as Logic: 1125853 out of 343680 327% (*) Number used as Memory: 228 out of 99200 Number used as SRL: 37 28
  • 29. Virtex-6 FPGA Board /-5 #78$")*+,-.+ 01Y57 _/]^d^ 01Y57] 90:0& 01Y570 /01 #2. % !"#$%&&($")*+,-.+ %&)-L !"#$%&&($")*+,-.+ %&)-L !"#$%&&($")*+,-.+ 23 %$V%$$V%$$$ 1@S /.++0/1.&$23445*-+6 0OJA@ /.++0/1.&$23445*-+6 0OJA@ /.++0/1.&$23445*-+6 /014 Ded&*7 9&*7-`c? ) #2. !" !" !" !" !" !" 01Y575 /565788 01Y57F 9:;<=>;? # # _/]^dD #2. #2. 01Y57. !"#$ !"#$ !"#$ ;;<=$>?;@!! # #2. # #2. ;;<=$>?;@!! 23#A$!)6 ! D C 23#A$!)6 *) "$ 9@2AB? %+$ %%#X&)*7-`c %+$ %&()*+, %&()*+, %&()*+, >?@A&B<!"#$ &# -./0123-.4,52 *) -./0123-.4,52 "$ -./0123-.4,52 -`c 03;RM;H>S 6.4752 6.4752 6.4752 /SHB@;A=c;3 Y$ &$ #$ -.551236.0852 -.551236.0852 -.551236.0852 9/=*+&"? 9&e`c 9!!785:; 9!!785:; 9!!785:; B27$$7-`c? &$ #$ %%#X&)*7-`c #$ #2. ) # #2. #2. &# -`c 03;RM;H>S !"#$<= #$ #$ /SHB@;A=c;3 Y% %&()*+5 #$ %& %" 9/=*+&"? 9&e`c >?@A&B<!"#$ #$ #$ #$ #( # !* !* &$ &$ + + #2. #2. B2$$7-`c? #$ %& ,-. %%#X&)*7-`c ">F) # #2. # #2. &# ;;<=$>?;@!! -`c 03;RM;H>S !"#$ !"#$ !"#$ ;;<=$>?;@!! 23#A$!)6 /SHB@;A=c;3 Y& # #2. # #2. > E $ 23#A$!)6 9/=*+&"? 9&e`c %&$ %$$ %+$ %+$ B27$$7-`c? %&()*+, %&()*+, %&()*+, -./0123-.4,52 -./0123-.4,52 -./0123-.4,52 %&$ %$$ %&*7-`c 6.4752 6.4752 6.4752 %*$7-`c -.551236.0852 &$ -.551236.0852 #$ -.551236.0852 W/ &*$7-`c 9!!785:; 9!!785:; 9!!785:; +%&X*7-`c -Y67O2>U &$ #$ Y$ #$ ) ) #2. -117.MA #2. ">F) W/ #$ 9#)@7; 0<+<GH@)I #78$")*+,-.+ %&)-L #78$")*+,-.+ %&)-L 90:0& 0D5/` 90:0& 0OJA@%$V%$$V%$$$ %$V%$$V%$$$ ^Y-88 LJA;6 1@S ^b#* JHK)GG<J%8L/11 "# BCD!$)$E3 ;;<C C7DEF/7G@;H7IJ=3;:K7LMB7>JH7L;73MH ^/&+& 01_ 01_ 777A=HNO;P;H:;:7JB73;:M>;:7Q3;RM;H>S F-59#[? + %&)-L_/.7&X$ 1_ 1_ _/. /18 9+[? .22B70D5/` /565 18; #2. PPT71J>U;B78VW763JHA>;=<;3A #POJH;A ) &*"-L 77777"X*7YLVA7I;37>@JHH;O /565788 ^6 18;79Y],%? ,5,F70D5/` 77777L=:=3;>B=2HJO79%+7YLVA7ZJ[? 9@2AB? .22B &[ 187]a1^]// DiniFPGA board CM%,!,">F) !"#$%&()*+), -./.0 29