• Save
Real-time processing for ATST
Upcoming SlideShare
Loading in...5
×
 

Real-time processing for ATST

on

  • 906 views

http://www.dur.ac.uk/cfai/adaptiveoptics/rtc2011/agenda/abstracts/#VV1...

http://www.dur.ac.uk/cfai/adaptiveoptics/rtc2011/agenda/abstracts/#VV1

Vivek Venugopal (National Solar Observatory): Real-time control for the Advanced Technology Solar Telescope (20 minutes)

Real-time processing for Adaptive Optics (AO) systems is challenging as the motion vectors have to be computed to properly actuate the mirrors before the wavefront information has become obsolete. The four meter Advanced Technology Solar Telescope (ATST) will provide unprecedented resolution for solar observation due to its larger aperture. The ATST AO system with 2 kHz frame rate camera, 1750 sub-apertures and 1900 actuators requires massive parallel processing and this increased demand in computational horsepower is far from being manageable by conventional processors. Hardware accelerators such as Field Programmable Gate Array (FPGA) and Graphics Processing Unit (GPU) are better equipped to harness the the parallel processing requirements of such a system. We investigate the implementation of the data processing architecture for Shack-Hartmann correlation and the wavefront reconstruction using FPGAs and GPUs. We benchmark the AO algorithm implemented using FPGAs and GPUs and compare it with the existing legacy FPGA-Digital Signal Processing (DSP) based hardware system used in the 76cm Dunn Solar Telescope(DST).

Statistics

Views

Total Views
906
Views on SlideShare
900
Embed Views
6

Actions

Likes
0
Downloads
0
Comments
0

4 Embeds 6

https://twitter.com 2
http://www.techgig.com 2
http://twitter.com 1
http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Real-time processing for ATST Real-time processing for ATST Presentation Transcript

  • RTC Workshop, Durham, UK, April 2011 Real-time processing for the Advanced Technology Solar Telescope Vivek Venugopal (vivekv@nso.edu) National Solar Observatory Sunspot, New Mexico, USAWednesday, April 13, 2011
  • Advanced Technology Solar Telescope !Wednesday, April 13, 2011
  • Adaptive Optics system Uncorrected Tip/Tilt light Mirror Deformable Mirror (DM) Tilt drive signal DM drive signal Corrected Processors Beamsplitter light Shack-Hartmann Lenslet Array CCD Camera "Wednesday, April 13, 2011
  • HOAO Real-time system Actuator gains Offscale Recon- Dark Reference slope Slope struction Actuator Flat field image field tolerance offsets matrix offsets FPGA GPU Deformable mirror Cross- Offscale WFS correlation Matrix Actuator Camera X slope slope detection X multiply servos Servo computation parameters Average Tip/Tilt slope servos Tip/Tilt mirror Data Zernike collection offload processWednesday, April 13, 2011
  • Camera format Channel # 480 columns x 480 columns x 0 77 76 73 72 53 52 49 48 29 28 25 24 5 4 1 0 960 rows 960 rowschannels 12 channels 4 463 462 459 458 439 438 435 434 415 414 411 410 391 390 387 386 per FPGA per FPGA 0 87 86 83 82 63 62 59 58 39 38 35 34 15 14 11 10 9 1 183 182 179 178 159 158 155 154 135 134 131 130 111 110 107 106 • 12 channels processed per 10 2 3 279 375 278 374 275 371 274 370 255 351 254 350 251 347 250 346 231 327 230 326 227 323 226 322 207 303 206 302 203 299 202 298 4 FPGA 0 471 470 467 466 447 446 443 442 423 422 419 418 399 398 395 394 95 94 91 90 71 70 67 66 47 46 43 42 23 22 19 18 • 5 packets to receive a 11 12 1 2 191 287 190 286 187 283 186 282 167 263 166 262 163 259 162 258 143 239 142 238 139 235 138 234 119 215 118 214 115 211 114 210 3 383 382 379 378 359 358 355 354 335 334 331 330 311 310 307 306 complete row 4 479 478 475 474 455 454 451 450 431 430 427 426 407 406 403 402 #Wednesday, April 13, 2011
  • Pixel unpacking Byte 1 Byte 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 49 48 47 Pixel 1 46 45 44 43 42 9 8 7 Pixel 0 6 5 4 3 2 • FPGA receives camera data using the 31 30 29 Byte 3 28 27 26 25 24 23 22 21 Byte 2 20 19 18 17 16 fiber channel interface through 12 Pixel 3 129 128 127 126 125 124 123 122 89 88 87 Pixel 2 86 85 84 83 82 transceivers @ 9.42 ns Byte 5 Byte 4 47 46 Pixel 1 45 44 43 42 Pixel 5 41 40 39 38 Pixel 0 37 36 35 34 Pixel 4 33 32 • Pixel unpacking implemented using 41 40 59 58 57 56 55 54 1 0 19 18 17 16 15 14 Byte 7 Byte 6 FSM with 2 modes (10 states/mode) 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 Pixel 3 Pixel 7 121 120 139 138 137 136 135 134 Pixel 2 81 80 99 98 Pixel 6 97 96 95 94 • 16 pixels (10 bits/pixel) written to FIFO Byte 9 Byte 8 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 Pixel 5 Pixel 9 Pixel 4 Pixel 8 53 52 51 50 69 68 67 66 13 12 11 10 29 28 27 26 Byte 11 Byte 10 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 Pixel 7 Pixel 11 Pixel 6 Pixel 10 133 132 131 130 149 148 147 146 93 92 91 90 109 108 107 106 Byte 13 Byte 12 111 110 109 108 107 106 105 104 103 102 101 100 99 98 97 96 Pixel 9 Pixel 13 Pixel 8 Pixel 12 65 64 63 62 61 60 79 78 25 24 23 22 21 20 39 38 Byte 15 Byte 14 127 126 125 124 123 122 121 120 119 118 117 116 115 114 113 112 Pixel 11 Pixel 15 Pixel 10 Pixel 14 145 144 143 142 141 140 159 158 105 104 103 102 101 100 119 118 Byte 17 Byte 16 143 142 141 140 139 138 137 136 135 134 133 132 131 130 129 128 Pixel 13 Pixel 12 77 76 75 74 73 72 71 70 37 36 35 34 33 32 31 30 Byte 19 Byte 18 159 158 157 156 155 154 153 152 151 150 149 148 147 146 145 144 Pixel 15 Pixel 14 157 156 155 154 153 152 151 150 117 116 115 114 113 112 111 110 $Wednesday, April 13, 2011
  • Dark and flat correction pixel0 10 • Dark pixel and flat pixel stored in - 10 RAM dark_pixel 8 8 x 18 flat_product0 • Flat corrected product is flat_pixel 8 accumulator 8 concatenated and written to flat_acc1 pixel 1 10 FIFO - 10 • Flat accumulated value can be used to update the reference dark_pixel 8 flat_pixel 8 x 8 18 flat_product1 image 8 accumulator flat_acc1 pixel16 10 - 10 dark_pixel 8 flat_pixel 8 x 8 18 flat_product16 8 accumulator flat_acc16 %Wednesday, April 13, 2011
  • Pixel unpacking & Dark and flat correction Synchronizer/ counters dark and flat reference image value RAM RAM 206.8 ns 20 ns 256 channel 1 128 Data 160 Dark-flat correction/ Receiver FIFO unpack accumulator 16 160 288 channel 2 PCIe system bus 128 Data 160 Dark-flat correction/ 12 channels Receiver FIFO 1/2 camera unpack accumulator 16 160 288 channel 12 128 Data 160 Dark-flat correction/ Receiver FIFO unpack accumulator 16 160 288 clock period = 9.42 ns clock period = 5 ns clock rate = 106.15 MHz clock rate = 200 MHz &Wednesday, April 13, 2011
  • Nvidia Tesla C2050 GPU Multiprocessor 14 • Nvidia Tesla C2050: 14 streaming multi-processors Multiprocessor 2 with 32 cores each (SIMD) Multiprocessor 1 Instruction Cache clocked at 1.15 GHz Warp Scheduler Warp Scheduler • 3 GB on-board RAM Dispatch Unit Dispatch Unit • Kernel-based execution Register File • 1.288 TFLOPS single Core 1 Core 2 Core 1 Core 2 Load/ Store 1 SFU 1 precision Load/ SFU 2 Core 3 Core 4 Core 3 Core 4 Store 2 • 515.2 GFLOPS double SFU 3 Load/ precision Core 15 Core 16 Core 15 Core 16 SFU 4 Store 16 Interconnection Network 64 KB Shared Memory/ L1 cache Uniform Cache Reference: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Wednesday, April 13, 2011
  • Process mapping and partitioning Raw Flat Reference pixels pixels pixels 20x20 20x20 20x20 FPGA GPU Dark find x and y dark flat 2D cross-correlation pixels maximum interpolation correction correction 20x20 ()Wednesday, April 13, 2011
  • Correlation routines 1. FFT correlation 2. 7x7 correlation flat reference corrected image image precomputed original reference Region 1 reference FFT FFT image 26x26 pixels (20x20 pixels) precomputed Region 2 reference Complex conjugate (20x20 pixels) Multiplication IFFT precomputed Region 49 reference (20x20 pixels) Precomputed Reference pixels 20x20 (49 regions) ((Wednesday, April 13, 2011
  • find_max and interpolation routines • Find the maximum value and itʼs index • Find x and y shifts using the interpolation equations num x = max value − out(shif ted y index, (shif ted x index − 1) den x = 2 ∗ max value − out(shif ted y index, (shif ted x index − 1)) −out(shif ted y index, (shif ted x index + 1)) num x x = (shif ted x index − 0.5) + den x num y = max value − out((shif ted y index − 1), shif ted x index) den y = 2 ∗ max value − out((shif ted y index − 1), shif ted x index) −out((shif ted y index + 1), shif ted x index)) num y y = (shif ted y index − 0.5) + den y (!Wednesday, April 13, 2011
  • GPU results Tesla C1060 FFT correlation Tesla C2050 7x7 correlation 2200 400 1889 313 307 301 1619 278 279 281 1650 1510 300 Time in us Time in us 1188 1100 200 550 100 0 0 1 50 1 50 584 No. of images No. of images Note: Least time indicates better performance ("Wednesday, April 13, 2011
  • Reconstruction routine 1900 Tesla C1060 x y Tesla C2050 1750 1750 x DSP CPU x and y shifts for 1750 sub-aperture images 3500 100000 46769 reconstruction matrix 1900x3500 10000 964 956 Time in us 1900 1000 229 accumulated values for 1900 actuators 100 10 • 1750 sub-aperture x and y shifts • 3500 x 1900 reconstruction matrix 1 Devices (*Wednesday, April 13, 2011
  • Xilinx design flow Design verification Design Entry Functional simulation Design Synthesis Design implementation Optimization Static timing analysis Mapping Placement Routing Back Timing simulation Annotation Bitstream generation Download to In-circuit Xilinx FPGA verification (#Wednesday, April 13, 2011
  • Cross-correlation 18 • Configure 400x392 (49x8 bits/ flat_product0 pixel) RAM bank (RAM0-RAM19) 18 8 x 26 xcorr_product0 with pre-computed referenceflatcorr_value pixels ref_pixel0 392 • Multiply each pixel with 18 ref_pixel corresponding reference pixel flat_product0 8 x 26 xcorr_product1 1274 xcorr_value_per pixel ref_pixel1 18 flat_product0 8 x 26 xcorr_product48 ref_pixel48 ($Wednesday, April 13, 2011
  • Cross-correlation 18 flat_product0 • Configure 400x392 (49x8 bits/ 18 x 26 xcorr_product0 flatcorr_value 392 8 ref_pixel0 18 pixel) RAM bank (RAM0-RAM19) ref_pixel with pre-computed reference flat_product0 8 x 26 xcorr_product1 1274 xcorr_value_per pixel ref_pixel1 18 flat_product0 pixels x 26 xcorr_product48 • 8 ref_pixel48 Multiply each pixel with 18 flat_product1 corresponding reference pixel 18 flatcorr_value 8 x 26 xcorr_product0 ref_pixel0 392 18 ref_pixel flat_product1 8 x 26 xcorr_product1 1274 xcorr_value_per pixel ref_pixel1 18 flat_product1 8 x 26 xcorr_product48 ref_pixel48 18 flat_product15 18 flatcorr_value 8 x 26 xcorr_product0 ref_pixel0 392 18 ref_pixel flat_product15 8 x 26 xcorr_product1 1274 xcorr_value_per pixel ref_pixel1 18 flat_product15 8 x 26 xcorr_product48 ($ ref_pixel48Wednesday, April 13, 2011
  • Sub-aperture formatChannel # Channel # 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 1 0 1 1 0 1 3 8 3 8 3 8 3 8 2 7 2 7 2 7 2 7 1 6 1 6 1 6 1 6 0 5 0 5 0 4 0 4 • Sub-aperture regions in 480 columns x 2 2 3 2 2 3 13 18 13 18 13 18 13 18 12 17 12 17 12 16 12 16 11 15 11 15 10 15 10 15 9 14 9 14 9 14 9 14 1 row per channel 4 4 23 23 22 22 21 21 21 21 20 20 20 20 19 19 19 19 0 0 4 4 4 4 3 3 2 2 1 1 1 1 0 0 0 0 • Accumulate pixels per sub-aperture in 3 4 1 2 3 4 1 2 9 13 9 13 8 13 8 13 7 12 7 12 7 12 7 12 6 11 6 11 6 11 6 11 5 10 5 10 5 10 5 10 each channel 3 3 18 18 18 18 17 17 17 17 16 16 16 16 15 15 14 14 1274 1715 4 4 23 23 23 23 22 22 22 22 21 21 20 20 19 19 19 19 xcorr_pixel0 subap0_acc 1274 1715 0 0 4 4 4 4 3 3 3 3 2 2 2 2 1 1 0 0 xcorr_pixel1 subap1_acc subap_accumulator 5 1 5 1 9 9 9 9 8 8 8 8 7 7 6 6 5 5 5 5 channel #1,#2,#7,#8 6 2 6 2 14 14 14 14 13 13 12 12 11 11 11 11 10 10 10 10 3 3 1274 1715 19 19 18 18 17 17 17 17 16 16 16 16 15 15 15 15 xcorr_pixel15 subap23_acc 4 4 23 23 23 23 22 22 22 22 21 21 21 21 20 20 20 20 0 0 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0 0 1274 1715 7 1 7 1 8 8 8 8 7 7 7 7 6 6 6 6 5 5 4 4 xcorr_pixel0 subap0_acc 8 2 8 2 13 13 13 13 12 12 12 12 11 11 10 10 9 9 9 9 1274 1715 3 3 18 18 18 18 17 17 16 16 15 15 15 15 14 14 14 14 xcorr_pixel1 subap1_acc subap_accumulator 4 4 23 23 22 22 21 21 21 21 20 20 20 20 19 19 19 19 channel #3,#4,#9,#10 0 0 4 4 4 4 3 3 2 2 1 1 1 1 0 0 0 0 1274 1715 xcorr_pixel15 subap23_acc 9 1 9 1 9 9 8 8 7 7 7 7 6 6 6 6 5 5 5 5 10 2 10 2 13 13 13 13 12 12 12 12 11 11 11 11 10 10 10 10 3 3 18 18 18 18 17 17 17 17 16 16 16 16 15 15 14 14 4 4 23 23 23 23 22 22 22 22 21 21 20 20 19 19 19 19 1274 1715 xcorr_pixel0 subap0_acc 0 0 4 4 4 4 3 3 3 3 2 2 2 2 1 1 0 0 1274 1715 11 1 11 1 9 9 9 9 8 8 8 8 7 7 6 6 5 5 5 5 xcorr_pixel1 subap1_acc subap_accumulator 12 2 12 2 14 14 14 14 13 13 12 12 11 11 11 11 10 10 10 10 channel #5,#6,#11,#12 3 3 19 19 18 18 17 17 17 17 16 16 16 16 15 15 15 15 4 4 23 23 23 23 22 22 22 22 21 21 21 21 20 20 20 20 1274 1715 xcorr_pixel15 subap23_acc (%Wednesday, April 13, 2011
  • Top level design channel_cycle_count 288 288 160 subap_row_count refim_fetch_addr_d RAM bank (RAM0- FCFPGA dark_flat_acc_top Flatcorr xcorr_pixel_channel ch1278_subap_accumulator ecoder RAM19) _FIFO addr_decoder_ce subap_acc_out (1715 bits) x24 address decoder data unpack xcorr_pixel refim_in (1274 bits) x16 xcorr_sm xcorr_pixel_ce (392 bits) x16 subap_acc_ce channel1_top subap_acc_12ch_ce xcorr state flat_fifo_rd machine subap_acc_out 24subap_12ch_ (1715 bits) x24 accumulator 288 288 160 FCFPGA dark_flat_acc_top Flatcorr xcorr_pixel_channel ch561112_subap_accumulator _FIFO subap_acc_out xcorr_pixel (1715 bits) x24 data unpack refim_in (1274 bits) x16 (392 bits) x16 channel12_top (&Wednesday, April 13, 2011
  • Synthesis estimates for Virtex-6 FPGA • Implement dark, flat correction only : resources used 288 out of 687,360 (1%) • Implement the correlation for single channel up to the sub-aperture accumulator within the channel (without the final 12 channel accumulation) : resources used 2,578 out of 687,360 (1%) Device utilization summary: Slice Logic Utilization: Number of Slice Registers: 992448 out of 687360 144% (*) Number of Slice LUTs: 1126081 out of 343680 327% (*) Number used as Logic: 1125853 out of 343680 327% (*) Number used as Memory: 228 out of 99200 Number used as SRL: 37 (Wednesday, April 13, 2011
  • FPGA timing Rxdata from transceiver unpacked data 123.73 ns written to FIFO 40 ns unpacked data read 95 ns from FIFO 15 ns dark-flat output 40 ns input to xcorr_pixel module 20 ns output from xcorr_pixel 16 ns output from sub-aperture accumulator per channel 91 ns • Each data packet is available from the FIFO after 95 ns • 95 ns * 5 packets * 10 rows = 4.75 us to read the data from the FIFO • Total latency for computing the 960 rows x 480 columns = 4.75 us * (960/20)  = 228 us. !)Wednesday, April 13, 2011
  • GPU vs FPGA vs DSP 100 us 225 us 300.93 us Camera readout Data transfer through PCIe x16 C2050 GPU 1 C2050 GPU 2 C2050 GPU 3 C2050 GPU throughput = 525.93 us FPGA FPGA throughput = 250 us DSP 96 DSPs throughput = 495 us Camera readout Data transfer through PCIe x16 C2050 GPU 1 !(Wednesday, April 13, 2011
  • Conclusions GPU FPGA • DSP: excellent performance but not cost-effective • GPU: fast SIMD architectures - suitable for certain tasks • FPGA: MIMD architectures, custom I/O, meets latency and throughput constraints Slide idea: David Pellerin, Impulse Accelerated Technology !!Wednesday, April 13, 2011
  • Future work Virtex-6 Virtex-7 Resources XC6VLX550T XC7V2000T Slice logic resources 549,888 1,954,560 I/O pins 840 850 GTX transceivers 36 36 • Investigate performance improvement after mapping the find_max, interpolation and reconstruction matrix calculation routines on the FPGA • Promising because of increased logic density in Virtex-7 FPGAs • Throughput sustained even if the processes are partitioned over multiple FPGAs !"Wednesday, April 13, 2011
  • Discussion Questions !*Wednesday, April 13, 2011
  • Backup Device utilization summary: Selected Device : 6vlx550tff1759-2 Slice Logic Utilization: Number of Slice Registers: 992448 out of 687360 144% (*) Number of Slice LUTs: 1126081 out of 343680 327% (*) Number used as Logic: 1125853 out of 343680 327% (*) Number used as Memory: 228 out of 99200 0% Number used as SRL: 228 Slice Logic Distribution: Number of LUT Flip Flop pairs used: 1509605 Number with an unused Flip Flop: 517157 out of 1509605 34% Number with an unused LUT: 383524 out of 1509605 25% Number of fully used LUT-FF pairs: 608924 out of 1509605 40% Number of unique control sets: 221 IO Utilization: Number of IOs: 88 Number of bonded IOBs: 80 out of 840 9% IOB Flip Flops/Latches: 25 Specific Feature Utilization: Number of BUFG/BUFGCTRLs: 36 out of 32 112% (*) WARNING:Xst:1336 - (*) More than 100% of Device resources are used !#Wednesday, April 13, 2011
  • Pre-computed referenceednesday, April 13, 2011
  • Pre-computed referenceednesday, April 13, 2011