Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Atc On An Simd Cots System Wmpp05


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Atc On An Simd Cots System Wmpp05

  1. 1. Air Traffic Control on a SIMD COTS System Stewart Reddaway World Scape Inc, Marlton, NJ Will Meilander (retired) Johnnie Baker Justin Kidman Kent State University, Dept Computer Science, OH IPDPS Denver, April 2005
  2. 2. Abstract <ul><li>Air Traffic Control is a demanding real-time application </li></ul><ul><li>Current systems have been: </li></ul><ul><ul><li>expensive </li></ul></ul><ul><ul><li>late </li></ul></ul><ul><ul><li>over-budget </li></ul></ul><ul><ul><li>not up to specification </li></ul></ul><ul><ul><li>complex in both algorithms and software </li></ul></ul><ul><li>Modest-sized SIMD COTS systems will enable: </li></ul><ul><ul><li>guaranteed real-time performance </li></ul></ul><ul><ul><li>simpler algorithms </li></ul></ul><ul><ul><li>Substantially simpler and cheaper hardware </li></ul></ul><ul><li>We cover: </li></ul><ul><ul><li>System & application approach </li></ul></ul><ul><ul><li>some solution details </li></ul></ul>
  3. 3. Introduction
  4. 4. A previous analysis at Kent State (1) <ul><li>Two tables from Meilander, Jin and Baker [2002] </li></ul><ul><li>ATC – Worst-Case Environment </li></ul><ul><li>Reports per second 12,000 </li></ul><ul><li>IFR flights 4,000 </li></ul><ul><li>VFR/backup flights 10,000 </li></ul><ul><li>Controllers 600 </li></ul><ul><li>There are thus up to </li></ul><ul><li>14000 flights in all </li></ul><ul><li>4000 flights controlled by this system </li></ul><ul><li>12000 radar reports/sec (6000 dealt with in each 0.5 sec) </li></ul>Static scheduling is key to guaranteed real-time database operation
  5. 5. A previous analysis at Kent State (2) Statically Scheduled Solution Time Task p j c d Proc time 1. Report Correlation & Tracking .5 15 .09 .10 1.44 2. Cockpit Display 750 /sec) 1.0 120 .09 .20 .72 3. Controller Display Update (7500/sec) 1.0 12 .09 .30 .72 4. Aperiodic Requests (200 /sec) 1.0 250 .05 .36 .4 5. Automatic Voice Advisory (600 /sec) 4.0 75 .18 .78 .36 6. Terrain Avoidance 8.0 40 .32 2.93 .32 7. Conflict Detection & Resolution 8.0 60 .36 3.97 .36 8. Final Approach (100 runways) 8.0 33 .2 6 .81 .2 Summation of Tasks in a period P 4.52 The system period P (in which all tasks must be completed) is 8 seconds p the task period time, determines the next task release time ri + 1 = ri + p j is the execution time, in microseconds, for each jobset in a task, c is the cost for each task for the worst-case set of jobsets, d the deadline time for each task ri + c + .01 (includes 10 ms interrupt proc. per task)
  6. 6. Modern SIMD Chips <ul><li>Chips considered: </li></ul><ul><li>200 MHz CS301 </li></ul><ul><li>250 MHz CSX600 due in Q2 2005. </li></ul><ul><li>These SIMD chips have: </li></ul><ul><li>powerful PEs (Processing Elements) </li></ul><ul><li>64 – 96 PEs per chip </li></ul><ul><li>floating and fixed point in every PE </li></ul><ul><li>4 – 6 kB of “poly” RAM in each PE </li></ul><ul><li>96 GB/s load/store between poly RAM and register files </li></ul><ul><li>fast I/O (up to 11 GB/s) </li></ul><ul><li>each PE can specify its own address for I/O to external “mono” RAM </li></ul>
  7. 7. Generic core of the SIMD chips In multi-chip systems, each chip runs its own program. Global SIMD achieved by same code in each chip, and software synchronization Cycle estimates are for highly optimized assembly code. (Less optimized code can still comfortably achieve real-time)
  8. 8. SIMD boards <ul><li>CS301 </li></ul><ul><li>CS301 boards contain 2 CS301 chips & 1 GB of “mono” DRAM. </li></ul><ul><li>Proprietary ClearConnect bus runs from one CS301, across the other CS301 and (via FPGA) to DRAM </li></ul><ul><li>PCI interface connects to host computer such as a PC </li></ul><ul><li>Kent State University will use this COTS board </li></ul><ul><li>CSX600 </li></ul><ul><li>CSX600 board has 2 SIMD chips, each with on-chip DRAM interface </li></ul><ul><li>ClearConnect bus connects chips and (via FPGA) board 64-bit PCI-X interface </li></ul>
  9. 9. Analysis for a modern SIMD system <ul><li>Modern chip fast enough to process many tracks per PE </li></ul><ul><li>Previous analysis had a PE for every track </li></ul><ul><li>Both algorithms are SIMD </li></ul><ul><li>~100 tracks per PE possible, so 14000 require ~140 PEs </li></ul><ul><ul><li>Either three 64-PE chips or two 96-PE chips </li></ul></ul>
  10. 10. Reduction and broadcast operations (1) <ul><li>SIMD for ATC requires efficient global Reduction ops </li></ul><ul><li>Global synchronization required </li></ul><ul><li>This is the “difficult part” of the application </li></ul><ul><li>ATC uses global tests and PickOne </li></ul><ul><li>Other operations included for completeness. </li></ul>
  11. 11. Reduction and broadcast operations (2) <ul><li>Global test (AND or OR) </li></ul><ul><li>Is a Boolean condition true anywhere? </li></ul><ul><li>Each PE first reduced to a single Boolean </li></ul><ul><li>Hardware reduces Boolean/PE to scalar (~15 cycles) </li></ul><ul><li>Multi-chip systems must (without hardware help): </li></ul><ul><ul><li>check that all chips have finished </li></ul></ul><ul><ul><li>combine the results. </li></ul></ul><ul><li>Each chip posts on DRAM that it has finished and its result, and then checks all chips have finished, reads results and computes global result </li></ul><ul><li>This across-chip work is ~100 cycles </li></ul><ul><ul><li>Thus within-chip tests take ~15 cycles, and across-chip ~115 cycles </li></ul></ul>
  12. 12. Reduction and broadcast operations (3) <ul><li>PickOne (no special hardware for this non-trivial work) </li></ul><ul><li>PickOne picks the first False element in a Boolean array in 3 stages: </li></ul><ul><ul><li>each PE finds if it has any F </li></ul></ul><ul><ul><li>find first PE (if any) with an F </li></ul></ul><ul><ul><li>finds first F in the PE </li></ul></ul><ul><li>A “binary chop” does across-PE work in ~120 cycles: </li></ul><ul><ul><li>a global test looks for an F in the first half of PEs </li></ul></ul><ul><ul><li>if not, the second half is chosen </li></ul></ul><ul><ul><li>the chosen half is tested to find the quarter </li></ul></ul><ul><ul><li>with 64 PEs, 6 such tests find the first F </li></ul></ul><ul><li>With multiple chips, each chip posts its result on DRAM. After synchronization, chips read results and compute which is selected </li></ul>
  13. 13. Reduction and broadcast operations (4) <ul><li>Max and Min </li></ul><ul><li>Each PE finds its own max, followed by a single across-PE stage </li></ul><ul><li>Bits are worked through starting with the MS, progressively eliminating PEs that cannot be biggest and retaining at least one PE in the “competition” </li></ul><ul><li>Test the next bit of all remaining PEs. If any bit is T, PEs with F are eliminated. </li></ul><ul><li>A record of the global test results gives the scalar max </li></ul><ul><li>Algorithm is constant time and takes ~20 cycles/bit. </li></ul><ul><li>For multiple chips, results are posted in mono RAM, each chip reads them and computes the global max. This adds about 100 cycles </li></ul>
  14. 14. Reduction and broadcast operations (5) <ul><li>Sum </li></ul><ul><li>After within-PE sums, across-PE work uses a &quot;log(n)&quot; approach (within-chip Sum takes ~200 cycles) </li></ul><ul><li>Across-chip adds about 100 cycles. </li></ul><ul><li>Broadcast </li></ul><ul><li>Broadcasting to all PEs is part of the chip instruction set </li></ul><ul><li>For multi-chip systems, each chip accesses the same mono RAM </li></ul><ul><li>If the mono data is stable (eg when correlating a sequence of radar reports) no validity check is needed, but other cases may need validity to be semaphored </li></ul>
  15. 15. Report Correlation & Tracking (1) <ul><li>An ATC system has many radars reporting object positions </li></ul><ul><li>~6000 radar reports assembled in 0.5 sec are trial correlated against all tracks </li></ul><ul><li>The track database is in mono DRAM. Correlation starts by loading 3 position coordinates/track, plus uncertainties, into poly RAM </li></ul><ul><li>14000 tracks and 192 PEs mean 73 tracks per PE </li></ul><ul><li>6 values (x, y, h plus uncertainties) broadcast for each report. Boxes of uncertainty around both track and report positions are tested for intersection </li></ul><ul><li>3 possibilities: </li></ul><ul><ul><li>report intersected by a unique track. Report data stored for updating that track, and track marked not to correlate again </li></ul></ul><ul><ul><li>two or more tracks intersect - marked as multiple hits </li></ul></ul><ul><ul><li>uncorrelated report earmarked for wider tolerance correlation rounds </li></ul></ul><ul><li>6 comparisons and 5 Booleans per track per report </li></ul>
  16. 16. Report Correlation & Tracking (2) (omit?) <ul><li>For each report, correlations in each PE are counted, and the within-PE track number of first correlated track found. Count is 0, 1, many. </li></ul><ul><li>A global OR finds if there are any hits. </li></ul><ul><ul><li>If no hits, mark report for the next correlation round. </li></ul></ul><ul><ul><li>If hit(s), PickOne finds first nonzero PE, and its count is decremented </li></ul></ul><ul><li>A global test checks for multiple hits. Any tracks involved are marked </li></ul><ul><li>If correlation unique, report data is copied to unique track </li></ul><ul><li>Correlation repeated for unmatched reports, with wider tolerances </li></ul><ul><li>Remaining unmatched reports start new tracks </li></ul>
  17. 17. Report Correlation & Tracking (3) (omit?) <ul><li>Starting new tracks </li></ul><ul><li>Before Correlation starts: </li></ul><ul><ul><li>identify empty track locations </li></ul></ul><ul><ul><li>count within-PE empty tracks </li></ul></ul><ul><ul><li>form within-PE list of empty track locations </li></ul></ul><ul><ul><li>global “scan sum” to give each PE its first “empty number” location </li></ul></ul><ul><li>This work is done only once and takes negligible time </li></ul><ul><li>For each unmatched report: </li></ul><ul><ul><li>increment a mono count of new tracks </li></ul></ul><ul><ul><li>compare this count with each PE’s empty track numbers </li></ul></ul><ul><ul><ul><li>(e.g. if this is the 57th new track, only one PE will have this empty track number in its range) </li></ul></ul></ul><ul><ul><li>within-PE address used to initiate a track with the report data </li></ul></ul>
  18. 18. Report Correlation & Tracking (4) (omit?) <ul><li>Higher Quality Correlation </li></ul><ul><li>Two error components in a radar report: </li></ul><ul><ul><li>along the radar radius (radar response time error) </li></ul></ul><ul><ul><li>across the radius (azimuth angle error) </li></ul></ul><ul><li>Other than for short range, errors are much bigger in azimuth than range </li></ul><ul><li>An eccentric ellipse is ideal, but an elongated report box is quite good </li></ul><ul><li>However, in system coordinates the rectangle is not usually aligned with the axes. Computing efficiency requires a box that is aligned </li></ul><ul><li>To include all possible good correlations makes an aligned box much bigger, but that will also include dubious correlations </li></ul><ul><li>“ Radar coordinates”, with axes along and perpendicular to the radar radius, align the original radar box to the axes. </li></ul>
  19. 19. Report Correlation & Tracking (5) <ul><li>Estimating cycles </li></ul><ul><li>In worst case, 6000 reports are estimated to need ~7300 correlations each: </li></ul><ul><ul><li>1000 fail to correlate on first round, </li></ul></ul><ul><ul><li>300 fail on second round </li></ul></ul><ul><li>Per track work (for 73 tracks/PE) is: </li></ul><ul><ul><li>6 compares </li></ul></ul><ul><ul><li>5 Boolean ops </li></ul></ul><ul><ul><li>one step of count hits and find the first address </li></ul></ul><ul><li>~40 cycles inner loop gives ~21 M cycles per 0.5 sec </li></ul><ul><li>Each report needs ~1 PickOne and ~2.2 global tests. Reduction ~2.2 M cycles </li></ul><ul><li>Total ~23.7 M cycles, a 19.7% load at 250 MHz in a 0.5 sec period. </li></ul>
  20. 20. Report Correlation & Tracking (6) <ul><li>Storage </li></ul><ul><li>Tracks stored in mono RAM, with data brought into poly RAM as needed </li></ul><ul><li>Six 32-bit numbers/track, plus status byte and byte for empty track list (26 bytes/track) </li></ul><ul><li>Up to ~210 tracks/PE can be stored in the 6 kB/PE poly RAM of CSX600 </li></ul><ul><li>Only ~1 msec needed for initial loading of all tracks </li></ul>
  21. 21. Report Correlation & Tracking (7) <ul><li>Faster Reductions in bulk </li></ul><ul><li>Big speedup can be achieved by processing several (e.g. 16) reports before doing across-PE reductions. </li></ul><ul><li>Detail in future paper </li></ul>
  22. 22. Conflict Detection & Resolution (1) <ul><li>Conflict detection </li></ul><ul><li>Tracks projected 20 mins and each IFR track checked for conflict with any other track </li></ul><ul><li>For each dimension: </li></ul><ul><ul><li>Compute min and max closing velocity </li></ul></ul><ul><ul><li>Compute min and max current track separation </li></ul></ul><ul><ul><li>Division gives min and max tolerance on the time for that dimension to coincide </li></ul></ul>
  23. 23. Conflict Detection & Resolution (2)
  24. 24. Conflict Detection & Resolution (3) <ul><li>Potential conflict if, across the 3 dimensions, biggest min time is smaller than smallest max time. (Ken Batcher algorithm.) </li></ul><ul><li>Conflict declared after two potential conflicts </li></ul><ul><li>Conflict resolution </li></ul><ul><ul><li>Track heading or altitude adjusted, and algorithm run again. Continues until conflict resolved and no new ones created </li></ul></ul>
  25. 25. Conflict Detection & Resolution (3) <ul><li>Implementation </li></ul><ul><li>21 of 73 tracks/PE are IFR </li></ul><ul><ul><li>49 bytes/track. 73 tracks take ~3.6 kB out of 6 kB poly RAM/PE. </li></ul></ul><ul><li>Active IFR track broadcast and conflict detection performed on all tracks: </li></ul><ul><ul><li>6 subtracts to get min and max closing velocities </li></ul></ul><ul><ul><li>6 subtracts to get min and max distances </li></ul></ul><ul><ul><li>6 divides to get min and max times </li></ul></ul><ul><ul><li>4 compares to find max min and min max </li></ul></ul><ul><ul><li>Subtract max min from min max. Sign is result </li></ul></ul><ul><ul><li>one step of within-PE OR </li></ul></ul><ul><li>Global OR checks for conflict, and PickOne finds the conflicting track </li></ul><ul><li>~330 cycles/track/PE. 4000 IFR tracks ~86 M cycles. Processing load < 5% </li></ul>
  26. 26. Cockpit Display <ul><li>x, y, h positions of all tracks transferred to poly RAM </li></ul><ul><li>Batch of 750 IFR flights, processed one at a time </li></ul><ul><li>IFR flight’s x, y, h and velocity broadcast to all PEs </li></ul><ul><li>All track coordinates transformed so they are centered on broadcast track and rotated to coordinates in which the broadcast track is heading “North” </li></ul><ul><li>Tracks selected in 10 mile x 10 mile box elongated to include a 30 sec projection of the IFR flight </li></ul><ul><li>12 hit tracks is assumed to be worst case average </li></ul><ul><li>~5.9 M cycles </li></ul>
  27. 27. Controller Display Update <ul><li>~7500 out of 14000 tracks in the controlled area are selected </li></ul><ul><li>Information on them sent to all control stations </li></ul><ul><li>Each station selects the tracks in its local area </li></ul><ul><li>The information is track position (x, y, h), heading and speed, all 16-bit </li></ul><ul><li>Speed is in knots using BCD (3 or 4 decimals) </li></ul><ul><li>Each track is made into a fixed size message of 16B including flight ID </li></ul><ul><li>The task is fast as there is no “all against all” component </li></ul>
  28. 28. Terrain Avoidance (1) <ul><li>Warn ~7500 flights (~40/PE) within local control boundaries if they risk running into ground terrain </li></ul><ul><li>Every 8 secs flights projected 1 min and tested against the terrain map. </li></ul><ul><li>The terrain map is an irregular 3D mesh surface with thousands of triangles. Both natural topography and buildings/masts etc </li></ul><ul><li>Flights loaded to poly RAM and triangles broadcast one at a time </li></ul><ul><li>Terrain overhangs not included. Only base of flight projection box needed </li></ul><ul><li>Algorithm </li></ul><ul><li>Compute intersection of base surface and plane of a triangle. Collision if part of line lies in both the triangle and the base surface </li></ul><ul><li>~30 arithmetic, comparison or Boolean operations </li></ul>
  29. 29. Terrain Avoidance (2) <ul><li>Performance </li></ul><ul><li>Proportional to number of triangles. </li></ul><ul><li>x and y for all 14k flights input to poly RAM, and used to select flights </li></ul><ul><li>Pack tracks and construct mono addresses to load rest of track data </li></ul><ul><li>The base of the flight projections are computed </li></ul><ul><li>For each triangle: </li></ul><ul><ul><li>broadcast 40 bytes, ~50 cycles </li></ul></ul><ul><ul><li>compute intersections, 40 x ~70 = ~2.8k cycles </li></ul></ul><ul><ul><li>within-PE OR of hits, 40 x ~2 = ~80 cycles </li></ul></ul><ul><ul><li>single-chip global test for any hits, ~15 cycles </li></ul></ul><ul><li>With 20k triangles, ~60 M cycles </li></ul><ul><li>Information on intersections is extracted and output to controller affected </li></ul><ul><li>Load ~3.8%. 20k triangles is manageable </li></ul>
  30. 30. Sporadic (Aperiodic) requests <ul><li>This task: </li></ul><ul><li>inputs to various database tables in mono RAM </li></ul><ul><li>responds to queries to the database. </li></ul><ul><li>Once per second the host writes buffered messages to the database </li></ul><ul><li>With no simultaneous tasks, there are no synchronization or scheduling issues </li></ul><ul><li>Queries extract data from tables and transmit it to requesters </li></ul><ul><li>Most messages are small, but some, such as wind table update, are quite large </li></ul><ul><li>Estimated maximum of 200 messages per second. </li></ul><ul><li>~1 M cycles </li></ul>
  31. 31. Automatic Voice Advisory (AVA) <ul><li>AVA advises uncontrolled (VFR) flights of other aircraft and terrain </li></ul><ul><li>Gives a near equivalent to Cockpit Display </li></ul><ul><li>Computing similar to Cockpit Display, but simpler </li></ul><ul><li>~60% of the load of Cockpit Display </li></ul>
  32. 32. Final Approach (Runways) <ul><li>Each flight plan specifies: </li></ul><ul><ul><li>departure terminal and planned departure time </li></ul></ul><ul><ul><li>destination terminal and planned arrival time </li></ul></ul><ul><li>Every 8 secs information for each of ~100 runways is gathered, a queue organized and any consequential modifications inserted in each flight plan </li></ul><ul><li>The relevant controller is informed of recommended flight changes </li></ul><ul><li>Stacking is minimized by delaying tracks in flight. (In emergencies, flights get stack detail from the controller) </li></ul><ul><li>~870k cycles, a load of < 0.1% </li></ul>
  33. 33. Speedup with Sorting (1) <ul><li>Most tasks include “all against all” matching involving proximity. Sorting can greatly speed these up. </li></ul><ul><li>This is still SIMD computing. The sorting has data-independent deterministic speed. The rest has some data dependence, but big speedup even in worst cases </li></ul><ul><li>The dramatic speed gains can be used for: </li></ul><ul><ul><li>less optimized coding </li></ul></ul><ul><ul><li>more complex or bigger requirements </li></ul></ul><ul><ul><li>less hardware </li></ul></ul><ul><li>The disadvantage is more complex algorithms </li></ul>
  34. 34. Speedup with Sorting (2) <ul><li>Correlation </li></ul><ul><li>Input x and sort (including track numbers) with PE LS </li></ul><ul><li>Fetch track data using addresses constructed from track numbers </li></ul><ul><li>~200 bins for x. Min poly track address for each bin in mono array </li></ul><ul><li>Use min and max x for each report to TLU track addresses needed </li></ul><ul><li>Worst case average number of tracks/PE ~20x fewer than without sorting </li></ul><ul><li>Core correlation cycles ~1.2 M; 4 other contributions: </li></ul><ul><ul><li>sort time </li></ul></ul><ul><ul><li>reduction operations </li></ul></ul><ul><ul><li>finding the range of poly addresses </li></ul></ul><ul><ul><li>final output of update data </li></ul></ul><ul><li>Cycles reduce ~12x, from ~24M to ~1.92 M (with bulk Reduction) </li></ul>
  35. 35. Speedup with Sorting (3) <ul><li>Terrain Avoidance </li></ul><ul><li>The 20k triangles permanently sorted on min value of x </li></ul><ul><li>The sorted triangles have PE LS and RAM address MS </li></ul><ul><li>Worst case speedup ~9x </li></ul><ul><li>Conflict Detection </li></ul><ul><li>Sorting by x velocity and x position gives speedup of ~3x </li></ul><ul><li>Cockpit Display and Automatic Voice Advisory </li></ul><ul><li>Sorting on x gives speedup of ~9x for both tasks </li></ul>
  36. 36. Summary Table Two CSX600 chips will do all the processing
  37. 37. A staged plan of work <ul><li>1. Establish requirements and algorithms </li></ul><ul><li>2. Code in a high level language (Cn) for one CS301 </li></ul><ul><li>3. Speed and space optimization. Real-time speed for ~5000 tracks </li></ul><ul><li>4. Extend reduction codes to 2 SIMD chips  ~10000 tracks </li></ul><ul><li>5. Move application to CSX600 board.  ~20k tracks </li></ul><ul><li>6. Demonstrations, reports and presentations. </li></ul><ul><li>7. Using sorting, develop much faster codes </li></ul><ul><li>Kent State well placed with CS301 board </li></ul><ul><li>Steps 1 through 6 will take ~12 months </li></ul>
  38. 38. References <ul><li>[1] W. Meilander, M. Jin, J. Baker. Tractable Real-Time Air Traffic Control Automation. Proceedings of the 14th IASTED International Conference Parallel and Distributed Computing and Systems, Cambridge, USA, 2002, pp. 483-488 </li></ul><ul><li>[2] </li></ul><ul><li>[3] To be published. </li></ul>
  39. 39. Flight Plan/Track Conformance