Real-Time Football Cup 2011 Project report - Team 1 Hoo Chin Hau, Lee Hui Hui Evon, Lee Wang Wei, Lo Yat Piu, Ng Zhong Qin, Teo Sing Ying Alex I. I NTRODUCTION C. Artiﬁcial Intelligence Co-processor The objective of this project is to develop a soccer system.The project involves 3 FPGAs, 2 of them are the Spartan 3E An AI co-processor was implemented in order to ofﬂoadboard, while the third is a Spartan 6. Of the two Spartan 3Es, computationally intensive calculations used in the client AIone plays the role of the server, while the other is the client. system to custom hardware. It is implemented as a XilinxThe Spartan 6 acts as a High Deﬁnition display controller, as EDK custom IP project that is designed to be imported intoan additional feature. the client XPS project. The AI co-processor provides registers for the Microblaze processor to write to and read from through II. H ARDWARE D ESIGN AND I MPLEMENTATION the slave PLB interface of the AI co-processor. The Microblaze processor writes the current state data (the packets received)A. Server into input registers for the co-processor to work on, and the The server is conﬁgured with 2 Microblaze cores, each with co-processor writes the results into result registers for the2KB instruction cache and 8KB data cache. Microblaze 0 Microblaze processor to read from. A conﬁguration register(MB0) is designated to be the graphics core, and is hence con- allows the processor to issue instructions. In order to indicatenected to a DMA controller. The DMA controller essentially that the co-processor has completed its calculations and thecopies bitmap data into the TFT frame buffer without CPU result register is ready to be read, an interrupt is issued.intervention, thereby allowing the processor to perform other Five functions are determined to be computationally inten-tasks in parallel. In addition, the DMA controller attempts sive and was implemented in custom hardware.to optimize the speed of the data transfer by initiating bursttransactions instead of single beat transfers whenever possible. • In Range - The function determines whether a player isTherefore, DMA can draw a complete screen much faster than in kicking range of the ball so that the player can executethe Microblaze. Unfortunately, data transfer using DMA is still a kick commandnot fast enough to meet the strict deadline required to refresh • Seek - The function calculates the optimal speed andthe screen at 60 Hz during runtime, and thus it was used only direction of the player given the player and ball statefor pre-loading of full screen images. information so that the player will reach the ball in the The second Microblaze (MB1) is tasked to handle commu- shortest time possible. The algorithm takes into accountnications and physics calculations. Information about game ball bouncing as well to predict future ball positions.state, player and ball positions are relayed to MB0 through • Best Supporting Position - The function calculates thea hardware mailbox. In addition, the same information is best supporting position where a player should move/passalso relayed to a Spartan 6 FPGA for high deﬁnition display, the ball to. Scores are assigned to various points of thethrough an ethernet connection. ﬁeld in which goal scoring potential, passing potential Information on current game state, ball and player positions and optimal distance from the ball are considered. Theare also relayed to the client boards via RS232 connections at position on the ﬁeld with the highest score is deemed to115200 baud rate. be the best supporting position. • Move To Target - The function calculates the optimalB. Client speed and direction of the player given a target position A single Microblaze drives the Client board. It is responsible so that the player approaches the target in the shortestfor communicating with the server, as well as implementing time possible.the strategy after considering the position of the ball and • Check Goal - The function determines whether a goalplayers. Dip switch and push buttons are used to indicate the can be scored based on the position of the ball, takingstart of the game and the side the team is playing on. Moreover, into account whether there are players blocking the goala hardware co-processor is developed to aid in the complex scoring shot and returns the best direction for goalcalculations required for the strategy implemented. scoring.
D. High Deﬁnition Display Running with a lower priority is the simulation thread. As calculations may be rather complex depending on the An advanced version of the ﬁeld display is created using situation, there may be times where it may fail to meet thethe Atlys Spartan 6 board which has an HDMI output port. deadlines. However, as the thread runs asynchronously to theSince the VGA output provided by the xps tft controller uses communications thread, a missed deadline is not catastrophic,a signaling protocol that is very different from the Transition and the correct data will be available on the next cycle.Minimized Differential Signaling (TMDS) used by HDMI, acustom hardware core is created to utilize the HDMI port 2) Interrupts: Timer interrupts are triggered 25 times peron the Spartan 6 board. The hardware core is based on the second. Semaphores are posted with each interrupt, thusreference design ﬁles that came with Xilinxs Application ensuring the communication and simulation threads run at 25Note 495 (XAPP495) which implements the required logic to Hz.serialize RGB data using the advanced IO logic and clocking UART interrupts are triggered when a receive or send isresources on the Spartan 6 board. However, Xilinxs design complete. Upon receiving incoming data, a semaphore will beprocedurally generates a SMPTE color bars image instead of posted by the receive ISR, allowing the communications threadreading RGB data from a frame buffer, which is inadequate to immediately copy data from the UART receive buffer intoto render a dynamically changing football ﬁeld. Therefore, a software circular buffer. The circular buffer is ideal in thisa controller is coded in Verilog to utilize the Video Frame case as we are only interested in the most recent data. We haveBuffer Controller (VFBC) Personality Interface Module (PIM) also tried using the system message queue but abandoned thatof the multi-port memory controller. VFBC allows 2D video due to performance reasons.data to be read from a frame buffer using a simple command The send interrupt is used for ﬂow control, to ensure thatbased interface. During the horizontal blanking period, a read data is written into the send buffer only when the previouscommand is sent to the VFBC to allow video data to be fetched entries are sent out. Every time a timer interrupt is triggered, afrom the DDR RAM. The data is then pushed into a FIFO semaphore is posted and the communications thread will packbefore being popped during the active video period. The FIFO the data to be sent into the send buffer. It will then check ais crucial in bridging between the different clock domains of ﬂag to ensure that the previous batch of data is already sentthe memory controller and the HDMI controller. Due to the before it calls the send command. When send is complete, thelimited DDR bandwidth and speed of the IO logic of the board, designated interrupt service routine is called and the ﬂag bita 720p HDMI output was designed instead of 1080p. is reset to indicate that it is clear for the next batch of data to The controller has 2 user accessible registers which are the be sent.frame buffer address register and the stride register. The ﬁrst The use of interrupts for communications is crucial inregister tells the controller where to fetch video data from ensuring that data is read off the receive buffers of the UARTwhile the second register indicates the number of bytes to in- as soon as possible. This is because the buffers are only 16crement after fetching one line of video data. The combination entries deep, and will overﬂow in just 1.11 ms at 115200 baudof the two registers allows for interesting hardware accelerated rate. Should polling be used, context switching would have toeffects such as panning of the screen in such a way that the be done every 1ms, which is not practical given the overheadball is always in the center. involved. 3) Synchronization: The communication and simulation has III. S OFTWARE I MPLEMENTATION D ETAILS access to the shared game state by locking access to the shared memory region using a mutex lock. Due to the higher priorityA. Server level of the communications thread, it will have higher priority Microblaze 1 on the server runs two main threads, namely on each 25Hz cycle to receive and send the data before thecommunication and simulation. In addition, 3 interrupt service simulation thread can access the data, ensuring that the actionsroutines are setup to handle interrupts from the hardware timer, are processed as soon as the data is received. The simulationas well as the UART hardware. thread also tries to reduce the time it locks access to the shared 1) Priority Levels: The most important constraint for Mi- memory region by copying data in and out to its own datacroblaze 1 is to send and receive updates to and from clients at structure and unlocking access to this shared resource.25 Hz. This thread also handles the passing of the game state 4) Graphics: Microblaze 0 runs 2 threads, one to read datato the other Microblaze processor via a hardware Mailbox from a hardware mutex, and the second to render the graphics.to draw the game on the screen. To accomplish this, we Priority scheduling is implemented.assigned the communication thread with the higher priority, Data is received from Microblaze 1 through a 512 bytethus ensuring that no other threads can preempt it while it is deep hardware mailbox at 25 Hz, with each packet containingrunning. As this thread is event driven, it waits on semaphores information such as ball and player coordinates, as well as thewhen idle, thus preventing it from starving the simulation state of the game. The reading thread has higher priority, andthread. waits on a semaphore triggered by the mailbox interrupt.
In order to achieve smooth graphical transitions, double- cations thread. After performing calculations, it converts thebuffering is implemented. A region is allocated in the DDR ﬁnal values into ﬁxed point and writes back to the sharedmemory to be used as video memory frame buffers. The game state. As mentioned earlier, all access to shared memoryregion is large enough for three frames, one for each alternate locations are protected by mutex locks, thus preventing dataframe, and one as a reference. Essentially the graphics thread corruption due to simultaneous access.will draw onto a frame buffer which is not displayed. Uponcompletion, the thread waits for a v-sync interrupt, which posts B. Clienta semaphore, signaling the precise moment to switch to the The client runs two threads. The ﬁrst thread handles thenewly drawn frame buffer. Switching is done by changing receiving of data from the server board while the secondthe frame pointer of the controller to the new region in the thread processes the information and lets the AI implement itsDDR memory. The thread will then perform the draw onto strategy before sending it back to the server. The receive threadthe undisplayed buffer, and the cycle repeats itself again. As waits for a semaphore from the receive interrupt handler.the v-sync interrupts occur at 60 Hz, it is important to ensure Once posted, the receive thread will run and pass the datathat the drawing process is performed within a 16.8ms time to a global variable which has a deﬁned structure. The AIframe. thread then waits for a semaphore posted by the timer interrupt As the rendering thread runs at a higher frequency than the and accesses the same global variable. Similar to the server,reading thread, calculations have to be performed to determine mutex locks are implemented to prevent data corruption duethe coordinates or objects in between each key frame. Various to simultaneous access of a shared memory location.optimizations are performed to ensure the drawing can bedone fast enough. Firstly, instead of erasing the entire balland player regions each time the screen is refreshed, theintersection between the old and the new region is not erasedbecause it will be overwritten by the new data anyway. Erasingin this context means to replace a pixel in the frame bufferwith the corresponding original pixel color in the referenceframe buffer. In addition, the C program is built with -O3optimization ﬂag enabled. 5) High Deﬁnition Graphics: The game state is sent fromthe Spartan 3E board to the Spartan 6 board via Ethernet at 25Hz before being rendered with the same technique mentioned (a) Global Finite State Machineabove. However, in order to keep up with the frame rateat a much higher resolution, further optimization is needed.Firstly, the data and code section (except the bitmaps andframe buffers) are placed in the local memory to eliminatethe bottleneck of fetching data from DDR RAM. Moreover,coordinate interpolation calculations are performed as integersinstead of ﬂoating points because the latter take more clockcycles and are not pipelined. To ensure that accuracy ismaintained when performing integer arithmetic, the remainderof a integer operation is stored and the quotient is incrementedaccordingly when the remainder is more than or equal to thedivisor. 6) Physics and rules check: Physics calculations and rules (b) Player Finite State Machinecheck are performed on a separate thread on Microblaze 1, Fig. 1. Strategies Finite State Machinewith a lower priority than the communications thread. Thisis done to ensure that the communications thread will not be 1) Strategy: There are three states in the global FSM,pre-empted by the calculations thread, as the calculations may mainly Attacking, Defending and Passing (See Fig 1a). Playerget complex depending on the situation. roles depend on the global state, as can be seen in Fig 1b. The calculations thread maintains its own set of object In defending state, a player closest to the ball will becoordinates and other attributes in ﬂoating point for ﬁner assigned to chase for the ball, while the rest of the team willgranularity. Each calculation cycle is triggered by a 25 Hz mark opponents. Once the chaser is within range of the ball,timer interrupt. At the start of each cycle, the thread updates his state will turn into possess, and the global game state willobject attributes with information received by the communi- go into Attacking.
Fig. 3. Screenshot of Java Simulator Fig. 2. Java simulator block diagram • Set the initial positions of the players In Attacking mode, a Best Support Position (BSP) will be • Control player movements and kickscalculated every cycle. With the help of the hardware co- • Monitor the server output data by receiving and decodingprocessor, the algorithm takes into consideration the position the packets using the protocol speciﬁcations deﬁned in theof the ball as well as all player positions. The closest player module wiki pageto the BSP will be assigned the role of Supporter, and will • Monitor the rate of server to player packets by displayinghave to move to the BSP as fast as possible. Meanwhile, the the following parameters:Possessor also tries to dribble to the BSP, while other players • Total packets sentmaintain their roles as Markers. • Number of packets sent in the single second Once the Supporter is within range of the BSP, the game • Average rate of sending packets (packets per second)state goes into Passing mode, where the Possessor kicks the • Refresh Rate ( packets per second/11)ball in the direction of the BSP. In this state, the Supporter • Stores the output log in a text ﬁle with the values storedchases the ball, while other players maintain their Marker as hex stringroles. The Possessor will maintain its heading and speed, as a The program itself incorporates elements of a real-timebackup in case the pass is not successful. A countdown is also system (Fig 2), and enabled us to perform simulation of theinitialized at the start of the state, and should the Supporter game without the need for a client board, hence allowing thefail to get in range of the ball before the countdown runs out, team develop the server and client in parallel. This valueswe assume that the pass has failed and the global state returns shown in the screen-shot (See Fig 3) indicates that the hexto Defending mode. values sent out by our server are correct. As illustrated, the At all points in time, the Possessor will attempt to shoot at refresh rate of our server is indeed 25Hz.the goal should it be in range and has clear line-of sight. Thiscriteria is also calculated with the help of the co-processor. B. Python simulation for AI co-processor 2) Communication with co-processor: Driver functions are A python program is written to assist in the debugging of thewritten for the co-processor so that the client can commu- BSP calculation. The program displays visually the positionsnicate with the co-processor. The functions basically write on the ﬁeld that is possible for the ball to be passed to andthe received packets into the input registers, write the correct determines whether a goal scoring opportunity is available. Aninstruction word into the conﬁguration register and unpack example of the visualization can be seen below:the result from the result register. To run a certain function on In Fig 4, the blue dots represent positions that a pass can bethe co-processor, one calls the execution function, and waits made, and the pink lines indicate that goal shots are possiblefor the completion interrupt to occur using a semaphore. The from that position. Using this visualization, one can determineunpack function is then called to obtain the results from the whether the calculated BSP in the co-processor is correct.result register. As can be seen in the summary report, the co-processor meets the timing constraints of the Microblaze clock (< 20ns IV. T ESTING AND V ERIFICATION minimum period). Approximately 109120 clock cycles areA. Java simulator required in the worse case scenario for the most complex In order to be sure that the server met the requirements operation (BSP calculation), which would result in a delayspeciﬁed, a separate program was written to process the output of roughly 2ms. This is still way faster than if it weredata on a PC. Features incorporated in the program include the implemented on the Microblaze.ability to:
VI. L ESSONS L EARNT One major mistake we made was the failure to test the sys- tem under full load. During the testing of the communication threads, we did not send data at the full rate speciﬁed, and hence did not foresee the problem of data-loss due to buffer overﬂow. The issue was discovered only at a much later date, leaving us with hardly any time left for debugging. Being a crucial part of the system, the lack of a stable communication also held back the debugging of the AI. Despite the ability of the hardware co-processor, the software strategy implemented was primitive and untested, which was a huge disappointment. Fig. 4. BSP Visualization in Python In general, we placed too much focus on developing extra features, most notably the high deﬁnition display. This left us with little time and manpower to ensure that basic require- Number of Slices: 3372 out of 14752 22% Number of Slice Flip Flops: 2053 out of 29504 6% ments are fulﬁlled. Number of 4 input LUTs: 6348 out of 29504 21% Number of IOs: 138 VII. C ONCLUSION Number of bonded IOBs: 138 out of 250 55% Number of MULT18X18SIOs: 29 out of 36 80% Despite the setbacks faced, we have gained invaluable Number of GCLKs: 1 out of 24 4% knowledge on real-time operating systems from this project. Minimum period: 17.247ns (Maximum Frequency: 57.981MHz) Not only do we learnt to optimize the code to meet stringent Minimum input arrival time before clock: 13.248ns deadlines, we have also learnt how to conﬁgure the hardware Maximum output required time after clock: 10.152ns Maximum combinational path delay: 17.399ns} to deliver maximum performance. This includes the use of instruction and data-caches, as well as the hardware co- V. P OSSIBLE I MPROVEMENTS processor and custom controller for high deﬁnition display.A. Communication issues We have also realized the difﬁculties in debugging a real- The standard protocol assumes that not a single byte of data time system, and the importance of rigorous tests to ensureis lost throughout the entire match, which is a dangerous as- reliability and robustness of the system.sumption to make. In our experience, a single byte loss would In terms of project management, we have learnt the impor-result in corruption to all subsequent data received, and the tance of including buffer periods in our development schedule,only resolution would be to restart the entire match. Such an in case of unforeseen technical complexities. It is also moreimplementation would be unacceptable for any ﬁrm real-time important to meet the basic requirements ﬂawlessly thansystems, as it lacks robustness and error-detection/recovery. having extra features.To make things worse, Xilinx has published that the UartLiteserial controller has a 8% error rate, which increases withincreasing baud-rate used. Hence we propose to improve the communications protocol,with the addition of sentinel ﬂags to the beginning and endof each update packet. This would at least provide a way forclient/servers to discover and recover from data loss. The most common cause of data loss is due to bufferoverﬂow on the receive buffers. While we have already imple-mented interrupt service routines to discover incoming data,as well as having the receive thread running at top-priority,the problem can still occur. This issue has been identiﬁed tobe caused by slow execution of the communication thread, asit code is placed in the DDR section of the memory. As DDRarbitration is still based on a Round-Robin algorithm, the rateat which the thread can execute is variable. We have sincelearnt to enable a larger instruction cache on the Microblaze1 of the server, as well as the client Microblaze, and the issuehas been resolved. Unfortunately, the realization came afterthe project presentation, which is a step too late.