• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
A parallel hardware architecture for real time object detection with support vector machines
 

A parallel hardware architecture for real time object detection with support vector machines

on

  • 380 views

services on...... ...

services on......
embedded(ARM9,ARM11,LINUX,DEVICE DRIVERS,RTOS)
VLSI-FPGA
DIP/DSP
PLC AND SCADA
JAVA AND DOTNET
iPHONE
ANDROID
If ur intrested in these project please feel free to contact us@09640648777,Mallikarjun.V

Statistics

Views

Total Views
380
Views on SlideShare
380
Embed Views
0

Actions

Likes
1
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    A parallel hardware architecture for real time object detection with support vector machines A parallel hardware architecture for real time object detection with support vector machines Document Transcript

    • IEEE TRANSACTIONS ON COMPUTERS, VOL. 61, NO. 6, JUNE 2012 831A Parallel Hardware Architecture for Real-TimeObject Detection with Support Vector Machines Christos Kyrkou, Student Member, IEEE, and Theocharis Theocharides, Senior Member, IEEE Abstract—Object detection applications are often associated with real-time performance constraints that stem from the embedded environment that they are often deployed in. Consequently, researchers have proposed dedicated hardware architectures, utilizing a variety of classification algorithms targeting object detection. Support Vector Machines (SVMs) is among the most popular classification algorithms used in object detection yielding high accuracy rates. However, existing SVM hardware implementations attempting to speed up SVM classification, have either targeted only simple applications, or SVM training. As such, there are limited proposed hardware architectures that are generic enough to be used in a variety of object detection applications. Hence, this paper presents a parallel array architecture for SVM-based object detection, in an attempt to show the advantages, and performance benefits that stem from a dedicated hardware solution. The proposed hardware architecture provides parallel processing, resource sharing among the processing units, and efficient memory management. Furthermore, the size of the array is scalable to the hardware demands, and can also handle a variety of applications such as multiclass classification problems. A prototype of the proposed architecture was implemented on an FPGA platform and evaluated using three popular detection applications, demonstrating real-time performance (40-122 fps for a variety of applications). Index Terms—Field programmable gate array (FPGA), support vector machines, object detection, parallel architecture. Ç1 INTRODUCTIONS UPPORT vector machines (SVMs) have been widely adopted since their introduction by Cortes and Vapnik[1]. They are considered one of the most powerful accuracy [16], [17], [18], [19], [20], [21], significant con- straints in embedded real-time applications. The majority of these emerging hardware solutions can mostly be dividedclassification engines due to their mathematical background into two categories: 1) application specific architectures thatthat is based on statistical learning theory [2]. SVMs have are tailored toward very specific problems [17], [18], [19],exhibited high classification accuracy rates, which in many and 2) optimizations that aim at reducing the hardwarecases outperform well-established classification algorithms complexity [22], [23], [24], [25], [26], [27]. Only recently asuch as neural networks [4]. Consequently, there has been a few works have looked into more generic architectures thatgrowing interest in utilizing SVMs in embedded object are not depended on the vector dimensionality, or numberdetection and other image processing applications [5], [6], of support vectors [20], [21]. While there has been a[7], [8], [9]. One of the main challenges in efficiently utilizing considerable amount of work done in accelerating supportSVMs in real-time object detection systems is the amount of vector machines with dedicated hardware, there is a lack indata that need to be processed per input image in order to works that have looked into hardware architectures forclassify it; as such, general-purpose processors and specia- embedded object detection. In particular, such detectionlized architectures such as Digital Signal Processors (DSP) architectures need to adapt to a variety of environments, bedo not provide the flexibility required to achieve real-time scalable to the available hardware demands, and provideperformance. Recently, hierarchical SVMs have been pro- high frame rates for real-time processing.posed as a means to discard nonpromising regions very fast In this work, we propose an optimized hardwarein the detection process; however, the performance reaches architecture that performs object detection using supportonly 4 frames per second on a conventional general-purpose vector machines. The architecture is based on an array ofprocessor [39]. Hence, while software implementations of processing elements, and is the completion of our initialSVMs yield high accuracy rates, they cannot efficiently meet idea presented in [10]. Our previous work in [10] presentedhard real-time constraints that are imposed in embedded the implementation of a Systolic Chain of Processing Elementsenvironments, while also addressing power and perfor- (SCoPE) that computed the SVM feed-forward phase formance trade-offs. Consequently, dedicated SVM hardwarearchitectures have emerged as a potential solution to bridge one search window (a search window is defined as anthe gap between real-time performance and high detection image region searched for objects of interest). We finalize the initial architecture by combining multiple such chains in an array structure, thus permitting for higher throughput. The authors are with the University of Cyprus, 75 Kallipoleos Str, Nicosia, and a more scalable and flexible design. Furthermore, we Cyprus 1678. E-mail: {kyrkou.christos, ttheocharides}@ucy.ac.cy. also demonstrate how the proposed architecture can beManuscript received 20 Jan. 2011; revised 22 Apr. 2011; accepted 01 June configured to operate on multiclass classification problems,2011; published online 14 June 2011. such as face recognition, and how it can adapt to variousRecommended for acceptance by W. Najjar. embedded application scenarios.For information on obtaining reprints of this article, please send e-mail to:tc@computer.org, and reference IEEECS Log Number TC-2011-01-0041. The proposed architecture is integrated into an objectDigital Object Identifier no. 10.1109/TC.2011.113. detection system that is implemented on a Virtex 5 FPGA 0018-9340/12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society
    • 832 IEEE TRANSACTIONS ON COMPUTERS, VOL. 61, NO. 6, JUNE 2012and is evaluated using three object detection applications: scalar operations. The vector and scalar operations dependface, pedestrian, and car side view detection. The three on the choice of kernel (standard kernel functions areapplications serve as a representative group of the demands shown in (2-4)). For example, the vector operation forof object detection applications and as such give indicative kernels (2) and (3) involves a dot product computation,measures for generic object detection. Results indicate high whereas for (4), it involves the computation of the squaredperformance in terms of frame rate (40-122 fps for a variety norm of the difference of two vectors. Common kernelof applications) and detection accuracy (76-78 percent) for functions are shown below in (2-4)the three benchmark applications. Additionally, when Linear: Kðx; zÞ ¼ x  z; ð2Þconsidering hierarchical SVMs for face detection theperformance almost doubles, resulting in similar perfor-mance to that of hardware implementations of the Viola- Polynomial: Kðx; zÞ ¼ ððx  zÞ þ 1Þd ; d > 0; ð3ÞJones detection algorithm. The paper provides a brief overview of SVMs and object RBF: Kðx; zÞ ¼ expðÀ kx À zk2 =ð22 ÞÞ: ð4Þdetection in Section 2. An overview of related SVMimplementations present in the literature is given in 2.2 Fundamentals of Object DetectionSection 3. The proposed array architecture and its main The process of image object detection deals with determin-components are presented in Section 4, along with a ing whether an object of interest is present in an image/discussion on the scalability of the array, and extension to video frame or not. An image object detection systemhandle multiclass problems. The FPGA array prototype receives an input image/video frame, which will subse-and its evaluation are presented in Section 5, along with quently search to find possible objects of interest. Thisperformance results for the three benchmark applications, search is done by extracting smaller regions from the frame,and a comparison with other works. Finally, Section 6 called search windows, of m x n pixels (m can be equal to n),concludes the paper, with some possible future directions. which go through some form of preprocessing (histogram equalization, feature extraction), and are then processed by a classification algorithm to determine if they contain an2 SUPPORT VECTOR MACHINE BACKGROUND object of interest or not. However, the object of interest may2.1 Overview of Support Vector Machines have a larger size than that of the search window, and given that the classification algorithm is trained for a specificA Support Vector Machine (SVM) is a supervised learning search window size, the object detection system must havealgorithm based on statistical learning theory [2]. Given a a mechanism to handle larger objects. To account for this, anlabeled data set (training set), D ¼ fðx; yÞjx ! data sample; object detection system can increase the size of the searchy ! class labelg, an SVM tries to compute a mapping window and rescan the image, which implies that differentfunction f such that fðxÞ ¼ y for all samples in the data set. classifiers are used for each window size. Alternatively, theThis mapping function describes the relationship between size of the input image can be decreased (downscaling);the data samples and their respective class labels; and is consequently, the size of the object of interest will beused to classify new unknown data. The mapping function reduced so it can be “enclosed” within the search window.in SVMs corresponds to a hyperplane that separates the All subsequent downscaled versions of the input image, aredata samples of the two classes. The hyperplane is desired therefore reexamined using the same search window size.to have the maximum distance (margin) from the two The downscaling process is often preferred as it isclasses. SVMs formulate this problem using lagrangian computationally less expensive [42]. Downscaling is doneoptimization theory and try to find the data samples that in steps to account for various object sizes, down to the sizeinfluence the shape of that hyperplane. The data samples of the search window. Hence, many downscaled images arethat constrain the margin from becoming larger are those produced from a single input image/video frame, each inlying on the boundary of each class. These samples are turn producing a number of search windows, whichcalled support vectors [4] and correspond to nonzero alpha increases the amount of data that must be processed bycoefficients (a) in the lagrangian optimization problem. Further- the classification algorithm. Search windows are extractedmore, SVMs utilize a technique called the kernel trick that every few pixels, and the number of pixels that are skippeduses a kernel function Kðx; zÞ to project the data into a is called the window overlap.higher dimensional space, making it easier to find aseparating hyperplane. Classification in the context ofSVMs is done using the following classification decision 3 RELATED WORK ON SUPPORT VECTORfunction (a process called the feed-forward phase) MACHINES ! XN Since their introduction support vector machines have DðzÞ ¼ sign ai yi K ðz; si Þ þ b ; ð1Þ been used in many applications as part of larger software- i¼1 based detection systems for pedestrian [5], [6] and car sidein which ai are the alpha coefficients, yi are the class labels of view detection [7], and as the main classification engine forthe support vectors, si are the support vectors, z is the input face detection in [8] and [9]. Although the above softwarevector, Kðz; si Þ is the chosen kernel function, and b is the bias. applications demonstrated high accuracy rates, they The majority of the processing time for the calculation offered limited performance.of (1) goes to the kernel computation. This computation There exists a fair amount of work on accelerating bothhappens between the input and support vectors. The kernel the SVM training and classification for general-purposecomputation is split into two parts, the vector operation and processors and DSPs, aiming to provide higher performance
    • KYRKOU AND THEOCHARIDES: A PARALLEL HARDWARE ARCHITECTURE FOR REAL-TIME OBJECT... 833on such platforms. The work in [11] presents an evaluation system can be potentially offset by the cost of converting toof SVM implementation on embedded processor architec- and from the logarithmic domain, something that dependstures, and proposes architectural modifications in order to on the application and requires a case-by-case study. Aimprove their performance. An analysis was performed in different approach was followed by Anguita et al. [25] and[12] where critical parts of the SVM algorithm were mapped [26] which propose a new kernel function that does notbetween hardware and software, demonstrating how hard- require multiplication. However, it may not offer the sameware can be used to accelerate SVM computations. An generalization capabilities compared to commonly usedattempt to implement SVMs on a microcontroller was SVM kernels (2-4). Furthermore, both Anguita et al. [25] andpresented in [13], dealing with issues such as limited [26] propose modifications on the SVM training algorithmsmemory and hardware. Recently, Graphics Processing so that its parameters are integers rather than floating pointUnits (GPUs) have been utilized in the implementation of numbers. Finally, in [27], Irick et al. propose hardwareSVMs [14] due to their parallel nature when compared with optimizations for the RBF kernel.general-purpose processors showing significant speedups. From the related works present in the literature, onlyHowever, GPUs still need very careful and efficient Genov and Cauwenberghs [15], Reyna et al. [19], and Robertoprogramming in order to provide high performance, et al. [20] deal with object detection in images, and only theprimarily because of the fixed hardware (especially the latter two are digital architectures. The architecture pre-interconnect) that may not suit the computation and data sented in [19] is dependent on the vector dimensionality inflow of some applications. terms of computational resources, and thus, suffers from Hardware implementations of SVMs have gained notice- scalability issues when dealing with applications with highable interest in recent years, primarily because of the dimensional vectors. The implementation in [20] is tightlypotential real-time performance benefits they offer in terms coupled with a dedicated embedded processor and is unclearof both training and classification. Significant work has been from the paper whether it could operate as a standalonemade in the implementation of SVMs on custom hardware, SVM object detection processor. Both works do not discussmostly on FPGAs. A mixed-signal SVM processor was important object detection issues such as image downscalingpresented in [15], utilizing analog computation for accuracy and search window size, and omit important performanceand digital output for VLSI integration. Anguita et al. [16] metrics such as frames per second.propose a hardware architecture for a modified SVM Considering the limitations of previous works for SVM-training algorithm showing comparable results with respect based object detection, we propose a generic full customto the widely used Sequential Minimal Optimization (SMO) hardware architecture for the SVM feed-forward phase. Thetraining algorithm [36]. Small scale SVM implementations array-based architecture has many benefits compared towere proposed in [17], [18], and [19] with applications that previous works, as it provides a configurable platform forhad either a few support vectors or low vector dimension- parallel processing of many input and support vectors, it isality. Furthermore, the proposed architectures were devel- modular and regular thus allowing for a scalable designoped for specific problems and thus are not easily extendable and reduced complexity, and demanding hardware unitsto other scenarios. Finally, Roberto et al. [20] and Peter et al. are shared among the most common simpler units. Also the[21] focus on the implementation of custom hardware to array-based approach allows for efficient memory manage-accelerate the vector operations of SVM. The former utilizes a ment and data flow. Additionally, the proposed architec-vector coprocessor of 64 multiprecision units (8 and 16 bit), ture can be configured to handle multiclass problems,comprised of ALUs and multipliers. The latter utilizes arrays something that has not been considered in previous works.of vector processing elements and also compares theproposed hardware implementation with GPU and CPU 4 SVM ARRAY PROCESSING ARCHITECTURESVM implementations, demonstrating both higher perfor- The proposed architecture has three main regions. Themance and lower power consumption. This emphasizes the memory region comprised of a chain of memory unitsimportance of custom hardware architectures for embedded where the training data are stored, the vector processingapplications that require both performance and low power. region which is responsible for the vector processing, and isBoth Roberto et al. [20] and Peter et al. [21] demonstrate how the largest region in the array, and the scalar region thata parallel vector processing architecture can be used to speed processes the results produced from the vector operations.up the SVM feed-forward phase. The array is comprised of two types of processing elements The main vector operations in SVM require multiplica- that serve different purposes: the Vector Unit (VU) is usedtions that can be expensive in terms of area, power, and for all vector computations, and the Scalar Unit (SU)performance. Thus, there has been extensive research in operates on the scalar values produced by the VUs. Inimplementing SVMs without the use of multiplication. To addition to these processing elements, the architecturethat end Khan et al. [22], [23] and Boni and Zorat [24] contains dedicated memory units that feed the array withproposed that all the operations of SVMs be done in the training data, and a finite state machine (FSM) control unitlogarithmic number system. The main advantage that stems that synchronizes the array operation. The structure of thefrom this approach is that the costly operation of multi- array is given in Fig. 1. The main regions of the array areplication is replaced by additions and subtractions resulting detailed next, followed by a description of the array datain reduced hardware resources and faster designs, as well flow and the issues associated with the scalability of theas reduced power consumption. It must be noted, however, array, as well as details on extending its operation tothat the benefits from operating in the logarithmic number address multiclass classification problems.
    • 834 IEEE TRANSACTIONS ON COMPUTERS, VOL. 61, NO. 6, JUNE 2012 Fig. 2. Vector Unit: Responsible for the vector processing. It computes either the dot product of two vectors, or the square norm of their difference (depending on the choice of kernel). VU is involved with the computation of the vector operation, and simultaneously transfers data/control values to its neighboring VUs. Support vector components are trans-Fig. 1. Architecture of the SVM array processing engine illustrating the ferred vertically and input vector components are trans-different data flows in the array, and the connectivity between units. ferred horizontally. During the TRANSFERRING state, theSearch window pixel values enter from the left-most VUs, while supportvectors enter through the top row VUs. scalar value computed by each VU is transferred toward the SUs. Transfer data switching is done through a 2-1 multi-4.1 Array Architecture plexer and data propagation through registers. Moreover, a vector operation signal determines the input to the multiplier4.1.1 Vector Processing Region and consequently the resulting vector operation. Lastly, theThe majority of components in the array are VUs. These VUs simply remain idle during the IDLE state.units (shown in Fig. 2) are used in the processing of the It is possible to allow for multiple components of theinput and support vectors, and produce the scalar values same vector to be processed in parallel. However, doing sorequired for the latter computations. Multiple VUs are increases the resources required per VU and thus, depend-interconnected to form a systolic array, allowing for rapid ing on the hardware budget, this approach to increasingparallel processing of vectors. parallelism may reduce the number of VUs that can be used The operation between vectors is kernel dependent as in the array. This decision involves a trade-off betweendescribed in Section 2.1. Most kernel functions ((2) and (3)) vector-level parallelism (process more vectors in parallel)require the computation of the dot product between the two and component-level parallelism (process vector compo-vectors, while for (4) the main operation is the squared nents in parallel). Increasing the component-level paralle-norm of the difference of two vectors. In [21], the dot lism requires the following changes to the VU architecture.product variant of the RBF kernel is used resulting in a First, for each vector component that is to be processed amore uniform architecture that computes just dot products. dedicated subtractor and multiplier is needed. Second, theHowever, the variant is not as numerically stable and also products produced by each multiplier must be summed up.requires that squares of vectors are precomputed. Thus, we This can be done sequentially using a cascade of adders, orconsider the initial RBF kernel that only requires minimal in parallel using a tree of adders. Notice that the additionaladditional hardware in the form of a subtractor, compared adders also increase the hardware utilization per VU.to the dot product-based kernel. Hence, the VUs are Processing i additional vector components increases thecomprised of a subtractor, a multiplier, and an accumulator hardware overhead by i additional subtractors and multi-to satisfy the need for both vector operations. The subtractor pliers, and i-1 adders for the partial sums, per VU.is used to calculate the difference between the two vectors(used for kernel (4)). 4.1.2 Scalar Processing Region Object detection is usually performed in grayscale images The Scalar Units (SUs) are involved in the latter stages of the(8 bit per pixel), therefore we allocate 8-bits to encode the computation and process the scalar values produced by theresult of the subtractor as discussed in [12]. The preceding VUs, after the vector operations have been completed. Eachmultiplier either computes the product of the two vector SU receives the scalar values of the VUs in its row, one percomponents, or squares the difference of the two compo- cycle via the right-most VU in the array. The SUs arenents. An 8  8-bit multiplier would suffice for the majority comprised of two major components, the kernel scalarof object detection applications most of which operate on module (KSM) and a multiply accumulate unit (MAC). Thegrayscale images (8-bits per pixel). The result of the 8  8-bit kernel scalar module performs the scalar operation of eachmultiplier (a 16-bit value) is passed to an accumulator to kernel, as described in Section 2.1. The MAC unit multipliescomplete the dot product computation. The bit width of the each scalar value with its respective alpha coefficient andaccumulations is proportional to the vector dimensionality, accumulates the outcome; finally it adds the bias to thewhich we denote as c. The accumulator therefore, performs accumulated result once the processing of all supportc accumulations of 16-bit operands, thus the accumulator vectors is complete. The MAC’s bit width (precision) isbit-width requirements are given by log2 ðc  ð216 À 1ÞÞ. determined by the choice of kernel (i.e., the kernel’s output Each VU has three operational states: PROCESSING, bit width), the number of support vectors (determines theIDLE, and TRANSFERRING. In the PROCESSING state the number of accumulations and consequently the precision of
    • KYRKOU AND THEOCHARIDES: A PARALLEL HARDWARE ARCHITECTURE FOR REAL-TIME OBJECT... 835 having to process large amount of training data. Thus, providing parallel access to the training data is critical in exploiting the inherit parallelism of SVMs. Under these considerations we developed an efficient parallel memory structure that feeds the array of VUs with training data. The memory structure consists of banks of memories (equal to the number of array columns) that supply the array with support vector data through the VUs in the top row of the array (Fig. 1). The support vectors are distributed among the memory banks to allow for parallel access and processing.Fig. 3. Scalar Unit: Handles the scalar operations of the SVM processing The memories are arranged in a pipelined structure thatflow. The KSM unit can be configured either as a multiplier for the facilitates address data movement in the same manner as insquaring operation in the second degree polynomial kernel (the most the processing array. This helps maintain temporal consis-frequent used kernel for image-based object detection) or as a LUT toimplement any other kernel. The rest of the units handle the tency as well as provide parallel access to the memories sinceaccumulation of the result. the address data are moved in a pipelined fashion, from memory to memory, avoiding the use of dedicated wires perthe accumulator), and the chosen precision for the alpha memory that would have increased the hardware complexity.coefficients. The architecture of the SU is shown in Fig. 3. 4.2 Flow of Operation The KSM computes the kernel outcome for each scalarvalue it receives. The operation it performs depends on the Processing of the input vectors (search windows) happens in steps. During each step, an amount of support vectors,kernel function. The implementation of the kernel function equal to the number of columns in the array, is processed.is an important issue. The most common function used in The number of steps required is determined by theliterature for image object detection is the second degree maximum number of support vectors in each memorypolynomial kernel [8], [19], [20]. As such we hardwire the unit. For example, if there are 80 columns and memorynecessary components for that specific kernel, into a single units, and 120 support vectors, then 40 memory banks willmodule called the poly module, an adder and a multiplier. hold two support vectors and the other 40 will have oneTo implement the rest of the kernels we included a LUT in support vector each. Processing an input vector will requirethe KSM which can be initialized to the values correspond- two steps, in the first step the first 80 support vectors willing to a specific kernel, let that be an RBF kernel or a third be processed while the remaining 40 will be processed indegree polynomial (kernel (3) with d ¼ 3). A selection signal the second. At each step, the VUs which do not process anyis used to determine whether the poly module or the LUT support vectors simply propagate data to the SUs. Thewill provide the input to the latter stages. This KSM arrangement of support vectors in memories is applicationconfiguration allows the SU to implement a variety of specific and depends on the available hardware resourceskernel functions; however, the trade-off involved is that the as well. If the hardware budget allows it, the array can beprecision of the LUT in the KSM impacts the classification made as parallel as possible; otherwise, it can be adapted toaccuracy and memory demands. The implementation the available hardware.requirements of the kernel functions result in processing Before the computation of the feed-forward phase beginsunits that may require a lot of resources (high bit-width the array must first be initialized with the SVM parameters.resources especially), and may also reduce the operating These include the bias value, the support vectors, the alphafrequency. As such, the SUs are pipelined to reduce long coefficients, and the kernel function. The initialization can bepaths and increase clock frequency. done at runtime by an I/O controller that interfaces with the One of the key benefits from arranging the processing array. The memory region must first be initialized with theunits in an array structure, in contrast to existing works, is training data (support vectors and alpha coefficients). Thethat the resource-hungry components in the SUs are shared vector operation is selected in each VU via a control signalamong multiple VUs and thus are used in computing the that is propagated in systolic manner through the array.scalar value of many support vectors, instead of having Another control bit is used to select between the LUT anddedicated units for each vector operation. This is possible as the poly module in the SUs. The LUT must first be initializedthe vector operations and the produced scalar values have with the appropriate data. Initialization is done through theno dependencies between them. Furthermore, due to the top row SU, which transmits data values through the samesystolic nature of the array architecture the scalar values, in pipeline that is used for the alpha coefficients.the same row, are produced sequentially and since all scalar After the array is initialized with the SVM parameters andvalues will be subject to the same processing operations, it training data, the classification procedure can be initiated. Itis efficient to process them with the same unit using begins with the all the array processing elements in the IDLEresource sharing; thus, reducing hardware demands and state. The top-left-most VU is the first one to be enabled aftercomplexity without negatively impacting performance. it receives the first components of the input and support vectors, thus entering the PROCESSING state. The neighbor-4.1.3 On-Chip Data Memory Management ing VUs follow next and continue to propagate the vectorSupport Vector Machines, as many other machine learning values and control signals leading more VUs to thealgorithms, exhibit characteristics such as predictable mem- PROCESSING state. After ðrow  column À 1Þ cycles, allory access patterns and independent operations, while the VUs in the array will be in the PROCESSING state, at
    • 836 IEEE TRANSACTIONS ON COMPUTERS, VOL. 61, NO. 6, JUNE 2012which point the array will reach its full processing potential.Input vector values and control signals are propagated row-wise, while the incoming support vector values arepropagated column-wise by each VU. When the scalar value is computed in all VUs, after c(number of components in vector) cycles, they all enter theTRANSFERING state simultaneously, each propagating thecomputed scalar values toward the right-most VUs, whichin turn propagate them on to the SUs. At this point the SUsare enabled and begin processing each scalar value that theyreceive from the right-most VUs. Along with the SUs, thealpha coefficients memory, which provides the SUs with thealpha coefficients, is also enabled. Each alpha value istransferred through a pipeline downward to each SU. Thisis necessary to maintain temporal consistency, as the alpha Fig. 4. Extension of the array implementation to handle multiclass problems, such as face recognition, and different training sets.coefficients must be multiplied with the scalar value of theirrespective support vectors. The following cycle after the example of a multiclass classification problem is faceVUs have entered the TRANSFERRING state they are resetin systolic manner starting from the top-left-most VU; each recognition, where the input window must be classified inVU will again enter the PROCESSING state to begin a new one of the possible candidates in a face database [40]. Tovector operation the following cycle after it has been reset. handle such problems the rows in the array must beWhen all scalar values have been processed, the bias is decoupled so that they can work independently towardadded to the accumulated result to obtain the classification different classification problems. Each row must be suppliedoutcome. The transfer from one state to another is facilitated with its own set of support vectors and alpha coefficients.by dedicated control signals that flow in systolic manner The following modifications must take place to allow thethrough the array avoiding the use of global control signals. architecture to handle multiclass classification problems From the above analysis the total number of cycles (also shown in Fig. 4).required for the classification of input vectors equal to thenumber of rows is given by 1. Each VU requires a multiplexer to select between the support vector from its above VU, or a support ½m=columnsŠ  ððrow þ columns À 1Þ þ c vector memory unit. ð5Þ þ reset cycle þ transfer cycleÞ: 2. Multiple controllers are required, one per row. Each controlling the operation of a row under differentThe array processes in parallel as many support vectors classification parameters. However, only one isas the number of columns in the array. Hence, if we necessary when the system operates as an array.assume m support vectors in a training set, we need 3. The left-most VUs will also require multiplexers to½m=ðnumber of columnsފ repetitions in total for all sup- select between different control signals. Whenport vectors. operating as an array the control signals will come from one central control unit, while when each row4.3 Array Scalability and Implementation Issues operates independently the control signals will comeAn array consisting of i columns and j rows, will have i x j from the dedicated control unitsVUs, j SUs, and it can process j parallel input vectors and 4. Each SU will also require a multiplexer to selecti parallel support vectors. The hardware resources are between the alpha coefficients from the previous SUdetermined by the number of SUs and VUs, as well as when operating as an array, or from one of the alphaminor overheads for wiring and control of the array. coefficient memories.Increasing the number of rows requires additional VUs The main advantage that stems from enhancing the arrayequal to the number of columns, and a single SU. On the with the additional multiplexers is that the array can operateother hand, increasing the number of columns requires in various configurations depending on the applicationadditional VUs equal to the number of rows in the array. demands. First, it can operate as a fully parallel systolic arrayIncreasing the array size in either way increases parallelism; to speed up a single object detection problem, using one ofthe hardware overheads to the array rows and columns the available training set memories. Second, each row canincrease linearly as well. The array rows are however, work independently on one classification problem (faceconstrained by the memory I/O bandwidth and the array recognition). Finally, any combination of the above twocolumns are constrained by the number of support vectors configurations is possible, such as multiple systolic arrays,and support vector memory bandwidth and capacity. or a single systolic array with multiple independent rows.4.4 Multiclass Classification Support This flexibility allows the enhanced array processing engineThe proposed array architecture is not only suitable for to adapt to a variety of object detection scenarios and specificbinary classification problems (traditional object detection application demands, making it suitable for embeddedproblems), but can also be extended to handle multiclass environments which exhibit a high degree of variabilityclassification as well, with minimal hardware overhead. An between them.
    • KYRKOU AND THEOCHARIDES: A PARALLEL HARDWARE ARCHITECTURE FOR REAL-TIME OBJECT... 8375 EXPERIMENTAL EVALUATION AND RESULTS Another factor that limits the performance of object detection systems is the memory access mechanism and I/OWe designed and implemented a prototype of the array capabilities. Memory access for both the input and supportarchitecture on an FPGA platform, and evaluated it using vectors is of great importance as it limits the capabilities of athree popular detection applications. Prior to detailing the system for parallel processing. It is important to haveFPGA implementation, we first discuss the factors that affect parallel access to the training set and at the same time fetchthe performance of an object detection system, and then multiple search windows, in order to take full advantage ofproceed with detailing the training methodology for the the capabilities of a fully parallel architecture such as athree applications. We then provide the performance results, systolic array. Additionally, it is important to considerand discuss how it compares with existing detection systems. where the data will be located, off-chip or on-chip. Input5.1 Performance and Constraints data are usually stored off-chip as they arrive from an external image/video acquisition source. The training dataThere are several factors that impact the performance of an on the other hand, can be stored on-chip if the memory isobject detection system, but first, it is important to consider available; otherwise, they are also stored off-chip as well.the metrics used to measure performance. An image object The latter may decrease performance if the number ofdetection system is characterized by how accurately it can support vectors is large, as off-chip communication willclassify data as well as how many image frames it can become the bottleneck to the system performance.process per second. Thus, the two commonly used perfor- Finally, the operating frequency of the object detectionmance metrics are the detection accuracy, and frames per second system greatly impacts the performance. For FPGA-based(FPS) or frame rate. A minimum performance of 30 FPS is designs this is a limiting factor as fixed routing and LUTrequired in order for an object detection system to be capable placement may not allow for a design to operate at its fullfor real-time video processing. potential. High frequencies can be achieved by regular and A factor that affects both the detection accuracy as well modular designs with small critical paths; such as theas the frame rate is the number of support vectors. The proposed hardware architecture.fewer the support vectors the better the performance, sinceless operations are required per input vector. However, 5.2 SVM Training Set and Parametersthe detection accuracy may be reduced when fewer support We use three object detection problems as benchmarks forvectors are used. To satisfy both goals a combined frame- evaluating the proposed architecture: pedestrian detectionwork for training support vector machines with respect to [5], [6], [30], [32], car side view detection [7], and facehardware constraints could potentially be explored, but is detection [8], [9]. All three detection problems are interest-left as future work. Also, on a lesser extent the detection ing for the proposed implementation, as they can be appliedaccuracy is also affected by the bit width used to represent in intelligent embedded environments for surveillance andthe training data in hardware. However, if the bit width is security purposes, as well as traffic and street monitoring.chosen appropriately with respect to the targeted applica- Furthermore, the three detection problems concern different objects; consequently, each one has different detectiontion domain, there can be minimal to zero accuracy loss. parameters such as search window size, window overlap,Performance is also affected by several other factors, and number of downscaled images. As a result we cantypically encountered in object detection applications. evaluate how the architecture fairs for different application The first is the number of search windows that need to parameters and analyze its suitability for generic objectbe processed per frame. This is determined both by the size detection. Details of the parameters and characteristics ofof the object of interest, and the search window overlap. each application are shown in Table 1. All detectionWhen the object of interest has a relatively small size, problems concern grayscale images corresponding to 8 bitand consequently a small search window is used, more pixel values. Training of the SVM models was done usingwindows will be generated per input frame increasing the the SMO algorithm implementation provided in MATLABprocessing time per frame. Conversely, if the targeted object [41], with publicly available data sets from [28], [29], andof interest is large, fewer windows will be generated. The [31], [34], each consisting of training and test sets for bothwindow overlap between successive windows also deter- the negative and positive classes. To further test andmines the number of generated windows, and is deter- evaluate the generalization capabilities of the trained SVMmined by the size of the object of interest. Choosing the models we obtained full test frames from [29], [31], and [33]appropriate window overlap involves a trade-off between for face, car side view, and pedestrian detection, respec-the granularity of the window search and as such the tively. These frames were rescaled to 320 Â 240 pixel images and used to evaluate the performance of the hardwaredetection accuracy, and the resulting FPS. implementation in terms of detection accuracy and frame The input image size is also equally important to the rate. We experimented with both second degree polynomialperformance of an object detection system. A larger input and RBF kernels. The former was found to be efficient forimage will generate more search windows increasing the time object detection applications [20], while the latter is alsoneeded to process the whole frame. At the same time the widely used in many applications with very good detectionnumber of downscaled versions of the input image will results [3]. We selected the best SVM model based on theincrease to account for objects of different sizes. The input following two criteria. First, the memory required to storeimage size and number of downscaled frames have a greater the support vectors must not exceed the available memoryimpact on performance when the search window size is small of the experimental platform. Second, the selected modeland as a result a large number of windows will be generated. must maintain good accuracy rate on the full frame images
    • 838 IEEE TRANSACTIONS ON COMPUTERS, VOL. 61, NO. 6, JUNE 2012 TABLE 1 stronger training set can increase the detection accuracies. SVM Parameters and Application Characteristics However, since this is not the main focus of this work we did not attempt to further optimize the training sets. The SVM model parameters selected for each detection problem are given in Table 1. 5.3 FPGA Implementation and Evaluation We developed an array prototype to perform the SVM feed forward phase for the three applications targeting the ML505 evaluation platform. The selected platform is equipped with a Virtex 5 LX110T FPGA, an external DDR2 DRAM with a capacity of 256MB, a DVI output, a compact flash card reader, and 64 DSP units (embedded multipliers and accumulators), which makes it suitable for evaluating object detection algorithms that require large amounts of memory as well as visual verification of results. We developed a prototype of the array on the FPGA, which interfaced with an embedded Xilinx Microblaze soft- processor [35] for I/O purposes. The overall system is illustrated in Fig. 5a, while resource utilization for both the Microblaze system and the implemented array are given in Table 2. The FPGA prototype is illustrated in Fig. 5b, with performance results (frame rate and detection accuracy) show in Fig. 5c. Finally, some examples of test imagefor each application. We selected full-frame test images detection results are show in Fig. 5d.from the data sets [29], [31], and [33], and used them toevaluate the accuracy of each SVM model (for each frame 5.3.1 I/O System Based on Microblazethe number of windows that must be processed perapplication is shown on Table 1). The true positive (TP) Microblaze handles tasks such as system supervision,and false positive (FP) rates are given in Table 1. True control and data transferring to and from the array, whilepositives correspond to rectangles on the resulting output the array is responsible for the SVM classification. Micro-image which correctly contained an object of interest; all blaze communicates with external components via dedi-other image regions that are marked as containing an object cated interfaces for data transfer and monitoring purposes.of interest, but do not in fact contain such an object, are Input image frames were initially stored in a Compact Flashcategorized as false positives. The reported accuracy rates card (acting as the image acquisition source) and loaded toof course depend on the training data, and obviously a the external DDR2 DRAM prior to the detection phase.Fig. 5. (a) Block diagram of the FPGA prototype system. (b) FPGA prototyping platform. (c) Performance of the FPGA hardware system in terms offrame rates and accuracy. (d) Detection results for selected images used in the evaluation.
    • KYRKOU AND THEOCHARIDES: A PARALLEL HARDWARE ARCHITECTURE FOR REAL-TIME OBJECT... 839 TABLE 2 TABLE 3 Array Processing Engine FPGA Synthesis Results Hardware Evaluation Information other hand, were more demanding (because of the alpha coefficient multiplier which is the critical path of the design) and thus were mapped on the dedicated DSP units of the FPGA. The entire SU was pipelined in an effort to maintain high frequency. It must be noted that the number of DSPsMicroblaze retrieves pixel data from the external DRAM on the FPGA does not limit the number of SUs that canand sends it to the array via the Fast-Simplex-Link (FSL) be instantiated, as any additional SU can be instantiated onbus interface. Evaluation and verification as well as overall the FPGA custom logic. Table 3 summarizes the hardwaresystem monitoring were done using a serial communication parameters of the implemented array. The clock frequencyinterface, and a DVI interface to output the image frames of the design was set at 100 MHz, which is the system clockwith the detected objects on a monitor. frequency of the FPGA. Higher clock frequencies can be achieved by further optimizing the design, especially the5.3.2 Array Memory Hierarchy SU that is the system bottleneck in terms of the operatingObject detection applications exhibit a high degree of data frequency. However, for prototyping purposes we did notreuse, as a large amount of the currently processed window attempt to further improve the overall system frequency.will also be used for the next window. To reduce the Depending on the application I/O demands and inter-external I/O memory accesses we developed a suitable face, the structure can be chosen accordingly to provide thememory hierarchy. At the first level a memory block is necessary trade-off between performance and hardwareutilized to store an image region (the whole image if area. The memory units were initialized according to eachpossible), while at the second level a window buffer is used application’s training data; we briefly describe this alloca-to store the active window region. tion next as it impacts the performance of the prototype. If The window buffer structure unit is illustrated in Fig. 5a the number of support vectors necessary for each applica-and is comprised of a number of buffers equal to the tion (as derived by the training set) is larger than thenumber of rows in the array plus one. Each buffer has a columns of the array, the computation will have to besize of (window rows x window overlap) and can be either repeated (using time-division multiplexing) until all sup-active (feed the rows with data) or passive (incoming data port vectors are processed for each input vector.are stored). Only one buffer is passive at a time and is usedto store the region that will be processed next by the array. 5.4 Performance Results and DiscussionWhen the current window data have been processed by the Typically performance in object detection systems isarray the additional buffer will become active while one of measured in frames per second. The processing of a framethe active ones, the one which contains data that is not also includes processing all downscaled versions. The timerequired for processing in the following window, will needed for the proposed architecture to process a singlebecome passive. In this way, fetching of new data happens frame can be calculated using (5) from Section 4.2, and itconcurrently with the window processing. depends on the total number of windows that must be processed for all scaled image versions, the number of5.3.3 SVM Array Processing Core Prototype windows that are processed in parallel, the input rate, andA 4-row by 80-column array was implemented based on the the operating frequency of the array. Also the structure ofproposed array architecture and synthesized targeting an the array (number of units and size) is important to theFPGA prototype. The structure was chosen to correspond to resulting performance. To evaluate the performance of thethe 32-bit input from the FSL bus, which accounts for 4 pixels architecture on the FPGA we considered images of 320 Âper cycle. A total of 320 VUs, 80 memory units and four SUs 240 pixels for the three benchmark applications, selectedwere implemented in the prototype. Each memory unit was from publicly available data sets from [29], [31], and [33].allocated a capacity of 4 KB, thus, a total of 320 KB of FPGA These images contain a varying number of objects inblock RAM was allocated for the training set. With the rest of various sizes. The number of generated windows per imagethe memory available on the FPGA we can store the whole frame (including the downscaled versions) for eachinput image on chip to reduce off-chip memory accesses. application was then computed (given in Table 1). The VUs were mapped on the FPGA custom logic as they The implemented array prototype can process fourdid not consume many FPGA resources. The SUs, on the windows in parallel; while the operating frequency of the
    • 840 IEEE TRANSACTIONS ON COMPUTERS, VOL. 61, NO. 6, JUNE 2012FPGA is 100 MHz, and the input rate to the array is four resources, manages to offer adequate real-time perfor-pixels every cycle (the cycles required for the array to read mance for all three benchmark applications.form the FSL bus). Using this information, the resulting Detection accuracy is also an important performanceframe rates achieved by the proposed architecture are 40, measure in the context of object detection. The accuracy46, and 122, for face detection, pedestrian detection, and car when dealing with SVMs is determined by the supportside view detection, respectively, which are sufficient for vectors and alpha coefficients that are derived duringreal time object detection. It must be noted, that the frame training, and their representation in hardware. We runrate of the proposed architecture depends only on the simulations in software to determine the appropriate bit-number of search windows, and architecture-specific details width representation so no accuracy loss is observed for the test(such as number of units). As such, computing the set of each application. The sufficient number of bits to satisfyperformance is not affected by variable parameters such all three benchmark applications was found to be 24 bits. Byas number of objects or their size, since all search windows using this bit-width representation the hardware implemen-will go through the whole classification procedure. The tation of the proposed architecture maintains the sameresulting frame rates suggest that the system is capable of accuracy rates as the equivalent software SVM models inprocessing larger input images. The performance of the MATLAB (Software accuracy shown in Table 1, hardwareproposed architecture depends linearly on the image size accuracy shown in Fig. 5c).(i.e., number of windows in the image). As such, whenthe input image size increases above a certain size it is 5.5 Comparison with Related Worksexpected that the frame rate will decrease below the Comparing architectures evaluated with different applica-adequate real-time performance levels (30 fps) since the tions other than video object detection is not practical, asamount of data that needs to be processed (number of factors that possibly affect performance (input image size,windows and downscaled image versions) increases as number of downscaled images) were not considered inwell. This is also true for most hardware implementations related works. Consequently, we attempt a comparisonfound in the literature. To handle higher resolution images based on the provided information. Related works that havewe can use hierarchical SVMs [39] to speed up the proposed the hardware implementation of the SVM feed-classification procedure of a single window, or even explore forward phase include [17], [18], [19], [20], and [21]. Froman ASIC implementation of the architecture that will enable these works, Biasi et al. [17], Pina-Ramirez et al. [18] andthe architecture to operate in higher frequencies. Further- Reyna et al. [19] propose architectures which depend on themore, with such high frame rates additional applications vector dimensionality (i.e., target a specific application) andcan be integrated to the system, such as object recognition as such would not be applicable for different object detectionand tracking. We use a single array to evaluate the applications.performance of all three benchmark applications to illus-trate that a single structure can be used in a variety of In [20], matrix bar-code detection in images is performedapplications. As such the bit widths of the processing units using a search window size of 16 Â 16 pixels (256 vectorwere chosen to cover the most demanding application. components), and 88 support vectors. The implementedHowever, if only a single application was to be considered, vector coprocessor aims at parallelizing the processing of athe array could be optimized for that application in terms of single vector and as such it processes input vectorsbit width for each processing unit, thus permitting for more sequentially, and each requires 352 cycles to be processed.units to be implemented, leading to higher frame rates. By utilizing a similar configuration to the one used for the The car side view detection application has the highest benchmark applications (4 rows and 80 columns), ourframe rate, an order of magnitude higher than the other proposed architecture can process four input vectors intwo applications. This is primarily due to the fact that it 443 cycles.must process an order of magnitude less support vectors The vector coprocessor implementation in [21] primarilyand generated windows compared to the other two targets SVM training, and thus no clear results are given inapplications. This shows the impact of the search window terms of classification performance, which the authorssize with respect to the input image size, the primary measure in GMACS. The presented coprocessor consists ofreason for the small number of generated windows. On the 100 vector processing elements clocked at 115 MHz provid-other hand, the face detection application produces the ing a sustained performance of about 10 GMACS. Using alargest number of generated windows, when compared to similar configuration, with 100 VUs clocked at 100 MHz, thethe other two applications, as it has the smallest search array architecture we propose in this paper can also providewindow size, resulting in the lowest frame rate. The 10 GMACS. However, given that in our implementations wepedestrian detection frame rate is higher than that of face were able to use three times more VUs, the resulting computedetection since it has four times less generated windows to performance is much greater.process, even though it requires processing more support In both cases the architecture we propose in this papervectors. In addition the frame rate suffers for both the shows that due to its parallel systolic nature and modularpedestrian and face detection applications because the design, it can outperform existing works.number of columns is less than the number of supportvectors for each application. As a result the input vectors 5.6 Comparison with Viola-Jones Detectionmust pass through the array multiple times until they are Algorithmcomputed across all support vectors. Overall, for a variety The Viola-Jones algorithm [42], which is characterized by itsof window sizes and amount of training data, the targeted ability to discard nonpromising regions really fast, isarchitecture under the limitation of the given FPGA’s currently considered the state of the art in object detection.
    • KYRKOU AND THEOCHARIDES: A PARALLEL HARDWARE ARCHITECTURE FOR REAL-TIME OBJECT... 841Hardware implementations of this algorithm have so far REFERENCESonly targeted face detection and so we focus only on that [1] C. Cortes and V. Vapnik, “Support-Vector Networks,” Machineapplication for comparison purposes. A relevant implemen- Learning, vol. 20, no. 3, pp. 273-297, 1995.tation is presented in [37], [38] which targets face detection [2] V. Vapnik, The Nature of Statistical Learning Theory. Springer-on 320 Â 240 images and achieves 64 frames per second. In Verlag, 1995. [3] L. Hamel, Knowledge Discovery with Support Vector Machines.comparison, the SVM approach suffers from the fact that Wiley-Interscience, 2009.many operations must take place to classify input windows. [4] C.J.C. Burges, “A Tutorial on Support Vector Machines for PatternHowever, a similar approach to that of Viola-Jones can be Recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2,used to discard nonpromising regions early in the context of pp. 121-167, 1998. [5] C. Papageorgiou and T. Poggio, “Trainable Pedestrian DetectionSVMs. This approach is called hierarchical SVM and was System,” Int’l J. Computer Vision, vol. 38, pp. 15-33, 2000.proposed in [39]. The idea is to arrange SVMs in stages with [6] M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, and T. Poggio,the final being a nonlinear SVM, while all others are linear. “Pedestrian Detection Using Wavelet Templates,” Proc. IEEE CSThe advantage of linear SVMs is that the input windows Conf. Computer Vision and Pattern Recognition, pp. 193-199, 1997. [7] S. Agarwal and D. Roth, “Learning a Sparse Representation formust only be processed by one vector. This can be easily Object Detection,” ECCV ’02: Proc. Seventh European Conf. Computerintegrated into our architecture as it is optimized for vector Vision, pp. 113-130, 2002.processing. We used a single linear stage (SVM with kernel [8] E. Osuna, R. Freund, and F. Girosi, “Training Support Vector Machines: An Application to Face Detection,” Proc. IEEE Conf.(2)) prior to the second degree polynomial SVM already Computer Vision and Pattern Recognition, pp. 130-136, 1997.used. The sensitivity of the linear SVM is important since [9] H. Sahbi, D. Geman, and N. Boujemaa, “Face Detection Usingface regions must not be discarded, thus the trained linear Coarse-to-Fine Support Vector Classifiers,” Proc. Int’l Conf. ImageSVM allows on average about 10 percent (a logical ratio of Processing, pp. 925-928, 2002. [10] C. Kyrkou and T. Theocharides, “SCoPE: Towards a Systolicfaces versus nonfaces in an image) of the total number of Array for SVM Object Detection,” IEEE Embedded System Letters,windows to pass through to the next stage, thus only those vol. 1, no. 2, pp. 46-49, Aug. 2009.will be processed by the polynomial SVM. Consequently, the [11] S. Dey, M. Kedia, N. Agarwal, and A. Basu, “Embedded Supportresulting performance increases from 40 frames per second Vector Machine: Architectural Enhancements and Evaluation,” Proc. 20th Int’l Conf. Very Large-Scale Integration (VLSI) Design,to approximately 70, a bit over that of the work in [38]. pp. 685-690, 2007.Finally, when considering multiclass classification pro- [12] R. Pedersen and M. Schoeberl, “An Embedded Support Vectorblems, architectures for the Viola-Jones algorithm must Machine,” Proc. Fourth Workshop Intelligent Solutions in Embeddedprocess the training set for each class sequentially, as such Systems, pp. 1-11, 2006. [13] A. Boni, F. Pianegiani, and D. Petri, “Low-Power and Low-Costthe frame rates drops significantly. On the other hand, the Implementation of SVMs for Smart Sensors,” IEEE Trans.proposed SVM architecture does not face the same chal- Instrumentation and Measurement, vol. 56, no. 1, pp. 39-44, Feb.lenges as it can process the different data sets in parallel. In 2007.future works, we intend to further optimize the proposed [14] B. Catanzaro, N. Sundaram, and K. Keutzer, “Fast Support Vector Machine Training and Classification on Graphics Processors,”array architecture to process hierarchical SVMs more Proc. 25th Int’l Conf. Machine Learning, pp. 104-111, 2008.efficiently as this approach has shown promising results. [15] R. Genov and G. Cauwenberghs, “Kerneltron: Support Vector “Machine” in Silicon,” IEEE Trans. Neural Networks, vol. 14, no. 5, pp. 1426-1434, Sept. 2003.6 CONCLUSION [16] D. Anguita, A. Boni, and S. Ridella, “A Digital Architecture for Support Vector Machines: Theory, Algorithm, and FPGA Im-This paper presented an array processing engine for object plementation,” IEEE Trans. Neural Networks, vol. 14, no. 5, pp. 993-detection with SVMs that can achieve real-time performance 1009, Sept. 2003.(40-122 fps for a variety of applications) while maintaining [17] I. Biasi, A. Boni, and A. Zorat, “A Reconfigurable Parallel Architecture for SVM Classification,” Proc. IEEE Int’l Joint Conf.high detection accuracies (76-78 percent for a variety of Neural Networks, vol. 5, pp. 2867-2872, 2005.applications). The architecture scales linearly to the hard- [18] O. Pina-Ramirez, R. Valdes-Cristerna, and O. Yanez-Suarez, “Anware budget taking full advantage of its modular and FPGA Implementation of Linear Kernel Support Vector Ma-regular design, while providing true parallel processing for chines,” Proc. IEEE Int’l Conf. Reconfigurable Computing and FPGA’s, pp. 1-6, 2006.both input and support vectors. We have also addressed the [19] R.A. Reyna, D. Esteve, D. Houzet, and M.-F. Albenge, “Imple-demanding kernel implementation by sharing it among mentation of the SVM Neural Network Generalization Functionmany vector processing units. Furthermore, the same array for Image Processing,” Proc. IEEE Fifth Int’l Workshop Computerstructure can be used for different applications regardless of Architectures for Machine Perception, pp. 147-151, 2000. [20] R. Roberto, H. Dominique, D. Daniela, C. Florent, and O. Salim,the window size, number of support vectors, and image size. “Object Recognition System-on-Chip Using the Support VectorAdditionally, using the enhanced version of the proposed Machines,” EURASIP J. Advances in Signal Processing, vol. 2005,architecture it can be configured to operate in a variety of pp. 993-1004, 1900.modes and is able to adapt to different application demands [21] H. Peter, G. Srihari, C. Durdanovic, V. Jakkula, M. Sankardadass, E. Cosatto, and S. Chakradhar, “A Massively Parallel Digital(such as multiclass applications). As an immediate follow Learning Processor,” Proc. 22nd Ann. Conf. Neural Informationup to this work we aim to further optimize the array Processing Systems (NIPS), pp. 529-536, 2008.architecture for processing hierarchical SVMs. Furthermore, [22] F.M. Khan, M.G. Arnold, and W.M. Pottenger, “Finite Precision Analysis of Support Vector Machine Classification in Logarithmicwe also aim to further explore the scalability of the system in Number Systems,” Proc. Euromicro Symp. Digital System Design,terms of various parameters such as the input image size, by pp. 254-261, 2004.porting it to an ASIC. Overall the proposed architecture is [23] F. Khan, M. Arnold, and W. Pottenger, “Hardware-Based Supportcapable of real-time SVM-based object detection while Vector Machine Classification in Logarithmic Number Systems,” Proc. IEEE Int’l Symp. Circuits and Systems, pp. 51-54, May 2005.providing a configurable detection platform that can operate [24] Boni and A. Zorat, “FPGA Implementation of Support Vectorin a variety of embedded object detection scenarios, and Machines with Pseudo-Logarithmic Number Representation,”adapt to specific application and designer demands. Proc. Int’l Joint Conf. Neural Networks, pp. 618-624, 2006.
    • 842 IEEE TRANSACTIONS ON COMPUTERS, VOL. 61, NO. 6, JUNE 2012[25] D. Anguita, A. Ghio, and S. Pischiutta, “A Learning Machine for Christos Kyrkou (SM’09) received the BSc and Resource-Limited Adaptive Hardware,” Proc. Second NASA/ESA MSc degrees in computer engineering from the Conf. Adaptive Hardware and Systems, pp. 571-576, Aug. 2007. University of Cyprus, Nicosia, in 2008 and 2010,[26] D. Anguita, S. Pischiutta, S. Ridella, and D. Sterpi, “Feed-Forward respectively, where he is currently working Support Vector Machine without Multipliers,” IEEE Trans. Neural toward the PhD degree in computer engineering. Networks, vol. 17, no. 5, pp. 1328-1331, Sept. 2006. His research interests focus on digital hardware[27] K. Irick, M. DeBole, V. Narayanan, and A. Gayasen, “A Hardware architectures for pattern recognition and ma- Efficient Support Vector Machine Architecture for FPGA,” Proc. chine vision algorithms. He is a student member 16th Int’l Symp. Field-Programmable Custom Computing Machines, of the IEEE. pp. 304-305, 2008.[28] “CBCL Face Database #1,” MIT Center for Biological and Computation Learning, http://cbcl.mit.edu/software-datasets/ FaceData2.html, Jan. 2010. Theocharis Theocharides received the PhD[29] “CMU and MIT Face Database,” http://vasc.ri.cmu.edu/idb/ degree in computer science and engineering html/face/frontal_images/, Jan. 2010. from Pennsylvania State University, University[30] D. Anguita, A. Ghio, S. Pischiutta, and S. Ridella, “A Park. He is currently a lecturer at the Depart- Hardware-Friendly Support Vector Machine for Embedded ment of Electrical and Computer Engineering, at Automotive Applications,” Proc. Int’l Joint Conf. Neural Networks, the University of Cyprus. His research focuses pp. 1360-1364, Aug. 2007. on the broad area of intelligent embedded[31] “UIUC Image Database for Car Detection,” http://l2r.cs.uiuc systems design, with emphasis on the design .edu/~cogcomp/Data/Car/, Jan. 2010. of reliable and low power embedded and[32] S. Munder and D.M. Gavrila, “An Experimental Study on application specific processors, media proces- Pedestrian Classification,” IEEE Trans. Pattern Analysis and sors and real-time digital artificial intelligence applications. He is a senior Machine Intelligence, vol. 28, no. 11, pp. 1863-1868, Nov. 2006. member of the IEEE.[33] “USC Pedestrian Set C,” Bo Wu, and Ram Nevatia, Cluster Boosted Tree Classifier for Multi-View, Multi-Pose Object Detec- tion, ICCV, 2007. . For more information on this or any other computing topic,[34] MIT Center for Biological and Computation Learning, “CBCL please visit our Digital Library at www.computer.org/publications/dlib. PEDESTRIAN DATABASE #1,” http://cbcl.mit.edu/projects/ cbcl/softwaredatasets/PeopleData1Readme.html, Jan. 2010.[35] “Microblaze Soft Processor,” Xilinx, San Jose, CA, http://www .xilinx.com/tools/microblaze.htm, 2011.[36] J. Platt, “Fast Training of Support Vector Machines Using Sequential Minimal Optimization,” Advances in Kernel Methods— ¨ Support VectorLearning, B. Scholkopf, C. Burges, and A. Smola, eds., MIT Press, 1999.[37] T. Theocharides, N. Vijaykrishnan, and M.J. Irwin, “A Parallel Architecture for Hardware Face Detection,” Proc. IEEE CS Ann. Symp. Emerging Very Large-Scale Integration (VLSI) Technologies and Architectures, pp. 452-453, Mar. 2006.[38] C. Kyrkou and T. Theocharides, “A Flexible Parallel Hardware Architecture for AdaBoost-Based Real-Time Object Detection,” IEEE Trans. Very Large Scale Integration Systems, vol. 19, no. 6, pp. 1034-1047, June 2011.[39] B. Heisele, T. Serre, S. Prentice, and T. Poggio, “Hierarchical Classification and Feature Reduction for Fast Face Detection with Support Vector Machines,” Pattern Recognition, vol. 36, no. 9, pp. 2007-2017, 2003.[40] B. Heisele, P. Ho, J. Wu, and T. Poggio, “Face Recognition: Comparing Component-Based and Global Approaches,” Computer Vision and Image Understanding, vol. 91, nos. 1/2, pp. 6-21, 2003.[41] Mathworks, MATLAB Online Documentation, SMO Algorithm http://www.mathworks.com/help/toolbox/bioinfo/ref/ svmtrain.html, 2011.[42] P. Viola and M. Jones, “Real-Time Object Detection,” Int’l J. Computer Vision, vol. 57, no 2, pp. 137-154, May 2004.