• Save
Upcoming SlideShare
×

Like this document? Why not share!

# CVGIP 2010 Part 3

## on Aug 08, 2012

• 2,220 views

CVGIP 2010 The 23th IPPR Conference on Computer Vision, Graphics, and Image Processing

CVGIP 2010 The 23th IPPR Conference on Computer Vision, Graphics, and Image Processing

### Views

Total Views
2,220
Views on SlideShare
2,176
Embed Views
44

Likes
0
0
0

### 4 Embeds44

 http://blog.codylab.com 36 http://localhost 6 http://54.248.125.214 1 http://webcache.googleusercontent.com 1

### Report content

• Comment goes here.
Are you sure you want to

## CVGIP 2010 Part 3Document Transcript

• relationship between points on two planes: sct = Hcs , (1)where s is a scalar factor and cs and ct are a pair of corre-sponding points in the source and target patches, respectively.If there are at least four correspondences where no threecorrespondences in each patch are collinear, we can estimateH through the least-squares approach. We regard cs as points of 3-D environment model and ctas points of 2-D image and then calculate the matrix H tomap points from the 3-D model to the images. In the reverseorder, we can also map points from the images to the 3-Dmodel.B. Planar patch modeling Precise camera calibration is not an easy job [13]. In the Fig. 4. The comparison of rendering layouts between different numbers and sizes of patches. A large distortion occurs if there are fewer patches forvirtual projector methods [4], [7], the texture image will be rendering (left). More patches make the rendering much better (right).miss-aligned to the model if the camera calibration or the3-D model reconstruction has large error. Alternatively, wedevelop a method that approximates the 3-D environment where Iij is the intensity of the point obtained from homog-model through multiple yet individual planar patches and ˜ raphy transformation, Iij is the intensity of the point obtainedthen renders the image content of every patches to generate from texture mapping, i and j are the coordinates of row anda synthesized and integrated view of the monitored scene. In column in the image, respectively, and m × n represents thethis way we can easily construct a surveillance system with dimension of the patch in the 2-D image. In order to have3-D view of the environment. an reference scale to quantify the distortion amount, a peak Mostly we can model the environment with two basic signal-to-noise ratio is calculated bybuilding components, horizontal planes and vertical planes.The horizontal planes for hallways and ﬂoors are usually MAX2 Isurrounded by doors and walls, which are modeled as the PSNR = 10 log10 , (3) MSEvertical planes. Both two kinds of planes are further dividedinto several patches according to the geometry of the scenes where MAXI is the maximum pixel value of the image.(Figure 3). If the scene consists of simple structures, a few Typical values for the PSNR are between 30 and 50 dB andlarge patches can well represent the scene with less rendering an acceptable value is considered to be about 20 dB to 25 dBcosts. On the other hand, more and smaller patches are in this work. We set a threshold T to determine the qualityrequired to accurately render a complex environment, at the of texture mapping byexpense of more computational costs. In the proposed system, the 3-D rendering platform is PSNR ≥ T . (4)developed on OpenGL and each patch is divided into tri-angles before rendering. Since linear interpolation is used If the PSNR of the patch is lower than T , the procedureto ﬁll triangles with texture in OpenGL and not suitable divides it into smaller patches and repeats the process untilfor the perspective projection, distortion will appear in the the PSNR values of every patches are greater than the givenrendering result. One can use a lot of triangles to reduce this threshold T .kind of distortion, as shown in Figure 4, it will enlarge thecomputational burden and therefore not feasible for real-time III. O N - LINE MONITORINGsurveillance systems. To make a compromise between visualization accuracy and The proposed system displays the videos on the 3-D model.rendering cost, we propose a procedure that automatically However, the 3-D foreground objects such as pedestrians aredivides each patch into smaller ones and decides suitable projected to image frame and become 2-D objects. They willsizes of patches for accurate rendering (Figure 4). We use the appear ﬂattened on the ﬂoor or wall since the system displaysfollowing mean-squared error method to estimate the amount them on planar patches. Furthermore, there might be ghostingof distortion when rendering image patches: effects when 3-D objects are in the overlapping areas of m−1 n−1 different camera views. We need to tackle this problem by 1 ˜ MSE = (Iij − Iij )2 , (2) separating and rendering 3-D foreground objects in addition m×n i=0 j=0 to the background environment. 990 View slide
• Fig. 7. Orientation determination of the axis-aligned billboarding. L is the location of the billboard, E is the location projected vertically from the viewpoint to the ﬂoor, and v is the vector from L to E. The normal vector (n) of the billboard is rotated according to the location of the viewpoint. Y is the rotation axis and φ is the rotation angle.Fig. 6. A ﬂowchart to illustrate the whole method. The purple part is basedon pixel. are always moving on the ﬂoor, the billboards can be aligned to be perpendicular to the ﬂoor in the 3-D model. The 3-Dwe propose another method to ﬁnd out Tθ (x, y, t) more fast. location of the billboard is estimated by mapping the bottom-The number of samples which are classiﬁed as shadow or middle point of the foreground bounding box in the 2-Dbackground at time t is ATr (x, y, t) by using FSMS. We {b,s} image through the lookup tables. The ratio between the heightdeﬁne a ratio R(Tr ) = ATr /A{b,s,f } where A{b,s,f } is all {b,s} of the bounding box and the 3-D model determines the heightsamples in position x, y, where b, s, f represent the back- of the billboard in the 3-D model. The relationship betweenground, shadow and foreground respectively. The threshold the direction of a billboard and the viewpoint is deﬁned asTθ (x, y, t) can be updating to Tθ (x, y, t) by R(Tr ). The shown in Figure 7.number of samples whose cos(θ(x, y)) values are larger than The following equations are used to calculate the rotationthe Tθ (x, y, t) is equal to A{b,s} and is required angle of the billboard: R(Tθ (x, y, t)) = R(Tr ) (10) Y = (n × v) , (12) Besides, we add a perturbation δTθ to the Tθ (x, y, t). TSince FSMS only ﬁnds out a threshold in If θ (x, y, t), if the φ = cos−1 (v · n) , (13)initial threshold Tθ (x, y, 0) is set larger than true threshold,the best updating threshold is equal to threshold Tθ not where v is the vector from the location of the billboard, L, tosmaller than threshold Tθ . Therefore the true angle threshold the location E projected vertically from the viewpoint to thewill never be found with time. To solve this problem, a per- ﬂoor, n is the normal vector of the billboard, Y is the rotationturbation of the updating threshold is added to the updating axis, and φ is the estimated rotation angle. The normal vectorthreshold of the billboard is parallel to the vector v and the billboard is always facing toward the viewpoint of the operator. Tθ (x, y, t) = Tθ (x, y, t) − δTθ (11) F. Video content integration Since the new threshold Tθ (x, y, t) has smaller valueto cover more samples, it can approach the true threshold If the ﬁelds of views of cameras are overlapped, objects inwith time. This perturbation can also make the method more these overlapping areas are seen by multiple cameras. In thisadaptable to the change of environment. Here is a ﬂowchart case, there might be ghosting effects when we simultaneouslyFigure 6 to illustrate the whole method. display videos from these cameras. To deal with this problem, we use 3-D locations of moving objects to identify the cor-E. Axis-aligned billboarding respondence of objects in different views. When the operator In visualization, axis-aligned billboarding [14] constructs chooses a viewpoint, the rotation angles of the correspondingbillboards in the 3-D model for moving objects, such as billboards are estimated by the method presented above andpedestrians, and the billboard always faces to the viewpoint of the system only render the billboard whose rotation angle isthe user. The billboard has three properties: location, height, the smallest among all of the corresponding billboards, asand direction. By assuming that all the foreground objects shown in Figure 8. 992
• C1 C3 C2 C1Fig. 8. Removal of the ghosting effects. When we render the foregroundobject from one view, the object may appear in another view and thuscause the ghosting effect (bottom-left). Static background images without Fig. 9. Determination of viewpoint switch. We divide the ﬂoor areaforeground objects are used to ﬁll the area of the foreground objects (top). depending on the ﬁelds of view of the cameras and associate each area to oneGhosting effects are removed and static background images can be update of the viewpoint close to a camera. The viewpoint is switched automaticallyby background modeling. to the predeﬁned viewpoint of the area containing more foreground objects.G. Automatic change of viewpoint The experimental results shown in Figure 12 demonstrate that the viewpoint can be able to be chosen arbitrarily in The proposed surveillance system provides target tracking the system and operators can track targets with a closerfeature by determining and automatic switching the view- view or any viewing direction by moving the virtual camera.points. Before rendering, several viewpoints are speciﬁed in Moreover, the moving objects are always facing the virtualadvance to be close to the locations of the cameras. During camera by billboarding and the operators can easily perceivethe viewpoint switching from one to another, the parameters the spatial information of the foreground objects from anyof the viewpoints are gradually changed from the starting viewpoint.point to the destination point for smooth view transition. The switching criterion is deﬁned as the number of blobs V. C ONCLUSIONSfound in the speciﬁc areas. First, we divide the ﬂoor area into In this work we have developed an integrated video surveil-several parts and associate them to each camera, as shown lance system that can provide a single comprehensive viewin Figure 9. When people move in the scene, the viewpoint for the monitored areas to facilitate tracking moving targetsis switched automatically to the predeﬁned viewpoint of the through its interactive control and immersive visualization.area containing more foreground objects. We also make the We utilize planar patches for 3-D environment model con-billboard transparent by setting the alpha value of textures, so struction. The scenes from cameras are divided into severalthe foreground objects appear with ﬁtting shapes, as shown patches according to their structures and the numbers andin Figure 10. sizes of patches are automatically determined for compromis- ing between the rendering effects and efﬁciency. To integrate IV. E XPERIMENT RESULTS video contents, homography transformations are estimated for relationships between image regions of the video contents We developed the proposed surveillance system on a PC and the corresponding areas of the 3D model. Moreover,with Intel Core Quad Q9550 processor, 2GB RAM, and one the proposed method to remove moving cast shadow cannVidia GeForce 9800GT graphic card. Three IP cameras with automatically decide thresholds by on-line learning. In this352 × 240 pixels resolution are connected to the PC through way, the manual setting can be avoided. Compared with theInternet. The frame rate of the system is about 25 frames per work based on frames, our method increases the accuracy tosecond. remove shadow. In visualization, the foreground objects are In the monitored area, automated doors and elevators are segmented accurately and displayed on billboards.speciﬁed as background objects, albeit their image do changewhen the doors open or close. These areas will be modeled in R EFERENCESbackground construction and not be visualized by billboards, [1] R. Sizemore, “Internet protocol/networked video surveillance market:the system use a ground mask to indicate the region of Equipment, technology and semiconductors,” Tech. Rep., 2008.interesting. Only the moving objects located in the indicated [2] Y. Wang, D. Krum, E. Coelho, and D. Bowman, “Contextualized videos: Combining videos with environment models to support situa-areas are considered as moving foreground objects, as shown tional understanding,” IEEE Transactions on Visualization and Com-in Figure 11. puter Graphics, 2007. 993
• Fig. 11. Dynamic background removal by ground mask. There is an automated door in the scene (top-left) and it is visualized by a billboard (top- right). A mask covered the ﬂoor (bottom-left) is used to decide whether to visualize the foreground or not. With the mask, we can remove unnecessary billboards (bottom-right).Fig. 10. Automatic switching the viewpoint for tracking targets. People Fig. 12. Immersive monitoring at arbitary viewpoint. We can zoom out thewalk in the lobby and the viewpoint of the operator automatically switches viewpoint to monitor the whole surveillance area or zoom in the viewpointto keep track of the targets. to focus on a particular place. [3] Y. Cheng, K. Lin, Y. Chen, J. Tarng, C. Yuan, and C. Kao, “Accurate transactions on Geosci. and remote sens., 2009. planar image registration for an integrated video surveillance system,” [10] J. Kim and H. Kim, “Efﬁcient regionbased motion segmentation for a Computational Intelligence for Visual Intelligence, 2009. video monitoring system,” Pattern Recognition Letters, 2003. [4] H. Sawhney, A. Arpa, R. Kumar, S. Samarasekera, M. Aggarwal, [11] E. J. Carmona, J. Mart´nez-Cantos, and J. Mira, “A new video seg- ı S. Hsu, D. Nister, and K. Hanna, “Video ﬂashlights: real time ren- mentation method of moving objects based on blob-level knowledge,” dering of multiple videos for immersive model visualization,” in 13th Pattern Recognition Letters, 2008. Eurographics workshop on Rendering, 2002. [12] N. Martel-Brisson and A. Zaccarin, “Learning and removing cast [5] U. Neumann, S. You, J. Hu, B. Jiang, and J. Lee, “Augmented virtual shadows through a multidistribution approach,” IEEE transactions on environments (ave): dynamic fusion of imagery and 3-d models,” IEEE pattern analysis and machine intelligence, 2007. Virtual Reality, 2003. [13] S. Teller, M. Antone, Z. Bodnar, M. Bosse, S. Coorg, M. Jethwa, and [6] S. You, J. Hu, U. Neumann, and P. Fox, “Urban site modeling from N. Master, “Calibrated, registered images of an extended urban area,” lidar,” Lecture Notes in Computer Science, 2003. International Journal of Computer Vision, 2003. [7] I. Sebe, J. Hu, S. You, and U. Neumann, “3-d video surveillance [14] A. Fernandes, “Billboarding tutorial,” 2005. with augmented virtual environments,” in International Multimedia Conference, 2003. [8] T. Horprasert, D. Harwood, and L. Davis, “A statistical approach for real-time robust background subtraction and shadow detection,” IEEE ICCV. (1999). [9] K. Chung, Y. Lin, and Y. Huang, “Efﬁcient shadow detection of color aerial images based on successive thresholding scheme,” IEEE 994
• A = { p ( x, y , z ) | pn T − v i n T = 0 , i ∈ (1,2,3), p ∈ Bin } The experiments use some objects file which is the wave front file format (.obj) from NTU 3D model database ver.1 Bin = { p( x, y, z ) | f (i , j ) ( p) × f (i , j ) (v) > 0} of National Taiwan University. The process of transforming f (i , j ) ( p) = r × a − b + s triangle mesh into point-based is shown in Figure 1. It is clear to see that some areas with uncompleted the whole b j − bi point set shown in red rectangle of figure 1. The planar r= , s = bi - r × ai dilation process is employed to refine fail areas. a j − ai Planar dilation process uses 26-connected planar to refine i, j = 1,2,3 a , b = x, y , z i < j a<b the spots leaved in the area. The first half portion of Figure 2 shows 26 positions of connected planar. If any planar and its 26 neighbor positions are the object planar is the condition. The main purpose to estimate the object planar is to verify the condition is true. The result in second half portion of Figure 2 reveals the efficiency of planar dilation process. III. POINT-BASED MORPHING FOR MODEL CREATING The more flexible in objects combining is one of property of point-based. No matter what the shape or category of the objects, the method of this study can put them into morphing process to create new objects. The morphing process includes 3 steps. Step one is to Figure 1. The process of transforming triangle mesh into point-based equalize the objects. Step two is to calculate each normal point of objects in morphing process. Step three is to estimate each point of target object by using the same normal point of two objects with the formula as described below. n −1 ot = p r1o1 + p r 2 o2 + ⋅ ⋅ ⋅ + (1 − ∑ p ri )o n i =1 n 0 ≤ p r1 , p r 2 ,⋅ ⋅ ⋅, p r ( n −1) ≤ 1 , ∑ p ri = 1 i =1 ot presents each target object point of morphing, and oi is the object for morphing process. p ri donates the object effect weight in morphing process, and i indicates the number of object. The new model appearance generated from morphing is depended on which objects were chosen and the value of each object weight as well. The research experiments use two objects, therefore i = 1 or 2, n = 2 . The results are shown in Figure 3. First row is a simple flat board and a character morphing. The second row shows the object selecting free in point-based modeling, because two totally different objects can be put into morphing and produced the satisfactory results. The models the created by objects morphing with different weights can be seen in figure 4. IV. POINT-BASED TEXTURE MAPPING Texturing mapping is a very plain in this research method. It uses a texture metric to map the 3D model to the 2D image pixel by using the concept of 2D image transformation into 3D. Assuming 3D spaces is divided into α × β blocks, α is the number of row, and β is the number of column. Hence the length, the width, and the height of 3D space is h × h × h ; afterwards the ( X , Y ) and ( x. y, z ) will denote the image coordination and 3D model respectively. Figure 2. Planar dilation process. The texture of each block is assigned by texture cube, and it 996
• is made by 2D image as shown in the middle image of first confirmed by the scalability and flexibility of proposedraw in figure 5. The process can be expressed by a formula methodologies.as below. At T = c T REFERENCES h h h [1] MARK PAULY, “Point-Based Multiscale Surface t = [ x mod , y mod , z mod ] , c = [ X,Y ] α β β Representation,” ACM Transactions on Graphics, Vol. 25, No. 2, pp. 177–193, April 2006. ⎡α 0 0⎤ A=⎢ β (h − z ) ⎥ [2] M. Müller1, R. Keiser1, A. Nealen2, M. Pauly3, M. Gross1 ⎢ 0 0 ⎥ and M. Alexa2, “Point Based Animation of Elastic, Plastic ⎣ y ⎦ and Melting Objects,” Eurographics/ACM SIGGRAPH A denotes the texture transforming metric, t denotes the 3D Symposium on Computer Animation, pp. 141-151, 2004.model current position, c denotes the image pixel content in [3] Theodoris Athanasiadis, Ioannis Fudos, Christophoros Nikou,the current position. “Feature-based 3D Morphing based on Geometrically The experiment results are shown in the second row of Constrained Sphere Mapping Optimization,” SAC’10 Sierre,figure 5 and 6. The setting results α = β = 2 are shown in Switzerland, pp. 1258-1265, March 22-26, 2010.second row of figure 5. The setting results α = β = 4 create [4] Yonghong Zhao, Hong-Yang Ong, Tiow-Seng Tan and Yongguan Xiao, “Interactive Control of Component-basedthe images are shown in the first row of figure 6. The last Morphing,” Eurographics/SIGGRAPH Symposium onrow images of figure 6 indicate the proposed texture Computer Animation , pp. 340-385, 2003.mapping method can be applied into any point-based model. [5] Kosuke Kaneko, Yoshihiro Okada and Koichi Niijima, “3D V. CONCLUSION Model Generation by Morphing,” IEEE Computer Graphics, Imaging and Visualisation, 2006. In sum, the research focuses on point-based modeling [6] Boris Springborn, Peter Schröder, Ulrich Pinkall, “Conformalapplications by using C++ instead of convenient facilities or Equivalence of Triangle Meshes,” ACM Transactions onother computer graphic software. The methodologies that Graphics, Vol. 27, No. 3, Article 77, August 2008.developed by point-based include the simple data structureproperties and less complex computing. Moreover, the [7] NATHAN A. CARR and JOHN C. HART, “Meshed Atlases for Real-Time Procedural Solid Texturing,” ACMmethods can be compiled with two applications morphing Transactions on Graphics, Vol. 21,No. 2, pp. 106–131, Apriland texture mapping. The experiment results have been 2002. Figure 3. The results of point-based modeling using different objects morphing. 997
• Figure 4. The models created by objects morphing with different weights.Figure 5. The process of 3D model texturing with 2D image shown in first row and the results shown in second row. 998
• Figure 6. The results of point-based texture mapping with α = β = 4 and different objects. 999
• LAYERED LAYOUTS OF DIRECTED GRAPHS USING A GENETIC ALGORITHM Chun-Cheng Lin1,∗, Yi-Ting Lin2 , Hsu-Chun Yen2,† , Chia-Chen Yu3 1 Dept. of Computer Science, Taipei Municipal University of Education, Taipei, Taiwan 100, ROC 2 Dept. of Electrical Engineering, National Taiwan University, Taipei, Taiwan 106, ROC 3 Emerging Smart Technology Institute, Institute for Information Industry, Taipei, Taiwan, ROC ABSTRACT charts, maps, posters, scheduler, UML diagrams, etc. It is important that a graph be drawn “clear”,By layered layouts of graphs (in which nodes are such that users can understand and get informationdistributed over several layers and all edges are di- from the graph easily. This paper focuses on lay-rected downward as much as possible), users can ered layouts of directed graphs, in which nodes areeasily understand the hierarchical relation of di- distributed on several layers and in general edgesrected graphs. The well-known method for generat- should point downward as shown in Figure 1(b).ing layered layouts proposed by Sugiyama includes By this layout, users can easily trace each edge fromfour steps, each of which is associated with an NP- top to bottom and understand the priority or orderhard optimization problem. It is observed that the information of these nodes clearly.four optimization problems are not independent, inthe sense that each respective aesthetic criterionmay contradict each other. That is, it is impossi-ble to obtain an optimal solution to satisfy all aes-thetic criteria at the same time. Hence, the choicefor each criterion becomes a very important prob-lem. In this paper, we propose a genetic algorithmto model the ﬁrst three steps of the Sugiyama’s al-gorithm, in hope of simultaneously considering the Figure 1: The layered layout of a directed graph.ﬁrst three aesthetic criteria. Our experimental re-sults show that this proposed algorithm could make Speciﬁcally, we use the following criteria to es-layered layouts satisfy human’s aesthetic viewpoint. timate the quality of a directed graph layout: to minimize the total length of all edges; to mini-Keywords: Visualization, genetic algorithm, mize the number of edge crossings; to minimize thegraph drawing. number of edges pointing upward; to draw edges as straight as possible. Sugiyama [9] proposed a 1. INTRODUCTION classical algorithm for producing layered layouts of directed graphs, consisting of four steps: cycleDrawings of directed graphs have many applica- removal, layer assignment, crossing reduction, andtions in our daily lives, including manuals, ﬂow assignment of horizontal coordinates, each of which ∗ Research supported in part by National Science Council addresses a problem of achieving one of the aboveunder grant NSC 98-2218-E-151-004-MY3 criteria, respectively. Unfortunately, the ﬁrst three † Research supported in part by National Science Council problems have been proven to be NP-hard when theunder grant NSC 97-2221-E-002-094-MY3 width of the layout is restricted. There has been 1000
• a great deal of work with respect to each step of is quite diﬀerent between drawing layered layoutsSugiyama’s algorithm in the literature. of acyclic and cyclic directed graphs. In acyclic Drawing layered layouts by four independent graphs, one would not need to solve problems onsteps could be executed eﬃciently, but it may not cyclic removal. If the algorithm does not restrictalways obtain nice layouts because preceding steps the layer by a ﬁxed width, one also would not needmay restrain the results of subsequent steps. For to solve the limited layer assignment problem. Noteexample, four nodes assigned at two levels after the that the unlimited-width layer assignment is not anlayer assignment step lead to an edge crossing in NP-hard problem, because the layers of nodes canFigure 2(a), so that the edge crossing cannot be be assigned by a topological logic ordering. Theremoved during the subsequent crossing reduction algorithm in [10] only focuses on minimizing thestep, which only moves each node’s relative posi- number of edge crossings and making the edges astion on each layer, but in fact the edge crossing straight as possible. Although it also combinedcan be removed as drawn in Figure 2(b). Namely, three steps of Sugiyama’s algorithm, but it onlythe crossing reduction step is restricted by the layer contained one NP-hard problem. Oppositely, ourassignment step. Such a negative eﬀect exists ex- algorithm combined three NP-hard problems, in-clusively not only for these two particular steps but cluding cycle removal, limited-width layer assign-also for every other preceding/subsequent step pair. ment, and crossing reduction. In addition, our algorithm has the following ad- vantages. More customized restrictions on layered layouts are allowed to be added in our algorithm. For example, some nodes should be placed to the (a) (b) left of some other nodes, the maximal layer number should be less than or equal to a certain number, Figure 2: Diﬀerent layouts of the same graph. etc. Moreover, the weighting ratio of each optimal criterion can be adjusted for diﬀerent applications. Even if one could obtain the optimal solution for According to our experimental results, our geneticeach step, those “optimal solutions” may not be the algorithm may eﬀectively adjust the ratio betweenreal optimal solution, because those locally optimal edge crossings number and total edge length. Thatsolutions are restricted by their respective preced- is, our algorithm may make layered layouts moreing steps. Since we cannot obtain the optimal solu- appealing to human’s aesthetic viewpoint.tion satisfying all criteria at the same time, we haveto make a choice in a trade-oﬀ among all criteria. For the above reasons, the basic idea of our 2. PRELIMINARIESmethod for drawing layered layouts is to combinethe ﬁrst three steps together to avoid the restric-tions due to criterion trade-oﬀs. Then we use the The frameworks of three diﬀerent algorithms forgenetic algorithm to implement our idea. In the layered layouts of directed graphs (i.e., Sugiyama’sliterature, there has existed some work on produc- algorithm, the cyclic leveling algorithm, and ouring layered layouts of directed graphs using ge- algorithm) are illustrated in Figure 2(a)–2(c), re-netic algorithm, e.g., using genetic algorithm to re- spectively. See Figure 2. Sugiyama’s algorithmduce edge crossings in bipartite graphs [7] or entire consists of four steps, as mentioned previously; theacyclic layered layouts [6], modifying nodes in a other two algorithms are based on Sugiyama’s algo-subgraph of the original graph on a layered graph rithm, in which the cyclic leveling algorithm com-layout [2], drawing common layouts of directed or bines the ﬁrst two steps, while our genetic algo-undirected graphs [3] [11], and drawing layered lay- rithm combines the ﬁrst three steps. Furthermore,outs of acyclic directed graphs [10]. a barycenter algorithm is applied to the crossing re- Note that the algorithm for drawing layered lay- duction step of the cyclic leveling and our geneticouts of acyclic directed graphs in [10] also com- algorithms, and the priority layout method is ap-bined three steps of Sugiyama’s algorithm, but it plied to the x-coordinate assignment step. 1001
• Sugiyama’s Algorithm Cyclic Leveling Genetic Algorithm Cycle Removel Cycle Removel Cycle Removel edge-node crossing Layer Assignment Layer Assignment Layer Assignment edge crossing Crossing Reduction (a) An edge crossing. (b) An edge-node crossing Crossing Reduction Crossing Reduction Barycenter Algorithm x-Coordinte Assignment x-Coordinte Assignment Priority Layout Method x-Coordinte Assignment Figure 4: Two kinds of crossings. (a) Sugiyama (b) Cyclic Leveling (c) Our we reverse as few edges as possible such that theFigure 3: Comparison among diﬀerent algorithms. input graph becomes acyclic. This problem can be stated as the maximum acyclic subgraph prob- 2.1. Basic Deﬁnitions lem, which is NP-hard. (2) Layer assignment: Each node is assigned to a layer so that the total verticalA directed graph is denoted by G(V, E), where V is length of all edges is minimized. If an edge spansthe set of nodes and E is the set of edges. An edge across at least two layers, then dummy nodes shoulde is denoted by e = (v1 , v2 ) ∈ E, where v1 , v2 ∈ V ; be introduced to each crossed layer. If the maxi-edge e is directed from v1 to v2 . A so-called layered mum width is bounded greater or equal to three,layout is deﬁned by the following conditions: (1) the problem of ﬁnding a layered layout with min-Let the number of layers in this layout denoted by imum height is NP-compete. (3) Crossings reduc-n, where n ∈ N and n ≥ 2. Moreover, the n-layer tion: The relative positions of nodes on each layerlayout is denoted by G(V, E, n). (2) V is parti- are reordered to reduce edges crossings. Even if wetioned into n subsets: V = V1 ∪ V2 ∪ V3 ∪ · · · ∪ Vn , restrict the problem to bipartite (two-layer) graphs,where Vi ∩ Vj = ∅, ∀i ̸= j; nodes in Vk are assigned it is also an NP-hard problem. (4) x-coordinate as-to layer k, 1 ≤ k ≤ n. (3) A sequence ordering, signment: The x-coordinates of nodes and dummyσi , of Vi is given for each i ( σi = v1 v2 v3 · · · v|Vi | nodes are modiﬁed, such that all the edges on thewith x(v1 ) < x(v2 ) < · · · < x(v|Vi | )). The n- original graph structure are as straight as possi-layer layout is denoted by G(V, E, n, σ), where σ = ble. This step includes two objective functions: to(σ1 , σ2 , · · · , σn ) with y(σ1 ) < y(σ2 ) < · · · < y(σn ). make all edges as close to vertical lines as possible; An n-layer layout is called “proper” when it fur- to make all edge-paths as straight as possible.ther satisﬁes the following condition: E is parti-tioned into n − 1 subsets: E = E1 ∪ E2 ∪ E3 ∪ 2.3. Cyclic Leveling Algorithm· · · ∪ En−1 , where Ei ∩ Ej = ∅, ∀i ̸= j, andEk ⊂ Vk × Vk+1 , 1 ≤ k ≤ n − 1. The cyclic leveling algorithm (CLA) [1] combines the ﬁrst two steps of Sugiyama’s algorithm, i.e., it An edge crossing (assuming that the layout is focuses on minimizing the number of edges point-proper) is deﬁned as follows. Consider two edges ing upward and total vertical length of all edges.e1 = (v11 , v12 ), e2 = (v21 , v22 ) Ei, in which v11 It introduces a number called span that representsand v21 are the j1 -th and the j2 -th nodes in σi , the number of edges pointing upward and the totalrespectively; v12 and v22 are the k1 -th and the k2 - vertical length of all edges at the same time.th nodes in σi+1 , respectively. If either j1 < j2 &k1 > k2 or j1 > j2 & k1 < k2 , there is an edge The span number is deﬁned as follows. Considercrossing between e1 and e2 (see Figure 4(a)). a directed graph G = (V, E). Given k ∈ N, deﬁne a layer assignment function ϕ : V → {1, 2, · · · , k}. An edge-node crossing is deﬁned as follows. Con- Let span(u, v) = ϕ(v) − ϕ(u), if ϕ(u) < ϕ(v);sider an edge e = (v1 , v2 ), where v1 , v2 ∈ V i; v1 span(u, v) = ϕ(v) − ϕ(u) + k, otherwise. For eachand v2 are the j-th and the k-th nodes in σi , re- edge e = (u, v) ∈ E, denote span(e) = span(u, v) ∑spectively. W.l.o.g., assuming that j > k, there are and span(G) = e∈E span(e). In brief, span(k − j − 1) edge-node crossings (see Figure 4(b)). means the sum of vertical length of all edges and the penalty of edges pointing upward or horizontal, 2.2. Sugiyama’s Algorithm provided maximum height of this layout is given.Sugiyama’s algorithm [9] consists of four steps: (1) The main idea of the CLA is: if a node causesCycle removal: If the input directed graph is cyclic, a high increase in span, then the layer position of 1002
• the node would be determined later. In the algo- then priority(v) = B − (|k − m/2|), in which B is arithm, the distance function is deﬁned to decide big given number, and m is the number of nodes inwhich nodes should be assigned ﬁrst and is ap- layer k; if down procedures (resp., up procedures),plied. There are four such functions as follows, then priority(v) = connected nodes of node v onbut only one can be chosen to be applied to all layer p − 1 (resp., p + 1).the nodes: (1) Minimum Increase in Span Moreover, the x-coordinate position of each node= minϕ(v)∈{1,··· ,k} span(E(v, V ′ )); (2) Minimum v is deﬁned as the average x-coordinate position ofAverage Increase in Span (MST MIN AVG) connected nodes of node v on layer k − 1 (resp.,= minϕ(v)∈{1,··· ,k} span(E(v, V ′ ))/E(v, V ′ ); (3) k + 1), if down procedures (resp., up procedures).Maximum Increase in Span = 1/δM IN (v);(4) Maximum Average Increase in Span = 2.6. Genetic Algorithm1/δM IN AV G (v). From the experimental resultsin [1], using “MST MIN AVG” as the distance The Genetic algorithm (GA) [5] is a stochasticfunction yields the best result. Therefore, our global search method that has proved to be success-algorithm will be compared with the CLA using ful for many kinds of optimization problems. GAMST MIN AVG in the experimental section. is categorized as a global search heuristic. It works with a population of candidate solutions and tries 2.4. Barycenter Algorithm to optimize the answer by using three basic princi-The barycenter algorithm is a heuristic for solv- ples, including selection, crossover, and mutation.ing the edge crossing problem between two lay- For more details on GA, readers are referred to [5].ers. The main idea is to order nodes on eachlayer by its barycentric ordering. Assuming that 3. OUR METHODnode u is located on the layer i (u ∈ Vi ), the The major issue for drawing layered layouts of di-barycentric∑ value of node u is deﬁned as bary(u) = rected graphs is that the result of the preceding step(1/|N (u)|) v∈N (u) π(v), where N (u) is the set may restrict that of the subsequent step on the ﬁrstconsisting of u’s connected nodes on u’s below or three steps of Sugiyama’s algorithm. To solve it, weabove layer (Vi−1 or Vi+1 ); π(v) is the order of v design a GA that combines the ﬁrst three steps ofin σi−1 or σi+1 . The process in this algorithm is Sugiyama’s algorithm. Figure 5 is the ﬂow chartreordering the relative positions of all nodes accord- of our GA. That is, our method consists of a GAing to the ordering: layer 2 to layer n and then layer and an x-coordinate assignment step. Note thatn − 1 to layer 1 by barycentric values. the barycenter algorithm and the priority method are also used in our method, in which the former is 2.5. Priority Layout Method used in our GA to reduce the edge crossing, whileThe priority layout method solves the x-coordinate the latter is applied to the x-coordinate assignmentassignment problem. Its idea is similar to the step of our method.barycenter algorithm. It assigns the x-coordinateposition of each node layer to layer according to the Initializationpriority value of each node. At ﬁrst, these nodes’ x-coordinate positions ineach layer are given by xi = x0 + k, where x0 is k Assign dummy nodes ia given integer, and xk is the k-th element of σi . Draw the best Chromosome Terminate? BarycenterNext, nodes’ x-coordinate positions are adjusted Fine tune Selectionaccording to the order from layer 2 to layer n, layern − 1 to layer 1, and layer n/2 to layer n. The im- Mutation Remove dummy nodesprovements of the positions of nodes from layer 2 to Crossoverlayer n are called down procedures, while those fromlayer n−1 to layer 1 are called up procedures. Basedon the above, the priority value of a k-th node v on Figure 5: The ﬂow chart of our genetic algorithm.layer p is deﬁned as: if node v is a dummy node, 1003
• 3.1. Deﬁnitions 4. MAIN COMPONENTS OF OUR GAFor arranging nodes on layers, if the relative hori- Initialization: For each chromosome, we ran- √ √zontal positions of nodes are determined, then the domly assign nodes to a ⌈ |V |⌉ × ⌈ |V |⌉ grid.exact x-coordinate positions of nodes are also de- Selection: To evaluate the ﬁtness value of eachtermined according to the priority layout method. chromosome, we have to compute the number ofHence, in the following, we only consider the rela- edge crossings, which however cannot be computedtive horizontal positions of nodes, and each node is at this step, because the routing of each edge isarranged on a grid. We use GA to model the lay- not determined yet. Hence, some dummy nodesered layout problem, so deﬁne some basic elements: should be introduced to determine the routing ofPopulation: A population (generation) includes edges. In general, these dummy nodes are placedmany chromosomes, and the number of chromo- on the best relative position with the optimal edgesomes depends on setting of initial population size. crossings between two adjacent layers. Neverthe-Chromosome: One chromosome represents one less, permuting these nodes on each layer for thegraph layout, where the absolute position of each fewest edge crossings is an NP-hard problem [4].(dummy) node on the grid is recorded. Since the Hence, the barycenter algorithm (which is also usedadjacencies of nodes and the directions of edges by the CLA) is applied to reducing edge crossingswill not be altered after our GA, we do not need on each chromosome before selection. Next, therecord the information on chromosomes. On this selection step is implemented by the truncation se-grid, one row represents one layer; a column rep- lection, which duplicates the best (selection rate ×resents the order of nodes on the same layer, and population size) chromosomes (1/selection rate)these nodes on the same layer are always placed times to ﬁll the entire population. In addition, wesuccessively. The best-chromosome window reserves use a best-chromosome window to reserve some ofthe best several chromosomes during all antecedent the best chromosomes in the previous generationsgenerations; the best-chromosome window size ra- as shown in Figure 6.tio is the ratio of the best-chromosome window size Best-Chromosome Windowto the population size. Best-Chromosome WindowFitness Function: The ‘ﬁtness’ value in our def- duplicateinition is abused to be deﬁned as the penalty forthe bad quality of chromosome. That is, larger ‘ﬁt- Parent Population Child Population Child Populationness’ value implies worse chromosome. Hence, ourGA aims to ﬁnd the chromosome with minimal ‘ﬁt-ness’ value. Some aesthetical criteria to determine Figure 6: The selection process of our GA.the quality of chromosomes (layouts) are given asfollows (noticing that these criteria are referred Crossover: Some main steps of our crossover pro- ∑7from [8] and [9]): f itness value = i=1 Ci × Fi cess are detailed as follows: (1) Two ordered par-where Ci are constants, 1 ≤ i ≤ 7, ∀i; F1 is the to- ent chromosomes are called the 1st and 2nd parenttal edge vertical length; F2 is the number of edges chromosome. W.l.o.g., we only introduce how topointing upward; F3 is the number of edges point- generate the ﬁrst child chromosome from the twoing horizontally; F4 is the number of edge crossing; parent chromosomes, and the other child is similar.F5 is the number of edge-node crossing; F6 is the (2) Remove all dummy nodes from these two par-degree of layout height over limited height; F7 is ent chromosomes. (3) Choose a half of the nodesthe degree of layout width over limited width. from each layer of the 1st parent chromosome and In order to experimentally compare our GA place them on the same relative layers of child chro-with the CLA in [1], the ﬁtness function of our mosome in the same horizontal ordering. (4) TheGA is tailored to satisfy the CLA as follows: information on the relative positions of the remain-f itness value = span + weight × edge crossing + ing nodes all depends on the 2nd chromosomes.C6 × F6 + C7 × F7 where we will adjust the weight Speciﬁcally, we choose a node adjacent to the small-of edge crossing number in our experiment to rep- est number of unplaced nodes until all nodes areresent the major issue which we want to discuss. placed. If there are many candidate nodes, we ran- 1004
• domly choose one. The layer of the chosen node is Note that the x-coordinate assignment problemequal to base layer plus relative layer, where base (step 4) is solved by the priority layout methodlayer is the average of its placed connected nodes’ in our experiment. In fact, this step would notlayers in the child chromosome and relative layer is aﬀect the span number or edge crossing number. Inthe relative layer position of its placed connected addition, the second step of Sugiyama’s algorithmnodes’ layers in the 2nd parent chromosome. (5) (layer assignment) is an NP-hard problem when theThe layers of this new child chromosome are mod- width of the layered layout is restricted. Hence,iﬁed such that layers start from layer 1. we will respectively investigate the cases when theMutation: In the mutated chromosome, a node width of the layered layout is limited or not.is chosen randomly, and then the position of thechosen node is altered randomly. 5.1. Experimental EnvironmentTermination: If the diﬀerence of average ﬁtness All experiments run on a 2.0 GHz dual core lap-values between successive generations in the latest top with 2GB memory under Java 6.0 platformten generations is ≤ 1% of the average ﬁtness value from Sun Microsystems, Inc. The parameters ofof these ten generations, then our GA algorithm our GA are given as follows: Population size:stops. Then, the best chromosome from the latest 100; Max Generation: 100; Selection Rate: 0.7;population is chosen, and its corresponding graph Best-Chromosome Window Size Ratio: 0.2; Mutatelayout (including dummy nodes at barycenter po- Probability: 0.2; C6 : 500; C7 : 500; f itness value =sitions) is drawn. span + weight × edgecrossing + C6 × F6 + C7 × F7 .Fine Tune: Before the selection step or after thetermination step, we could tune better chromo- 5.2. Unlimited Layout Widthsomes according to the ﬁtness function. For ex-ample, we remove all layers which contain only Because it is necessary to limit the layout widthdummy nodes but no normal nodes, called dummy and height for L M algorithm, we set both limitslayers. Such a process does not necessarily worsen for width and height to be 30. It implies that therethe edge crossing number but it would improve are at most 30 nodes (dummy nodes excluded) onthe span number. In addition, some unnecessary each layer and at most 30 layers in each layout. Ifdummy nodes on each edge can also be removed we let the maximal node number to be 30 in ourafter the termination step, in which the so-called experiment, then the range for node distributionunnecessary dummy node is a dummy node that is equivalently unlimited. In our experiments, weis removed without causing new edge crossings or consider a graph with 30 nodes under three diﬀer-worsening the ﬁtness value. ent densities (2%, 5%, 10%), in which the density is the ratio of edge number to all possible edges, 5. EXPERIMENTAL RESULTS i.e. density = edge number/(|V |(|V | − 1)/2). Let the weight ratio of edge crossing to span be de-To evaluate the performance of our algorithm, our noted by α. In our experiments, we consider ﬁvealgorithm is experimentally compared with the diﬀerent α values 1, 3, 5, 7, 9. The statistics forCLA (combing the ﬁrst two steps of Sugiyama’s the experimental results is given in Table 1.algorithm) using MST MIN AVG as the distance Consider an example of a 30-node graph withfunction [1], as mentioned in the previous sections. 5% density. The layered layout by the LM B algo-For convenience, the CLA using MST MIN AVG rithm, our algorithm under α = 1 and α = 9 aredistance function is called as the L M algorithm shown in Figure 7, Figure 8(a) and Figure 8(b), re-(Leveling with MST MIN AVG). The L M algo- spectively. Obviously, our algorithm performs bet-rithm (for step 1 + step 2) and barycenter algo- ter than the LM B.rithm (for step 3) can replace the ﬁrst three stepsin Sugiyama’s algorithm. In order to be compared 5.3. Limited Layout Widthwith our GA (for step 1 + step 2 + step 3), we con-sider the algorithm combining the L M algorithm The input graph used in this subsection is the sameand barycenter algorithm, which is called LM B al- as the previous subsection (i.e., a 30-node graph).gorithm through the rest of this paper. The limited width is set to be 5, which is smaller 1005
• Table 1: The result after redrawing random graphswith 30 nodes and unlimited layout width. method measure density =2%density=5%density=10% span 30.00 226.70 798.64 LM B crossing 4.45 57.90 367.00 running time 61.2ms 151.4ms 376.8ms α =1 span 30.27 253.93 977.56 crossing 0.65 38.96 301.75 α =3 span 31.05 277.65 1338.84 crossing 0.67 32.00 272.80our α =5 span 30.78 305.62 1280.51GA crossing 0.67 29.89 218.45 α =7 span 32.24 329.82 1359.46 crossing 0.75 26.18 202.53 (a) α = 1 (b) α = 9 α =9 span 31.65 351.36 1444.27 crossing 0.53 24.89 200.62 (span: 188, crossing: 30)(span: 238, crossing: 14) running time 3.73s 17.32s 108.04s Figure 8: Layered layouts by our GA. Table 2: The result after redrawing random graphs with 30 nodes and limited layout width 5. method measure density =2%density=5%density=10% span 28.82 271.55 808.36 LM B crossing 5.64 59.09 383.82 running time 73.0ms 147.6ms 456.2msFigure 7: Layered layout by LM B (span:262, α =1 span 32.29 271.45 1019.56crossing:38). crossing 0.96 39.36 292.69 α =3 span 31.76 294.09 1153.60 crossing 0.80 33.16 232.76 our α =5 span 31.82 322.69 1282.24 GA crossing 0.82 30.62 202.31than the square root of the node number (30), be- α =7 span 32.20 351.00 1369.73cause we hope the results under limit and unlimited crossing 0.69 27.16 198.20 α =9 span 33.55 380.20 1420.31conditions have obvious diﬀerences. The statistics crossing 0.89 24.95 189.25for the experimental results under the same settings running time 3.731s 3.71s 18.07sin the previous subsection is given in Table 2. Consider an example of a 30-node graph with 5%density. The layered layout for this graph by the our GA may produce simultaneously span and edgeLM B algorithm, our algorithm under α = 1 and crossing numbers both smaller than that by LM B.α = 9 are shown in Figure 9, Figure 10(a) and Fig-ure 10(b), respectively. Obviously, our algorithm Moreover, we discovered that under any condi-also performs better than the LM B. tions the edge crossing number gets smaller and the span number gets greater when increasing the weight of edge crossing. It implies that we may ef- 5.4. Discussion fectively adjust the weight between edge crossingsDue to page limitation, only the case of 30-node and spans. That is, we could reduce the edge cross-graphs is included in this paper. In fact, we con- ing by increasing the span number.ducted many experiments for various graphs. Be- Under limited width condition, because the re-sides those results, those tables and ﬁgures show sults of L M are restricted, its span number shouldthat under any conditions (node number, edge den- be larger than that under unlimited condition.sity, and limited width or not) the crossing number However, there are some unusual situations in ourby our GA is smaller than that by LM B. How- GA. Although the results of our GA are also re-ever, the span number by our GA is not neces- stricted under limited width condition, its spansarily larger than that by LM B. When the layout number is smaller than that under unlimited widthwidth is limited and the node number is suﬃciently condition. Our reason is that the limited widthsmall (about 20 from our experimental evaluation), condition may reduce the possible dimension. In 1006
• REFERENCES [1] C. Bachmaier, F. Brandenburg, W. Brunner, and G. Lov´sz. Cyclic leveling of directed a graphs. In Proc. of GD 2008, volume 5417 of LNCS, pages 348–359, 2008.Figure 9: Layered layout by LM B algorithm [2] H. do Nascimento and P. Eades. A focus and(span: 288, crossing: 29) with limited layout constraint-based genetic algorithm for interac-width = 5. tive directed graph drawing. Technical Report 533, University of Sydney, 2002. [3] T. Eloranta and E. Makinen. TimGA: A genetic algorithm for drawing undirected graphs. Divulagciones Matematicas, 9(2):55– 171, 2001. [4] M. R. Garey and D. S. Johnson. Crossing number is NP-complete. SIAM Journal on Algebraic and Discrete Methods, 4(3):312–316, 1983. [5] J. Holland. Adaptation in Natural and Arti- (a) α = 1 (b) α = 9 ﬁcial Systems. University of Michigan Press,(span: 252, crossing: 29)(span: 295, crossing: 14) Ann Arbor, 1975. Figure 10: Layered layouts by our GA. [6] P. Kuntz, B. Pinaud, and R. Lehn. Minimizing crossings in hierarchical digraphs with a hy- bridized genetic algorithm. Journal of Heuris-this problem, the dimension represents the posi- tics, 12(1-2):23–36, 2006.tion as which nodes could be placed. Furthermore, [7] E. M¨kinen and M. Sieranta. Genetic algo- aif the dimension is smaller, then our GA can easier rithms for drawing bipartite graphs. Inter-converge to a better result. national Journal of Computer Mathematics, 53:157–166, 1994. 6. CONCLUSIONS [8] H. Purchase. Metrics for graph drawing aes-This paper has proposed an approach for producing thetics. Journal of Visual Languages andlayered layouts of directed graphs, which uses a GA Computing, 13(5):501–516, 2002.to simultaneously consider the ﬁrst three steps ofclassical Sugiyama’s algorithm (consisting of four [9] K. Sugiyama, S. Tagawa, and M. Toda. Meth-steps) and applies the priority layout method for ods for visual understanding of hierarchicalthe forth step. Our experimental results revealed system structures. IEEE Transitions on Sys-that our GA may eﬃciently adjust the weighting tems, Man, and Cybernetics, 11(2):109–125,ratios among all aesthetic criteria. 1981. [10] J. Utech, J. Branke, H. Schmeck, and P. Eades. ACKNOWLEDGEMENT An evolutionary algorithm for drawing di- rected graphs. In Proc. of CISST’98, pagesThis study is conducted under the ”Next Gener- 154–160. CSREA Press, 1998.ation Telematics System and Innovative Applica-tions/Services Technologies Project” of the Insti- [11] Q.-G. Zhang, H.-Y. Liu, W. Zhang, and Y.-J.tute for Information Industry which is subsidized Guo. Drawing undirected graphs with geneticby the Ministry of Economy Aﬀairs of the Repub- algorithms. In Proc. of ICNC 2005, volumelic of China. 3612 of LNCS, pages 28–36, 2005. 1007
• The decimal form of the resulting 8-bit LBP code can be expressed as follows: 7 LBP ( x, y ) = ∑ wi bi ( x, y ) i =0 where wi = 2i , bi ( x, y) = ⎧1, Haari ( x, y) > T . As Figure 2 shown, ⎨ ⎩0, otherwise each component of LBP is actually a binary 2 rectangle Haar feature with rectangle size 1 × 1. Even the gradient can be seen as combination of Haar features. For example, I x = Haar0 + Haar4 I y = Haar2 + Haar6 where I x and I y are gradient along x axis and y axis with Figure 1. Illustration of LBP and Haar. (a) The basic LBP operator, (b) filter [1, −2,1] and [1, −2,1]T , respectively. four types of Haar feature. III. STRUCTURED LOCAL BINARY HAAR PATTERNB. Haar Feature A. SLBHP A simple rectangular Haar feature can be defined as thedifference of the accumulate sum of pixels of area inside therectangle, which can be at any position and scale within thegiven image. Oren et al. [10] first used 2 rectangle featuresin pedestrian classification. Viola and Jones [11] extendthem to 3 rectangle features and 4 rectangle features inViola-Jones object detection framework for face andpedestrian. The difference values indicate certaincharacteristics of a particular area of image. Haar featureencodes the low-frequency information and each featuretype can indicate the existence of certain characteristics inthe images, such as vertical or horizontal edges, changes intexture. Haar feature can be computed quickly using theintegral image [11]. It is an intermediate representation ofimage, all the rectangular two-dimensional image featurescan be computed rapidly using this representation. Eachelement of the integral image contains the sum of all pixelslocated on the up-left region of the original image. Given Figure 3. An example of SLBHP. (a) Four Haar features; (b)the integral image, any rectangular sum of pixel values corresponding Haar features with overlapping; (c) an example to compute SLBHP values.aligned with the coordinate axes can be computed in fourarray references.C. A New Sight into LBP, Haar,and Gradient In this paper, based on the similar idea of multi-block local binary pattern features [12, 13], a descriptor Structured Local Binary Haar Pattern (SLBHP) has been modified from LBP with Haar feature. The proposed SLBHP adopts four types of Haar features, which capture the changes of gray values along the horizontal direction, the vertical direction and the diagonals as shown in Figure 3(a). However, only the polarity of Haar feature is involved in SLBHP, while the magnitude is discarded. It is noted that the number of encoding patterns has been reduced from 256 for LBP to 16 for SLBHP. Moreover, SLBHP encoding spatial structure of two adjacent rectangle regions in four-directions. Thus, Figure 2. LBP can be seen as a weighted combination of binary Haar compared to LBP, the SLBHP has compact encoding feature. patterns and incorporates more semantic structure information. Let ai , i = 0,1,L,8 denote the corresponding gray values for a 3×3 window with a0 at the center pixel 1009
• ( x, y) as shown in Figure 3(a). The value of SLBHP code ofa pixel ( x, y) is given by the following equation, SLBHP( x, y) = ∑ B ( H p ⊗ N ( x, y)) × 2 p 4 p =1 ⎡ a1 a2 a3 ⎤ ⎡ 1 1 0⎤where N ( x, y) = ⎢ a8 ⎢ ⎥, a0 a4 ⎥ H1 = ⎢ 1 0 −1⎥ , ⎢ ⎥ ⎢a7 ⎣ a6 a5 ⎥ ⎦ ⎢ ⎥ ⎣0 −1 −1⎦ Figure 4. An example of SLBHP histograms for graphics retrieval. ⎡ 0 1 1⎤ ⎡ 1 1 1⎤ ⎡−1 0 1⎤ ⎢−1 0 1⎥ , H = ⎢ 0H2 = ⎢ ⎥ , H = ⎢−1 0 1⎥. 0 0⎥ 4 ⎢ IV. EXPERIMENTAL RESULTS ⎥ 3 ⎢ ⎥ ⎢−1 −1 0⎥ ⎣ ⎦ ⎢−1 ⎣ −1 −1⎥ ⎦ ⎢−1 0 1⎥ ⎣ ⎦ ⎧1 if | x |> Tand B ( x ) = ⎨ with T as a threshold 15 in our ⎩0 otherwiseexperiments). By this binary operation, the feature becomesmore robust to global lighting changes. It is noted that H pdenote a Harr-like basis function and H p ⊗ N (x, y) denotesthe difference between the accumulated gray values of theblack and red rectangle as shown in Figure 3(c). Unliketraditional Haar feature, here the rectangles are overlappedwith one pixel. Inspired by LBP and the fact that a singlebinary Haar feature might have not enough discriminativepower, we combine this binary feature just like LBP. Figure3(c) shows an example of SLBHP feature. SLBHP featureextends the merits of both Haar feature and LBP and it (a) (b)encodes the most common structure information of graphics. Figure 5. Some query results for graphics database. (a) Query graphics; (b)Moreover, SLBHP has dimension of 16 smaller than the a list of three most similar graphics ordered by similarity values. The one with red rectangle is the ground true match.dimension of LBP 256, while has more immunities to noisesince Haar feature uses more pixels once at a time.B. SLBHP for Graphics Retrieval 479 electronic files of graphics are collected to After the SLBHP value is computed, the histogram of construct database for retrieval experiments. Test images are comprised of 479 graphics photos taken by a digital cameraSLBHP for a region R is computed by the following and then added by noises to obtain noisy test images Theequation performance of graphics retrieval is measured by the H (i) = ∑ I {SLBHP( x, y) = i} , retrieval accuracy. The retrieval accuracy is computed as the ( x, y )∈R ratio of the number of graphics correctly retrieved to the ⎧1，A = true, number of total queries. Moreover, not only the retrievalwhere I { A} = ⎨ The histogram H contains ⎩0，A = false. accuracy with respect to the first rank but also the secondinformation about the distribution of the local patterns, such and third ranks is concerned in our experiments. Theas edges, spots and flat areas, over the image region R . In retrieval accuracies for different approaches are listed inorder to make SLBHP robust to slight transition, a graphics Tables 1 through 4 with different block sizes from 8×8 tophoto is divided into several small spatial regions (“block”), 32×32. Moreover, the retrieval accuracy for non-for each of which a SLBHP histogram is computed and then overlapping case is also listed in Table 4. By comparingconcatenated to form the representation of graphics as shown Tables 1 and 4, we found that overlapping results in higherin Figure 4. For better invariance to illumination, it is useful retrieval accuracy. It is noted that the proposed method andto contrast-normalize the local responses in each block the approaches using EP [6] and LBP all adopt histogram-before using them. Experiment results showed that the based matching. However, for the Haar feature, theL2NORM gets better results than L1NORM and L1SQRT. computed four Haar values for each block are normalizedSimilar to other popular local feature based object detection, and then concatenated to form the representation, Chi-the detection windows are tiled with a dense (overlapping) square distance is also adopted as similarity measure of thegrid of SLBHP descriptors. The overlap size is half of whole Haar feature.block. 1010
• In our experiment, we found that chi-square is a better distance. Some retrieval results are shown in Figure 5.similarity for histogram-based matching than Euclidean TABLE I. RETRIEVAL ACCURACIES OF EDGE POINTS (EP), LBP, HAAR, AND SLBHP WITH HALF-OVERLAPPING BLOCKS. 1-best 2-best 3-best EP LBP Haar SLBHP EP LBP Haar SLBHP EP LBP Haar SLBHP 32x32 85.2 70.4 83.7 88.3 91.6 79.5 90.6 95.6 93.3 82.5 92.5 96.5 32x16 83.3 62.3 68.9 88.5 91.4 74.9 76.0 94.6 93.5 78.3 78.5 95.7 16x32 86.8 66.6 60.8 90.2 92.9 76.0 68.7 95.6 94.2 80.0 72.2 96.7 16x16 85.0 58.2 62.4 89.4 92.3 66.8 70.1 94.4 94.4 69.3 73.3 95.8 16x8 81.2 42.0 37.4 86.6 89.8 51.8 43.4 91.9 91.2 55.5 45.9 93.7 8x16 83.3 45.3 29.0 86.6 90.6 55.1 36.5 92.5 92.9 57.8 40.7 94.8 8x8 79.3 30.5 29.2 82.7 86.8 39.5 34.9 89.3 89.8 44.5 39.2 91.2 TABLE II. RETRIEVAL ACCURACIES UNDER GAUSSIAN NOISE WITH VARIANCE 50 AND PERTURBATION 1%. 1-best 2-best 3-best EP LBP Haar SLBHP EP LBP Haar SLBHP EP LBP Haar SLBHP 32x32 63.88 71.19 83.09 82.46 74.53 78.91 90.40 90.81 77.87 84.13 92.48 93.95 32x16 71.61 65.76 68.48 85.18 79.54 75.16 75.78 93.53 84.76 79.54 78.71 94.57 16x32 72.44 67.22 60.96 87.06 79.54 76.41 68.27 93.53 83.72 81.21 72.03 94.99 16x16 78.08 59.92 62.42 88.31 85.39 68.27 69.52 93.74 89.14 72.65 73.70 94.99 16x8 79.96 43.63 37.58 86.01 87.89 52.61 43.42 92.07 89.98 55.74 45.72 93.95 8x16 79.12 47.60 29.02 86.85 88.31 54.91 36.33 93.11 91.44 59.08 41.34 94.99 8x8 81.00 31.52 29.23 83.72 87.68 40.08 34.24 90.40 90.61 44.89 38.62 92.48 TABLE III. RETRIEVAL ACCURACIES UNDER SALT AND PEPPER NOISES WITH PERTURBATION 0.5%. 1-best 2-best 3-best EP LBP Haar SLBHP EP LBP Haar SLBHP EP LBP Haar SLBHP 32x32 15.24 70.77 83.51 84.76 19.83 79.33 91.02 92.48 25.05 82.88 92.48 94.78 32x16 20.46 64.93 68.48 86.01 27.97 75.57 75.79 94.15 39.25 79.33 78.50 95.62 16x32 22.55 67.43 60.96 88.10 27.35 76.20 68.48 94.15 34.66 80.17 72.44 95.62 16x16 37.37 59.71 61.59 88.52 46.97 67.85 68.89 93.95 61.38 71.19 73.28 95.41 16x8 55.95 42.80 36.74 86.22 67.22 52.40 43.01 92.28 73.90 55.95 45.30 93.32 8x16 54.28 47.39 28.60 87.27 67.43 54.90 36.12 93.32 78.08 58.87 40.71 94.57 8x8 70.35 31.11 29.02 83.51 81.00 40.29 34.86 89.77 84.55 45.09 39.25 92.28 TABLE IV. RETRIEVAL ACCURACIES WITH NON-OVERLAPPING BLOCKS. 1-best 2-best 3-best EP LBP Haar SLBHP EP LBP Haar SLBHP EP LBP Haar SLBHP 32x32 82.04 70.98 69.73 87.68 90.40 77.66 78.29 94.57 92.49 81.42 82.88 95.82 32x16 79.75 64.09 61.80 87.27 88.10 72.65 68.89 93.53 89.98 75.57 71.61 94.78 16x32 82.25 66.18 53.44 89.14 89.77 74.95 61.17 94.78 90.81 78.50 64.30 96.24 16x16 81.00 57.20 57.83 88.94 88.94 66.18 65.76 93.11 91.23 69.31 69.10 94.36 16x8 78.50 41.34 29.23 84.97 87.06 51.57 36.33 91.23 89.35 54.90 40.08 92.48 8x16 79.96 43.01 23.38 86.01 88.52 51.77 29.44 91.23 91.65 55.11 32.57 93.53 8x8 75.99 29..22 27.97 83.30 84.97 39.25 33.83 88.31 88.31 42.17 36.12 90.61 V. CONCLUSION ACKNOWLEDGMENT A novel local feature SLBHP, combining the merits of This work was partially supported by National ScienceHaar and LBP, is proposed in this paper. The effectiveness Council of Taiwan, under Grants NSC 99-2221-E-155-072,of SLBHP has been proven by various experimental results. National Nature Science Foundation of China under GrantsMoreover, compared to the other approaches using EP, Haar 60873179, Shenzhen Technology Fundamental Researchand LBP descriptors, SLBHP is superior even in the noisy Project under Grants JC200903180630A, and Doctoralconditions. Further research can be directed to extend the Program Foundation of Institutions of Higher Education ofproposed graphics retrieval for slide retrieval or e-learning China under Grants 20090121110032.video retrieval using graphics as query keywords. REFERENCES 1011
• [1] R. Datta, D. Joshi, J. Lia, and J. Z. Wang, “Image retrieval: ideas, influences, and trends of the new age,” ACM Computing Surveys, 2008, vol. 40, no.2, Atricle. 5, pp. 1–60.[2] J. Deng, W. Dong, R. Socher, et al. ImageNet: A large-scale hierarchical image database. In: Proceedings of Computer Vision and Pattern Recognition, 2009.[3] A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images: a large dataset for non-parametric object and scene recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, vol. 30, no.11, pp. 1958- 1970.[4] B. Huet and E. R. Hancock, “Line pattern retrieval using rational histograms,” IEEE Transaction on Pattern Analysis and Machine Intelligence, 1999, vol.12, no.12, pp. 1363-1370.[5] Y. Chi and M.K.H. Leung, “ALSBIR: A local-structure-based image retrieval,” Pattern Recognition, 2007, vol. 40, pp. 244-261.[6] A. Chalechale, G. Naghdy and A. Mertins, “Sketch-based image matching using angular partitioning,” IEEE Transactions on Systems, Man, and Cybernetics –Part A: Systems and Humans, 2005, vol. 35, no. 1, pp.28-41.[7] T. Ojala, M. Pietikainen, and D. Harwood, “A comparative study of texture measures with classification based on featured distribution,” Pattern Recognition, 1996, vol. 29, no. 1, pp.51-59.[8] T. Ahonen, A. Hadid, and M. Pietikinen. “Face description with local binary patterns, application to face recognition,” IEEE Transaction on Pattern Analysis and Machine Intelligence, 2006, vol.28, no. 12, pp. 2037-2041.[9] X. Wang, T. X. Han, and S. Yan, “An HOG-LBP human detector with partial occlusion handling,” In: Proceedings of Internation Conference on Computer Vision, 2009.[10] M. Oren, C. Papageorion, P. Sinha, et al, “Pedestrian detection using wavelet templates,” In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 1997.[11] P. Viola and M Jones, “Robust real-time face detection,” International Journal of Computer Vision, 2004, vol. 57, no. 2, pp. 137-154.[12] L. Zhang, R. Chu, S. Xiang, S. Liao, and S. Z. Li, ‘Face detection based on multi-block LBP representation’, Proc. Int. Conf. on Biometrics, 2007.[13] S. Yan, S. Shan, X. Chen, and W. Gao, ‘Locally assembled binary (LAB) feature with feature-centric cascade for fast and accurate face detection’, Proc. Int. Conf. Computer Vision and Pattern Recognition, 2008. 1012
• IMAGE-BASED INTELLIGENT ATTENDANCE LOGGING SYSTEM Hary Oktavianto1, Gee-Sern Hsu2, Sheng-Luen Chung1 1 Department of Electrical Engineering 2 Department of Mechanical Engineering National Taiwan University of Science and Technology, Taipei, Taiwan E-mail: hary35@yahoo.comAbstract— This paper proposes an extension of thesurveillance camera’s function as an intelligentattendance logging system. The system works like a timerecorder. Based on sitting and standing up events, thesystem was designed with learning phase and monitoringphase. The learning phase learns the environment tolocate the sitting areas. After a defined time, the systemswitches to the monitoring phase which monitors theincoming occupants. When the occupant sits at the same Fig. 1. Occupant’s working room (left) and a map consists of occupants’location with the sitting area found by the learning sitting areas (right).phase, the monitoring phase will generate a sitting-timereport. A leaving-time report is also generated when an working area of the occupant does and when the occupantoccupant stands up from his/her seat. This system works.employs one static camera. The camera is placed 6.2 The diagram flow of the proposed system is shown in Fig. 2. The system consists of an object segmentation unit, ameters far, 2.6 meters high, and facing down 21° from tracking unit, learning phase, and monitoring phase. A fixedhorizontal. The camera’s view is perpendicular with the static camera is placed inside the occupants’ working room.working location. The experimental result shows that the The images taken by the camera are pre-processed by thesystem can achieve a good result. object segmentation unit to extract the foreground objects. The connected foreground object is called as blob. TheseKeywords Activity Map; Attendance; Logging system; blobs are processed further in the tracking unit. Once theLearning phase; Monitoring phase; Surveillance Camera; system detected the blob as an occupant, the system keeps tracking the occupant in the scene using centroid, ground I. INTRODUCTION position, color, and size of the occupant as the simple Intelligent buildings have increased as a research topic tracking features. The learning phase has responsibility torecently [1], [2], [3]. Many buildings are installed with learn the environment and constructs a map as the output.surveillance cameras for security reasons. This paper extends The monitoring phase uses the map to monitor whether thethe function of existing surveillance cameras as an intelligent occupants are present in their working desk or not. Theattendance logging system. The purpose is to report the report on the presence or the absence of the occupants is theoccupant’s attendance. The system works like a time final output of the system for further analysis. The system isrecorder or time clock. A time recorder is a mechanical or implemented by taking the advantages of the existing openelectronics timepiece that is used to assist in tracking the source library for computer vision, OpenCV [5] and cvBlobhours an employee of a company worked [4]. Instead of [6].spending more budgets to apply those timepieces, the The contributions of this paper are:surveillance camera can be used to do the same function. The (1) Learning mechanism that locates seats in an unknownsystem is so called intelligent since it learns from a given environment.environment automatically to build a map. A map consists of (2) Monitoring mechanism that detects in entering andsitting areas of the occupants. Sitting area is the space leaving events of occupants.information about where are the locations of the occupant’s (3) Integrating system with real-time performance up to 16working desk. So, there is no need to select the area of fps, ready for context-aware applications.occupant’s working area manually. Fig. 1 shows an example This paper is organized with the following sections. Thescenario. Naturally, the occupant enters into the room and problem definition and the previous researches as relatedsits to start working. Afterward, the occupant stands up from works are reviewed in Section II. Section III describes thehis/her seat and leaves the room. The sitting and standing up technical overview of the proposed solution. Section IVevents will be used by the system to decide where the explains about the tracking that is used to keep tracking of the occupants during their appearance in a scene based on the 1013
• information from the previous frame. The learning phase andthe monitoring phase are explained in Section V. Section VIexplains the experiments’ setup, result, and discussion.Finally, the conclusions are summarized in Section VII. II. PROBLEM DEFINITION AND RELATED WORK This section describes the problem definition and theprevious works related to the intelligent attendance loggingsystem.A. Problem Definition The goal of this paper is to design an image-basedintelligent attendance logging system. Given a fixed staticcamera as an input device inside an unknown workingenvironment with a number of fixed seats, each of thembelong to a particular user or occupant. Occupant enters andleaves not necessarily at the same time. We are to design acamera-equipped intelligent attendance logging system, such Fig. 2. Diagram flow of the system.that, the system can report in real-time each occupant’sentering and leaving events to and from his/her particular provide the vocabulary to categorize past and presentseat. activity, predict future behavior, and detect abnormalities. The system is designed based on two assumptions. The The researches above detect occupants and build a mapfirst assumption is the environment is unknown, in that, the consists of locations where those people mostly occupy. Thisnumber of seat and the location of these seats are not known paper extends the advantages of the surveillance cameras tobefore the system monitors. The second assumption is each monitor the occupant’s presence. A static camera is used byoccupant has his/her own seat, as such, detecting the the system as in [2], [8]. Morris and Trivedi appliedpresence/absence of a particular seat amounts to answering omnidirectional camera [9] to their system. The otherthe presence/absence of that corresponding occupant. researchers [1], [3], [7] used stereo camera to reduce the There are two performance criteria to evaluate the system effect of lighting intensity and occlusion. It is intended thatregarding to the main functions of the system. The main the system in this paper works in real time and has thefunctions of the system are to find the sitting area and to capability to learn the environment automatically fromreport the monitoring result. The first criterion is the system observed behavior.should find the sitting areas given by the ground truth. Thesecond criterion is the system should be able to monitor theoccupants during their appearance in the scene to generate III. TECHNICAL OVERVIEWthe accurate report. As shown in Fig. 2 and the detail in Fig. 3, the inputB. Related Work images acquired from the camera are fed into the object segmentation unit to extract the foreground object. During the past decades, intelligent building has been Foreground object is the moving object in a scene. Thedeveloped. Zhou et al [3] developed the video-based human foreground object is obtained by subtracting the currentindoor activity monitoring that is aimed to assist elderly. image with the background image. To model the backgroundDemirdjian et al [7] presented a method for automaticallyestimating activity zone based on observed user behaviors in image, Gaussian Mixture Model (GMM) is used. GMM represents each background pixel variation with a set ofan office room using 3D person tracking technique. They weighted Gaussian distributions [10], [11], [12], [13]. Theused simple position, motion, and shape features for first frame will be used to initialize the mean. A pixel istracking. This activity zone is used at run time tocontextualize user preferences, e.g., allowing “location- decided as the background if it falls into a deviation around the mean of any of the Gaussians that model it. The updatesticky” settings for messaging, environmental controls, process, which is performed in the current frame, willand/or media delivery. Girgensohn, Shipman, and Wilcox [8] increase the weight of the Gaussian model that is matched tothought that retail establishments want to know about traffic the pixel. By taking the difference between the current imageflow in order to better arrange goods and staff placement. and the background image, the foreground object is obtained.They visualized the results as heat maps to show activity andobject counts and average velocities overlaid on the map of After that, the foreground object is converted from RGBthe space. Morris and Trivedi [9] extracted the human color image to gray color image [13]. The edges of theactivity. They presented an adaptive framework for live objects in the gray color image are extracted by applyingvideo analysis based on trajectory learning. A surveillance edge detector. The edge detector uses moving framescene is described by a map which is learned in unsupervised algorithm. Moving frame algorithm has four steps. Step one,fashion to indicate interesting image regions and the way the gray color image (I) is shifted to eight directions usingobjects move between these places. These descriptors the fixed distance in pixel unit (dx and dy), resulting eight 1014
• images with an offset to right, left, up, down, up right, upleft, down right, and down left, respectively. Those eightshifted images with offset are called moving frame images(Fi ). Fi ( x , y )  I ( x  dxi , y  dyi ) (1)Step two, each of moving frames image is updated (F*) bymaking subtraction to the image frame (I) to get theextended edges. Fi ( x , y )  I ( x , y )  Fi ( x , y ) (2)Step three, each of moving frames is converted to binary byapplying a threshold value (TF). Fig. 3. The detail of the object segmentation unit and the tracking unit. FiT ( x , y )  f T ( Fi ) (3) where Biy and Bjy are the y-coordinate of of blob-i and blob- j, respectively, ci and cj are the centroid of each blob. If 1 if ( Fi* ( x , y ))  TF those three conditions satisfy (6) then the broken blobs are fT   0 otherwise grouped.Finally, all of moving frame images are added together. As BI    TC   Bdy  TD  B A  TA   1the result, the edges of the image (E) are obtained. G (6) 0 otherwise E( x , y )   FiT ( x , y ) (4) i TC, TD, and TA are the threshold values for the intersection distance, the nearest vertical distance of blobs, and the angleEdge detector extracts the object while removes the weak of blobs, respectively. In the experiments, TC is 0 pixel, TD isshadows at the same time since weak shadows do not have 50 pixels, and TA is 30°.edges. However, strong shadows may happen and create After the broken blobs are grouped into one, the motionsome edges. Strong edges appear between legs can still be detector will test the blob whether it is an occupant or not.tolerated since the system does not consider about occupant’s The blob is an occupant if the size of the blob looks like acontour. human and the blob has movement. A minimum size of The result from the edge detection process is refined by human is an approximation relative to the image size. X-axisusing morphology filters [13]. Dilation filter is applied twice displacement and optical flow [13] are used to detect theto join the edges and erosion filter is used once to remove the movement of the blob. If a blob is detected as an occupantnoises. The last step in the object segmentation unit is then the tracking unit gives a unique identification (ID)connected component labeling. The connected component number and a track the occupant. A track is an indicator thatlabeling is used to detect the connected region. The a blob is an occupant, and it is represented by a boundingconnected region is so called as blob. In the object box. Tracking rules are implemented as states to handle eachsegmentation unit, the GMM, the gray color conversion, the event. There are five basic states; entering state, people state,edge detector, and the morphology filters are implemented sitting state, standing up state, and leaving state. During theusing OpenCV library while the connected component tracking, the occlusion problem may happen. Two morelabeling is implemented using cvBlob library. states are added. They are merge state and split state. In the The blob that represents the foreground object may be tracking unit, optical flow implements OpenCV library whilebroken due to the occlusion with furniture or having the the tracking rules employ cvBlob library.same color with the background image. Some rules to group The learning phase is activated if the map has notthe broken blob are provided. There are three conditions to constructed yet. The sitting state in the tracking unit triggersexamine the broken blobs. The first is the intersection the learning phase to locate the occupant’s sitting area. Afterdistance of blobs (BI). The second is the nearest vertical a defined time, the learning phase finished its job and the monitoring phase is activated. In this phase, the sitting statedistance of blobs (Bdy). The third is the angle of blobs (BA) and the standing up state in the tracking unit trigger thefrom their centroids. Bdy and BA are calculated using (5) monitoring phase to generate reports. The reports tell whenwhile BI is explained in [14]. the occupant sit and left. Bdy  min( Bi . y , B j . y ) The system will be evaluated by testing it with some (5) video clips. There are two scenarios in the video clips. Five B A  ( ci ,c j ) occupants are asked to enter the scene. They sit, stand up, leave the scene and sometimes cross each other. 1015
• IV. TRACKING This section describes about the tracking rules in thetracking unit (Fig. 3). Tracking rules will keep tracking theoccupants during their appearance in the scene based on theinformation (features) from the previous frame. The Fig. 4. Basic tracking states.tracking rules are represented by states. The basic trackingstates are shown in Fig. 4. There are five states:  Entering state (ES), an incoming blob that appears in the scene for the first time will be marked as entering state. This state also receives information from the motion detector to detect whether the incoming blob is an occupant or a noise. If the incoming blob is considered as noise and it remains there for more than 100 frames then the system will delete it, for instance, the size is too small because of shadows. To erase the noise from the scene, the system re-initializes the Fig. 5. An occupant in the scene and the features. Gaussian model to the noise region so that the noise will be absorbed as a background image. An incoming blob is classified as an occupant if the incoming blob has motion at least for 20 frames continuously and the height of the blob is more than 60 pixels.  Person state (PS), if the incoming blob is detected as an occupant, a unique identification (ID) number and a bounding box are attached to this blob. The blob that is detected as an occupant is called as a track. The system adds this track in the tracking list. Fig. 6. Centroid feature to check the distance in 2D.  Sitting state (IS), detects if the occupant is sitting. Sitting occupant can be assumed if there is no surrounding by a bounding box), size (number of blob movement from the occupant for a defined time. In the pixels or area density), centroid (center gravity of mass), experiments, an occupant is sitting when the x-axis and ground position (foot position of occupant). displacement is zero for 20 frames and the velocity The first feature is centroid. Centroid is used to associate vectors from the optical flow’s result are zero for 100 the object’s location in the 2D image between two frames, continuously. consecutive frames by measuring the centroids distance.  Standing-up state (US), detects when the sitting Fig. 6 shows the two objects being associated. One object is occupant starts to move to leave his/her desk. In the already defined as track in the previous frame (t-1) and experiments, a standing up occupant is detected when another object is appearing in the current frame (t) as a blob. the sitting occupant produces movements, the height Each object has centroid (c). These two objects are increases above 75%, and the size changes to 80%- measured [14] in the following way. If one of centroid is 140% comparing to the size of the current bounding inside another object (the boundary of each object is defined box. as a rectangle) the returned distance value is zero. If the centroids are lying outside the boundary of each object then  Leaving state (LS), deletes the occupant from the list. the returned distance value is the nearest centroid to the A leaving occupant is detected when occupant moves opponent boundary. A threshold value (TC) is set. When the to the edge of the scene and occupant’s track loses its distance is below TC meaning that those two objects are the blob for 5 frames. same object, the track position will be updated to the blobA. Tracking Features position. If the distance is not satisfied then it means these The system is tried to match every detected occupant in two objects is not correlated each other. It could be thethe scene from frame to frame. This can be done by previous track loses the object in the next frame and a newmatching the features of occupant. Four features (centroid, object appears at the same time. A track that missed theground position, color, and size) are used for tracking tracking is defined in the leaving state (LS) and a new objectpurpose. Fig. 5 shows the illustration of blob (the connected that appears in the scene is handled in the blob state (BS).region of occupant object in the current frame), track (a The second feature is ground position. It is possible thatconnected blob that considers as an individual occupant, two objects are not the same object but their centroids are 1016
• Fig. 10. Extended tracking states.Fig. 7. Ground position feature to check the distance in 3D. Blob and track in the processing stage (left). View in the real image (right). categories; n is the total bin number; the histogram HR,G,B of occupant-i meets the following conditions: n H iR ,G ,B   bk (7) k 1 The histogram HR,G,B are calculated using the masked image and then normalized. The masked image, shown in Fig. 8, is obtained from the occupant’s object and the blob with and- operation. The method for matching the occupant’s Fig. 8. Color feature is calculated on masked image. histogram is correlation method. In the experiments, 10-bins for each color are chosen. The histogram matching procedure uses a threshold value of 0.8 to indicate that the comparing histogram is sufficient matched. The fourth feature is size. The size feature is used to match the object between two consecutive frames based on the pixel density. The pixel density is the blob itself, shown at Fig. 9. Allowable changing size at the next frame is set ± 20% from the previous size. Let p(x’,y’) be the pixel Fig. 9. Size feature of occupant. location of an occupant in binary image. The size feature of object-i is calculated as follow:lying inside each other boundary. Fig. 7 shows this problem.There are two occupants in the scene. One occupant is si   p( x , y ) (8) x , ysitting while the other is walking through behind. In the 2Dimage (left), the two objects are overlapped each other.However, it is clear that the walking occupant should not be B. Merge-split Problemconfused with the sitting occupant. To solve this problem, A challenging situation may happen. While the occupantsground position is used to associate the object’s location in are walking in the scene, they are crossing each other andthe 3D image between two consecutive frames. Ground making occlusion. Since the system keeps tracking eachposition feature will eliminate the error that an object to be occupant in the scene, it is necessary to extend the trackingupdated with another object even thought they overlap each states from Fig. 4. Two states are added for this purpose;other. Occupant’s foot location is used as ground position. merge state (MS) and split state (SS). Fig. 10 shows theA fixed uniform ellipse boundary (25 pixels and 20 pixels extended tracking states. Merge and split can be detected byfor major axis and minor axis, respectively) around the using proximity matrix [14]. Objects are merged whenground position is set to indicate the maximum allowable multiple tracks (in the previous frame) are associated to onerange of the same person to move. In the real scene, this blob (in the current frame). Objects are split when multiplepixel area is equal to 40 centimeters square for the nearest blobs (in the current frame) are created from a track (in theobject from the camera until 85 centimeters square for the previous frame). In the merge condition, only centroidfurthest object from the camera. This wide range is caused feature is used to track the next possible position since theby the using of uniform ellipse distance for all locations in other three features are not useful when the objects merge.the image. After a group of occupants split, their color will be matched The third feature is color. Color feature is used to to their color just before they merged together.indicate color information of occupant’s clothing or wearing In experiments, when more than two occupants split,and help to separate the objects in term of occlusion. Three sometimes an occupant remains occluded. Later, thedimension of RGB color histogram is used. Let b be the bin occluded occupant splits. When the occluded occupantthat counts the number of pixel that fall into the same splits, the system will re-identify each occupant and correct 1017
• Sitting area number Event Time stamp 1 Sitting 09:02:09 Wed 2 June 2010 2 Sitting 09:07:54 Wed 2 June 2010 3 Sitting 09:12:16 Wed 2 June 2010 2 Leaving 10:46:38 Wed 2 June 2010 2 Sitting 10:49:54 Wed 2 June 2010 3 Leaving 12:46:38 Wed 2 June 2010 Fig. 12. A report example. B. Monitoring Phase The monitoring phase is derived from the sitting state and the standing up state in the tracking rules. The monitoring phase generates the reports of the occupant’s attendance. It Fig. 11. Merge-split algorithm with occlusion handling. uses the map that has been constructed by the learning phase. From Fig. 4, the sitting into state (IS) and standing uptheir previous ID number just before they have merged. Fig. state (US) trigger the monitoring phase. When the occupant11 shows the algorithm to handle the occluded problem. sits, the system will try to match the current occupant’s sitting location with the sitting area in the map. If the V. LEARNING AND MONITORING PHASES positions are the same then the system will generate a time This section introduces about how the learning phase stamp of sitting time for the particular sitting area. A timeand the monitoring phase work. These phases are derived stamp of leaving time is also generated by the system whenfrom the tracking unit, which are the sitting state and the the occupant moves out from the sitting area. Fig. 12 showsstanding up state in the tracking rules. At the beginning, the the example of the report.system activates the learning phase. Triggering by the sitting VI. APPLICATION TO INTELLIGENT ATTENDANCEevent, the learning phase starts to construct the map. When LOGGING SYSTEMthe given time interval is passed, the learning phase isstopped. A map has been constructed. The system switches This paper demonstrates the usage of the surveillanceto the monitoring phase to report the occupants’ attendance camera as an intelligent attendance logging system. Itbased on when they sit into and stand from their seat. mentioned earlier that the system works like a time recorder. The system assists for tracking the hours of occupantA. Learning Phase attendance. Using this system, the occupants no need to The learning phase is derived from the sitting state in the bring special tag or badge. In this section, the environmenttracking rules. The output of the learning phase is a map. setup, result, and discussion are described.The map consists of occupants’ sitting areas. From Fig. 4, A. Environment Setupthe information about when the occupant sits is extractedfrom the sitting into state (IS). When an occupant is detected A static network camera is used to capture the imagesas sitting, the system will start counting. After a certain from the scene. It is a HLC-83M, a network cameraperiod of counting, the location where the occupant sits is produced by Hunt Electronic. The image size taken from thedetermined as sitting area. The counting period is used as a camera is 320 x 240 pixels. The test room is in ourdelay. The delay makes sure that the occupant sits for laboratory. The camera is placed about 6.24 meters far, 2.63enough time. In the experiments, the delay is defined for meters high, and 21° facing down from the horizontal line.200 frames. Ideally, the learning phase is considered to be The occupant desks and the camera view are orthogonal tofinished after all of the sitting areas are found. In this paper, get the best view. There are 5 desks as the ground truths.to show that the learning phase does its job, the occupants The room has inner lighting from fluorescent lamps andenter into the scene and sit one by one without making the windows are covered so the sunlight cannot come intoocclusion. The scenario for this demonstration is arranged the room during the test.so that after 10 minutes, the map is expected to be B. Result and Discussioncompletely constructed. Thus, the learning phase is finished Visual C++ and OpenCV platform on Intel® Core™2its job. The system will be switched to the monitoring Quad CPU at 2.33GHz with 4 GB RAMs is used tophase. In the real situation, the delay and how long the implement the system. Both offline and online methods arelearning phase will be finished can be adjusted. allowed. In the scene without any detected objects, the system ran at 16 frames per second (fps). When the number 1018
• of incoming objects is increasing, the lowest speed can be Table 1. Test results of scene type 1. The number of detected seat by theachieved is 8 fps. system for 10 times experiments. The algorithm was tested with 2 types of scenarios. The Sitting Desk numberfirst scenario is sitting occupants with no occlusion (Fig. area #1 #2 #3 #4 #513). This scenario demonstrated the working of learning Detected 7 10 10 10 8phase. The second scenario is the same as the first scenario Missed 3 0 0 0 2but the occupants are allowed crossing each other to makean occlusion (Fig. 14). This scenario demonstrated the Table 2. Test results of scene type 2. The number shows the success rate ofmerge-split handling. monitoring without occlusion for 10 times experiments. Table 1 shows the test result of scenario type 1. There Desk numberare 5 desks as ground truth (Fig. 1). Five occupants enter Occupant #1 #2 #3 #4 #5into the scene. They sit, stand up, and left the scene one by Sitting 9 10 10 10 9one without making any occlusion. The order or the Leaving 0 9 10 10 9occupants enter and leave are arranged. The occupantstarted to occupy the desk number 5 (the right most desk), Table 3. Test results of scene type 2. The number of occupant mistakenlyuntil the desk number 1 (the left most desk). When they left, assigned in merge-split case for 10 times merged.the occupant started to leave from the desk number 1, until Number of Sitting Walking Splitthe desk number 5. This order is made to make sure that Occupant occupant occupant Merge Succeeded Failedthere is no occupant walks through behind the sitting 2 0 2 10 9 1occupant. This scenario was repeated 10 times. The result 2 1 1 10 9 1shows that there is no problem for the desk number 2, 3, and4. However, there are some errors that the system failed to 3 0 3 10 8 2locate the occupants’ sitting areas. In the case of the desk 3 1 2 10 9 1number 1, sometimes the occupant’s blob merges with 3 2 1 10 9 1his/her neighbor occupant. So, the system cannot detect ortrack the occupant that sits into desk number 1. In the case which is which after they split. The error happened becauseof the desk number 5, the occupant’s color was similar to of the occupant’s color and the sitting occupant. If thethe color of the background image. This caused the occupants have a similar color then the system may getoccupant produced small blob. The system cannot track the confuse to differentiate them. Another time, when the sittingoccupant because his/her blob’ size becomes too small. occupant makes a movement, it creates a blob. However, the Table 2 shows the test result of scene type 2. The system system still does not have enough evidence to determine thatmonitored the occupants based on the map that has been this blob will change the status of sitting occupant becomefound. The experiments were done 10 times without standing up occupant. Another occupant walked closer andocclusion. There are some errors that the system failed to merged with this blob. After they split, the system confusedrecognize the sitting occupant. The system failed to detect since the blob had no previous information. As the result,the occupant because of the same problems in the previous the system missed count the previous track being merged.discussion; the system lost to track the occupant because the The ID number of occupant is restored incorrectly.occupant has the similar color to the background image sothat the occupant suddenly has small blob. The system also VII. CONCLUSIONSfailed to recognize the leaving event from desk number 1. We have already designed an intelligent attendanceThe system detects a leaving occupant when the occupant logging system by integrating the open source withsplit with his/her seat. Since the desk number 1 does not additional algorithm. The system works in two phases;have enough space for the system to detect the splitting, the learning phase and monitoring phase. The system cansystem still detected that the desk number 1 is always achieve real-time performance up to 16 fps. We alsooccupied even the corresponding occupant has left that demonstrate that the system can handle the occlusion up tolocation. three occupants considering that the scene seems become Table 3 shows the test result of scene type 2. The too crowded for more than three occupants. While theexperiments were done 10 times with occlusion. The system regular time recorder only reports the time stamp of theshould be able to keep tracking the occupants. To test the beginning and the ending of the occupant’s working hour,system, three occupants enter to the scene to make the this system provides more detail about the timingscenario as shown in Table 3. Some occupants walk through information. Some unexpected behavior may cause an error.behind the sitting occupant or the occupants just walk and For instance, the occupant has the color similar to thecross each other. Most of the cases, the system can detect background, the desk position, or the occupant moves while 1019
• sitting. In the future, the events generated by this system can beused to deliver a message to another system. It is possible tocontrol the environment automatically such as adjust thelighting, playing a relaxation music, setting the airconditioner when an occupant enters or leaves the room.The summary report of the occupant’s attendance also canbe used for activity analysis. The current system does notinclude the recognition capability since it only detectwhether the working desk is occupied or not. However, ifoccupant recognition is needed then there are two ways.After the map of sitting areas are found, user may label eachsitting area manually or a recognition system can be added. REFERENCES[1] B. Brumitt, B. Meyers, J. Krumm, A. Kern, and S. Shafers, “EasyLiving Technologies for Intelligent Environments,” Lecture Notes in Computer Science, Volume 1927/2000, pp. 97-119, 2000.[2] S. -L. Chung and W. –Y. Chen, “MyHome: A Residential Server for Smart Homes”, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 4693 LNAI (PART 2), pp. 664-670, 2007.[3] Z. Zhou, X. Chen, Y. –C. Chung, Z. He, T. X. Man, and J. M. Keller, “Activity analysis, summarization, and visualization for indoor human activity monitoring,” IEEE Transactions on Circuits and Systems for Video Technology 18 (11), art. no. 4633633, pp. 1489- 1498, 2008. Fig. 13. Scenario type-1. It shows how the system builds a map. The current[4] Wikipedia, “Time Clock,” http://en.wikipedia.org/wiki/Time_clock images (left) and a map is shown as filled rectangles (right images). (June 24, 2010).[5] OpenCV. Available: http://sourceforge.net/projects/opencvlibrary/[6] cvBlob. Available : http://code.google.com/p/cvblob/[7] D. Demirdjian, K. Tollmar, K. Koile, N. Checka, and T. Darrell, “Activity maps for location-aware computing,” Proceedings of the Sixth IEEE Workshop on Applications of Computer Vision (WACV), pp. 70-75, 2002.[8] A. Girgensohn, F. Shipman, and L. Wilcox, “Determining Activity Patterns in Retail Spaces through Video Analysis,” MM08 - Proceedings of the 2008 ACM International Conference on Multimedia, with co-located Symposium and Workshops , pp. 889- 892, 2008.[9] B. Morris and M. Trivedi, “An Adaptive Scene Description for Activity Analysis in Surveillance Video,” 2008 19th International Conference on Pattern Recognition, ICPR 2008 , art. no. 4761228, 2008.[10] A. Bayona, J.C. SanMiguel, and J.M. Martínez, “Comparative evaluation of stationary foreground object detection algorithms based on background subtraction techniques,” 6th IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS 2009 , art. no. 5279450, pp. 25-30, 2009.[11] S. Herrero and J. Bescós, “Background subtraction techniques: Systematic evaluation and comparative analysis” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 5807 LNCS, pp. 33- 42, 2009.[12] P. KaewTraKulPong and R. Bowden, “An Improved Adaptive Background Mixture Model for Real-time Tracking with Shadow Detection,” Proc. 2nd European Workshop on Advanced Video Based Surveillance Systems, AVBS01, 2001[13] G. Bradski and A. Kaehler, “Learning OpenCV: Computer Vision with the OpenCV Library,” Sebastopol, CA: OReilly Media, 2008.[14] A. Senior, A. Hampapur, Y.-L. Tian, , L. Brown, S. Pankanti, and R. Bolle, ” Appearance models for occlusion handling,” Image and Fig. 14. Scenario type-2. The map of 3 desks has been completed. The Vision Computing 24 (11), pp. 1233-1243, 2006. occupants cross each other and the system can handle this situation. 1020
• i-m-Walk : Interactive Multimedia Walking-Aware System 1 Meng-Chieh Yu(余孟杰), 2Cheng-Chih Tsai(蔡承志), 1Ying-Chieh Tseng(曾映傑), 1Hao-Tien Chiang(姜昊天), 1Shih-Ta Liu(劉士達), 1Wei-Ting Chen(陳威廷), 1Wan-Wei Teo(張菀薇), 2Mike Y. Chen(陳彥仰), 1,2Ming-Sui Lee(李明穗), and 1,2Yi-Ping Hung(洪一平) 1 Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei, Taiwan 2 Dept. of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan Abstract calories burned [16]. adidas used a accelerometer to detect the footsteps of the runner, and it will let you know running i-m-Walk is a mobile application that uses pressure information audibly [31]. Wii fit used balance boards tosensors in shoes to visualize phases of footsteps on a mobile detect users center of gravity and designed several games,device in order to raise the awareness for the user´s walking such as yoga, gymnastics, aerobics, and balancing [18]. Inbehaviour and to help him improve it. As an example addition, walking is an important factor of our health. Forapplication in slow technology, we used i-m-Walk to help example, it is one of the earliest rehabilitation exercises andbeginners learn “walking meditation,” a type of meditation an essential exercise for elders [5]. Improper foot pressurewhere users aim to be as slow as possible in taking pace, and distribution can also contribute to various types of footto land every footstep with toes first. In our experiment, we injuries. In recent years, the ambient light and theasked 30 participants to learn walking meditation over a biofeedback were widely used in rehabilitation and healing,period of 5 days; the experimental group used i-m-Walk and the concept of “slow technology” was proposed. Slowfrom day 2 to day 4, and the control group did not use it at all. technology aimed to use slowness in learning, understandingThe results showed that i-m-Walk effectively assisted and presence to give people time to think and reflect [30].beginners in slowing down their pace and decreasing the Meditation is one kind of the example in slow technology.error rate of pace during walking meditation. To conclude, Also, “walking meditation” is an important form ofthis study may be of importance in providing a mechanism to meditation. Although many research projects have focusedassist users in better understanding of his pace and on meditation, showing benefits such as enhancing theimproving the walking habit. In the future, i-m-Walk could synchronization of neuronal excitation [11] and increasingbe used in other application, such as walking rehabilitation. the concentration of antibodies in blood after vaccination [3], most projects have focused on meditation while sitting. InKeywords: Smart Shoes, Walking Meditation, Visual Feedback, order to better understand how users walk in a portable way,Slow Technology we have designed i-m-Walk, which uses multiple force sensitive resistor sensors embedded in the soles of shoes to monitor users’ pressure distribution while walking. The 1. INTRODUCTION sensor data are wirelessly transmitted over ZigBee, and then Walking is an integral part of our daily lives in terms of relayed over Bluetooth to be analyzed in real-time ontransportation as well as exercise, and it is a basic exercise smartphones. Interactive visual feedback can then becan be done everywhere. In recent years, many research provided via the smartphones (see Figure 1).projects have studied walking-related human-computer In this paper, in order to develop a system that can helpinterfaces on mobile phones with the rapid growth of users in improving the walking habit, we use the training ofsmartphones. For example, there is research evaluated the walking meditation as an example application to evaluate thewalking user interfaces for mobile devices [9], and proposed effectiveness of i-m-Walk. Traditional training of walkingminimal attention user interfaces to support ecologists in the meditation demands one-on-one instruction, and there is nofield [21]. In addition, there are several walking-related standardized evaluation after training. It is challenging forsystems developed to help people in walking and running. beginners to self-learn walking meditation without feedbackNike+ used footstep sensors attached to users’ shoes to from the trainers.adjust the playback speed of music while running and totrack running related statistics like time, distance, pace, and 1021
• 2.2 Multimedia-Assisted Walking Application There are some studies using multimedia feedback and walking detection technique to help people in monitoring or training application in daily life. In the application of dancing training, there was an intelligent shoe that can detect the timing of footsteps, and play the music to help beginners in learning of ballroom dancing. If it detected missed footsteps while dancing, it would show warning messages to the user. The device emphasizes the acoustic element of the music to help the dancing couple stay in sync with the music [4]. The other application of dance performance could detect dancers’ pace and applied them in interactive music for dance performance [20]. In the application of musical tempo and rhythm training for children, there was a system which can write out the music on a timeline along the ground, and Figure 1. A participant is using i-m-Walk during walking each footstep activates the next note in the song [13]. meditation. Besides, visual information was be used to adjust foot trajectory during the swing phase of a step when stepping We have designed experiments to test the effect of onto a stationary target [23].training by using i-m-Walk during walking meditation. In the application of psychological, there are someParticipants were asked to do a 15-minute practice of experiments related to walking perceptive system. In thewalking meditation for five consecutive days. During the application in walking assisting of stroke patients, lightedexperiment, participants using i-m-Walk will be shown real- target was used to load onto left side and right side oftime pace information on the screen. We would like to test walkway, and stroke patients can follow the lighted target towhether it could help participants to raise the awareness for carry on their step. The result pointed out that stroke patienttheir walking behaviour and to improve it. We proposed two might effectively get help by using vision and hearing ashypotheses: (a) i-m-Walk could help users to walk slower guidance [14]. An fMRI study of multimedia-assistedwalking during meditation; (b) i-m-Walk could help users to walking showed that increased activation during visuallywalk correctly in the method of walking meditation. guided self-generated ankle movements, and proved that This paper is structured as follows: The first section deals multimedia-assisted walking is profound to people [1]. In thewith the introduction of walking system. The second section related application of walking in entertainment, Personalof the article is a review of walking detection and Trainer – Walking [17] detects users’ footsteps troughmultimedia-assisted walking applications. This is followed accelerometer, and encourage users to walk throughby some introduction of walking meditation. The forth interesting and interactive games. In the healthcaresection describes the system design. After which application, there was a system applied the concept ofexperimental design is presented. The results for the various intelligence shoes on the healthcare field, such as to detectanalyses are presented following each of these descriptive the walking stability of elderly and thus to prevent fallingsections. Finally, the discussion and conclusion are presented down [19]. The system monitored walking behaviours andand suggestions are made for further research. used a fall risk estimation model to predict the future risk of a fall. Another application used electromyography biofeedback system for stroke and rehabilitation patients, and 2. RELATED WORKS the results showed that there was recovery of foot-drop in the swing phase after training [8].2.1 Methods of Walking Detection In the past decade, there were many researches on 3. WALKING MEDITATIONintelligent shoes. The first concept of wearable computingand smart clothing systems included an intelligence cloth, The practice of meditation has several different ways andglasses, and an intelligence shoes. The intelligence shoes postures, such as meditation in standing, meditation in sitting,could detect the walking condition [12]. Then, a research meditation in walking, or meditation in lying down on back.used pressure sensor and gyro sensor to detect feet posture, Compared to sitting meditation, people tend to feel less dull,such as heel-off, swing, and heel-strike[22], and a research tense, or easily distract in walking meditation. In this paper,embedded pressure sensor in the shoes to detect the walking we focus on the meditation in walking, which is also namedcycle, and a vibrator was equipped to assist when walking walking meditation. Walking meditation is a way to align the[26]. Besides, there were many other methods on walking feeling inside and outside of the body, and it would helpsdetection, such as use bend sensor [15], accelerometer [2], people to focus and concentrate on his mind and body.ultrasonic [29], and computer vision technology [24] to Furthermore, it can also deeply investigate our knowledgeanalyze footsteps. and wisdom. 1022
• 4.1 i-m-Walk Architecture The shoe module is based on Atmels high-performance, low-power 8-bit AVR ATMega328 microcontroller, and transmits sensing values through a 2.4GHz XBee 1mW Chip Antenna module wirelessly. The module size is 3.9 cm x 5.3 cm x 0.8 cm with an overall weight of 185g (Figure 4), Figure 2. Six phases of each footstep in walking meditation [25]. including an 1800mAh Lithium battery can continuous use for 24 hours. We kept the hardware small and lightweight in The methods of walking meditation aim to be as slow as order not to affect users while walking.possible in taking pace, and landing each pace with toes first. We use force sensitive resistor sensors to detect theThe participants could focus on the movement of walking, pressure distribution of feet while walking. The sensing areafrom raising, lifting, pushing, lowering, stepping, to pressing of sensor is 0.5 inch in diameter. The sensor changes its(Figure 2). Also, the participants should aware of the resistance depending on how much pressure is applied to themovement of the feet in each stage. It is important to stay sensing area. In our system, the intelligent shoes wouldaware of the feet sensation. As a result, keep on practicing of detect the walking speed and the walking method in walkingwalking meditation is an effective way to develop meditation. According to the recommendations ofconcentration and maintain tranquillity in participants’ daily orthopaedic surgery, we use three force sensitive resistorlives. Furthermore, it can also help participants become sensors fixed underneath the shoe insole, and the three maincalmer and their minds can be still and peaceful. With the sustain areas located at structural bunion, Tailor’s bunion,long-term practice of walking meditation, it benefits people and heel, seperately. (see Figure 4). The shoe module is putby increasing patience, enhancing attention, overcoming on the outside of the shoes (see Figure 5). With a fullydrowsiness, and leading to healthy body [6]. In order to help charged battery, the pressure sensing modules canbeginners in learning the walking methods in walking continuous use for 24 hours. A power button can switch themeditation, i-m-Walk system was developed. module off when it is not being used. 4. SYSTEM DESIGN i-m-Walk includes a pair of intelligent shoes for detectingpace, an ZigBee-to-Bluetooth relay, and a smartphone forwalking analysis and visual feedback. There are three forcesensitive resistor sensors fixed underneath each shoe insolethat send pressure data through the relay. We implementedan analysis and visual feedback application on HTC HD2smartphone running Window Mobile 6.5 and which has a4.3-inch LCD screen. The overview of the system is shownin Figure 3. Figure 4. Sensing module: the micro-controller and wireless Force Relay module (right), and one of the insole with three force sensitive Sensor Right Shoe resistor sensors (left). Xbee (Receiver) Force Microcontroller Sensor Bluetooth Force Xbee (Transfer) Sensor Smart Phone Force Bluetooth Sensor Left Shoe Footstep Detection Force Microcontroller Sensor Stability Analysis Force Xbee (Transfer) Visual feedback Sensor Figure 3. System structure of i-m-Walk. Figure 5. Sensing shoes: attached the sensing module into the shoes. 1023
• 4.2 Walking detection 4.3.1 Pace awareness There were many methods in walking detection, and the The function of pace awareness is to help user aware of hismethods were different according to the applications. In our walking phases and whether he use correct footstep duringsystem, we use three pressure sensors in each shoe, and walking meditation. A feet pattern shows on the smartphone,totally will sense six sensing values at the sample rat 30 and the color block shows where the position of foot’stimes per second. In order to detect whether the user lands center of gravity is and how much is forced on the foot ineach pace with toes first or heel first, we divide each shoe real-time. The transparency of the block will decrease whileinto two parts, toe part and heel part. The sensing value of user land his feet. On the contrary, the transparency of thetoe part is the average of two force sensors, which block will increase while user raise his foot. Besides, theunderneath at structural bunion and Tailor’s bunion. The color block moves top-down while the participants landsensing value of heel part is the force sensor underneath at with toes first. The color blocks move bottom-up means thatheel. Therefore, the system divides the sensing area into toe the participants land with heel first. If forward of the footpart and heel part in each shoe, and totally four parts in each lands first, then the colour block moves forward to indicateperson. Then, we use threshold method to detect the moment the landing position. In addition, if the user land pace withwhile the sensing part less than the threshold value, and toe first, the system would defined that he is using correctactivate that part. We define the beginning of the each gait walking methods in walking meditation, and the colourcycle while heel part is lifting. The end of the gait cycle is at block would display as the color of green. On the contrary,the moment while the another foot’s heel part is rising. The while the user lands pace with heel first, the system wouldprevious cycle stops in one foot and another foot begins a recognize that he is using wrong walking methods, and thenew cycle of gait. Figure 6 shows an example. In this case, colour block would change the color from green to red.when the heel in left foot rose in 5 seconds, the sensing valuewas less than the threshold, and our system detected left foot 4.3.2 Walking Speed and Warning Messagerise in this moment. In the mean time, it means that user’s During walking meditation, people should stabilize hisright foot is stepping down. On the contrary, when the heel walking paces at a lower speed. By this way, the userin right foot rose in 10.7 seconds, the sensing value was less interface should provide the information of walking speed inthan the threshold, and our system detected right foot rise in real time, and remind the user while the walking speed isthis moment. too fast. Walking speed and wrong pace can be measured after the processing of walking signals. Then, the walking speed is visualized as a speedometer. The indicator point of the speedometer will point to the value of walking speed. For example, if the indicator points to the value “30”, it means that the user walk thirty paces per three minutes. Therefore, the speedometer provides the function to remind the user while he is walking too fast. According to the pilot study, we defined the lower-bound of the walking speed as 40 paces per three minutes. While the walking speed exceeds the speed, the indicator will point to the red area, and the screen will show a warning message “too fast” on the top of the screen. The warning message would disappear while the walking speed is less than 40 paces per three minutes.Figure 6. Signal processing of walking signals. Blue line indicates Warningthe sensed weight (kg) of heel and green line indicates the sensed messageweight of toe. Red line means the threshold in detecting thelanding event. Gray block means that which foot is landing. Footstep Awareness4.3 User interface Multimedia feedback can be effectively applied inpreventive medicine [7], and it can also assist rehabilitation Walkingpatients in walking effectively [5, 27]. i-m-Walk is developed speedto assist user in learning the walking methods during walkingmeditation. The user interface of i-m-Walk includes threecomponents: warning message, pace awareness, and walkingspeed (see Figure 7). In this section, we describe the userinterface and the design principles of our system. Figure 7: User interface of i-m-Walk. The user interface shows three events: warning message, condition of each footstep, and 1024
• 6. DISCUSSION the time during walking meditation because we did not know whether the user needed the guidance or not. But we The aim of this section is to summarize, analyze and informed participants that they could decide not to look atdiscuss the results of this study and give guidelines for the the visual feedback while they could aware of their pace well.future development of applications. By this way, it could minimize the interference when using. The questionnaire showed that the participants think that6.1 User Interface: there was no interference while using i-m-Walk, and the system was helpful in use.The user interface of i-m-Walk provides the information ofpace, including walking speed, wrong pace, and the center 6.3 Beginners vs. Mastersof feet. The results in walking speed significantly showedthat i-m-Walk could help beginners decrease walking speed In recent years, the concept of “slow technology” wasduring walking meditation. There are some participants’ applied in many mediate systems. The design philosophy ofcomments from experimental group: “slow technology” is that we should use slowness inUser E6 in day 3: “I always walked fast before, but when I learning, understanding and presence to give people time tosaw the dashboard and the warning message “too fast,” it think and reflect. In our case, walking meditation is onewas helpful to remind me in slowing down the walking kind of the conception in slow technology. There are twospeed. main parts in walking meditation, including inside condition and outside condition. The inside condition means the We list two design principles of the user interface: (a) meditation of mind and the outside condition means theWe used the form of dashboard to represent the walking meditation of walking posture. All participants werespeed. The value of walking speed is easy to watch, and user beginners in our experiment, because we focused on themight aware of the change of walking speed while he slowed training of the outside condition, walking posture. Thedown or speeded up; (b) i-m-Walk provided additional alarm difference between beginners and masters in walkingmechanism, a warning message “too fast”, while walking too meditation is that the beginners do not familiar in walkingfast. The mechanism could remind user when distracted. The meditation and needs to pay more attention on the control ofresults in wrong pace showed that i-m-Walk could effectively pace, but the masters familiar it and could focus on thereduce wrong pace for beginners during walking meditation. meditation of align the inside and outside of body. WalkingOne of the participants from experimental group said that: meditation is a way to align the feeling inside and outside ofUser E1 in day2: “While I saw the color of block on the the body. The beginner should familiar the walking posturescreen changed from green to red, and I knew that I had a before the spiritual development. In this paper, the goal ofwrong pace. Then, I would concentrate on my pace our experiment is to evaluate the learning effects of i-m-deliberately while the next footstep. Walk system. The experimental results showed that the participants of experimental group could slow down the6.2 Human Perception walking speed and decrease the wrong pace after five days training. Six participants in experimental group felt that theHuman beings receive messages by means of the five experimental time in day four is short than first day,modalities, including vision, sound, smell, taste and touch. although the experimental time is the same. On the contrary,The most use in the field of human-computer interaction is there was no such comment from the participants in controlvisual modality and audio modality. There was a comment group. The results showed that i-m-Walk could help user infrom an experimental participant: training the walking posture of walking meditation. User E3 in day 2: “If I can listen to my pace duringwalking meditation, I do not need to hold the smartphone”. 6.4 Reaction Time In cross-modal research, visual modality is always Reaction time is an important issue in human-computerconsidered superior than auditory modality in spatial domain. interaction design. If the reaction time delay too long, usersIn our case, we need to show the footstep phases accurately, could not control it well and could not aware of theand also need to show walking speed and wrong pace in the interaction easily. According to the observation, the delaysame time. Therefore, we selected visual feedback as the time of i-m-Walk is 0.2 second. However, the delay time douser interface. The advantage was that users could decide not affect users because the application in this experimentwhether to watch the information or not, but the shortcoming do not need fast reaction time. The average pace speed iswas that users failed to receive the information while they 10.9 seconds in experiment group in day five. The results ofdid not see it. Therefore, there are possible to provide more questionnaires also showed that participants felt that theinteraction methods, such as tactile perception and acoustic visual feedback could reflect the walking status immediately.perception, to remind users. However, the somatosensory of one’s feet is the most On the other hand, the mechanisms of multimedia intuitive, and i-m-Walk can only provide accessibility forfeedback might attract user’s attention in some case. Too beginners while they need.many inappropriate and redundant events might disturb userswhile using it. In our system, we provided visual feedback all 1027
• Object of Interest Detection Using Edge Contrast Analysis Ding-Horng Chen FangDe Yao Department of Computer Science and Information Department of Computer Science and Information Engineering Engineering Southern Taiwan University Southern Taiwan University Yong Kang City, Tainan County Yong Kang City, Tainan County chendh@mail.stut.edu.tw m97g0102@webmail.stut.edu.twAbstract— This study presents a novel method to detect the the separation of variations in illumination from thefocused object-of-interest (OOI) from a defocused low depth- reflectance of the objects (also known as intrinsic imageof-field (DOF) image. The proposed method divides into three extraction) and in-focus areas (foreground) or out-of-focussteps. First, we utilized three different operators, saturation (background) areas in an image.contrast, morphological functions and color gradient to The DOF is the portion of a scene that appearscompute the objects edges. Second, the hill climbing color acceptably sharp in the image. Although lens can preciselysegmentation is used to search the color distribution of an focus at one specific distance, the sharpness decreasesimage. Finally, we combine the edge detection and colorsegmentation to detect the object of interest in an image. The gradually on each side of the focused distance. A low (small)proposed method utilizes the edge analysis and color DOF can be more effective to emphasize the photo subject.segmentation, which takes both advantages of two features The OOI is thus obtained via the photography technique byspace. The experimental results show that our method works using low DOF to separate the interested object in a photo.satisfactorily on many challenging image data. Fig. 1 shows a typical OOI image with low DOF. Keywords-component; Object of Interest (OOI); Depth ofField (DOF); Object Detection; Edge Detection; Blur Detection. I. INTRODUCTION The market for digital single-lens reflex cameras, or so-called DSLR, has expanded tremendously for its pricebecome more acceptable. For a professional photographer,the DSLR owns the advantages for the excellent imagequality, the interchangeable lenses, and the accurate, large,and bright optical viewfinder. The DSLR camera has biggersensor unit that can create more obvious depth-of-field (DOF)photos, and that is the most significant features of DSLR. Figure 1. A typical OOI imageAccording to market reports [1][2][3], the DSLR marketshare will grows very fast in the near future. Table 1 shows The OOI detection problem can be viewed as anthe growth rate of the digital camera market. extension of the blurred detection problem. In Chung’s Table 1. Market Estimate of the Digital Cameras method [6], they compute x and y direction derivative and gradient map to measure the blurred level ,by obtaining the Year 2006 2011 Growth Rate edge points which is computed by a weighted average of the World Market 81 82.2 108% standard deviation of the magnitude profile around the edge point. DSLR 4.8 8.3 173% Renting Liu et al. [7] have proposed a method could DSC 76.8 79.9 104% determine blurred type of an image, using the pre-de ned Unit: Million US\$ blur features, the method train a blur classifier to The extraction of the local region of interested in an discriminate different regions. This classifier is based onimage is one of the most important research topics for some features such as local power spectrum slope, gradientcomputer vision and image processing [4][5]. The detection histogram span, and maximum saturation. Then theyof object of interest (OOI) in a low DOF images can be detected the blurry regions that are measured by localapplied in many fields such as content-based image retrieval. autocorrelation congruency to recognize the blur types.To measure the sharpness or blurriness edges in an image is The above methods determine the blur level and regions,also important for many image processing applications. For but they still cannot extract OOI object from an image. If theinstance, checking the focus of a camera lens, identifying background is complex or edges are blurred, the describedshadows (which edges are often less sharp than object edges), methods are unable to find OOI [6][7]. N. Santh and K.Ramar have proposed two approaches, i.e. the edge-based 1029
• and region-based approach, to segment the low-DOF images A. Saturation Edge Power Mean[8]. They transformed the low-DOF pixels into an Fig. 3 shows the original image that we want to detectappropriate feature space called higher-order statistics (HOS) the OOI. The background is out-of-focus and thus ismap. The OOI is then extracted from a low-DOF image by smoother then the object we want to detect. The colorregion-merging and threholding technique as the final saturation and edge sharpness are the major differencesdecision. between the objects and the background. Color information But if the object’s shape is complex or the edges are not is very important in blur detection. It is observed that blurredfully connected, it’s still hard to find the object. The OOI pixels tend to have less vivid colors than un-blurred pixelsmay not a compact region with a perfect sharp boundary. It because of the smoothing effect of the blurring process.cannot simply use edge detection to find a complete object in Focused (or un-blurred) objects are likely to have more vivida low-DOF image. In some cases, such as macro colors than blurred parts. The maximum saturation value inphotography or close-up photography, the depth-of-field is blurred regions is expected to be smaller than in un-blurredvery low. Some parts of subject may out of focus. This also regions. By this observation, we use the following equationcauses a partial blur on subject. To acquire a satisfactory to compute pixel saturation:result on OOI detection, not only the blurred part but also the 3sharp part needs to be taken into consideration. How to find a S P=1- Min R,G,B ,good OOI subject in the image are challenging in this issue. R G B where Sp means the saturation point for image. Equation (1) II. THE P ROPOSED METHOD transforms the original image into saturation feature space to In this paper, we proposed a novel method to extract the find the higher saturation part in the image.OOI from a low-DOF image. The proposed algorithm In low-DOF images, the saturation won’t changecontains three steps. First, we find the object boundaries dramatically for the background is smoother. On the contrary,based on computing the sharpness of edges. Second, the hill the color saturation will change sharply along the edges.climbing color segmentation is used to find color distribution Therefore, we define the edge contrast CA, which isand its edges. Finally, we integrate the above results to get computed in a 3x3 window described as follows:the OOI location. 1 CA ( n A) 2 The first step is divided into three parts and is illustrated n M ,n A nin Fig. 2. We calculate the feature parameters including the Here M is the 3x3 window; A is the saturation value on themaximum saturation, the color gradient and the local range window center, n is the value of the neighborhood in thisimage. The image is converted into CIE Lab color-space and window.is performed with edge detection. In the part of noise Equation (2) calculates the saturation intensity. Here,reduction, we use a median filter to reduce the fragmentary we show the result images to demonstrate the processingvalue. Then all the featured image will be multiply together steps. Fig. 4 is the result of saturation image. Fig. 5 showsto extract the exact position of OOI. the result after performing the edge contrast computation. Figure 3. Original image Figure 4. Saturation image Figure 2. Edge detection flowchart 1030
• Gx Gy Gx Gy . G2 0.5 cos(2 A ) 2 * Gxy . sin 2 A 2 The value of color gradient CG, is obtained by choosing the maximum value of comparing G1 and G2, i.e., CG Max G1 ,G2 This CG value shows the color intensity followed the Figure 5. Saturation edge image edge gradient. The CG value will increase if the color of this edge point changes dramatically.B. Color Gradient Fig. 6 shows the result after color vector computation. The gradient is a scalar field for a vector field whichpoints in the direction of the greatest rate of increase of thescalar field, and whose magnitude is the greatest rate ofchange. It is very useful in a lot of typical edge detectionproblem. To calculate the gradient of the color intensity, first wewould use the Sobel operator to separate vertical andhorizontal edges. 1 0 1 1 2 1 Gx 2 0 2 A, Gy 0 0 0 A, Figure 6. Color vector image 1 0 1 1 2 1 C. Local Range Image 2 2 G Gx Gy , In this study, we adopt the morphological functions DILATION and EROSION to find the local maximum and Gy minimum values in the specified neighborhood. arctan G x First, we convert the original image form RGB color Equation (3) and (4) show a traditional way to compute space to CIE Lab color space. Because the luminance of an object is not always flat, we compute the local range valuegradient. Here is the edge angle, and is 0 for a vertical for A and B layer without L(luminance) component. Weedge which is darker on the left side. We modify the above censored the color diversification on object in order toequations to be more accurate in our case with the following prevent this situation. The dilation, erosion and local rangeequations: computation are defined as the following equations: 2 2 2 Gx Rx Gx Bx Dilation Gy Ry 2 Gy 2 By 2 A B z| B z, A B z Erosion G xy Rx R y G x Gy Bx By ˆ ˆ A B z| B z, A B z 2 * Gxy A 0.5 * arctan Local Range Image A B A B (Gx Gy ) Fig. 7 shows the result after the local range operation. Gx Gy Gx Gy . G1 0.5 cos(2 * A) 2 * Gxy . sin(2 * A)where Rx, Gx, Bx are RGB layers through horizontal Sobeloperator; Ry, Gy and By are RGB layers through verticalSobel operator. A is the angle of Gxy, and G1 is the colorgradient of image on angle 0. The definition of G2 is quite similar as G1, but the term Ais replaced by: A 2 Therefore, G2 is computed by: Figure 7. A local range image 1031
• D. Median Filter ImageColorSegmention Median filter is a nonlinear digital filtering technique,which is often used to remove noise. Such noise reduction isa typical pre-processing step to improve the results for laterprocessing. The process of edge detection will cease somefragmentary values. If the values are low or the fragmentedges are not connected, it could as seen as noise. Therefore,we adopt the median filter to reduce the fragmentary pixels.E. Hill Climbing Color Segmentation Edge detection can find most edges of OOI object, butthe boundaries are not usually closed completely. The Figure 9. A color segmentation resultmorphological operators cannot link all the disconnectededges to obtain a complete boundary. Most OOI edges can F. Edge Combinationbe detected after the previous procedures, but some edges arestill unconnected. To make the OOI boundary be a regular The OOI edges are obtained by two methods. First, weclosure, we adopt color segmentation to connect the isolated use morphological close operation, which is a dilationedges. followed by an erosion, to connect the isolated points. The The color segmentation method is illustrated in Fig. 8. close operation will make the gaps between unconnectedThis method is based on T. Ohashi et al. [9] and R. Achanta edges become smaller and make the outer edges becomeet al. [10] .The hill-climbing algorithm detects local maxima smoother. Second, we adopt edge detection on colorof clusters in the global three-dimensional color histogram of segmentation map to find the color distribution, and merge itan image. Then, the algorithm associates the pixels of an with pre-edge detection result.image with the detected local maxima; as a result, several After the above procedures, we can get most of the edgevisually coherent segments are generated. clues, and then we want to integrate these clues to a complete OOI boundary. Let the result of the boundary detection be IE, the result from the color segmentation be IC. The edge is extended by counting the pixels in IC and the neighboring points of IE. To determine whether a pixel at the end of the IE = to be extended or not, here we reassign a value P at point (i,j) ( , ) (, ) as an “edge extension” value as follow: (, ) , where n=-1, m=1, is sliding in a 3x3 window, IE is the pre- edge detection image value of the neighborhood in this window. Equation (16) will remove the un-necessary pixels and let the OOI mask be closed by extending the boundaries. The value is shown in Fig. 10. The result image that merges the edge extension image with the color segmentation edge is shown in Fig. 11. Figure 8. Color segmentation and egde detection flow chart The detailed algorithm is described as follows: (a) 1. Convert image to CIE Lab color space. 2. Build CIE Lab color histogram. 3. Follow color histogram to find local maximum value. 4. Apply local maximum color to be initial centroid of k-means classification. 5. Re-train the classifier until the cluster centers are stable. 6. Apply K-means clustering and remap the original (b) pixels to each cluster. Figure 10. (a) The result before the edge extension (b) The result after the Fig.9 shows the result of color segmentation. edge extension 1032
• Plus ColorSeg Figure 13. Five examples with different aperture values The DOF is smaller as the aperture value gets lower, and the OOI would be blurred as well. The higher aperture value will increase the edge sharpness; that will cause the difficulty to separate the background and the OOI. From Fig. 14 to Fig. 17, we show the OOI detection results. By experiment, the object boundaries become irregular while the aperture value gets higher. In our experiment, the proper aperture value to obtain the best segmentation results is about f2.8 to f5.6. Figure 11. The result image that merged the edge extention image and color segmentation image We integrate the above edge pieces into a complete OOImask. If the boundaries are closed, we will add this regioninto the final OOI mask. The edge combination of the finalOOI mask is shown in Fig.12. Figure 12. Edge combination result III. THE EXPERIMENTAL RESULTS The aperture stop of a photographic lens companion withshutter speed can adjust the amount of light reaching to thefilm or image sensor. In this study, we use a digital cameraPantax istDL and a prime lens “Helois M44-2 60mm F2.0” Figure 14. The experimental results (sample 1)to perform the experiment. We choose a prime lens to be ourtest lens in order to reduce the instability parameters. Toinsure all of the exposures are the same, we have controlledthe shutter speed and aperture parameter manually. To test the propose method, we select 5 test photos in a50 photos album randomly. They are all prepared in a samecondition and camera parameter. Fig. 13 shows the proposedOOI detection results in different aperture value. Figure 15. The experimental results (sample 2) 1033
• the overlapped region between the reference and the detected Accuracy OOI boundaries, i.e. ( x ,y ) I est ( x , y ) I ref ( x , y ) 1 ( x ,y ) I ref ( x , y ) where Iest is the OOI mask from the proposed method and Iref is the mask drawn by the user as the ground truth. Fig. 18 (a) shows the user drawn OOI boundaries and (b) shows the detected OOI boundaries. Figure 16. The experimental results (sample 3) Figure 18. Comparsion results: (a)User drawn OOI boundary (b) The proposed method result The detection accuracy decreases while the OOI has complex texture such as shirt, cloth, or artificial structures; and the accuracy is higher while the background is simple. However, if the image is not correctly focused on the target, the proposed method still can find a complete object. The correctness will become lower if there are more than one OOI in an image, as shown in sample 2 in Fig. 18. Table 2 shows the result of accuracy computed by Equation (17). Table 2. The comparison result between the reference images and the proposed method Figure 17. The experimental results (sample 4) Sample 1 2 3 4 5 The convincing definition of “good OOI” is hard to Accuracy 98.2% 94.6% 96.1% 98% 91%define; it will depend on human cognition. In this paper, werefer N. Santh and K.Ramar’s experiment [8] to verify theproposed method. First, five user-defined OOI boundariesare drawn, then we compare with the boundaries thatdetected by the proposed method. Equation (17) computes 1034
• IV. CONCLUSION [6] Yun-Chung Chung, Jung-Ming Wang, Robert R. Bailey, Sei-Wang Chen, “A Non-Parametric Blur Measure Based on Edge Analysis for In this paper we propose a method to extract the OOI Image Processing Applications,” IEEE Conference on Cyberneticsobjects form a low DOF image based on edge and color and Intelligent Systems Singapore, 1-3 December, 2004.information. The method needs no user-defined parameters [7] Renting Liu ,Zhaorong Li ,Jiaya Jia, “Image Partial Blur Detectionlike shapes and positions of objects, or extra scene and Classi cation,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008, pp. 1–8.information. We integrate the color saturation,morphological functions and color gradient to detect the [8] N. Santh, K.Ramar, “Image Segmentation Using Morphological Filters and Region Merging,” Asian Journal of Informationrough OOI. Final we utilize color segmentation to make the Technology vol. 6(3) 2007,pp. 274-279.OOI boundaries close and compact. Our method takes both [9] D. Kornack and P. Rakic, “Cell Proliferation without Neurogenesis inadvantages of edge detection and color segmentation. Adult Primate Neocortex,” Science, vol. 294, Dec. 2001, pp. 2127- The experiments show that our method works 2130.satisfactorily on many different kinds of image data. This [10] T.Ohashi, Z.Aghbari, and A.Makinouchi. “Hill-climbing Algorithmmethod can apply in image processing or computer vision for Efficient Color-based Image Segmentation,” IASTEDtasks such as object indexing or content-based image International Conference On Signal Processing, Pattern Recognition, and Applications (SPPRA 2003), June 2003. P.200.retrieval as a pre-processing. [11] R. Achanta, F. Estrada, P. Wils, and S. Süsstrunk1. “Salient Region REFERENCES Detection and Segmentation,” International Conference on Computer Vision Systems (ICVS 2008), May 2008. PP.66-75[1] InfoTrend ,”The Consumer Digital SLR Marketplace: Identifying & [12] Martin Ru i, Davide Scaramuzza, and Roland Siegwart. “Automatic Profiling Emerging Segments,” Digital Photography Trends. Detection of Checkerboards on Blurred and Distorted Images,” September,2008 International Conference on Intelligent Robots and Systems 2008, http://www.capv.com/public/Content/Multiclients/DSLR.html Sept, 2008. PP.22-26[2] Dudubird, “Chinese Photographic Equipment Industry Market [13] Hanghang Tong, Mingjing Li, Hongjiang Zhang, and Chanshui Zang. Research Report,” December,2009. http://www.cnmarketdata.com “Blur Detection for Digital Images Using Wavelet Transform,” /Article_84/2009127175051902-1.html International Conference on Multimedia and Expo 2004, PP.17-20[3] .” [14] Gang Cao, Yao Zhao and Rongrong Ni. “Edge-based Blur Metric for ,” Tamper Detection,” Journal of Information Hiding and Multimedia September,2007. https://www.fuji-keizai.co.jp/market/06074.html Signal Processing, Volume 1, Number 1, January 2009. pp. 20-27[4] Khalid Idrissi, Guillaume Lavou e, Julien Ricard , and Atilla Baskurt, [15] Rong-bing Gan, Jian-guo Wang. “Minimum Total Variation “Object of interest-based visual navigation, retrieval, and semantic Autofocus Algorithm for SAR Imaging,” Journal of Electronics & content identi cation system” Computer Vision and Image Information Technology, Volume 29, Number 1, January 2007. pp. Understanding vol. 94 ,2004 , pp. 271-294. 12-14[5] James Z. Wang, Jia Li, Robert M. Gray, Gio Wiederhold , [16] Ri-Hua XIANG, Run-Sheng WANG, “A Range Image Segmentation “Unsupervised Multiresolution Segmentation for Images with Low Algorithm Based on Gaussian Mixture Model,” Journal of Software Depth of Field” IEEE TRANSACTIONS ON PATTERN 2003, Volume 14, Number 7, pp. 1250-1257 ANALYSIS AND MACHINE INTELLIGENCE vol.23 no.1, January 2001, pp. 85-90. 1035
• Efficient Multi-Layer Background Model on Complex Environment for Foreground Object Detection 1 Wen-kai Tsai(蔡文凱),2Chung-chi Lin(林正基), 1Ming-hwa Sheu(許明華), 1Siang-min Siao(蕭翔民), 1 Kai-min Lin(林凱名) 1 Graduate School of Engineering Science and Technology National Yunlin University of Science & Technology 2 Department of Computer Science Tung Hai University E-mail:g9610804@yuntech.edu.twAbstract—This paper proposes an establishment of multi-layer has the advantages of updating model parametersbackground model, which can be used in a complex automatically, it is necessary to take a very long period ofenvironment scene. In general, the surveillance system focuses time to learn the background model. In addition, it also faceson detecting the moving object, but in the real scenes there are strenuous limitations such as memory space and processingmany moving background, such as dynamic leaves, falling rain speed in embedded system. Next, Codebook backgroundetc. In order to detect the object in the moving background model [3] establishes a rational and adaptive capabilityenvironment, we use exponential distribution function to which is able to improve the detection accuracy of movingupdate background model and combine background background and lighting changes. However, the Codebooksubtraction with homogeneous region analysis to find out background model still requires higher computational cost,foreground object. The system uses the TI TMS320DM6446 larger memory space for saving background data.Davinci development platform, and it can achieve 20 framesper second for benchmark images of size 160×120. From the Subsequently, Gaussian model [4] is presented by updating the threshold value for each pixel, but its disadvantagesexperimental results, our approach has better performance in includes large amount of computing and lots of memoryterms of detection accuracy and similarity measure, whencomparing with other modeling techniques methods. space used to record the background model. In order to reduce the usage of memory, [5] and [6] are to calculate the Keywords-background modeling; object detection weight value for each pixel to establish background model. According to the weight value, the updating mechanism determines whether the pixel is replaced or not. So it uses a I. INTRODUCTION less amount of memory space to establish moving Foreground object detection is a very important background,.technology in the image surveillance system since the system The above works all use multi-layer background modelperformance highly dependents on whether the foreground to store background information, but this is still inadequateobject detection is right or not. Furthermore, it needs to to deal with moving background issues. They need to takedetect the foreground object accurately and quickly, such into account the dependency between adjacent pixels tothat the follow-up works such as tracking, identification can inspect whether the neighbor region possesses thebe easy to perform correctly and reliably. Conceptually, the homogeneous characteristics or not. This paper proposes antechnology of foreground object detection is based on efficient 4-layer background model and homogeneous regionbackground substation mostly. This approach seems simple analysis to feature the background pixels.and low computational cost; however, it is difficult to obtaingood results without reliable background model. To manage II. BUILDING MULTI-LAYER BACKGROUND MODELSthese complex background scenarios, the skill of how to First, the input image pixel xi,j(t) consists of R, G and Bconstruct a suitable background model has become the most elements as shown in Eq. (1). The pixels of movingcrucial one. background are inevitably appeared in some region Generally speaking, most of the algorithms only regard repeatedly, so we have to learn these appearance behaviorsnon-moving objects as background, but in real environment, when constructing multi-layer background model. The firstmany moving objects may also belong to a part of the layer background model (BGM1) is used to store the firstbackground, in which we named the moving background input frame. For the 2nd frame, we record on the differencesuch as waving trees. However, it is a difficult task to of the 1st and 2nd frames for the second layer backgroundconstruct the moving background model. The general model (BGM2). Similarly, the difference of the consecutive 3practice is to use algorithms to conduct the learning and grams is saved for the third layer (BGM3), etc. We use theestablish of background model. After building up the model, first 4 frame and their differences as the initial backgroundthe system starts to carry on the foreground object detection. model. Besides, Eq. (2) is used to record the numbers ofTherefore, in recent years a number of background models occurrence each pixel in the learning frame.have been proposed. The most popular approach is theMixture of Gaussians Model (MoG) [1- 2]. Although MoG xi , j (t ) = ( xiR j (t ), xiGj (t ), xiB j (t )) , , , (1) 1036
• ⎧ ⎪ remove , if weight iu, j (t ) < Te (5) ⎧ MATCH iu, j (t − 1), if xi , j (t ) − BGM iu, j (t ) > th (2) BGM iu, j (t ) = ⎨ ⎪ ⎪α × BGM i , j (t ) + (1 − α ) × BGM i , j (t − 1), else u u MATCH iu, j (t ) = ⎨ ⎩ ⎪MATCH i , j (t − 1) + 1, u ⎩ elsewhere, u=1…4 and th is the threshold value of compare where Te is a threshold for weight; α is a constant andsimilarity. From the 5th learning frame, we start to calculate α <1.all the pixel repetition numbers of occurrence in each layer Based on the above mentioned approach, Fig. 2of background model, and Eq. (3) is to obtain its frequency demonstrates a 4-layer background which be constructedof occurrence. after learning 100 frames. MATCH iu, j (t ) λu, j = i (3) Nwhere N is the total number of learning frames. The largerλu j i, indicates that the corresponding pixel in the learningperiod has higher occurrence and must preserve with 4layers. Conversely, the pixel with lower occurrence will beremoved. (a) BGM1 (b) BGM2 III. BACKGROUND UPDATE After building up multi-layer background model, wemust update the content of BGMi,j along with the time, toreplace the inadequate background information. So thebackground update mechanism is very important for thefollowing object detection. The proposed background update (c) BGM3 (d) BGM4method uses the exponential distribution model to calculate Figure 2. Multi-layer Background Modelthe weight value for each pixel, as shown in Eq. (4). It canobtain the repetition condition of occurrence for each pixel inbackground model. The lower weight expresses that the IV. OBJECT DETECTIONcorresponding pixel has not appeared for a long time. It After establishing the accurate background model, theshould be replaced by the higher-weight input pixel. background subtraction can be used to obtain foreground object. From the practical observation, the moving − λu, j × t weightiu, j (t ) = λu, j × e i i ,t > 0 (4) background has the homogeneous characteristic. Therefore, the object detection method can carry on the subtraction on both 4-layer background and their homogeneous regions.where t is the number of non-match frames. As shown in Fig. 2, the information stored in background Fig. 1 shows the distribution of weight values. If the model is the scene of the moving background. It is withpixel in background model does not be matched in a period important features of homogeneous. In Eq. (6) and (7), TI(t)time, its weight value becomes exponentially decreased. Ifthe weight value is less than a threshold, the background is the total matching index for input pixel and thepixel should be replaced based on Eq.(5). homogeneous region of 4-layer background and Diu+ k , j + p is an weight individual matching index the input pixel and one background data BGM iu+ k , j + p . The homogeneous region is defined as (2r +1) * (2r +1) for the background data at (i, j) location. 4 r r TI (t ) = ∑ ∑ ∑D u i+k, j+ p (t ) (6) u =1 k = − r p = − r ⎧1, if xi , j (t ) − BGM iu+ k , j + p (t ) ≤ th Diu+ k , j + p (t ) = ⎨ (7) ⎩0, else t Figure 1. Exponential distribution of weight where th is a threshold value to determine whether it is similar. If TI(t) is greater than a threshold (τ), that indicates the input xi,j(t) is similar to many background information 1037
• and it is not a object pixel. Eq. (8) is used to find out sequence. Our proposed approach can achieve the highestforeground object (FO). similarity value, i.e. our results are close to those of ground truth. ⎧ 0, if TI (t ) ≥ τ FOi , j (t ) = ⎨ (8) ⎩1, elseWhen FOi , j (t ) = 1 , the input pixel belongs to foregroundobject pixel. On the other hand, If FOi , j (t ) = 0 , the inputpixel belongs to background pixel. V. EXPERIMENTAL RESULTS OF PROTOTYING SYSTEM Based on our proposed approach, the object detection isimplemented by TMS320DM6446 Davinci as shown inFig.3. The input image resolution is 160*120 per flame.Averagely, our approach can process 20 frames per secondfor performing object detection on the prototyping platform. Figure 3. TI TMS320DM6446 Davinci development kitNext, by using the presented research methods, theforeground object with binary-value results are alsodisplayed in Fig.4. The result of ground truth, which issegmented the objects manually from the original imageframe, is regarded as the perfect result. It can be found thatour result has the better object detection. In order to make afair comparison, we adopt [7] calculating similarity andtotal error pixels method to assess these results of the Figure 4. Foreground Object Detection Resultalgorithms. Eq.(9) is used to get the total error pixel numberand Eq. (10) is used to evaluate similarity value. Wu[2] Total Error Pixels Chien[5] 3000 total error pixel = fn + fp (9) Tsai[6] Our proposed 2500 Error pixel 2000 tp Similarity = (10) 1500 tp + fn + fp 1000where fp is the total number of false positives, fn is the total 500number of false negatives, and tp indicates the total number 0of true positives. Fig. 6 depicts the number of error pixels 240 245 250 255 260 265 270 275 280for a video sequence. We can see the numbers of error Frame Numberpixels produced by our proposed are less than otheralgorithms. Fig. 7 shows the similarity of the video Figure 5. Error pixels by different methods 1038
• Wu[2] [6] Wen-Kai Tsai, Ming-Hwa Sheu, Ching-Lung Su, Jun-Jie Lin and Similarity Shau-Yin Tseng, “Image Object Detection and Tracking Chien[5] 1 Tsai[6] Implementation for Outdoor Scenes on an Embedded SoC 0.9 Our proposed Platform,” International Conference on Intelligent Information 0.8 Hiding and Multimedia Signal Processing, pp.386-389, September, 0.7 2009. 0.6 [7] Lucia Maddalean, Alfredo Petrosino, “A Self-Organizing Approach Sim ilarity to Background Subtraction for Visual Surveillance Applications,” 0.5 IEEE Trans. on Image Processing, vol. 17, No.7, July, 2008. 0.4 0.3 0.2 0.1 0 242 245 248 251 254 257 260 Frame Number Figure 6. Similarity by different methods VI. Conclusion In this paper, we propose an effective and robust multi-layer background modeling algorithm. The foregroundobject detection will encounter the problem of movingbackground, because there are outdoor scenes of flutteringleaves, rain, and indoor scenes of fans etc. Therefore, weconstruct the moving background into multi-layerbackground model through calculating weight value andanalyzing the characteristics of regional homogeneous. Inthis way, our approach can be suitable to a variety of scenes.Finally, we present the result of foreground detection byusing data-oriented form of similarity and total error pixels,furthermore through explicit data and graph to show thebenefit of our algorithms. REFERENCES[1] C. Stauffer, W. Eric L. Grimson, “Learning Patterns of Activity Using Real-Time Tracking,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol.22, No. 8, pp.747-757, 2000.[2] H. H. P. Wu, J. H. Chang, P. K. Weng, and Y. Y. Wu, “Improved Moving Object Segmentation by Multi-Resolution and Variable Thresholding, ” Optical Engineering. vol. 45, No. 11, 117003, 2006.[3] K. Kim, T. H. Chalidabhongse, D. Harwood, and L. S. Davis, “Real-Time Foreground-Background Segmentation using Codebook Model, ” Real-Time Imaging, pp.172-185, 2005.[4] Hanzi Wang, and David Suter, “ A Consensus-Based Method for Tracking Modelling Background Scenario and Foreground Appearance,” Pattern Recognition, pp.1091-1105, 2006.[5] Wei-Kai Chan, Shao-Yi Chien,”Real-Time Memory-Efficient Video Object Segmentation in Dynamic Background with Multi- Background Registration Technique,” International Workshop on Multimedia Signal Processing, pp.219-222, 2002. 1039
• CLEARER 3D ENVIRONMENT CONSTRUCTION USING IMPROVED DM BASED ON GAZE TECHNOLOGY APPLIED TO AUTONOMOUS LAND VEHICLES 1 2 Kuei-Chang Yang (楊桂彰), Rong-Chin Lo (駱榮欽) 1 Dept. of Electronic Engineer & Graduate Institute of Computer and Communication Engineering, National Taipei University of Technology, Taipei 2 Dept. of Electronic Engineer & Graduate Institute of Computer and Communication Engineering, National Taipei University of Technology, Taipei E-mail: t7418002@ntut.edu.tw ABSTRACT to obtain meaningful information. There are a lot of manpower and resources devoted to the binocular stereo In this paper, we propose a gaze approach that sets vision [4] research for many countries. As applied tothe binocular cameras in different baseline distances to robots and ALV, the advantage of binocular stereoobtain better resolution of three dimensions (3D) vision is to obtain the depth of the environment, and thisenvironment construction. The method being capable depth can be used for obstacle avoidance, environmentof obtain more accurate distance of an object and learning, and path planning. In such applications, theclearer environment construction that can be applied to disparity is used as the vision system based on imagethe Autonomous Land Vehicles (ALV) navigation. In recognition and image-signal analysis. Besides, twothe study, the ALV is equipped with parallel binocular cameras need to be set in parallel and to be fixedcameras to simulate human eye to have the binocular accurately, this disparity method still requires a high-stereo vision. Using the information of binocular stereo speed computer to store and analyze images. However,vision to build a disparity map (DM), the 3D setting the binocular cameras of ALV with fixedenvironment can be reconstructed. Owing to the baseline can only obtain the better DM of environmentbaseline of the binocular cameras usually being fixed, images in a specific region.the DM, shown as an image, only has a better resolutionin a specific distance range, that is, only partial specific In this paper, we try to propose an approach that sets theregion of the reconstructed 3D environment is clearer. binocular cameras with different baseline to obtain theHowever, it cannot provide a complete navigation depths of DM corresponding to different measuringenvironment. Therefore, the study proposes the multiple distances; In the future, this method can obtain thebaselines to obtain the clearer DMs according to the environment image from near to far range, such that itnear, middle and far distances of the environment. will help the ALV in path planning.Several experimental results, showing the feasibility ofthe proposed approach, are also included. 2. STEREO VISIONKeywords binocular stereo vision; disparity map In recent years, because the computing speed of the computer is much faster and its hardware 1. INTRODUCTION performance also becomes better, therefore a lot of researches relating to the computer vision are proposed In recent years, the machine vision is the most for image processing. The computer vision system withimportant sensing system for intelligent robots. The depth sensing ability is called the stereo vision system,vision image captured from camera has a large number and the stereo vision is the core of computer visionof object information including shape, color, shading, technologies. However, one camera can only obtain twoshadow, etc. Unlike other used sensors can only obtain dimensions (2D) information of environment image thatone of measurement information, such as ultrasonic is unable to reconstruct the 3D coordinate, To improvesensors [1], infrared sensors [2], laser sensors [3], etc. the shortage of one camera, in the study, two camerasIn other words, the visual sensor can achieve a lot of are used to calculate 3D coordinate. The details areenvironmental information, but this information is with described in the following sub-sections.each other. Therefore, various image processingtechniques are necessary for separating them one by one 1040
• 2.1. Projective Transform Nowadays the cost of camera becomes very The projective transform model of one camera is cheaper, therefore, in the study, we chose two camerasto project the real objects or scene to the image plane. fixed in parallel to solve the problem of depth andAs shown in Fig. 1, assume that the coordinate of object height. The usage of parallel cameras can reduce theP in the real world is (X, Y, Z) relative to the origin (0, complexity of the corresponding problem. In Fig. 3, we0, 0) at the camera center. After transform, the easily derive the Xl and Xr by using similar triangles, andcoordinate of P projected by P on the image plane is (x, we have:y, f) relative to the image origin (0, 0, f), where f is the Zx   l X  ldistance from the camera center to image plane. Using fsimilar triangle geometry theory to find the relationshipbetween the actual object P and its projected point P on Zx rthe image plane, the relationship between two points is Xr  f  as follows: X   Assuming that the optical axes of two cameras are x f parallel to each other, where b is the distance between Z Y  two camera centers, and b= Xl - Xr. C and G are the y f projected points of P to left image plane and right image Z plane, respectively. The disparity d is defined as d = xl - Therefore, even if P(x, y, f) captured from the xr . From (3) and (4), we have:image plane is the known condition, we still cannot b  X l  X r  x l  x r   d  Z Zcalculate the depth Z of P point and determine itscoordinate P(X, Y, Z) according to (1) and (2) unless f fwe know one of X or Y (height) or Z (depth). Therefore, the image depth Z can be given by: f  b  P (X ,Y ,Z ) Z d Y y P ( x , y , f ) Xl P(Object) X x r X Z Z Z Camera center C (0,0,0) Image plane xl xr G f f f Figure 1. Perspective projection of one camera. Ol b Or 2.2. Image Depth From the previous discussion, we have known that Figure 3. Projection transform of two cameras and disparity.it is impossible to calculate accurately the depth orheight of object or scene from the information of one As shown in Fig. 4. The height image of an object cancamera, even if we have a lot of known conditions in be derived from the height of the object image based onadvance. Therefore, several studies use the overlapping the assumption of a pinhole camera and the image-view’s information of two [5] or more cameras to forming geometry.calculate the depth or height of object or scene, shown  Y  y  Z in Fig. 2. f Right image plane Y y Pinhole Left image plane r r (x ,y ) x Optical axis y y x b l l (x ,y ) f Z P (X ,Y ,Z ) Figure 4. Image-forming geometry. Figure 2. The relationship between depth and disparity for two cameras. 1041
• Due to rapid corresponding on two cameras, the the middle region is from 5m to 10m, and far region ismethod has high efficiency on calculating the depth and over 10m.height of the objects, and is suitable for the application Acquisition of the best baseline bof ALV navigating. This method can find the disparity dfrom two corresponding points (for instance, C and G To acquire a best baseline b means that to find theshown in Fig. 3.) respective to left image and right appropriate cameras baseline b on the basis of differentimage. Here, the accuracies of two corresponding points depths of the region. Table 1 and Table 2 show theare very important. Regard the value of disparity d as relationship between depth Z and the two-cameraimage intensity shown by gray values (0 to 255), then, distance baseline b. We set the d = 30 as the thresholdwhole disparities form an image, called disparity map value dth, and region of d less than dth as background.(DM) or DM image. The DM construction proposed by Therefore, when the depth Z is known, the disparity dBirchfield and Tomasi [6] is employed in this paper. can be obtained from Table 1 and Table 2in theThe advantage of this constructing method is faster to different baseline, and then find the most appropriateobtain all depths that also include the depths of value of b makes the value of d closest or greater thandiscontinuous, covered, and mismatch points. Otherwise, the dth. For example: 20cm is the best b for short-rangethe disadvantage is to lack the accuracy of the obtained region (0 m~ 5m), and 40cm for medium-range regiondisparity map. Fig. 5 shows that the disparity map is (5m ~ 10m).generated from left and right images. Calculation of the depth and height The cameras are calibrated [8] in advance, then, we can obtain the focus value f =874 pixels. Substituting the obtained d for object into (6), we find the distance Z between the camera and the object, and Z is then substituted into (7) to calculate the object height Y [9] that usually can be used to decide whether the object is an obstacle. (a) Distance (b) Figure 5. The disparity map (a) left image and right image (b) disparity map Disparity Camera 3. PROPOSED METHOD Figure 6. The relationship between distance and disparity. From (6) [7], we know that the object is far from TABLE I. DISPARITY VALUES d (PIXELS) VS. DEPTH Z=1M~5Mtwo cameras, the disparity value will become small, and AND BASELINE b =10CM~150CM.vice versa. In Fig. 6, there is obviously a nonlinear Z(m)relationship between these two terms. The disadvantage b(cm) 1 2 3 4 5of DM is that the farther distance between objects and 10 87 44 29 22 17two cameras makes the smaller disparity value, and it 20 175 87 58 44 35*begets the difficulty of separation between the object 30 262 131 87 66 52 40 350 175 117 87 70and the background becoming difficult. Therefore, how 50 437 219 146 109 87to find the suitable baseline b for obtaining the clearer 60 524 262 175 131 105DM for each region in different depth region of two 70 612 306 204 153 122cameras is required. The processing steps are described 80 699 350 233 175 140in the following sub-sections: 90 787 393 262 197 157 100 874 437 291 219 175Region segmentation 110 961 481 320 240 192 120 1049 524 350 262 210 We partition the region segmentation into three 130 1136 568 379 284 227levels by near, middle and far, and obtain the best DM 140 1224 612 408 306 245of the depth in the different regions. In the paper, we 150 1311 656 437 328 262define the near region is the distance from 0m to 5m, *: The best disparity for short-range region. 1042
• TABLE II. DISPARITY VALUES d (PIXELS) VS. DEPTH Z=6M~10M AND BASELINE b =10CM~150CM. Z(m) b(cm) 6 7 8 9 10 10 15 12 11 10 9 20 29 25 22 19 17 30 44 37 33 29 26 40 58 50 44 39 35* 50 73 62 55 49 44 60 87 75 66 58 52 70 102 87 76 68 61 80 117 100 87 78 70 90 131 112 98 87 79 100 146 125 109 97 87 110 160 137 120 107 96 120 175 150 131 117 105 130 189 162 142 126 114 140 204 175 153 136 122 150 219 187 164 146 131 *: The best disparity for medium-range region. 4. EXPERIMENTAL RESULTS Figure 8. The disparity map (a) left image and right image (b) The proposed methods have been implemented disparity map (Z=400CM、800cm,b=20cm).and tested on the 2.8GHz Pentium IV PC. Fig. 7 showstwo cameras are fixed on a sliding way and can bepulled apart to change the baseline distance. In SectionIII, we know that the best b for the short-range region0m ~ 5m is 20cm, and 40cm for medium-range region5m ~ 10m. Therefore, we set two persons standing inthe distance from the two-camera of 4m and 8m, two-camera distance b = 20cm, shown in Figure 8. Becausethe person standing at 4m is in the short-range region,so it can be seen clearly. However, another personstanding at 8m is in medium-range region, its difficultto separate it from background. Figure 9. The disparity map (a) left image and right image (b) disparity map (Z=800CM,b=20cm). Figure 7. Experiment platform of stereo vision. To compare Fig. 9 and Fig. 10 with the distancefrom a person to the baseline is 8m (medium-rangeregion) and the baseline is changed from b = 20cm to b= 40cm, so the results show that as b = 40cm, the person(object) becomes clearer as shown in Fig. 10. 1043
• [6] S. Birchfield and C. Tomasi, ”Depth Discontinuities by Pixel-to-Pixel Stereo,” International Journal of Computer Vision, pp. 269-293, Aug 1999. [7] G. Bradski and A. Kaehler, Learning OpenCV: Computer Vision with the OpenCV Library, OReilly Press, 2008. [8] http://www.vision.caltech.edu/bouguetj/calib_doc/ [9] L. Zhao and C. Thorpe, “Stereo- and Neural Network- Based Pedestrian Detection,” IEEE Trans, Intelligent Transportation System, Vol. 3, No. 3, pp. 148-154, Sep 2000. Figure 10. The disparity map (a) left image and right image (b) disparity map (Z=800cm,b=40cm). 5. CONCLUSION From the experimental results, we have found thatthe suitable baseline of two cameras can help us toobtain the better disparity. However, if the object is farfrom two cameras, its disparity value will become small,then the disparity value of the object is near to that ofthe background, and not easily detected. Using theproposed method, to change the baseline of twocameras, the object becomes clearer and easier detected,and 3D object information is obtained more. The resultscan be used to a lot of applications, for example, ALVnavigation. In the future, we plan to solve the DM noiseof horizontal stripe inside, so DM can be shown better. REFERENCES[1] A. Elfes, “Using occupancy grids for mobile robot perception and navigation,” Computer Magazine, pp. 46- 57, June 1989.[2] J. Hancock, M. Hebert and C. Thorpe, “Laser intensity- based obstacle detection Intelligent Robots and Systems,” 1998 IEEE/RSJ International Conference on Intelligent Robotic Systems, Vol. 3, pp. 1541-1546, 1998.[3] E. Elkonyaly, F. Areed, Y. Enab, and F. Zada, “Range sensory=based navigation in unknown terrains,” in Proc. SPIE, Vol. 2591, pp.76-85.[4] 陳禹旗，使用 3D 視覺資訊偵測道路和障礙物應用於人 工智慧策略之室外自動車導航，碩士論文，國立台北 科技大學電腦與通訊研究所，台北，2003。[5] 張煜青，以雙眼立體電腦視覺配合人工智慧策略做室 外自動車導航之研究，碩士論文，國立台北科技大學 自動化科技研究所，台北，2003。 1044
• A MULTI-LAYER GMM BASED ON COLOR-TEXTURE COMBINATION FEATURE FOR MOVING OBJECT DETECTION Tai-Hwei Hwang (黃泰惠), Chuang-Hsien Huang (黃鐘賢), Wen-Hao Wang (王文豪) Advanced Technology Center, Information and Communications Research Laboratories, Industrial Technology Research Institute, Chutung, HsinChu, Taiwan ROC 310 E-mail: {hthwei, DavidCHHuang, devin}@itri.org.tw ABSTRACT background scene. The background scene contains the images of static or quasi-periodically dynamic objects,Foreground detection generally plays an important role in the for instance, sea tides, a fountain, or an escalator. Theintelligent video surveillance systems. The detection is based representation of background scene is basically aon the characteristic similarity of pixels between the input collection of statistics of pixel-wise features such asimage and the background scene. To improve the color intensities or spatial textures. The color featurecharacteristic representation of pixel, a color and texturecombination scheme for background scene modeling is can be the RGB components or other features derivedproposed in this paper. The color-texture feature is applied from the RGB, such as HSI, or YUV expression. Theinto a four-layer structured GMM, which can classify a pixel texture accounts for information of intensity variation ininto one of states of background, moving foreground, static a small region centered by the input pixel, which can beforeground and shadow. The proposed method is evaluated computed by the conventional edge or gradientwith three in-door videos and the performance is verified by extraction algorithm, local binary pattern [1], etc. Thepixel detection accuracy, false positive and false negative rate statistical background models of pixel color and texturesbased on ground truth data. The experimental results are respectively efficient when the moving objects aredemonstrate it can eliminate shadow significantly but without with different colors from background objects and aremany apertures in foreground object. full of textures for either background or foreground moving objects. For example, it is hard to detect a 1. INTRODUCTION walking man in green from a green bush using the color feature only. In this case, since the bush is full ofWide range deployment of video surveillance system is different textures from the green cloth, the man can begetting more and more importance to security easily detected by background subtraction with texturemaintenance in a modern city as the criminal issue is feature. However, this will not be the case when onlystrongly concerned by the public today. However, using the texture difference to detect the man walking inconventional video surveillance systems need heavy the front of flat white wall because of the lack of texturehuman monitoring and attention. The more cameras for both the cloth and the wall. Therefore, some studiesdeployed, the more inspection personnel employed. In are conducted to combine the color and the textureaddition, attention of inspection personnel is decreased information together as a pixel representation forover time, resulting in lower effectiveness at recognizing background scene modeling [2][3][4]. In addition to theevents while monitoring real-time surveillance videos. different modeling abilities of color and texture, textureTo minimize the involved man power, research in the feature is much more robust than color under thefield of intelligent video surveillance is blooming in illumination change and is less sensitive to slight castrecent years. shadow of moving object.Among the studies, background subtraction is a Though the combination of color and texture canfundamental element and is commonly used for moving provide a better modeling ability and robustness forobject detection or human behavior analysis in the background scene under illumination change, it is notintelligent visual surveillance systems. The basic idea enough to eliminate a slightly dark cast shadow or tobehind the background subtraction is to build a keep an invariant scene under stronger illuminationbackground scene representation so that moving objects change or automatic white balance of camera. Toin the monitored scene can be detected by a distance improve the robustness of background modeling further,comparison between the input image and the 1045
• such as waving leaves, ocean waves or traffic lights. The background model is first initialized with a set ofDetails of feature extraction stage, the background and training data. For example, the training data could bethe shadow layers are described in the following collected from the first L frames of the testing video.subsections. After that, each pixel at frame t can be determined whether it matches to the m-th Gaussian Nm by satisfying2.1. Feature extraction stage the following inequality for all components {xC ,i , xT ,i } ∈ x :The color-texture feature is a vector including ( xC , i − µC ,i , m ) 2 ( xT ,i − µT , i, m ) 2 dC dT λ 1− λ B Bcomponents of RGB and local difference pattern (LDP)as the texture in a local region. The LDP is an edge-like dC ∑ i =1 k × (σ C ,i , m ) 2 B + dT ∑ i =1 k × (σT , i , m ) 2 B <1 (3)feature which is consisted of intensity differencesbetween predefined pixel pairs. Each component of LDP where dC and dT denote vector dimension of color andis computed by texture, respectively, λ is the color-texture combination weight, k is a threshold factor and we set it to three LDPn(C)=I(Pn)-I(C), (1) according to the three-sigma rule (a.k.a. 68-95-99.7 rule)where C and Pn represent the pixel and thereof neighbor of normal distribution. The weights of Gaussianpixel n, respectively, and I(C) represents the gray level distribution are sorted in decreasing order. Therefore ifintensity of pixel C. The gray level intensity can be the pixel matches to the first nB distributions, where nB iscomputed by the average of RGB components. Four obtained by Eq. (4), it is then classified as thetypes of pattern defining the neighbor pixels are background [13].depicted in Figure 2 and are separately adopted tocompare their performance of moving object detection  b experimentally. b  ∑ n B = arg min  π m > 1 − p f   (4)  m=1  where pf is a measure of the maximum proportion of the data that belong to foreground objects without influencing the background model. When a pixel fits the background model, the background model is updated in order to adapt it toFig. 2. Four types of pattern defining neighbor pixels progressive image variations. The update for each pixelfor computation of LDP is as follows:2.2. Background Layer π m ← π m + α (om − π m ) − αcL B B B (5)The GMM background subtraction approach presentedby Stauffer and Grimson [8] is a widely used approach µ m ← µ m + om (α / π m )(x − µ m ) B B B B (6)for extracting moving objects. Basically, it uses couplesof Gaussian distribution to model the reasonable (σ m ) 2 ← (σ m ) 2 + om (α / π m )((x − µ m )T (x − µ m ) − (σ m ) 2 ) B B B B B B (7)variation of the background pixels. Therefore, anunclassified pixel will be considered as foreground if the where α=1/L is a learning rate and cL is a constant valuevariation is larger than a threshold. We consider non- (set to 0.01 herein [14]). The ownership om is set to 1 forcorrelated feature components and model the the matched Gaussian, and set to 0 for the others.background distribution with a mixture of M Gaussiandistributions for each pixel of input image: 2.3. Shadow Layer M p (x) = ∑π m=1 B B B m N m ( x; µ m , Iσ m ) (2) The problem of color space selection for shadow detection has been discussed in [6][12]. Their experimental results revealed that performing cast Bwhere x represents the feature vector of a pixel, µ m is shadow detection in CIE L*u*v, YUV or HSV is more B efficient than in RGB color space. Considering that thethe estimated mean, σ m is the variance, and I represents RGB-to-CIE L*u*v transform is nonlinear and the Huethe identity matrix to keep the covariance matrix domain is circular statistics in HSV space, YUV colorisotropic for computational efficiency. The estimated space shows more computing efficiency due to itsmixing weights, denoted by π m , are non-negative and B linearity of transforming from RGB space. In addition,they add up to one. YUV is also for interfacing with analogy and digital television or photographic equipment. As a result, YUV 1047
• [13]Izadi, M., and Parvaneh, S.: Robust Region-based [14]Zivkovic Z., and van der Heijden, F.: Recursive Background Subtraction and Shadow Removing Unsupervised Learning of Finite Mixture Models. using Color and Gradient Information. in IEEE Transactions on Pattern Analysis and Machine Proceedings of International Conference on Pattern Intelligent. vol. 26, no. 7, pp. 773-780 (2006). Recognition. pp. 1-5 (2008). Color-texture representation of Pixel i Fit Background No Model? Yes Update Is Shadow Background No Candidate? Model Yes Fit Static Background Update Shadow Foreground No Model Model? No Yes Update Static Fit Moving Fit Shadow Foreground No Foreground Model Model ? Model? and Count SF +1 Yes Yes Shadow If Count SF > T 1 Update Moving Foreground Model Transfer Moving and Count MF +1 Yes Foreground to Background, Reinitialize Moving Transfer Static If Count MF >T2 Foreground Model , Foreground Model to No and Set Count MF = 0 Background Model Yes Transfer Moving to Static Foreground Model No and Set Count SF =0 Foreground Model Static Moving Foreground Foreground Fig. 1. Flowchart of the proposed multi-layer scene model 1050
• Fig. 3. Results of background subtraction controlled by combination weight of color and texture Fig. 4. Foreground detection results of video 2. 1051
• Fig.5. Foreground detection results of video 3, the intelligentroom_raw. Fig. 6. Foreground detection results of video 4, the Laboratory_raw. 1052
• (a) (a) (b) Figure 3. Convex hulls of (a) non-occlusion vehicle and (b) occluded vehicles. (b) Figure 2. Background image Construction where Vs and Vc represent the vehicle area from the background subtraction and the vehicle convex area, re- spectively. When the value of Γ is closer to one, thecollect various types of vehicle masks. The classiﬁcation vehicle area is similar to its convex hull area and itis implemented by the vehicle size information obtained indicates that the occlusion may not happen. In the trainingfrom the trafﬁc information analysis. When the system process, our system tries to extract non-occluded vehicleruns after a period of time, there will be enough vehicle patterns so we set up a high threshold to ensure that mostmasks to establish the implicit shape model. We will detail of the extracted vehicle patterns contain single vehicles.the procedures of our proposed system as follows. D. Trafﬁc Information AnalysisB. Background Model Construction As mentioned before, we require that our system be A series of trafﬁc surveillance frames will be utilized executed in an more automatic manner to reduce theto construct the background image of the trafﬁc scene human efforts for tuning the parameters. Our scheme willcaptured by a static roadside camera so that the moving obtain the direction of trafﬁc appearing in the scene andvehicles will be detected by the background subtraction. i the common vehicle size information by the statistics ofLet Bx,y be the pixel at (x, y) of the background image, the surveillance videos in the training phase. For analyzingand the background updating function is given by the direction of trafﬁc, the vehicle movements must be i+1 b i b i attained ﬁrst. SIFT is employed to identify features on Bx,y = (1 − αMx,y )Bx,y + αMx,y Fx,y (1) vehicles. After the vehicle segmentation, the vehicles iin which Fx,y is the pixel at (x, y) in frame i; α is the are transformed into feature descriptors of SIFT. The bsmall learning rate; Mx,y is the binary mask of the current features of frames will be compared and the positionsframe. If the pixel at (x, y) belongs to the foreground part, of movements are recorded. After a period of time, the b bMx,y = 1 to turn on the updating. Otherwise, Mx,y is set main direction of trafﬁc in the surveillance scene canas 0 to avoid updating the background with the moving be observed from the resultant movement histogram. Inobjects. An example of the scene with its constructed addition, the Region of Interest (ROI) can be identiﬁed tobackground is demonstrated in Fig. 2. facilitate the subsequent processing. The position of ROI is located in the area of the detected trafﬁc ﬂow and the areaC. Occlusion Vehicle Detection near the bottom of the captured trafﬁc scene for vehicles It has been observed that the shape of non-occluded of larger size, which can offer more information.vehicle should be close to its convex hull and that the After determining ROI, we can collect vehicle patternsshape of occluded vehicles will show certain concavity, or masks that appear in the ROI. In the training phase, ve-as illustrated in Fig. 3. This characteristic can be used to hicle patterns that are determined to contain single vehiclesroughly extract the non-occluded vehicle. In our imple- based on the convex hull analysis will be archived. Thenmentation, compactness, Γ, is used to evaluate how close we can check the size histogram of archived vehicles to setthe vehicle’s shape and its convex hull are. That is, up the criterion for roughly classifying them. In our test Vs videos, the most common vehicles are motorcycles, sedan Γ= , (2) cars and buses. When we examine the histogram of the Vc 1055
• (a) (b) (a) (b)Figure 5. (a) Multi-type vehicle error detection and (b) the result after Figure 6. (a) Multiple hypotheses detected in one vehicle and (b) thethe reﬁning procedure. results from the reﬁning procedure.function returns a value one. Otherwise, it returns zero. There exists another problem in the vehicle recognitionFor the 3D voting space, we use a spherical kernel and by using ISM. As shown in Fig 6, there are three boundingthe radius is the bandwidth, b(sc ), which is adaptive to boxes on the same vehicle. It means that the recognitionthe local maximum scale sc . As the object scale increases, result includes some error detections that ISM has deﬁnedthe kernel bandwidth should also increase for an accurate for multiple hypotheses on this vehicle. Since the multipleestimation. Therefore, we sum up all the weighting values deﬁnition problem comes from the fact that the ISMthat are inside of the kernel and divide them by the volume searches the local maxima in the scale-space as shownV (sc ) to obtain an average weight density, which is called in Fig. 7, the scheme may ﬁnd several local maxima inthe score. After the score is derived, we deﬁne a thresh- different scale levels but at a similar location. In fact,old θ for determining whether the object exists. When these local maxima are generated by the same vehiclethe score is above θ, the hypothesized object center is center. Therefore, the unnecessary hypotheses should bepreserved. Finally, we back-project the votes that support eliminated. We deal with the problem by computing thethis hypothesized object center to obtain an approximate overlapped area between the two bounding boxes. Whenshape of the object. the overlapped area between two bounding boxes is very large, we can claim that the bounding box that has aF. Occlusion Resolving weaker score is an error detection. For efﬁcient compu- After detecting the existence of certain occluded vehi- tation, the rate of overlap is computed by ﬁnding thecles in the image, we need to classify them into different distance between the two bounding boxes’ central pointstypes. In our scheme, we construct the codebooks of and use the longer diagonal line of the larger boundingdifferent types of vehicles. Each type of vehicle codebook box as the criterion. The longer the distance is, the higherwill be established automatically after we obtain enough the areas overlap. In other words, for every two boundingvehicle patterns collected by the process of vehicle ex- boxes, we need to checktraction. However, as shown in Fig. 5, the performance of 1recognition is not as good as expected since many errors distance(B1 , B2 ) < D, (7) 3happen on the bus image. Owing to the fact that the area where B1 and B2 denote two bounding boxes centralof buses are much larger than sedan cars and that there points and D is the diagonal line of the larger one. In ourare many similar local appearances in these two types, 1 implementation, when the distance is smaller than 3 D, theerrors of this kind occur quite often. We provide a reﬁning overlapped area of the bounding boxes is above 50% andprocedure as follows. we will thus remove the bounding box that has a lower All the hypotheses are supported by the contributing score. The error detection from ISM can thus be reduced.votes that are cast by the matched features. Theoretically,every extracted feature should only support one hypothesis IV. E XPERIMENTAL R ESULTSsince it is not possible that one feature belongs to two We have tested the proposed self-training mechanismvehicles. Thus, we will modify these hypotheses after on two different surveillance videos. The scenes of twoexecuting multiple recognition procedures. We ﬁrst store surveillance videos are displayed in Fig. 8. Scene 1 shownall the hypotheses whose scores are over a threshold. in Fig. 8(a) is a 15 minutes long video while SceneThen all the hypotheses are reﬁned by checking each 2 shown in Fig. 8(b) is a 17 minutes long video. Thecontributing vote that appears in two hypotheses at the experimental results will be demonstrated in three parts,same time. The hypothesis with a higher score can retain i.e. the trafﬁc information analysis, the vehicle patternthis vote while the vote from others will be eliminated. extraction/classiﬁcation and the occlusion resolving.Next, the scores of these hypotheses are recalculated.When the new score is above the threshold, the hypothesis A. Trafﬁc Information Analysiscan be preserved. After this reﬁning procedure, the number The directions of trafﬁc ﬂow analysis of two scenesof error detections can be reduced. are illustrated in Fig. 9. The red points represent forward 1057
• 9 4 D D 8 3.5 Number of Occurences (per minute) Number of Occurences (per minute) 7 3 6 2.5 5 2 4 1.5 3 1 2 1 0.5 0 0 0 10 20 30 40 50 60 70 80 0 20 40 60 80 100 120 140 Size of Vehicles Unit: 100 pixels Size of Vehicles Unit: 100 pixels (a) (b) Figure 10. The vehicle size statistics for (a) Scene 1 and (b) Scene 2.Figure 7. If the distance between two bounding boxes’ centers is smaller,then the overlap area is larger so the distance will be employed to remove Table Ithe duplicated detections. V EHICLE PATTERN E XTRACTION Total Error Correct rate Scene 1 940 15 98.4% Scene 2 1251 31 97.5% B. Vehicle Pattern Extraction and Classiﬁcation The various extracted vehicle patterns are demonstrated (a) (b) and they pass the occlusion detection process to ensure that it have no occlusion problem. In our experiment, we giveFigure 8. The views of two surveillance videos. (a) Scene 1. (b) Scene Eq.(2) a threshold 0.9 for extracting the sedan car/bus and2. 0.8 for motorcycles. We apply the shape analysis on sedan cars and buses but not on motorcycles since they cannot be approximated by a convex hull. The performance of vehicle extraction is summarized in Table I. These vehiclemoving vehicles and blue points are backward moving patterns will be employed for training. It should be notedvehicles. We can see that the directions of trafﬁc ﬂows are that the errors usually come from some unstable envi-successfully obtained after training the video for a while. It ronmental conditions, which will affect the constructionshould be noted that the more trafﬁc volume is, the lesser of background image. The vehicle classiﬁcation result istime we will need. The vehicle size information statistics summarized in Table II. Some extracted patterns fromfor Scene 1 and Scene 2 are exhibited in Fig 10. There Scene 1 are illustrated in Figs. 11-13. We can see thatexist two peaks in each scene as the left peak, which has the vehicle patterns can be effectively extracted and theya smaller vehicle size, represents a motorcycle, while the will be helpful in training a more accurate codebook orright one, which has a larger vehicle size, stands for a models.sedan car. In Scene 1, according to Fig. 10, we assign thelower bound 700 pixels and upper bound 1000 pixels for C. Occlusion Resolvingmotorcycle size. The upper and lower bounds of sedan Table III and Figs. 14-16 demonstrate the results ofcar size are 1700 pixels and 3300 pixels respectively. In occlusion resolving. We use the extracted vehicle patternsScene 2, the motorcycle size is assigned with 1400 pixels to train the ISM codebooks for two different scenes. Tableand 2100 pixels while the sedan car size is assigned with III is the performance of resolving occlusion on sedanthe lower bound 4000 pixels and the upper bound 8500 cars and the occlusion part of Table III denotes the sedanpixels. We can see that the vehicle size information i.e. cars actually occlude with other vehicles while the non-the motorcycle and sedan car, for surveillance video can occlusion part stands for the sedan cars which are notbe obtained by statistics successfully. occluded with other vehicles but pass the occlusion detec- tion. As shown in Figs. 14 and 15, there are several sedan cars that are partially occluded. We use the trained ISM to resolve the occlusions. The red points and bounding boxes Table II V EHICLE PATTERN C LASSIFICATION Motorcycle Sedan car (a) (b) Total Error Correct rate Total Error Correct rate Scene 1 135 3 97.8% 765 34 95.6%Figure 9. The directions of trafﬁc ﬂows for (a) Scene 1 and (b) Scene Scene 2 159 2 98.7% 826 46 94.4%2. 1058
• Figure 11. The extracted motorcycle patterns from Scene 1. Figure 13. The extracted bus patterns from Scene 1. Figure 14. Occlusion resolving of sedan cars in Scene 1. Figure 12. The extracted sedan car patterns from Scene 1.represent vehicle’s central coordinate and its position thatare detected by ISM. In Fig. 16, we resolve the problem ofocclusion from the two types of vehicles i.e. bus and sedancar. By combining ISM and the proposed self-trainingmechanism, these occlusion problems can be reasonablyresolved. Figure 15. Occlusion resolving of sedan cars in Scene 2. Table III S EDAN C AR O CCLUSION R ESOLVING R ATE Total Miss False alarm occlusion 177 35 46 Scene 1 non-occlusion 88 1 2 occlusion 92 16 21 Scene 2 non-occlusion 130 2 12 Figure 16. Resolving the partial occlusion of sedan car and bus. Recall Precision occlusion 80.2% 75.5% Scene 1 non-occlusion 98.9% 97.8% V. C ONCLUSION occlusion 82.6% 78.2% Scene 2 non-occlusion 98.4% 99.2% We have proposed a framework of analyzing the trafﬁc information in the surveillance videos captured by the 1059
• static roadside cameras. The trafﬁc and vehicle infor- [13] N. Kanhere, S. Birchﬁeld, and W. Sarasua, “Vehicle seg-mation will be collected from the videos for training mentation and tracking in the presence of occlusions,”the related model automatically. For the vehicles without Transportation Research Record: Journal of the Trans- portation Research Board, vol. 1944, no. -1, pp. 89–97,occlusion, we can use the scene model to record and 2006.classify. If an occlusion happen, the implicit shape modelwill be employed. The experimental results demonstrate [14] W. Zhang, Q. Wu, X. Yang, and X. Fang, “Multilevelthis potential solution of solving occlusion problems in Framework to Detect and Handle Vehicle Occlusion,” IEEEthe trafﬁc surveillance videos. Future work will be further Transactions on Intelligent Transportation Systems, vol. 9, no. 1, pp. 161–174, 2008.improving the accuracy and the speed of execution. [15] L. Tsai, J. Hsieh, and K. Fan, “Vehicle detection using R EFERENCES normalized color and edge map,” IEEE Transactions on [1] O. Javed, S. Ali, and M. Shah, “Online detection and clas- Image Processing, vol. 16, no. 3, pp. 850–864, 2007. siﬁcation of moving objects using progressively improv- ing detectors,” Computer Vision and Pattern Recognition, [16] C. Wang and J. Lien, “Automatic Vehicle Detection Using vol. 1, p. 696701, 2005. Local FeaturesA Statistical Approach,” IEEE Transactions on Intelligent Transportation Systems, vol. 9, no. 1, pp. 83– [2] J. Hsieh, S. Yu, Y. Chen, and W. Hu, “Automatic trafﬁc 96, 2008. surveillance system for vehicle tracking and classiﬁcation,” IEEE Transactions on Intelligent Transportation Systems, [17] B. Leibe, A. Leonardis, and B. Schiele, “Robust object de- vol. 7, no. 2, pp. 175–187, 2006. tection with interleaved categorization and segmentation,” International Journal of Computer Vision, vol. 77, no. 1, [3] B. Wu and R. Nevatia, “Improving part based object pp. 259–289, 2008. detection by unsupervised, online boosting,” in IEEE Con- ference on Computer Vision and Pattern Recognition, 2007. [18] Y. Cheng, “Mean shift, mode seeking, and clustering,” CVPR’07, 2007, pp. 1–8. IEEE Transactions on Pattern Analysis and Machine In- telligence, vol. 17, no. 8, pp. 790–799, 1995. [4] J. Zhou, D. Gao, and D. Zhang, “Moving vehicle detection for automatic trafﬁc monitoring,” IEEE transactions on vehicular technology, vol. 56, no. 1, pp. 51–59, 2007. [5] H. Celik, A. Hanjalic, E. Hendriks, and S. Boughor- bel, “Online training of object detectors from unlabeled surveillance video,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2008. CVPRW’08, 2008, pp. 1–7. [6] H. Celik, A. Hanjalic, and E. Hendriks, “Unsupervised and simultaneous training of multiple object detectors from unlabeled surveillance video,” Computer Vision and Image Understanding, vol. 113, no. 10, pp. 1076–1094, 2009. [7] V. Nair and J. Clark, “An unsupervised, online learning framework for moving object detection,” Computer Vision and Pattern Recognition, vol. 2, p. 317324, 2004. [8] C. Pang, W. Lam, and N. Yung, “A novel method for resolving vehicle occlusion in a monocular trafﬁc-image sequence,” IEEE Transactions on Intelligent Transportation Systems, vol. 5, pp. 129–141, 2004. [9] ——, “A method for vehicle count in the presence of multiple-vehicle occlusions in trafﬁc images,” IEEE Trans- actions on Intelligent Transportation Systems, vol. 8, no. 3, pp. 441–459, 2007.[10] A. Yoneyama, C. Yeh, and C. Kuo, “Robust vehicle and trafﬁc information extraction for highway surveillance,” EURASIP Journal on Applied Signal Processing, vol. 2005, p. 2321, 2005.[11] X. Song and R. Nevatia, “A model-based vehicle segmen- tation method for tracking,” in Tenth IEEE International Conference on Computer Vision, 2005. ICCV 2005, 2005, pp. 1124–1131.[12] J. Lou, T. Tan, W. Hu, H. Yang, and S. Maybank, “3- D model-based vehicle tracking,” IEEE Transactions on image processing, vol. 14, no. 10, pp. 1561–1569, 2005. 1060
• or buildings as printed on a brochure or guide book and usethem for user interactions. The paper is organized as follows. Section 2 describesthe design of the system; section 3 describes the differentsteps in the implementation of the system; section 4discusses the operational navigation system; and section 5provides the conclusion and discusses possible futureresearch directions. II. S YSTEM DESIGN The section describes how the system is designed andimplemented. Issues that arose during the implementation ofthe system, as well as the approaches taken to resolve theissues are also discussed in the following. To achieve the goals and ideas set out in the previoussection, the system is designed with the following Figure 2. Concept of the navigation system.considerations. The system obtains input via a camera, located above and  Minimum direct contact: The need for a user to overlooking the pamphlet. The camera captures images of come into direct contact with hardware devices such the users hand and the pamphlet. The images are processed as a keyboard, or a mouse, or a touch screen, should and analyzed to extract the motion and the location of the be minimized. fingertip. The extracted information is used to determine the  User friendliness: The system should be easy and multimedia data, including text, 2D pictures, 3D models, intuitive to use, with simple interface and concise sound files, and/or movie clips, to be displayed for the instructions. selected location on the pamphlet. Fig. 2 shows the concept of the proposed navigation system.  Adaptability: The system should be able to handle other different but similar operations with minimum modifications. III. SYSTEM IMPLEMENTATION  Cost effectiveness: We wish to implement the system using readily available hardware, to The main steps in our system are discussed in the demonstrate that the integration of simple hardware followings. can have fascinating performance. A. Build the system using ARToolKit  Simple and robust setup. Our goal is to have the We have selected ARToolKit to develop our system, system installed at various locations throughout the since it has many readily available high level functions that school, or other public facilities. By having a simple can be used for our purpose. It can also be easily integrated and robust setup, we reduce the chances of a system with other libraries to provide more advanced functions and failure. implement many creative applications. In accordance to the considerations listed above, the B. Create markerssystem is designed to have the input and out interfaces as The system will associate 2D markers on the pamphletshown in Fig. 1. with 3D objects stored in the database, as well as actions to manipulate the objects. This is achieved by first scanning the marker patterns, storing them in the system and let the program learn to recognize the patterns. In the program, each marker is associated with a particular 3D model or action, such that when the marker has been selected by the user, the associated data or action will be displayed or executed. Fig. 3 shows examples of markers used for the system. Each marker is surrounded by a black border to facilitate recognition. The object markers, as indicated by the blue Figure 1. Diagram for the navigation interface. arrows, are designed to match the objects to be displayed. Bottom right shows a row of markers, as enclosed by the red oval, used to perform actions on the displayed 3D objects. 1062
• Objects model. Note that the actions can be applied to any 3D models that can be displayed by the system. Actions Figure 3. Markers used by the system.C. Create 3D models Figure 5. The user selects the zoom in function to magnify the displayed 3D model. The 3D models that are associated with the markers arecreated using OpenGL or VRML format. These models canbe displayed on top of the live-feed video, such that the usercan interact with the 3D models in real time. The models aretexture mapped to provide realistic appearances. The modelsare created in collaboration with the Kaohsiung Museum ofHistory [9]. Fig. 4 shows examples of the 3D models used inthe navigation system. The models are completely 3D withtexture mapping, and can be viewed from any angle by theuser. Figure 6. The user selects the zoom out function to shrink the displayed 3D model. Figure 4. Markers used by the system.D. Implement interactive functions In addition to displaying the 3D models when the userselects a marker, the system will also provide a set of actionsthat the user can use to manipulate the displayed 3D modelin real time. For example, we have designed “+/-” markersfor the user to magnify or shrink the displayed 3D model.The user simply places his/her finger on the markers and the3D model will change size accordingly. There are alsomarkers for the user to rotate the 3D model, as well as resetthe model to its original size and position. Figs 5 to 7 showthe system with its implemented actions in operation. In the Figure 7. The user use the rotation marker to rotate the 3D object.figures, the user simply puts a finger over the marker, andthe selected actions will be performed on the displayed 3D 1063
• E. Determine selection V. CONCLUSION An USB camera is used to capture continuous images of A multimedia, augmented reality interactive navigationthe scene. The program will automatically scan the field of system has been designed and implemented in this work. Inview in real time for recognized markers. Once a marker has particular, the system is implemented for application infound to be selected, that is, it is partially obstructed by the providing museum guidance.hand, it is considered to be selected. The program will match The implemented system does not require the user tothe selected marker with the associated 3D model or action operate hardware devices such as the keyboard, mouse, orin the database. Figs. 5 to 7 show the user selecting markers touch screen. Instead, computer vision approaches are usedby pointing to the markers with a finger. From the figures, it to obtain input information from the user via an overheadcan be seen that the selected 3D model is shown within the camera. As the user points to certain locations on thevideo window in real time. Also notice that the models are pamphlet with a finger, the selected markers are identified byplaced on top of the corresponding marker’s position in the the system, and relevant data are shown or played, includingvideo window. a texture mapped 3D model of the object, textual, audio, or other multimedia information. Actions to manipulate the displayed 3D model can also be selected in a similar manner. IV. NAVIGATION SYSTEM Hence, the user is able to operate the system without The proposed navigation system has been designed and contacting any hardware device except for the printout of theimplemented according to the descriptions provided in the pamphlet.previous sections. The system does not have high memory The implementation of the system is hoped to reduce therequirements and runs effectively on usual PC or laptops. cost of providing and maintaining peripheral hardwareThe system also requires no expensive hardware, an USB devices at information terminals. At the same time,camera is sufficient to provide the input required. It is also eliminating health risks associated with contaminations byquite easy to set up and customized to various objects and contact in public areas.applications. Work to enhance the system is ongoing and it is hoped The system can be placed at various points in the that the system will be used widely in the future.museum on separate terminals to enable visitors to accessadditions museum information in an interactive manner. ACKNOWLEDGMENT This research is supported by National Science Council (NSC98-2815-C-390-026-E). We would also like to thank Kaohsiung Museum of History for providing cultural artifacts and kind assistance. REFERENCES [1] J.-Z. Jiang，"Why can Wii Win ?",Awareness Publishing，2007 [2] D.-Y. Lai, M. Liou, "Digital Image Processing Technical Manual", Kings Information Co., Ltd.,2007 [3] R. Jain, R. Kasturi & B. G. Schunck, Machine Vision、McGraw-Hill, 1995 [4] R. Klette, K. Schluns & K. Koschan, Computer vision: three- dimensional data from images, Springer; 1998. [5] R. C. Gonzalez and R. E. Woods, Prentice Hall,Digital Image Processing, Prentice Hall; 2nd edition, 2002. [6] HitLabNZ, http://www.hitlabnz.org/wiki/Home, 2008 Figure 8. The interface showning the 3D model and other multimedia information. [7] R. T. Azuma, A Survey of Augmented Reality. In Presence: Teleoperators and Virtual Environments 6, pp 355—385, (1997) Fig. 8 shows the screen shot of the system in operation. [8] Augmented Reality Network, http://augmentedreality.ning.com, 2008In Fig. 8, the left window is the live-feed video, with the [9] H.-J. Chien, C.-Y. Chen, C.-F. Chen, Reconstruction of Culturalselected 3D model shown on top of the corresponding Artifact using Structured Lighting with Densified Stereo Correspondence, ARTSIT, 2009.marker’s position in the video window. The window on the [10] C.-H. Liu, Hand Posture Recognition, Master thesis, Dept. ofright hand side shows the multimedia information that will Computer Science & Eng., Yuan Ze University, Taiwan, 2006.be shown along with the 3D model to provide more [11] C.-Y., Chen, Virtual Mouse:Vision-Based Gesture Recognition,information about the object. For example, when the 3D Master thesis, Dept. of Computer Science & Eng., National Sun Yat-object is displayed, the window on the right might show sen University, Taiwan, 2003additional textual information about the object, as well as [12] J. C., Lai, Research and Development of Interactive Physical Gamesaudio files to describe the object or to suitable provide Based on Computer Vision, Master thesis, Department of Informationbackground music. Communication, Yuan Ze University, Taiwan, 2005 1064
• [13] H.-C., Yeh, An Investigation of Web Interface Modal on Interaction Design - Based on the Project of Burg Ziesar in Germany and the Web of National Palace Museum in Taiwan, Master thesis, Dept. of Industrical Design Graduate Institute of Innovation & Design, National Taipei University of Technology, Taiwan, 2007.[14] T. Brown and R. C. Thomas, Finger tracking for the digital desk. In First Australasian User Interface Conference, vol 22, number 5, pp 11--16, 2000[15] P. Wellner, Interacting with papers on the DigitalDesk, Communications of the ACM, pp.28-35, 1993 1065
• Facial Expression Recognition Based on Local Binary Pattern and Support Vector Machine 1 2 3 4 Ting-Wei Lee (李亭緯), Yu-shann Wu(吳玉善), Heng-Sung Liu(柳恆崧) and Shiao-Peng Huang(黃少鵬) Chunghwa Telecommunication Laboratories 12, Lane 551, Min-Tsu Road Sec.5 Yang-Mei, Taoyuan, Taiwan 32601, R.O.C. TEL:886 3 424-5095, FAX:886 3 424-4742 Email: finas@cht.com.tw, yushanwu@cht.com.tw, lhs306@cht.com.tw, pone@cht.com.tw Abstract—For a long time, facial expression Besides the PCA and LDA, Gabor filter method [3]recognition is an important issue to be full of challenge. In is also used in facial feature extraction. This method hasthis paper, we propose a method for facial expression both multi-scale and multi-orientation selection inrecognition. Firstly we take the face detection method to choosing filters which can present some local features ofdetect the location of face. Then using the Local Binary facial expression effectively. However, the Gabor filterPatterns (LBP) extracts the facial features. When method suffers the same problem as PCA and LDA. Itcalculating the LBP features, we use an NxN window to bea statistical region and remove this window by certain would cost too much computation and high dimension ofpixels. Finally, we adopt the Support Vector Machine feature space.(SVM) method to be a classifier and recognize the facial In this paper, we use the Local Binary Patternexpression. In the experimental process, we use the JAFFE (LBP) [4][5] as the facial feature extraction method.database and recognize seven kinds of expressions. Theaverage correct rate achieves 93.24%. According to the LBP has low computation cost and efficiently encodesexperimental results, we prove that this proposed method the texture features of micro-pattern information in thehas the higher accuracy. face image. In the first step, we have to detect the face area to remove the background image. We extract theKeywords: facial expression, face detection, LBP, SVM Haar-like [6] features and use the Adaboost [7] classifier for face detection. The face detection module can be found in the Open Source Computer Vision Library I. INTRODUCTION (OpenCV). After adopting the face area, we calculate this area’s LBP features. Finally, using the Support To analyze facial expression can provide much Vector Machine (SVM) classifies the LBP feature andinteresting information and used in several applications. recognizes the facial expression. Experimental resultsTake electronic board as example, we can realize demonstrate the effective performance of the proposedwhether the commercials attract the customers or not by method.the facial expression recognition. In recent years, manyresearches had worked on this technique of human- The rest of this paper is organized as follows: Incomputer interaction. Section Ⅱ, we introduce our system flow chart and the The basic key point of any image processing is to face detection. In section Ⅲ, we explain the facial LBPextract the facial features from the original images. representation and SVM classifier. In Section Ⅳ ,Principal Component Analysis (PCA) [1] and Linear experimental results are presented. Finally, we give briefDiscriminant Analysis (LDA) [2] are two methods used discussion and conclusion in section Ⅴ.widely. PCA computes a set of eigenvalues andeigenvectors. By selecting several most significant II. THE PROPOSED METHODeigenvectors, it produces the projection axes to let theimages projected and minimizes the reconstruction error. The flow chart of the proposed facial expressionThe goal of LDA is to find a linear transformation by recognition method was shown in Fig.1. In the first step,minimizing the within-class variance and maximizing the face detection is performed on the original image tothe between-class variance. In other words, PCA is locate the face area. In order to reduce the region of hairsuitable for data analysis and reconstruction. LDA is image or the background image, we take a smaller areasuitable for classification. But the dimension of image is from the face area after the face detection. In the secondusually higher, the calculations require for the process of step, using the LBP method extracts the facialfeature extraction would be significant. expression features. When calculating the histogram of LBP features, we use an NxN window to be a statistical 1066
• Original Image Figure 2. Haar-like features: the first row is for the edge Face Detection features and the second row is for the line features. The face detection module can be found in the Open Source Computer Vision Library (OpenCV) [10]. But if we use the original detection region, it may include some areas which are unnecessary, such as hair LBP Feature or background. For avoiding this situation, we cut the Extraction smaller area from the detection region and try to reduce the unnecessary areas but also keep the important features. This area’s width is 126 and the height is 147. Fig. 4 shows the final result of face area. SVM Classification Features Weak Pass Weak Pass Pass Weak Pass Classifier Classifier Classifier A face area 1 2 N Recognition Deny Deny Deny Result Not a face area Figure 1. The flow chart of the proposed method. Figure 3. The decision process of cascade Adaboost.region and move this window by certain pixels. In thelast step, SVM classifier is used for the facial expressionrecognition.A. The Face Detection Viola and Jones [9] used the Haar-like featurefor face detection. There are some Haar-like featuresamples shown in Fig. 2. Haar-like features canhighlight the differences between the black region andthe white region. Each portion in facial area hasdifferent property, for example, the eye region is darkerthan the nose region. Hence, the Haar-like features mayextract rich information to discriminate different regions. The cascade of classifiers trained by Adaboosttechnique is an optimal way to reduce the time forsearching face area. In this cascade algorithm, theboosted classifier combines several weak classifiers tobecome a strong classifier. Different Haar-like featuresare selected and processed by different cascade weakclassifiers. Fig. 3 shows the decision process of thisalgorithm. If the feature set passes through all of theweak classifiers, it is acknowledged as the face area. Onthe other hand, if the feature set is denied by any weakclassifier, it is rejected. Figure 4. The first column is the original images; the second column is the final face areas. 1067
• 6 18 8 III. THE LBP METHOD AND SVM CLASSIFIERB. Local Binary Patterns 21 LBP was used in the texture analysis. Thisapproach is defined as a gray-level invariantmeasurement method, derived from the texture in a localneighborhood. The LBP has been applied to many Figure 7. Representation of statistic way in width.different fields including the face recognition. By considering the 3x3-neighborhood, the operator C. Support Vector Machineassigns a label to every pixel around the central points in The SVM is a kind of learning machine whosean image. By thresholding each pixel with the center fundamental is statistics learning theory. It has beenpixel value, the result is regarded as a binary number. widely applied in pattern recognition.Then, the histogram of the labels can be used as a The basic scheme of SVM is to try to create antexture descriptor. See Figure 5 for an illustration of the optimal hyper-plane as the decision plane, whichbasic LBP operator. maximizes the margin between the closest points of two Another extension version to the original LBP is classes. The points on the hyper-plane are called supportcalled uniform patterns [11]. A Local Binary Pattern is vectors. In other words, those support vectors are usedcalled uniform if it contains at most two bitwise to decide the hyper-plane.transitions from 0 to 1 or vice versa. For example, Assume we have a set of sample points from two00011110 and 10000011 are uniform patterns. classes We utilized the above idea of LBP with uniformpatterns in our facial expression representation. We {xi , yi }, i  1,, m xi  R N , yi  {1,1} (1)compute the uniform patterns using the (8, 2)neighborhood, which is shown in Fig.6. The (8, 2) stand the discrimination hyper-plane is defined as below:for finding eight neighborhoods in the radius of two.The black rectangle in the center means the threshold, mthe other circle points around there mean the f ( x )   y i a i k ( x, xi )  b (2)neighborhoods. But we can see four neighborhoods are i 1not located in the center of pixels, these neighborhoods’values are calculated by interpolation method. After that, where f (x ) indicates the membership of x . ai anda sliding window with size 18x21 is used for uniformpatterns statistic by shifting 6 pixels in width and 8 b are real constants. k ( x, xi )   ( x),  ( xi ) is apixels in height. Fig.7 represents the statistic way in kernel function and  (x) is the nonlinear map fromwidth. original space to the high dimensional space. The kernel function can be various types. For example, the linear   function is k ( x, xi )  x  xi , the radial basis function (RBF) kernel function is 1 k ( x, xi )  exp(  x  y ) and the polynomial 2 Figure 5. The basic idea of the LBP operator 2 2 kernel function is k ( x, xi )  ( x  xi  1) n . SVM can be designed for either two-classes classification or multi- classes classification. In this paper, we use the multi- classified SVM and polynomial kernel function [12]. IV. EXPERIMENTAL RESULTS In this paper, we use the JAFFE facial expression database [13]. The examples of this database are shown in the Table 1. The face database is composed of 213 gray scale images of 10 Japanese females. Each person has 7 kinds of expressions, and every expression Figure 6. LBP representation using the (8, 2) includes 3 or 4 copies. Those 7 expressions are Anger, neighborhood Disgust, Fear, Happiness, Neutral, Sadness and Surprise. 1068
• Table I THE EXAMPLES OF JAFFE DATABASE Table III THE COMPARISON RESULTS Anger The Reference Reference proposed [14] [15] method Disgust Anger 95% 95.2% 90% Disgust 88% 95.2% 88.89% Fear Fear 100% 85.7% 92.3% Happiness Happiness 100% 84.9% 100% Neutral 75% 100% 100% Neutral Sadness 90% 90.4% 81.8% Surprise 100% 89.8% 100% Sadness Average 92.57% 91.6% 93.24% Surprise According to the Table 3, we can realize the proposed method has the better performance than theThe size of each image is 256x256 pixels. Two images other two references obviously. Even though someof each expression for all of the people are used as recognition rates of expressions aren’t as good as thetraining samples and the rest are testing samples. Hence two reference methods, we still have the highest averagethe total number of training sample is 140, and the recognition rate.number of testing sample is 73. V. CONCLUSIONS The Table 2 shows that recognition rate of eachfacial expression which were experimented by the In this paper, we proposed a facial expressionproposed method. The last row is the average recognition method by using the LBP features. Forrecognition rate of 7 expressions, which is 93.24%. The decreasing the computing efforts, we detect the facerecognition time of each face image is 0.105 seconds. region before the LBP method. After we extract the We also compare our experimental results with facial features from the detected area, the SVMsome references. In reference [14], the author used the classifier will recognize the facial expression finally. ByGabor features and NN fusion method. In another using the JAFFE be the experiment database, we canreference [15], the author took the face image into three prove the proposed method has the 93.24% correctionparts and used the 2DPCA method. The training images rate and better than the two reference methods.and test images are the same as the proposed method. For the future work, we still have some aspects toTable 3 shows the comparison result. The average be studied hardly. Those experiments which werecognition rate of reference [14] is 92.57% and the discussed above have the same property. This propertyreference [15] is 91.6%. is that the training and testing samples are from the same person. In other word, if we want to recognize someone’s expression, we must have his images of Table II THE RECOGNITION RATE OF PROPOSED METHOD various expressions in database previously. But this property is not suitable for the real application. In the Anger 90% future, we want to overcome this problem. Perhaps we Disgust 88.89% can utilize the variations between the different expressions to become a model and use this model to Fear 92.3% recognize. There are other problems in the facial Happiness 100% recognition still have to be dealt with, such as the lighting variation and the pose changing. Those difficult Neutral 100% issues exist for a long time. We will try to find out a Sadness 81.8% better algorithm to enhance our method. Surprise 100% REFERENCES Average 93.24% [1] L.I. Smith, “A Tutorial on Principal Components Analysis”, 2002. [2] H. Yu and J. Yang, “A Direct LDA Algorithm for High- 1069
• Dimensional Data with Application to Face Recognition”, Pattern Recognition, vol. 34, no. 10, pp.2067–2070, 2001.[3] Deng Hb, Jin Lw and Zhen Lx et al, “A New Facial Expression Recognition Method Based on Local Gabor Filter Bank and PCA plus LDA”, International Journal of Information Technology, vol.11, no. 11, pp.86-96, 2005.[4] Timo Ahonen, Abdenour Hadid and Matti Pietika¨ inen, “Face Description with Local Binary Patterns: Application to Face Recognition”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 12, pp.2037–2041, 2006.[5] Timo Ahonen, Abdenour Hadid and Matti Pietika¨ inen, “Face Recognition with Local Binary Patterns”, Springer-Verlag Berlin Heidelberg 2004, pp.469–481, 2004.[6] Pavlovic V. and Garg A. “Efficient Detection of Objects and Attributes using Boosting”, IEEE Conf. Computer Vision and Pattern Recognition, 2001.[7] Jerome Friedman, Trevor Hastie and Robert Tibshirani, “Additive Logistic Regression: A Statistical View of Boosting”, The Annals of Statistics, vol. 28, no. 2, pp.337–407, 2000.[8] C. Burges, "Tutorial on support vector machines for pattern recognition," Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 955-974, 1998.[9] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features”, Proceedings of the 2001 IEEE Computer Society Conference, vol 1, 2001, pp. I-511-I-518.[10] Intel, “Open source computer vision library; http://sourceforge.net/projects/opencvlibrary/”, 2001.[11] T. Ojala, M. Pietika¨inen, and T. Ma¨enpa¨a¨, “Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971-987, July 2002.[12] Dana Simian, “A model for a complex polynomial SVM kernel”, Mathematics And Computers in Science and Engineering, pp. 164-169, 2008.[13] M. Lyons, S. Akamatsu, etc. “Coding Facial Expressions with Gabor Wavelets”. Proceedings of the Third IEEE International Conference on Automatic Face and Gesture Recognition,Nara Japan, 200-205, 1998.[14] WeiFeng Liu and ZengFu Wang “Facial Expression Recognition Based on Fusion of Multiple Gabor Features”, International Conference on Pattern Recognition, 2006[15] Bin Hua and Ting Liu , “Facial expression recognition based on FB2DPCA and multi-classifier fusion”, International Conference on Information Technology and Computer Science, 2009. 1070
• MILLION-SCALE IMAGE OBJECT RETRIEVAL 1 1,2 Yin-Hsi Kuo (郭盈希) and Winston H. Hsu (徐宏民) 1 Dept. of Computer Science and Information Engineering, National Taiwan University, Taipei 2 Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei ABSTRACTIn this paper, we present a real-time system thataddresses three essential issues of large-scale imageobject retrieval: 1) image object retrieval—facilitatingpseudo-objects in inverted indexing and novel object-level pseudo-relevance feedback for retrieval accuracy;2) time efficiency—boosting the time efficiency andmemory usage of object-level image retrieval by a novelinverted indexing structure and efficient queryevaluation; 3) recall rate improvement—miningsemantically relevant auxiliary visual features throughvisual and textual clusters in an unsupervised andscalable (i.e., MapReduce) manner. We are able tosearch over one-million image collection in respond to auser query in 121ms, with significantly better accuracy(+99%) than the traditional bag-of-words model. Figure 1: With the proposed auxiliary visual featureKeywords Image Object Retrieval; Inverted File; Visual discovery, more accurate and diverse results of imageWords; Query Expansion object retrieval can be obtained. The search quality is greatly improved. Regarding efficiency, because the 1. INTRODUCTION auxiliary visual words are discovered offline on a MapReduce platform, the proposed system takes less than one second searching over million-scale imageDifferent from traditional content-based image retrieval collection to respond to a user query.(CBIR) techniques, the target images to match mightonly cover a small region in the database images. Theneeds raise a challenging problem of image object noisily quantized descriptors. Meanwhile, the targetretrieval, which aims at finding images that contain a images generally have different visual appearancesspecific query object rather than images that are globally (lighting condition, occlusion, etc). To tackle thesesimilar to the query (cf. Figure 1). To improve the issues, we propose to mine visual features semanticallyaccuracy of image object retrieval and ensure retrieval relevant to the search targets (see the results in Figure 1)efficiency, in this paper, we consider several issues of and augment each image with such auxiliary visualimage object retrieval and propose methods to tackle features. As illustrated in Figure 5, these features arethem accordingly. discovered from visual and textual graphs (clusters) in an State-of-the-art object retrieval systems are mostly unsupervised manner by distributed computing (i.e.,based on the bag-of-words (BoW) [6] representation and MapReduce [1]). Moreover, to facilitate object-levelinverted-file indexing methods. However, unlike textual indexing and retrieval, we incorporate the idea ofqueries with few semantic keywords, image object pseudo-objects [4] to the inverted file paradigm and thequeries are composed of hundreds (or few thousands) of pseudo-relevance feedback mechanism. A novel efficient 1071
• Figure 2: The system diagram. Offline part: We extract visual and textual features from images. Textual and visualimage graphs are constructed by an inverted list-based approach and clustered by an adapted affinity propagationalgorithm by MapReduce (18 Hadoop servers). Based on the graphs, auxiliary visual features are mined byinformative feature selection and propagation. Pseudo-objects are then generated by considering the spatialconsistency of salient local features. A compact inverted structure is used over pseudo-objects for efficiency. Onlinepart: To speed up image retrieval, we proposed an efficient query evaluation approach for inverted indexing. Theretrieval process is then completed by relevance scoring and object-level pseudo-relevance feedback. It takes around121ms to produce the final image ranking of image object retrieval over one-million image collections.query evaluation method is also developed to remove Inverted file is a popular way to index large-scale data inunreliable features and further improve accuracy and the information retrieval community [8]. Because of itsefficiency. superiority of efficiency, many recent image retrieval Experiment shows that the automatically discovered systems adopt the concept to index visual features (i.e.auxiliary visual features are complementary to VWs). The intuitive way is to record each entry withconventional query expansion methods. Its performance <image ID, VW frequency> in the inverted file.is significantly superior to the BoW model. Moreover, However, to our best knowledge, most systems simplythe proposed object-level indexing framework is adopt the conventional method to the visual domain,remarkably efficiency and takes only 121ms for without considering the differences between documentssearching over the one million image collection. and images, where the image query is composed of thousands of (noisy) VWs and the object of interest may occupy small portions of the target images. 2. SYSTEM OVERVIEW 3.1. Pseudo-ObjectsFigure 2 shows a schematic plot of the proposed system,which consists of offline and online parts. In the offline Images often contain several objects so we cannot takepart, visual features (VWs) and textual features (tfidf of the whole image features to represent each object. Eachexpanded tags) are extracted from the images. We then object has its distinctive VWs. Motivated by the noveltypropagate semantically relevant VWs from the textual and promising retrieval accuracy in [4], we adopt thedomain to the visual domain, and remove visually concept of pseudo-object—a subset of proximate featureirrelevant VWs in the visual domain (cf. Section 4). All points with its own feature vector to represent a localthese operations are performed in an unsupervised area. An example shows in Figure 4 that the pseudo-manner on the MapReduce [1] platform, which is objects, efficiently discovered, can almost catch differentfamous of it scalability. Operations including image objects; however, advanced methods such as efficientgraph construction, clustering, and mining over million- indexing or query expansion are not considered. Wescale images can be performed efficiently. To further further propose a novel object-level inverted indexing.enhance efficiency, we index the VWs by the proposedobject-level inverted indexing method (cf. Section 3). 3.2. Index ConstructionWe incorporate the concept of pseudo-object and adoptcompression methods to reduce memory usage. Unlike document words, VWs have a spatial dimension. In the online part, an efficient retrieval algorithm is Neighboring VWs often correspond to the same object inemployed to speed up the query process without loss of an image, and an image consists of several objects. Weretrieval accuracy. In the end, we apply object-level adopt pseudo-objects and store the object information inpseudo-relevance feedback to refine the search result and the inverted file to support object-level image retrieval.improve the recall rate. Unlike its conventional Specifically, we construct an inverted list for each VW tcounterpart, the proposed object-level pseudo-relevance as follows, <Image ID i, ft,i, RID1, ... ,RIDf>, whichfeedback places more importance on local objects indicates the ID of the image i where the VW appears,instead of the whole image. the occurrence frequency (ft,i), and the associated object region ID (RIDf) in each image. The addition of the object ID to the inverted file makes it possible to search 3. OBJECT-LEVEL INVERTED INDEXING for a specific object even if the object only occupies a small region of an image. 1072
• Figure 3: Illustration of efficient query evaluation (cf.Section 3). To achieve time efficiency, first, we rank avisual word by its salience to the query and then retrievethe designated number of candidate images (e.g., 7images, A to G). After deciding the candidate images,we skip the irrelevant images and cut those non-salientVWs.3.3. Index CompressionIndex compression is a common way to reduce memoryusage in textual domain. First, we discard the top 5%frequent VWs as stop words to decrease the mismatchrate and reduces the size of inverted file. We then adoptdifferent coding methods to compress data based on their Figure 4: Object-level retrieval results by pseudo-visual characteristics. Image IDs are ordinal numbers objects and object-level pseudo-relevance feedback.sorted in ascending order in the lists, thus we store the The letter below each image represents the regiondifference between adjacent image IDs instead of the (pseudo-object) with the highest relevance to queryimage ID itself which is called d-gap [8]. And for region object by (2). The region information is essential forIDs, we adopt a fixed length bit-level coding of three bits query expansion. Instead of using the whole image asto encode it (e.g., R2 010). On the other hand, we use the seed for retrieving other related images, we cana variant length bit-level coding to encode frequency easily identify those related objects (e.g., R0, R5, R0) and(e.g., 3 1110). Furthermore, we implement AND and mitigate the influence of noisy features. Note that theSHIFT operations to efficiently decode the frequency yellow dots in the background are detected featureand region IDs at query time. The memory space for points.indexing pseudo-objects can be saved about 54.1%.3.4. Object-Level Scoring Method 3.5. Efficient Query Evaluation (EQE)We use the intersection of TFIDF, which performs thebest for matching, to calculate the score of each region Conventional query evaluation in inverted indexingindexed by VW t. Besides the discovered pseudo-objects, needs to keep track of the scores of all images in thewe also define a new object R0 to treat the whole image inverted lists. In fact, it is observed that most of theas another object. We first calculate the score of every scored images contain only a few matched VWs. Wepseudo-object (R) to the query object (Q) as follows, propose an efficient query evaluation (EQE) algorithm that explores a small part of a large-scale database to score ( R , Q ) = ∑ IDFt × min( wt , R , wt ,Q ), (1) reduce the online retrieval time. The procedures of EQE t∈Q are described below and illustrated in Figure 3.where wt,R and wt,Q are the normalized VW frequency in 1. Query term ranking: The ranking score in (1)pseudo-object and in the query respectively. And then favors the query term with higher frequency andthe pseudo-object with the highest score is regarded as IDFt; therefore, we sort the query terms accordingthe most relevant object with respect to the query, as to its salience, which is calculated as wt,Q×IDFt forsuggested in [4]: VW t. The following phases are then processed sequentially to deal with VWs ordered and score(i,Q) = max{score(R,Q) | R ∈ i}. (2) weighted by their visual significance to the query. 2. Collecting phase: In the retrieval process, user only cares about the images in the top ranks. 1073
• (a)visual cluster example (b)representative VW selection (c)example results (d)auxiliary VW propagation (e)textual cluster exampleFigure 5: Image clustering results and mining auxiliary visual words. (a) and (e) show the sample visual and textualclusters; the former keeps visually similar images in the same cluster, while the latter favors semantic similarities. Theformer facilitates representative VW selection, while the latter facilitates semantic (auxiliary) VW propagation. (b) and(d) illustrate the selection and propagation operations based on the cluster histogram as detailed in Section 4. And asimple example shows in (c). Therefore, instead of calculating the score of each R0 in Figure 4), we can further remove irrelevant objects image, we score the top images of the inverted lists such as the toy in R4 of the second image. and add them to a set S until we have collected sufficient number of candidate images. 4. AUXILIARY VISUAL WORD (AVW)3. Skipping phase: After deciding the candidate DISCOVERY images, we skip the images that do not appear in the collecting phase. For every image i in the inverted list, score the image i if i∈S , otherwise Due to the limitation of VWs, it is difficult to retrieve skip it. If the number of visited VWs reaches a images with different viewpoints, lighting conditions and predefined cut ratio, go on to the next phase. occlusions, etc. To improve recall rate, query expansion is the most adopted method; however, it is limited by the4. Cutting phase: Simply remove the remaining VWs, quality of initial retrieval results. Instead, in an offline which usually have little influence on the results. stage, we augment each image with auxiliary visual And then the process stops here. features and consider representative (dominant) features This algorithm works remarkably well, bringing in its visual clusters and semantically related features inabout almost the same retrieval quality with much less its textual graph respectively. Such auxiliary visualcomputational cost. As image queries are generally features can significantly improve the recall rate ascomposed of thousands or hundreds of (noisy) VWs, demonstrated in Figure 1. We can deploy all therejecting those non-salient VWs significantly improves processes in a parallel way by MapReduce [1]. Besides,the efficiency and slightly improves the accuracy. the by-product of auxiliary visual word discovery is the reduction of the number indexed visual features for each3.6. Object-Level Pseudo-Relevance Feedback image for better efficiency in time and memory. (OPRF) Moreover, it is easy to embed the auxiliary visual features in the proposed indexing framework by addingConventional approach using whole images for pseudo- one new region for those discovered auxiliary visualrelevance feedback (PRF) may not perform well when features not existing in the original VW set.only a part of retrieved images are relevant. In such acase, many irrelevant objects would be included in PRF, 4.1. Image Clustering by MapReduceresulting in too many query terms (or noises) anddegrading the retrieval accuracy. To tackle this issue, a The image clustering is first based on a graphnovel object-level pseudo-relevance feedback (OPRF) construction. The images are represented by 1M VWsalgorithm is proposed. Rather than using the whole and 50K text tokens expanded by Google snippets fromimages, we select the most important objects from each their associated (noisy) tags. However, it is veryof the top-ranked images and use them for PRF. The challenging to construct image graphs for million-scaleimportance of each object is estimated according to (2). images. To tackle the scalability problem, we constructBy selecting relevant objects in each image (e.g., R0, R5, 1074
• image graphs using MapReduce model [1], a scalable images in the same textual cluster are semantically closeframework that simplifies distributed computations. but usually visually different. Therefore, these images We take the advantage of the sparseness and use provide a comprehensive view of the same object.cosine measure as the similarity measure. Our algorithm Propagating the VWs from the textual domain canextends the method proposed in [2] which uses a two- therefore enrich the visual descriptions of the images. Asphase MapReduce model—indexing phase and the example shows in Figure 5(c), the bottom image cancalculation phase—to calculate pairwise similarities. It obtain auxiliary VWs with the different lightingtakes around 42 minutes to construct a graph of 550K condition of the Arc de Triomphe. The similarity scoreimages on 18-node Hadoop servers. To cluster images can be weighted to decide the number of VWs to beon the image graph, we apply affinity propagation (AP) propagated. Specifically, we derive the VW histogramproposed in [3]. AP is a graph-based clustering from the images of each cluster and then propagate VWsalgorithm. It passes and updates messages among nodes based on the cluster histogram weighted by its (semantic)on graph iteratively and locally—associating with the similarity to the canonical image of the textual cluster.sparse neighbors only. It takes around 20 minutes foreach iteration and AP converges generally around 20 4.4. Combining Selection and Propagationiterations (~400 minutes) for 550K images byMapReduce model. The selection and propagation operations described The image clustering results are sampled in Figure above can be performed iteratively. The selection5(a) and (e). Note that if an image is close to the operation removes visually irrelevant VWs and improvescanonical image (center image), it has a higher AP score, memory usage and efficiency, whereas the propagationindicating that it is more strongly associated with the operation obtains semantically relevant VWs to improvecluster. Moreover, images in the same visual cluster are the recall rate. Though propagation may include toooften visually similar to each other, whereas some of the many VWs and thus decrease the precision, we canimages in the same textual cluster differ in view, lighting perform selection after propagation to mitigate this effect.condition, angle, etc., and are potential to bring A straightforward approach is to iterate the twocomplementary VWs for other images in the same operations until convergence. However, we find that it istextual cluster. enough to perform a selection first, a propagation next, and finally a selection because of the following reasons.4.2. Representative Visual Word Selection First, only the propagation step updates the auxiliary visual feature and textual cluster images are fixed; eachWe first propose to remove irrelevant VWs in each image will obtain distinctive VWs at the firstimage to mitigate the effect of noise and quantization propagation step. The subsequent propagation steps willerror to reduce memory usage in the inverted file system only modify the frequency of the VWs. As the objectiveand to speed up search efficiency. We observe that is to obtain distinctive VWs, frequency is less importantimages in the same visual cluster are visually similar to here. Second, binary feature vectors perform better or ateach other (cf. Figure 5(a)). As illustrated in Figure 5(c), least comparable to the real-valued.the middle image can then have representative VWsfrom the visual cluster it belongs to. We accumulate thenumber of each VW from the images of a cluster to form 5. EXPERIMENTSa cluster histogram. As shown in Figure 5(b), each imagedonates the same weight to the cluster histogram. We 5.1. Experimental Setupcan then select the VWs whose occurrence frequency isabove a predefined threshold (e.g., in Figure 5(b) the We evaluate the proposed methods using a large-scaleVWs in red rectangles are selected). photo retrieval benchmark—Flickr550 [7]. Besides, we randomly add Manhattan photos to Flickr550 to make it4.3. Auxiliary Visual Word Propagation a 1 million dataset. As suggested by many literatures (e.g., [5]), we use the Hessian-affine detector to extractDue to variant capture conditions, some VWs that feature points in images. The feature points are describedstrongly characterize the query object may not appear in by SIFT and quantized into 1 million VWs for betterthe query image. It is also difficult to obtain these VWs performance. In addition, we use the average precision tothrough query expansion method such as PRF because of evaluate the retrieval accuracy. Since average precisionthe difference in visual appearance between the query only shows the performance for a single image query, weimage and the retrieved. Mining semantically relevant compute the mean average precision (MAP) to representVWs from other information source such as text is the system performance over all the queries.therefore essential to improve the retrieval accuracy. As illustrated in Figure 5(e), we propose to augment 5.2. Experimenal Resultseach image with VWs propagated from the textualcluster result. This is based on the observation that 1075
• Table 1: The summarization of the impacts in the features points. This result shows that the selectionperformance and query time comparing with the and propagation operations are effective in mining usefulbaseline methods. It can be found that our proposed features and remove the irrelevant one. In addition, themethods can achieve better retrieval accuracy and relative improvement of AVW (+44%) is orthogonal andrespond to a user query in 121ms over one-million complement to OPRF (0.352 0.487, +38%).photo collections. The number in the parenthesesindicates relative gain over baseline. And the symbol‘%’ stands for relative improvement over BoW model 6. CONCLUSIONS[6]. (a) Image object retrieval In this paper, we cover four aspects of large-scale retrieval system: 1) image object retrieval over one- MAP Baseline PRF OPRF million image collections—responding to user queries in 0.290 0.324 121ms, 2) the impact of object-level pseudo-relevancePseudo-objects [4] 0.251 (+15.5%) (+29.1%) feedback—boosting retrieval accuracy, 3) time (b) Time efficiency efficiency with efficient query evaluation in the inverted Flickr550 One-million file paradigm—comparing with the traditional inverted Pseudo-objects [4] +EQE +EQE file structure, and 4) image object retrieval based on effective auxiliary visual feature discovery—improvingQuery time 854 56 121 the recall rate. That is to say, the efficiency and (ms) effectiveness of the proposed methods are validated over (c) Recall rate improvement large-scale consumer photos. BoW model [6] AVW AVW+OPRFMAP 0.245 0.352 0.487 REFERENCES% - 43.7% 98.8% [1] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” OSDI, 2004.We first evaluate the performance of object-level PRF [2] T. Elsayed, J. Lin, and D. W. Oard, “Pairwise document(OPRF) in boosting the retrieval accuracy. As shown in similarity in large collections with mapreduce,” ACL,Table 1(a), OPRF outperforms PRF by a great margin 2008.(relative improvement 29.1% vs. 15.5%). The resultshows that the pseudo-object paradigm is essential for [3] B. J. Frey and D. Dueck, “Clustering by passingPRF-based query expansion in object-level image messages between data points,” Science, 2007.retrieval since the targets of interest might only occupy asmall portion of the images. [4] K.-H. Lin, K.-T. Chen, W. H. Hsu, C.-J. Lee, and T.-H. Li, “Boosting object retrieval by estimating pseudo- We then evaluate the query time of object-level objects,” ICIP, 2009.inverted indexing augmented with efficient queryevaluation (EQE) to achieve time efficiency. The query [5] J. Philbin, O. Chum, M. Isard, J. Sivic, and A.time is 15.2 times faster (854 56) after combining Zisserman, “Object retrieval with large vocabularies andwith EQE method as shown in Table 1(b). The reasons fast spatial matching,” CVPR, 2007.attribute to the selection of salient VWs and ignoringthose insignificant inverted lists. It is essential since [6] J. Sivic and A. Zisserman, “Video google: a text retrieval approach to object matching in videos,” ICCV,unlike textual queries with 2 or 3 query terms, an image 2003.query might contain thousands (or hundreds) of VWs.Therefore, we can respond to a user query in 121ms over [7] Y.-H. Yang, P.-T. Wu, C.-W. Lee, K.-H. Lin, W. H.one-million photo collections. Hsu, and H. Chen, “ContextSeer: context search and Finally, to improve recall, we evaluate the recommendation at query time for shared consumerperformance of auxiliary visual word (AVW) discovery. photos,” ACM MM, 2008.As shown in Table 1(c), the combination of selection,propagation and further OPRF brings 99% relative [8] J. Zobel and A. Moffat, “Inverted files for text searchimprovement over BoW model and reduces one-fifth of engines,” ACM Computing Surveys, 2006 1076
• Sport Video Highlight Extraction Based on Kernel Support Vector Machines Po-Yi Sung, Ruei-Yao Haung, and Chih-Hung Kuo, Department of Electrical Engineering National Cheng Kung University Tainan, Taiwan { n2895130 , n2697169 , chkuo }@mail.ncku.edu.twAbstract—This paper presents a generalized highlight density of cuts, and audio energy, with a derived function toextraction method based on Kernel support vector machines detect highlights. In [5], Duan proposes a technique that(Kernel SVM) that can be applied to various types of sport searches shots with goalposts and excited voices to findvideo. The proposed method is utilized to extract highlights highlights for soccer programs. To locate scenes of thewithout any predefining rules of the highlights events. The goalposts in football games, the technique of Chang [6]framework is composed of the training mode and the analysis detects white lines in the field, and then verifies touch-downmode. In the training mode, the Kernel SVM is applied to train shots via audio features. Wan [7] detects voices inclassification plane for a specific type of sport by shot features commentaries with high volume, combined with theof selected video sequences. And then the genetic algorithm frequency of shot change and other visual features to locate(GA) is adopted to optimize kernel parameters and selectfeatures for improving the classification accuracy. In the goal events. Huang [8] exploited color and motionanalysis mode, we use the classification plane to generate the information to find logo objects in the replay of sport video.video highlights of sport video. Accordingly, viewers can access All these techniques have to depend on predefined rules for aimportant segments quickly without watching through the single specific type of sport video, and as a result may needentire sport video. lots of human efforts to analyze the video sequences and identify the proper objects for highlights in the particular Keywords-Highlight extraction; Sport analysis; Kernel type of sport.support vector machines; Genetic algorithm Many other techniques have employed probabilistic models, such as Hidden Markov Models (HMM), to look for I. INTRODUCTION the correlations of events and the temporal dependency of features [9]-[15]. The selected scene types are represented by Due to the rapid growth of multimedia storage hidden states, and the state transition probabilities can betechnologies, such as Portable Multimedia Player (PMP), evaluated by the HMM. Highlights can be identifiedHD DVD and Blu-ray DVD, large amounts of video contents accurately by some specific transition rules. However, it iscan be saved in a small piece of storage device. However, hard to include all types of highlight events in the same set ofpeople may not have sufficient time to watch all the recorded rules, and the model may fail to detect highlights if the videoprograms. They may prefer skipping less important parts and features are different from the original ones. Cheng [16]only watch those remarkable segments, especially for sport proposed a likelihood model to extract audio and motionvideos. Highlight extraction is a technique making use of features, and employed the HMM to detect the transition ofvideo content analysis to index significant events in video the integrated representation for the highlight segments. Thisdata, and thereby help viewers to access the desired parts of kind of methods all need to estimate the probabilities of statethe content more efficiently. This technique can also be a transitions, which has to be set up through intense humanhelp to the processes of summarization, retrieval, and observations.abstraction from large amounts of video database. Most of the previous researches have adopted rule-based In this paper, we focus on the highlight extraction methods, whereby the rules are heuristically set to describetechniques for sport videos. Many works have been proposed the dynamics among objects and scenes in the highlightthat can identify objects that appear frequently in sport events of a specific sport. The rules set for one kind of sporthighlights. Xiong [1] propose a technique that extracts audio video usually cannot be applied to the other kinds. In [17],and video objects that are frequently appearing in the we have proposed a more generalized technique based onhighlight scenes, like applauses, the baseball catcher, the low-level semantic features. In this approach, we cansoccer goalpost, and so on. Tong [2] characterized three generate highlight tempo curves without definingessential aspects for sport videos: focus ranges of the camera, complicated transitions among hidden states, and hence weobject types, and video production techniques. Hanjalic et al. can apply this technique to various kinds of videos.[3]-[4] measured three factors, that is, motion activity, 1077
• In this paper, we extend our technique [17] and A. Shot Change Detectionincorporate it with the framework of Kernel support vector The task in this stage is to detect the transition point frommachines (Kernel SVM). For each type of sport video, a one scene to another. Histogram differences of twosmall amount of highlight shots are input so that some consecutive frames are calculated by (2) to detect the shotunified features can be extracted. Then apply the Kernel changes in video sequences. A shot change is said to beSVM system to train the classification plane and utilize the detected if the histogram difference is greater than atrained classification plane to analyze other input videos of predefined threshold. The pixel values that are employed tothe same sport type, generating the highlight shots. calculate the histogram contains luminance only, since the The rest of this paper is organized as follows. Section II human visual system is more sensitive to luminancepresents the overview of the proposed system. Section III (brightness) than to colors. The histogram difference isdetails the method for highlight shots classification and computed by the equationhighlight shots generation. The highlight extraction 255performance and experimental results are shown in Section  H (i)  H I I 1 (i) (2)Ⅳ. SectionⅤ is the conclusion. DI  i0 N II. PROPOSED HIGHLIGHT SHOT EXTRACTION SYSTEM OVERVIEW where N is the total pixel number in a frame, and HI (i) is the Fig. 1 shows four stages of the proposed scheme: (1) shot pixel number of level i for the I-th frame. Finally, the videochange detection, (2) visual and audio features computation, sequence will be separated into several shots according to(3) Kernel SVM training and analysis, and (4) highlight the shot change detection results.shots generation. In the first stage, histogram differences are B. Visual and Audio Features Computationcounted to detect the shot change points. In the second stage,the feature parameters of each shot are computed and taken Each shot may contain lots of frames. To reduce theas the input eigenvalues into the Kernel SVM training and computation complexity, we select a keyframe to representanalysis system. The shot eigenvalues include shot length (L), the shot. In this work, we simply define the 10th frame ofcolor structure (C), shot frame difference (Ds), shot motion each shot as the keyframe, since it is usually more stable(Ms), keyframe difference (Dkey), keyframe motion (Mkey), Y- than the previous frames, which may contain mixing frameshistogram difference (Yd), sound energy (Es), sound zero- during scene transition. Many of the following features arecrossing rate (Zs) and short-time sound energy (Est). They are extracted from this keyframe.collected as a feature set for the i-th shot 1) Shot Length  Vi  L, C, Ds , M s , Dkey , M key , Yd , Es , Zs , Est (1)  We designate the frame number in each shot as the shot length (L). Experiments show that the highlight shot lengths are shorter in non-highlight shots, such as the shots with In the third stage, the Kernel SVM either trains the judges or scenes with special effects. A highlight shot isparameters or analyzes the input features, according to the often longer than a non-highlight shot. For example,mode of the system. Then, in the last stage, highlight shots pitching in baseball games and shooting goal in soccerare generated based on the output of Kernel SVM. We games are usually longer in shot length. Hence, the shotexplain the first two stages in the following, and the other length is an important feature for the highlights and istwo stages are explained in Section Ⅲ. included as one of the input eigenvalues. 2) MPEG-7 Color Structure Highlight shots The color structure descriptor (C) is defined in the generation Highlight Highlight shot shot MPEG-7 standard [18,19] to describe the structuring property of video contents. Unlike the simple statistic of Training histograms, it counts the color histograms based on a data moving window called the structuring element. The analysis system GA Training and Kernel SVM Kernel SVM Baseball parameters descriptor value of the corresponding bin in the color optimization analysis training Basketball histogram is increased by one if the specified color is within and features mode mode selection Soccer the structuring element. Compared to the simple statistic of one histogram, the color structure descriptor can better Visual and audio features reflect the grouping properties of a picture. A smaller C Visual features Audio features value means the image is more structured. For example, both of the two monochrome images in Fig. 2 have 85 black Shot Shot pixels, and hence their histograms are the same. The color structure descriptor C of the image in Fig. 2-(a) is 129, Shot Shot while the image in Fig 2-(b) is more scattered with the C value 508. Fig. 3 depicts the curve of the C values in the detection Shot Video Data video of a baseball game. It shows that pictures with a Audio Data scattered structure usually have higher C values. Figure 1. The proposed highlight shots extraction system 1078
• where W and H are the block numbers in the horizontal and vertical directions respectively. MVx,n(i, j) and MVy,n(i, j) are the motion vectors in x and y directions respectively, of the block at i-th row and j-th column in the n-th frame of the shot. The motion vector of a block represents the displacement in the reference frame from the co-located block to the best matched square, and is searched by minimizing the sum of absolute error (SAE) [21]. (a) K 1 K 1 SAE   C(i, j)  R(i, j) (5) i 0 j 0 where C(i, j) is the pixel intensity of a current block at relative position (i, j), and R(i, j) is the pixel intensity of a reference block. 5) Keyframe Difference and Keyframe Motion (b) We calculate the frame difference and estimate theFigure 2. The MPEG-7 Color Structure: (a) a highly structured motion activity between the keyframe and its next frame.monochrome image; (b) a scattered monochrome image. Both have the Suppose the k-th frame is a keyframe. The keyframesame histogram difference Dkey of the shot is defined by W 1 H 1 Dkey   f k (i, j)  f k 1 (i, j) (6) i 0 j 0 where fk(i, j) represents the intensity of the pixel at position (i, j) in the k-th frame. Similarly, keyframe motion Mkey represents the average magnitude of motion vectors inside the key frame and is defined asFigure 3. MPEG-7 Color Structure Descriptor curve in a baseball game. W 1 H 1  MVx (i, j)2  MVy (i, j)2 1 M key  (7) In this paper, we perform edge detection before W  H / K 2 i0 j 0calculating the color structure descriptors. The resultant Cvalue of each keyframe is regarded as an eigenvalue and where MVx(i, j) and MVy(i, j) denote the components of theincluded in the input data set of Kernel SVM. motion vectors in x- and y- directions respectively. 3) Shot Frame Difference 6) Y-Histogram DifferenceThe average shot frame difference (Ds) of each shot is The average Y-histogram difference is calculated bydefined by 255 L1 W 1 H 1  H (i)  H n1 (i) (  f (i, j)  f n1 (i, j) ) 1 1 L1 n Ds  (3) Yd  1  i0 (8) L 1 n1 WH i0 j 0 n L  1 n1 W Hwhere W and H are frame width and height respectively, where Hn (i) represents the number of pixels at level i,and fn(i, j) is the pixel intensity at position (i, j) in the n-th counted in the n-th frame. In general, the value of Yd isframe. This feature shows the frame activities in a shot. In higher in the highlight shots.general, highlight shots have higher Ds values than non- 7) Sound Energyhighlight shots. The sound energy Es is defined as 4) Shot Motion To measure the motion activity, we first partition a Mframe into square blocks of the size K-by-K pixels, and  S (n)  S (n) (9)perform motion estimation to find the motion vector of each Es  n 1block [20]. The shot motion Ms is defined as the average Mmagnitude of motion vectors by where S(n) is the signal strength of the n-th audio sample in L1 W 1 H 1 a shot, M is the total number of audio samples in the  MVx,n (i, j)2  MVy,n (i, j)2 1Ms  (4) duration of the corresponding shot. In the highlight shot, the (L 1) W  H / K 2 n1 i0 j 0 sound energy is usually higher than those in non-highlight shots. 1079
• 8) Sound Zero-crossing Rate section, we briefly explain the basic idea about constructing We also adopt the zero-crossing rate (Zs) of the audio the SVM decision functions.signals as one of the input features, since it is a simple a) Linear SVMindicator of the audio frequency information. Experiments Given a training set (x1, y1), (x2, y2),…, (xi, yi), xn  Rn , yn 1, 1, n  1 i , where i is the totalindicate that the zero-crossing rate becomes higher inhighlight shots. The zero-crossing rate is defined as number of training data, each training data point xn is M associated with one of two classes characterized by a value  signS (i) signS (i  1) 1 fs Z s (n)  (10) yn = ±1. In the linear SVM theory, the decision function is 2M i 1 supposed to be a linear function and defined as f  x  wTx  bwhere fs is the audio sampling rate, and the sign function isdefined by (13)  1 , if S (i)  0, where w, x  Rn , b  R , w is the weighting vector of  signS (i)   0 , if S (i)  0, (11) hyperplane coefficients, x is the data vector in space and b  1 , otherwise. is the bias. The decision function lies half way between two  hyperplanes which referred to as support hyperplanes. SVM is expected to find the linear function f(x) = 0 such that 9) Short-time Sound Energy separates the two classes of data. Fig. 4-(a) shows the Since the crowd sounds always last for 1 or 2 seconds, decision function that separates two classes of data. Forand therefore the sound energy can not represent the crowd separable data, there are many possible decision functions.sounds for video shot with longer shot length. Thus, we The basic idea is to determine the margin that separates theselect short-time sound energy (Est) as one of the input two hyperplanes and maximize the margin in order to findeigenvalues. The short-time sound energy is defined as the optimal decision function. As shown in Fig. 4-(b), two hyperplanes consist of data points which satisfy wT  x  b  1 and wT  x  b  1 respectively. For example,  S (n)  S p (n)  24000 p the data point x1 of the positive class (yn = +1) lead to a e( p)  n 1 (12) positive value and the data points x2 of the negative class (yn 24000 = -1) are negative. The perpendicular distance between the Est  max e(1), e(2), e(3), , e(m) 2 two hyperplanes is . In order to find the maximized wwhere Sp(n) is the signal strength of the n-th audio sample atp-th second in the video shot, e(p) is the sound energy of margin and optimal hyperplanes, we must find the smallestthe p-th second in the video shot, and m is the time of the distance w . Therefore, the data points have to satisfy thevideo shot. condition as one set of inequalities y j  wT x j  b  1, for j  1, 2, 3, , i III. HIGHLIGHT SHOT CLASSIFICATION METHOD (14)A. Kernel SVM Training and Analysis System In this work, the Kernel SVM is adopted to analyze the The problem for solving the w and b can be reduced to theinput videos and generate the highlight shots. In the training following optimization problemmode, the selected shots for a specific sport type are fedinto the system to train for the classification hyperplanes 1 2 Minimize wand we apply genetic algorithm (GA) to select features and 2 (15) subject to y j  wT x j  b  1, for j  1 ioptimize kernel parameters for support vector machines. Inthe analysis mode, the system just loads these pre-storedparameters and generates highlight shots for the input sportvideo. We will explain the process in details in the This is a quadratic programming (QP) problem and can befollowing. solved by the following Lagrange function [24]: 1) Support Vector Machines α y w x  b  1, for j  1i SVM is a machine learning technique first suggests by 1 iVapnik [22] and has widespread applications in L(w, b,  )  wT w  j j T j (16)classification, pattern recognition and bioinformatics. 2 j 1Typical concepts of SVM are for solving binaryclassification problems [23]. The data may be where α j denotes the Lagrange multiplier. The w, b, andmultidimensional and form several disjoint regions in thespace. The feature of SVM is to find the decision functions α j at optimum to minimize (16) are obtained. Then,that optimally separate the data into two classes. In this following the Karush Kuhn-Tucker (KKT) conditions to 1080
• simplify this optimization problem. Since the optimization where C is the penalty parameter. This optimizationproblem have to satisfy the KKT conditions defined by problem also can be solved by the Lagrange function and transformed to dual problem as follows i w α y x j 1 j j j Maximize L( )  i  j  1 y i j yk  j k x j x k T j 1 2 j ,k 1 i α y (22) 0 (17) i j 0 j j Subject to  j1  j y j  0 ,0   j  C, j  1i αj 0 , for j  1i    j y j w x j  b  1  0 , for j  1i T   Similarity, we can solve this dual problem and find the optimal w and b.Substitute (17) into (16), then the Lagrange function is c) Non-linear SVMtransformed to dual problem as follows The SVM can extended to the case of nonlinear conditions by projecting the original data sets to a higher dimensional i i  y y α α x x 1 space referred to as the feature space via a mapping MaximizeL( )  αj  j k j k T j k function φ which also called kernel function. The nonlinear 2 j 1 j ,k 1 (18) decision function is obtained by formulating the linear i Subject to α yj1 j j  0, α j  0, j  1i classification problem in the feature space. In nonlinear SVM, the inner products xT xk in (22) can be replaced by j the kernel function k(x j , xk )  φ(x j )T φ(xk ) . Therefore, theSolving for this dual problem and find the Lagrange dual problem in (22) can be replaced by the followingmultiplier α j . Substitute α j into (19) to find the optimal w equationand b. i i i Maximize L( )   j  1  y y   k (x , x ) j k j k j k  j 1 2 j ,k 1 w α j y j x j , for j  1i (23) i  j1 (19) Subject to jyj  0 ,0   j  C, j  1i 1  sv  1  N b  N sv    wT x S  y  s1  s    j1 According to (19), we also can solve above dual problem and find optimal w and b. The classification is thenwhere xS are data points which Lagrange multiplier α j >0, ys obtained by the sign ofis the class of xS and Nsv is the number of xS .  i  b) Linear Generalized SVM In the case where the data is not linearly separable as sign    j 0 y j j k(x, x j )  b     (24)shown in Fig. 4-(c), the optimization problem in (15) willbe infeasible. The concepts of linear SVM can also beextended to the linearly nonseparable case. Rewrite (14) as d) Types of Kernels(20) by introducing a non-negative slack variable  . The most commonly used kernel functions are multivariate Gaussian radial basis function (MGRBF),  y j wTx j  b  1   j , for j  1i  (20) Gaussian radial basis function (GRBF), polynomial function and sigmoid function. MGRBF:The above inequality constraints are minimized through apenalized object function. Then the optimization problem n x jm xkm2can be written as   2 m 2 (25) k(x j , x k )  φ(x j ) T φ(x k )  e m1  i   1 Maximize L( )  w 2  C  j  where  m  , x jm , xkm  , x j , x k  n , xjm is m-th 2    j 1  (21)  Subject to y j w T x j  b  1   j , for j  1i  element of xj, xkm is m-th element of xk,  m is the adjustable parameter of the Gaussian kernel, x j , xk are input data. 1081
• GRBF: 1 n n where g1 ~ gSs , gC ~ gCc and g1 ~ g f f are parameters of n S f x j xk 2 kernel, penalty factor and features respectively. The ns, nc,  and nf are the bits to represent the above parameters. The 2 2 (26) k(x j , xk )  φ(x j )T φ(xk )  e parameters defined at the start process are bits of parameters and features, number of generations, crossover and mutation rate, and limitations of parameters. The nextwhere   , x j , xk n ,  is the adjustable parameter of step is output parameters and features to Kernel SVM forthe Gaussian kernel, x j , xk are input data. training. In the selection step, we keep two chromosomesPolynomial function: with maximum objective value (Of) obtained by (29) for next generation. These chromosomes will not change in the following crossover and mutation steps. Fig. 10 shows the k(x j , xk )  φ(x j )T φ(xk )  (1  xT xk )d j (27) crossover and mutation operations. As shown in Fig. 10-(a), two new offspring are obtained by randomly exchangingwhere d is positive integer, x j , xk n , d is the adjustable genes between two chromosomes using one point crossover. After crossover operation, as shown in Fig. 10-(b), theparameter of the polynomial kernel, x j , xk are input data. binary code genes are changed occasionally from 0 to 1 or 2) Kernel SVM Input Data Structure vice versa called mutation operation. Finally, a new In sport videos, a highlight event usually consists of generation is obtained and output parameters and featuresseveral consecutive shots. Fig. 5 shows an example of a again. These processes will be terminated until thehome run in a baseball game. It includes three consecutive predefined numbers of generations satisfy.shots: pitching and hitting, ball flying, and the base running. In this paper, we adopt precision and recall rates toUnlike many other highlight extraction algorithms that have evaluate the performance of our system. The precision (P)to predefine the highlight events with specific constituting and recall (R) rates are defined as followsshots, we simply propose to collect the feature sets ofseveral consecutive shots together as the input eigenvalues SNc SNcof the Kernel SVM. P ,R (28) SNe SNt 3) Kernel SVM Training Mode For the training mode, the data are processed in twosteps: a) initialization, b) kernel parameters optimization where SNc, SNe, and SNt are the number of correctlyand feature selection. extracted highlight shots, extracted highlight shots, and actual highlight shots repectively. a) Initialization of the Input Data In the objective function calculation step, we calculate The initialization process of the training mode is shown the objective value (Of) to evaluate the kernel parametersin Fig. 6. The video is partitioned into shots and divided and select features generated by GA. The objective valueinto two sets: highlight shots and non-highlight shots. The calculated by following equationeigenvalues of consecutive shots are collected as a data set.All data sets are composed as the input data vector. Then Of  0.5 P  0.5 Reach eigenvalue is normalized into the range of [0, 100]. (29)The order of the data set in the input data vector israndomized. These steps will terminate when the predefined number of b) Kernel Parameters Optimization and Feature generations have achieved. And finally we select the kernelselection parameters and features which have maximum objective value. Since the parameters in kernel functions are adjustable, 4) Kernel SVM Analysis Modeand in order to improve the classification accuracy, these In the analysis mode, the user has to select a sport type.kernel parameters should be properly set. In this process, The Kernel SVM system directly loads the pre-trainedwe adopt the GA-based feature selection and parameters classification function corresponding to the sport type. Theoptimization method proposed by Huang [25] to select classification function is defined as (30), where Cx is thefeatures and optimize kernel parameters for support vector classes of the video shots. Cx = +1 represents the shotsmachines. Fig. 7 shows the flowchart of the feature belong to highlight shot, and Cx = -1 are non-highlight shot.selection and parameters optimization method. This process can be performed very quickly, since these As shown in Fig. 7, we apply the GA to generate kernel kernel parameters and features do not need to be trainedparameters and select features to train the hyperplanes of again.the Kernel SVM. The processes to generate kernelparameters and select features utilize the GA are shown in   i Fig. 8. The GA start process include generate chromosome  i     1 , if  y j  j k(x, x j )  b   0  randomly and parameters setup. The chromosome isrepresented as binary coding format as shown in Fig. 9,   Cx  sign y j  j k(x, x j )  b       j 1  1 , if  y  k(x, x )  b   0  i  (30)  j 1      j 1 j j j    1082
• Figure 4. Linear decision function separatimg two classes: (a) Decision function separete class postive from class negative; (b) The margin that seperates two hyperplanes; (c) The case of linear non-separable data sets. (a) pitching and hitting (b) ball flying (c) base running Figure 5. A home run event in a baseball game. Start ｛ data set 0 data set 1 data set 2 Parameters and Highlights Output parameters and features Normalize Randomize … ｛ selected features Training Data Vector Data set nNon-Highlights Data Vector SelectionTraining Data Figure 6. The initialization of training data. Crossover Training data Testing data Mutation Selected New generation features Normalization and feature selection Figure 8. Genetic algorithm to generate prameters and features Genetic Algorithm Kernel SVM Kernel SVM training mode parameters i n j n n g1  g S  g Ss S g1 gC  g Cc g 1 g k  g f f f f Kernel SVM C analysis mode Precision and Figure 9. Chromosome recall rates Objective function calculation Parents 1 0101 1111 Offspring 1 0101 0010 Crossover No Parents 2 0001 0010 Offspring 2 0001 1111 Terminate? Yes Optimized parameters 01011111 01111111 and features Mutation Before After Figure 7. The flowchart of the feature selection and parameters optimization method Figure 10. (a) Crossover operation; (b) Mutation operation 1083
• IV. EXPERIMENTAL RESULTS [2] X. Tong, L. Duan, H. Lu, C. Xu, Q. Tian and J. S. Jin, „A mid-level visual concept generation framework for sports analysis‟, Proc. IEEEThe experimental setup for different sport types are listed in ICME, July 2005, pp. 646–649.Table I. For the baseball game, we take hits, home runs, [3] A. Hanjalic, „Multimodal approach to measuring excitement instrike out, steal, and replay as highlight events. For video‟, Proc. IEEE ICME, July 2003, pp. 289–292.basketball game, the highlight events are dunks, three-point [4] A. Hanjalic, „Generic approach to highlights extraction from a sportshots, jump shots, bank shots and replays. For soccer game, video‟, Proc. IEEE ICIP, Sept. 2003, pp. I - 1–4.we set highlight events as goals, long shoots, close-range [5] L. Y. Duan, M. Xu, T. S. Chua, Q. Tian, and C. S.Xu, „A mid-levelshoots, free kicks, corner kicks, break through, and replays. representation framework for semantic sports video analysis‟, Proc. ACM Multimedia, Nov. 2003, pp. 33–44. [6] Y. L. Chang, W. Zeng, I. Kamel, and R. Alonso, „Integrated image In this paper, we adopt three kernel functions include and speech analysis for content-based video indexing‟, Proc. IEEEmultivariate Gaussian radial basis function, Gaussian radial ICMCS, May 1996, pp. 306–313.basis function and polynomial function. Then we evaluate [7] K. Wan and C. Xu, „Efficient multimodal features for automaticthe performance for extracting highlight shots of sport video soccer highlight generation‟, Proc. IEEE ICPR, Aug. 2004, pp. 973–among these kernel functions. Table. II shows the 976.experimental results of NYY vs. NYM, Table. III shows the [8] Q. Huang, J. Hu, W. Hu, T. Wang, H. Bai and Y. Zhang, „A reliableexperimental results in the game NBA Celtics vs. Rockets, logo and replay detector for sports video‟, Proc. IEEE ICME, Julyand Table. IV shows the experimental results of the soccer 2007, pp. 1695–1698.game Arsenal vs. Hotspur. According to the experimental [9] J. Assfalg, M. Bertini, A. Del Bimbo, W. Nunziati and P. Pala, „Soccer highlights detection and recognition using HMMs‟, Proc.results, we find that the SVM with kernel function MGRBF IEEE ICME, Aug. 2002, pp. 825–828.have the best performance among these types of sport videos. [10] G. Xu, Y. F. Ma, H. J. Zhang and S. Yang, „A HMM based semantic analysis framework for sports game event detection‟, Proc. IEEE TABLE I. THE EXPERIMENTAL SETUP FOR DIFFERENT SPORT TYPES ICIP, Sept. 2003, pp. I - 25–8. Sport type Sequence Total length Shot length [11] J. Wang, C. Xu, E. Chng and Q. Tian, „Sports highlight detection from keyword sequences using HMM‟, Proc. IEEE ICME, June 2004, Baseball NYY vs. NYM 146 minutes 1097 pp. 599–602. Basketball Celtics vs. Rockets 32 minutes 180 [12] P. Chang, M. Han and Y. Gong, „Extract highlights from baseball Soccer Asenal vs. Hotspur 48 minutes 280 game video with hidden Markov models‟, Proc. IEEE ICIP, Sept. 2002, pp. 609–612. TABLE II. THE EXPERIMENTAL RESULTS OF BASEBALL GAME [13] N. H. Bach, K. Shinoda and S. Furui, „Robust highlight extraction using multi-stream hidden Markov models for baseball video‟, Proc. Sequence NYY vs. NYM IEEE ICIP, Sept. 2005, pp. III - 173–6. Kernel MGRBF GRBF Polynomial [14] Z. Xiong, R. Radhakrishnan, A. Divakaran and T. S. Huang, „Audio events detection based highlights extraction from baseball, golf and Precision 87% 89% 77% soccer games in a unified framework‟, Proc. IEEE ICME, July 2003, Recall 99% 81% 91% pp. III - 401–4. [15] B. Zhang, W. Chen, W. Dou, Y. J. Zhang and L. Chen, „Content- TABLE III. THE EXPERIMENTAL RESULTS OF BASKETBALL GAME based table tennis games highlight detection utilizing audiovisual clues‟, Proc. IEEE ICIG, Aug. 2007, pp. 833–838. Sequence Celtics vs. Rockets [16] C. C. Cheng and C. T. Hsu, „Fusion of audio and motion information Kernel MGRBF GRBF Polynomial on HMM-based highlight extraction for baseball games‟, IEEE Trans. Precision 100% 86% 93% Multimedia, pp. 585–599, June 2006. Recall 93% 100% 87% [17] L. C. Chang, Y. S. Chen, R. W. Liou, C. H. Kuo, C. H. Yeh and B. D. Liu, „A real time and low cost hardware architecture for video TABLE IV. THE EXPERIMENTAL RESULTS OF SOCCER GAME abstraction system‟, Proc. IEEE ISCAS, May 2007, pp. 773–776. Sequence Asenal vs. Hotspur [18] ISO/IEC JTC1/SC29/WG11/ N6881: ‟MPEG-7 Requirements Document V.18‟, January 2005. Kernel MGRBF GRBF Polynomial [19] ISO/IEC JTC1/SC29/WG11: „MPEG-7 Overview (version 10)‟, Precision 100% 76% 100% October 2004. Recall 88% 96% 73% [20] C. H. Kuo, M. Shen and C.-C. Jay Kuo, „Fast motion search with efficient inter-prediction mode decision for H.264‟, Journal of Visual V. CONCLUSION Communication and Image Representation, pp. 217–242, 2006. Kernel SVM can be trained to classify the shots by [21] Iain E. G. Richardson, H.264 and MPEG-4 Video Compression,exploiting the information of a unified set of basic features. WILEY, 2003.Experimental results show that the SVM with multivariate [22] V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.Gaussian radial basis kernel can achieve average of 96% [23] V. Kecman, Learning and Soft Computing, MIT Press, Cambridge,precision rate and 93% recall rate. 2001. [24] Vapnyarskii, I.B. (2001), “Lagrange Multipliers”, in REFERENCES Hazewinkel,Michiel, Encyclopedia of Mathematics, Kluwer Academic Publishers,ISBN 978-1556080104.[1] Z. Xiong, R. Radhakrishnan, A. Divakaran, and T.S. Huang, [25] C.L. Huang and C.J. Wei, GA-based feature selection and „Highlights extraction from sports video based on an audio-visual Parameters optimization for support vector machine, Expert Syst marker detection framework‟, Proc. IEEE ICME, July 2005, pp. 29– Appl 31 (2006), pp. 231–240. 32. 1084
• IMAGE INPAINTING USING STRUCTURE-GUIDED PRIORITY BELIEF PROPAGATION AND LABEL TRANSFORMATIONS Heng-Feng Hsin (辛恆豐), Jin-Jang Leou (柳金章), Hsuan-Ying Chen (陳軒盈) Department of Computer Science and Information Engineering National Chung Cheng University Chiayi, Taiwan 621, Republic of China E-mail: {hhf96m, jjleou, chenhy}@cs.ccu.edu.tw ABSTRACT problem with isophote constraint. They estimate the smoothness value given by the best chromosome of GA,In this study, an image inpainting approach using and project this value in the isophotes direction. Chanstructure-guided priority belief propagation (BP) and and Shen [3] proposed a new diffusion method, calledlabel transformations is proposed. The proposed curvature-driven diffusions (CDD), as compared toapproach contains five stages, namely, Markov random other diffusion models. PDE-based approaches arefield (MRF) node determination, structure map suitable for thin and elongated missing parts in an image.generation, label set enlargement by label For large and textured missing regions, the processedtransformations, image inpainting by priority-BP results of PDE-based approaches are usuallyoptimization, and overlapped region composition. Based oversmooth (i.e., blurring).on experimental results obtained in this study, as Exemplar-based approaches try to fill missingcompared with three comparison approaches, the regions in an image by simply copying some availableproposed approach provides the better image inpainting part in the image. Nie et al. [4] improved Criminisi etresults. al.’s approach [5] by changing the filling order and overcame the problem that gradients of some pixels onKeywords Image Inpainting; Priority Brief Propagation; the source region contour are zeros. A majorLabel Transformation; Markov Random Field (MRF); shortcoming of exemplar-based approaches is theStructure Map. greedy way of filling an image, resulting in visual inconsistencies. To cope with this problem, Sun et al. [6] 1. INTRODUCTION proposed a new approach. However, in their approach, user intervention is required to specify the curves onImage inpainting is to remove unwanted objects or which the most salient missing structures reside. Jia andrecover damaged parts in an image, which can be Tang [7] used image segmentation to abstract imageemployed in various applications, such as repairing structures. Note that natural image segmentation is aaged images and multimedia editing. Image inpainting difficult task. To cope with this problem, Komodaskisapproaches can be classified into three categories, and Tziritas [8] proposed a new exemplar-basednamely, statistical-based, partial differential equation approach, which treats image inpainting as a discrete(PDE) based, and exemplar-based approaches. global optimization problem.Statistical-based approaches are usually used for texturesynthesis and suitable for highly-stochastic parts in an 2. PROPOSED APPROACHimage. However, statistical-based approaches are hardto rebuild structure parts in an image. The proposed approach contains five stages, namely, PDE-based approaches try to fill target regions of Markov random field (MRF) node determination,an image through a diffusion process, i.e., diffuse structure map generation, label set enlargement by labelavailable data from the source region boundary towards transformations, image inpainting by priority-BPthe interior of the target region by PDE, which is optimization, and overlapped region composition.typically nonlinear. Bertalmio et al. [1] proposed aPDE-based image inpainting approach, which finds out 2.1. MRF node determinationisophote directions and propagates image Laplacians tothe target region along these directions. Kim et al. [2] As shown in Fig. 1 [8], an image I0 contains a targetused genetic algorithms (GA) to solve the inpainting region T and a source region S with S=I0-T. Image 1085
• inapinting is to fill T in a visually plausible way by Vpq (xp , xq )simply pasting various patches from S. In this study,image inpainting is treated as a discrete optimization = ∑Z(x dp, dq∈Ro p + dp, xq + dq)(I0 (xp + dp) − I0 (xq + dq))2 , (4)problem with a well-defined energy function. Here, where Ro is the overlapped region between two labels, xpdiscrete MRFs are employed. and xq. To define the nodes of an MRF, the image lattice isused with the horizontal and vertical spacings of gapx 2.3. Label set enlargementand gapy (pixels), respectively. For each lattice point, ifits neighborhood of size (2gapx + 1) × (2gapy + 1) overlaps To completely use label informations in the originalthe target region, it will be an MRF node p. Each label image, three types of label transformations are used toof the label set L of an MRF consists of enlarge the label set. The first type of label(2gapx+1) × (2gapy+1) pixels from the source region S. transformation contains two different directions: theBased on the image lattice, each MRF node may have 2, vertical and horizontal flippings, which can find out3, or 4 neighboring MRF nodes. labels (patches) that do not exist in the original source Assigning a label to an MRF node is equivalent to region, but have symmetric properties in the horizontalcopying the label (patch) to the MRF node. To evaluate or vertical direction. The second type of labelthe goodness of a label (patch) for an MRF node, the transformation contains three different rotations: leftenergy (cost) function of an MRF will be defined, 90° rotation, right 90° rotation, and 180° rotation, whichwhich includes the cost of the observed region of an can find out rotated labels (patches) of the above-MRF node. mentioned three degrees. The third type of label We will assign a label x p ∈ L to each MRF node p ˆ transformation is scaling. To keep the original size of horizontal and vertical spacings gapx and gapy, theso that the total energy F (x) of the MRFs is minimized. ˆ original image is directly up/down scaled so that newHere, labels (patches) can be obtained in the original image F ( x) = ∑ V p ( x p ) + ˆ ˆ ∑V pq ˆ ˆ ( x p , xq ), (1) with the same horizontal and vertical spacings. Here, p∈v ( p , q )∈ε both the up-sampling (double-resolution by bilinearwhere V p ( x p ) (called the label cost hereafter) denotes interpolation) image and the down-sampling (half-the single node potential for placing label xp over MRF resolution) image are used to generate extra candidatenode p, i.e., how the label xp agrees with the source labels (patches).region around p. Vpq(xp,xq) represents the pairwisepotential measuring how well node p agrees with the 2.4. Image inpainting by priority-BP optimizationoverlapped region ε between p and its neighboringnode q when pasting xp at p and pasting xq at q. Belief propagation (BP) [10] treats an optimization problem by iteratively solving a finite set of equations2.2. Structure map generation until the optimal solution is found. Ordinary BP is computationally expensive. For an MRF graph, eachIn this study, the Canny edge detector [9] is used to node sends “message” to all its neighboring nodes,extract the edge map of an image, which preserves the whereas the node receives messages from all itsimportant structural properties of the source region in neighboring nodes. This process is iterated until all thethe image. A binary mask E(p) to used to build the messages do not change any more.structure map of the image, which is just the edge map The set of messages sent from node p to itswith morphological dilation. If E(p) is non-zero, pixel p neighboring node q is denoted by {m pq ( xq )} . This xq ∈Lis belonging to the structure part. Then, E(p) is used to message expresses the opinion of node p aboutformulate the structure weighting function Z(p,q): assigning label xq to node q. The message formulation is ⎧ 1, if E ( p) = 0 and E (q ) = 0, defined as: Z ( p, q ) = ⎨ (2) ⎩w, otherwise, ⎧ ⎫where w is “the structure weighting coefficient.” The mpq (xq ) = min⎨Vpq (x p , xq ) +Vp (x p ) + ∑mrp (x p )⎬. (5) x p ∈Llabel cost Vp(xp) is defined as (the sum of weighted ⎩ r:r ≠q,( r , p)∈ε ⎭squared differences, SWSD): That is, if node p wants to send message mpq to node q, Vp (xp ) node p must traverse its own label set and find the best label to support node q when label xq is assigning to = ∑[Z( p + dp,x +dp)M ( p + dp)(I ( p + dp) − I (x +dp)) , ] dp∈[− gapx , gapx ]× − gapy , gapy p 0 0 p 2 (3) node q. Each message is based on two factors: (1) thewhere M(p) denotes a binary mask, which is non-zero if compatibility between labels xp and xq, and (2) thepixel p lies inside the source region S. Thus, for an likelihood of assigning label xp to node p, which alsoMRF node p, if its neighborhood of size (2gapx+1) × contains two factors: (1) the label cost Vp(xp), and (2)(2gapy+1) does not intersect S, Vp(xp)=0. Vpq(xp,xq) for the opinion of its neighboring node about xp measuredpasting labels xp and xq over p and q, respectively, can by the third term in Eq. (5).be similarly defined as: 1086
• Messages are iteratively updated by Eq. (5) until MRF edge can be bidirectionally traversed. In thethey converge. Then, a set of beliefs, which represents forward pass, all the nodes are visited by the prioritythe probability of assigning label xp to p, is computed order, an MRF node having the highest priority willfor each MRF node p as: pass message to its neighboring MRF nodes having the bp (x p ) = −Vp (x p ) − ∑m rp (x p ). (6) lower priorities, and the MRF node having the highest r:(r , p)∈ε priority will be marked as “committed,” which will notThe second term in Eq. (6) means that to calculate a be visited again in this forward pass. For label pruning,node’s belief, it is required to gather all messages from the MRF node having the highest priority can transmitall its neighboring nodes. When the beliefs of all MRF its “cheap” message to all its neighboring MRF nodesnodes have been calculated, each node p is assigned the having not been committed. The priority of eachbest label having the maximum belief: neighboring MRF node having received a new message x p = arg maxbp ( x p ). ˆ (7) is updated. The above process is iterated until there are x p∈L no uncommitted MRF nodes. On the other hand, the To reduce the computational cost of BP, backward pass is performed in the reverse order of theKomodakis and Tziritas [8] proposed “priority-BP” to forward pass. Note that label pruning is not performedcontrol the message passing order of MRF nodes and in the backward pass.“dynamic label pruning” to reduce the number ofelements in the label set of each MRF node. In [8], the 2.5. Overlapped region compositionpriority of an MRF node p is related to the confidenceof node p about the label should be assigned to it. The When the number of iterations reaches K, each MRFconfidence depends on the current set of beliefs node p is assigned a label having maximum bp values.{bp(xp)} that has been calculated by BP. Here, the xp∈L All the MRF nodes are composed to produce the finalpriority of node p is designed as: image inpainting results, where label composition is 1 performed in a decreasing order of MRF node priorities. priority ( p ) = , Depending on whether the region contains a global {x p ∈ L : b p ( x p ) ≥ bconf } rel (8) structure or not, two strategies are used to compose each bp (xp ) = bp (xp ) − bp , rel max (9) overlapped region. If an overlapped region contains a rel global structure, graph cuts are used to seam it.where bp is the relative belief value and b p is the max Otherwise, each pixel value of the overlapped region ismaximum belief among all labels in the label set of computed by weighted sum of two corresponding pixelnode p. Here, the confidence of an MRF node is the values, where the weighting coefficient is proportionalnumber of candidate labels whose relative belief values to the priority of an MRF node.exceed a certain threshold bconf. On the other hand, to traverse MRF nodes, the 3. EXPERIMENTAL RESULTSnumber of candidate labels for an MRF node can bepruned dynamically. To commit a node p, all labels with In this study, 21 test images are used to evaluate therelative beliefs being less than a threshold bprune for performance of the proposed approach. Threenode p will not be considered as its candidate labels. comparison inpainting approaches, namely, the PDE-The remaining labels are called “active labels” for node based approach [1], the exemplar-based approach [5],p. In this study, the label set of an MRF node is sorted and the ordinary priority-BP-based approach [8], areby belief values, at least Lmin active labels are selected implemented in this study. Some image inpaintingfor an MRF node, and a similarity measure is used to results by the three comparison approaches and thecheck the remaining labels. If the similarity between proposed approach are shown in Figs. 2-6.two remaining labels is greater than a threshold Sdiff, one In Fig. 2, the image size is 256 × 170, gapx=9,of the two remaining labels will be pruned. This process gapy=9, bconf=-180000, bprune=-360000, Lmax=30, Lmin=5,will be iterated until the relative belief value of any and w=10. Blurring artifacts appear in Fig. 2(c). In Fig.remaining label is smaller than bprune or the number of 2(d), because the isophotes direction is too complex toactive labels reaches a user-specified parameter Lmax. guide the inpainting process, the inpainting results are To apply priority-BP to image inpainting, the labels not good. Compared with the ordinary priority-BP-from the source region of an original image and the based approach (Fig. 2(e)), the proposed approach (Fig.labels by applying three types of label transformations 2(f)) can keep the global structure in the image byare obtained so that each MRF node maintains its label guiding the message passing process by the structureset. Then, the number of priority-BP iterations, K, is set, map. In Fig. 3, the image size is 206 × 308, gapx=5,the priorities of all MRF nodes are initialized only by gapy=5, bconf=-40000, bprune=-80000, Lmax=20, Lmin=3,their Vp(xp) values, and message passing is performed. and w=10. In Fig. 3(c), blurring artifacts appear in theEach priority-BP iteration consists of the forward and upper part of the image. In Fig. 3(d), the stone bridgebackward passes. Message passing and dynamic label can not be well reconstructed, because there is nopruning are performed in the forward pass, and each suitable patch in the image. Furthermore, error 1087
• propagation appears in the lake. In Fig. 3(e), becausethe priority of the bridge structure is low, the bridgestructure is broken. In the proposed approach, theweighting coefficient is used to raise the priority of thebridge structure, resulting in the better inpainting results.In Fig. 4, the image size is 208×278, gapx=7, gapy=7,bconf= -150000, bprune=-300000, Lmax=30, Lmin=5 andw=2. For the image, the proposed approach canreconstruct the tower structure by label transformations, (a) (b)whereas the three comparison approaches contain error Fig. 1. (a) Nodes and edges of an MRF; (b) labels of anpropagations, due to lack of suitable labels. In Fig. 5, MRF for image inpainting [8].the image size is 287×216, gapx=10, gapy=10, bconf=-200000, bprune=-400000, Lmax=50, Lmin=5 and w=15. InFig. 5(f), the proposed approach uses both the originallabels and the flipped labels to reconstruct the region tobe inpainted, resulting in the better inpainting image. InFig. 6, the image size is 257 × 271, gapx=6, gapy=6,bconf=-200000, bprune=-400000, Lmax=50, Lmin=10 and (a) (b)w=5. Because the building in the original image has thesymmetric property, label transformations can beemployed in this case. Blurring artifacts appear in Fig.6(c). In Fig. 6(d), the isophote direction is too complexso that the structures interfere each other. In Fig. 6(e),the inpainting results are poor, due to lack of validlabels. In Fig. 6(f), for the lower part of the image, thewindow structure is partially broken due to the building (c) (d)is not totally symmetric so that error propagationappears in some inpainting regions of the image.However, the inpainting image by the proposedapproach is better than that by the three comparisonmethods. 4. CONCLUDING REMARKS (e) (f) Fig. 2. (a) The original image, “Lantern;” (b) theIn this study, an image inpainting approach using masked image; (c)-(f) the image inpainting results bystructure-guided priority BP and label transformations is the PDE-based approach [1], the exemplar-basedproposed. In the proposed approach, to reconstruct the approach [5], the ordinary priority-BP-based approachglobal structures in an image, the structure map of the [8], and the proposed approach, respectively.image is generated, which guides the inpainting processby priority-BP optimization. Furthermore, three types oflabel transformations are employed to get more usablelabels (patches) for inpainting. Based on theexperimental results obtained in this study, as comparedwith three comparison approaches, the proposedapproach provides the better image inpainting results. ACKNOWLEDGEMENTThis work was supported in part by National ScienceCouncil, Taiwan, Republic of China under Grants NSC (a) (b)96-2221-E-194-033-MY3 and NSC 98-2221-E-194- Fig. 3. (a) The original image, “Bungee jumping;” (b)034-MY3. the masked image; (c)-(f) the image inpainting results by the PDE-based approach [1], the exemplar-based approach [5], the ordinary priority-BP-based approach [8], and the proposed approach, respectively (to be continued). 1088
• (c) (d) (e) (f) Fig. 4. (a) The original image, “Tower;” (b) the masked image; (c)-(f) the image inpainting results by the PDE- based approach [1], the exemplar-based approach [5], the ordinary priority-BP-based approach [8], and the proposed approach, respectively (continued). (e) (f)Fig. 3. (a) The original image, “Bungee jumping;” (b)the masked image; (c)-(f) the image inpainting resultsby the PDE-based approach [1], the exemplar-based (a) (b)approach [5], the ordinary priority-BP-based approach[8], and the proposed approach, respectively(continued). (c) (d) (a) (b) (e) (f) Fig. 5. (a) The original image, “Picture frame;” (b) the masked image; (c)-(f) the image inpainting results by the PDE-based approach [1], the exemplar-based approach [5], the ordinary priority-BP-based approach [8], and the proposed approach, respectively. (c) (d)Fig. 4. (a) The original image, “Tower;” (b) the maskedimage; (c)-(f) the image inpainting results by the PDE-based approach [1], the exemplar-based approach [5],the ordinary priority-BP-based approach [8], and theproposed approach, respectively (to be continued). 1089
• Computer Society Conf. on Computer Vision and Pattern Recognition, 2003, 721–728. [6] J. Sun, L. Yuan, J. Jia, and H. Y. Shun, “Image completion with structure propagation,” in Proc. of 2005 ACM SIGGRAPH on Computer Graphics, 2005, pp. 861–868. [7] J. Jia and C. K. Tang, “Image repairing: Robust image synthesis by adaptive and tensor voting,” in Proc. of 2003 IEEE Int. Conf. on Computer Vision and Pattern Recognition, 2003, pp. 643–650. (a) (b) [8] N. Komodakis and G. Tziritas, “Image completion using efficient belief propagation via priority scheduling and dynamic pruning,” IEEE Trans. on Image Processing, Vol. 16, pp. 2649–2661, 2007. [9] J. Canny, “A computational approach to edge detection,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 8, pp. 679–698, 1986. [10] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, San (c) (d) Francisco, CA, 1988. (e) (f)Fig. 6. (a) The original image, “Building;” (b) themasked image; (c)-(f) the image inpainting results bythe PDE-based approach [1], the exemplar-basedapproach [5], the ordinary priority-BP-based approach[8], and the proposed approach, respectively. REFERENCES[1] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, “Image inapinting,” in Proc. of ACM Int. Conf. on Computer Graphics and Interactive Techniques, 2000, 417–424.[2] J. B. Kim and H. J. Kim, “Region removal and restoration using a genetic algorithm with isophote constraint,” Pattern Recognition Letters, Vol. 24, pp. 1303–1316, 2003.[3] T. Chan and J. Shen, “Non-texture inpaintings by curvature-driven diffusions,” Journal of Visual Comm. Image Rep., Vol. 12, pp. 436–449, 2001.[4] D. Nie, L. Ma, and S. Xiao, “Similarity based image inpainting method,” in Proc. of 2006 Multi-Media Modeling Conf., 2006, 4–6.[5] A. Criminisi, P. Perez, and K. Toyama, “Object removal by exemplar-based inpainting,” in Proc. of IEEE 1090
• CONTENT-BASED BUILDING IMAGE RETRIEVAL Wen-Chao Chen(陳文昭), Chi-Min Huang (黃啟銘), Shu-Kuo Sun (孫樹國), Zen Chen (陳稔) Dept. of Computer Science, National Chiao Tung University E-mail:Chaody.cs94g@nctu.edu.tw, toothbrush.cs97g@ nctu.edu.tw, sksun@csie.nctu.edu.tw, zchen@cs.nctu.edu.twAbstract—This paper addresses an image retrieval query image, the content-based image retrieval systemsystem which searches the most similar building for a extracts the most similar images from a database bycaptured building image from an image database based either spatial information, such as color, texture andon an image feature extraction and matching method. shape, or frequency domain features, e.g. wavelet-basedThe system then can provide relevant information to methods [3].users, such as text or video information regarding the Existing content-based image retrieval algorithmsquery building in augmented reality setting. However, can be categorized into (a) image classification methods,the main challenge is the inevitable geometric and and (b) object identification methods. The first approachphotometric transformations encountered when a retrieves images which belong to the same category as ahandheld camera operates at a varying viewpoint under query image. Jing et al. proposed region-based imagevarious lighting environments. To deal with these retrieval architecture [6]. An image is segmented intotransformations, the system measures the similarity regions by the JSEG method and every region isbetween the MSER features of the captured image and described with color moment. Every region is clustereddatabase images using the Zernike Moment (ZM) to form a codebook by Generalized Lloyd algorithm.information. This paper also presents algorithms based The similarity of two images is then measured by Earthon feature selection by multi-view information and the Mover’s Distance (EMD). Willamowski et al. presentedDBSCAN clustering method to retrieve the most generic visual categorization method by using supportrelevant image from database efficiently. The vector machine as a classifier [7]. Affine invariantexperimental results indicate that the proposed system descriptor represents an image as a vector quantization.has excellent performance in terms of the accuracy and In the second approach Wu and Yang [8] detectedprocessing time under the above inevitable imaging and recognized street landmarks from database imagesvariations. by combining salient region detection and segmentation techniques. Obdrzalek and Matas [9] developed aKeywords Image recognition and retrieval; Geometric building image recognition system based on local affineand photometric transformations; Zernike moments; features that allows retrieval of objects in images takenImage indexing; from distinct viewpoints. Discrete cosine transform 1. INTRODUCTION (DCT) is then applied to the local representations to reduce the memory usage. Zhang and Kosecka [10] also In recent years, there have been an increasing proposed a system to recognize building by anumber of applications in Location-Based Service hierarchical approach. They first index the model views(LBS). LBS is an service that can be accessed from by localized color histograms. After converting tomobile devices to provide information based on the YCbCr color space and indexing with the hue value,current geographical position, e.g. GPS information. SIFT descriptors [4, 5] are then applied to refineHowever, GPS position is only available in open spaces recognition results.since the GPS signal is often blocked by high-rise Most of related image retrieval algorithms detectbuildings or overhead bridges. Magnetic compasses are local features of a query image and then compare withalso disturbed by nearby magnetic materials. Vision- detected features of database images by featurebased localization is therefore an alternative approach to descriptors. However, the feature detectors such asprovide both accurate and robust navigation information. Harris corner detector and the SIFT detector, which is This paper addresses the aspects of a building based on the difference of Gaussians (DOG), utilize aimage retrieval system. The building recognition is a circular window to search for a possible location of acontent-based image retrieval technique that can be feature. The image content in the circular window is notextended to applications of object recognition and web robust to affine deformations. Furthermore, the featureimage search via a cloud service combined with points may not be reliable and may not appearconsumer-oriented augmented reality tools. Given a 1091
• simultaneously across the multiple views with wide- 2. FEATURE DETECTOR AND DESCRIPTORbaselines. Matas et al. [13] presented a maximally stable 2.1. MSER feature region detectorextremal region (MSER) detector. Mikolajczyk and Recently, numbers of local feature detectors using aSchmid [3] proposed Harris-Affine and Hessian-Affine local elliptical window have been investigated. Thedetectors. The performances of the existing region MSER detector is evaluated as one of the best regiondetectors were evaluated in [14] in which the MSER detectors [5]. The advantage of MSER detector is thedetector and the Hessian-Affine detector were ranked as ability to resist geometry transformation. The MSERthe two best. Chen and Sun [2] compare various popular detector performs also well when images containfeature descriptors, e.g. SIFT, PCA-SIFT, GLOH, homogenous regions with distinctive boundaries [1].steerable filter, with phase-based Zernike Moment (ZM) Because building images contain regions withdescriptor. The ZM descriptor performs significantly boundaries, such as windows and color bricks, thebetter than other descriptors in geometric and MSER detector can extract these regions stably.photometric transformations, such as blur, illumination, After detecting elliptical regions by MSER method,noise, scale, JPEG compression. To describe a building we have to filter out unstable regions such as oversizedimage in geometric and photometric transformations, area, large aspect ratio, duplicated regions, and highthis paper utilizes the MSER method as the feature area variation, as shown in fig. 2.detector. The Zernike Moment is then applied todescribe each detected feature region. 2.2. Zernike Moment feature region descriptor In order to index a large number of features Once the feature regions are detected, every regiondescriptors, KD-tree [12] is a fundamental method to is described as a feature vector for similarityrecursively partition the space into two subspaces to measurement. This paper presents a method whichconstruct a binary tree. applies Zernike Moment (ZM) as the feature descriptor We also introduce a building image dataset, the [2].NCTU-Bud dataset, containing the high resolution Zernike moments (ZMs) have been used in objectimages of 22 buildings located on National Chiao Tung recognition regardless of variations in position, size andUniversity campus with a total of 190 database images. orientation. Essentially Zernike moments are theWe capture at least one face of each building from 5 extension of the geometric moments by replacing thedistinct viewing directions. Query images are capturedunder 12 different lighting conditions for performance conventional transform kernel x m y n with orthogonalevaluation. Zernike polynomials. Fig. 1 shows the overall system block diagram. The Zernike basis function Vnm ( ρ , θ ) is definedSection 2 briefly describes the background of the feature over a unit circle with order n and repetition m suchdetector and descriptor. Section 3 presents a feature that (a) n − m is even and (b) m < n , as given byselection method to remove unstable features and aclustering method to obtain representative features. In Vnm ( ρ ,θ ) = Rnm (ρ ,θ )e jmθ , for ρ ≤ 1 (1)Section 4 the image indexing and retrieval method is where {Rnm (ρ )} is a radial polynomial in the form ofdescribed. In Section 5 experimental results of the ( n −|m|) / 2 (n − s )!NCTU-Bud dataset are described. The performance on Rnm ( ρ ) = ∑ (−1) s n+ | m | n− | m | ρ n− 2 s (2)the publicly available ZuBud dataset is evaluated as well. s =0 s! ( − s )! ( − s )! 2 2Finally, Section 6 concludes the paper. (a) (b) Figure 2. (a) Initial MSER results. (b) Results after Figure 1. System block diagram. removing unstable MSER feature regions. 1092
• v vThe set of basis functions { Vnm ( ρ , θ ) } is orthogonal, i.e. The similarity of magnitude S mag ( Pq , Pd ) is defined π π ∫ ∫ V (ρ ,θ )V (ρ ,θ )ρdρdθ = n + 1 δ 2 1 * δ , as the degree of cosine between two vectors. nm pq np mq 0 0 v v mag q ⋅ mag d S mag ( Pq , Pd ) = (7) 1 a=b mag q mag d with δ ab = { (3) 0 otherwise The value ranges between 0 and 1, while a higher valueThe two-dimensional ZMs for a continuous image indicates two vectors are more similar. This isfunction f ( ρ ,θ ) are represented by equivalent to the Euclidean distance between the two n +1 normalized unit vectors. ∑ ∑ f ( ρ ,θ )V *nm (ρ ,θ ) = Znm e nm (4) iφ Z nm = A similarity measure using the weighted ZM phase π ( ρ , θ )∈unit disk differences is expressed by v v For a digital image function the two-dimensional S phase ( Pq , Pd ) =ZMs are given as min{ Φ nm − (mα ) mod(2π ) ,2π − Φ nm − (mα ) mod(2π ) } ˆ ˆ (8) 1 − ∑∑ wnm n +1 π ∑ ∑ f ( ρ ,θ )V *nm ( ρ ,θ ) = Z nm e nm (5) iφ m nZ nm = π ( ρ , θ )∈unit disk Z nm + Z nm q d where wnm = + Z nm , and Φ nm = (φ nm − φ nm ) mod 2π is q d r ∑Z q nm d Define a region descriptor P based on the sorted n ,mZMs as follows: the actual phase difference.r iφ iφ iφ The rotation angle α is determined by an iterative ˆP = [ Z11 e 11 , Z 31 e 31 ,LL , Z nmax mmax e nmaxmmax ]T (6) computation of α m = (Φ nm − α m −1 ) mod 2π , with the ˆ ˆwhere Z nm is the ZM magnitude, and ϕnm is the ZM initial value α 0 = 0 , using the entire information of ˆphase. Zernike moments sorted by m . The value range of v v Zernike Moment is then derived after integrating S phase ( Pq , Pd ) is the interval [0, 1] while a higher valuethe normalized region with respect to Zernike basis indicates two vectors are more similar.function. In this paper, the ZMs with m =0 are notincluded, and both the maximum order n and maximum 3. EFFICIENT BUILDING IMAGErepetition m equal to 12, resulting the length of feature DATABASE CONSTRUCTIONvector to be 42. In this way, two feature vectors In building image retrieval applications, the scalerepresent a feature region: mag = [ Z1,1 , Z 3,1 ,..., Z12,12 ]T of a database is typically large with considerable visual descriptors. In order to index and search rapidly,and phase = [φ1,1 , φ3,1 ,...,φ12,12 ]T . effective approaches to store appropriate descriptor are proposed for constructing a large-scale building image database. 3.1. Feature selection from multiple images Modern building databases in image retrieval applications normally contain multiple views for a single building. For example, the ZuBud dataset collects five images for each building in the database. We refine Figure 3. Normalization of an elliptical region. detected MSER feature regions by verifying consistency between multiple images of a building that are captured2.3. A similarity measure from distinct viewpoints. The basic idea of selection is v v to keep representative feature regions and removeLet Pq = (mag q , phaseq ) and Pd = (mag d , phased ) be two discrepant features as outliers. Feature region selectionZM feature vectors, where mag q = [ Z1q,1 , Z 3q,1 ,..., Z12 ,12 ]T , q reduces storage space of feature descriptors in a database. Furthermore, this method improves thephaseq = [φ1q1 ,φ3q,1 ,...,φ12,12 ]T , magd = [ Z1d,1 , Z 3d,1 ,..., Z12,12 ]T , , q d efficiency and accuracy of the image retrieval processand phased = [φ1d,1 , φ3d,1 ,..., φ12,12 ]T . d remarkably. 1093
• (a) (b) (c) (d) (e) (f) Figure 4. (a) - (c) Three different images in a group of a building image before feature selection. (d) - (f) Three different images in a group of a building image after feature selection. The occurrence of discrepant feature regions comes After removing non-building feature regions, mostfrom non-building areas, such as trees, bicycles, and of remaining feature regions belong to the buildings.pedestrians, as shown in Fig 4. Feature regions of non- However, repeated pattern, e.g. windows, doors, isbuilding area are not stable comparing to regions of popular in a building image. In order to reduce thebuilding area. Therefore, excluding these feature regions storage space of the repeated feature descriptors in aout of a database is necessary to ensure uniform results. database, clustering similar features into a representative This paper presented a method to select feature feature descriptor is necessary.regions automatically by measuring similarity between In conventional clustering algorithms, e.g. k-meansmultiple images of a building. The algorithm description algorithm and k-medoid algorithm, each cluster isfor feature region selection is given in Fig. 5. Only represented by the gravity center or by one of thesimilar feature regions across the views are preserved. v v objects of the cluster located near its center. However,Two regions are similar if Smag ( Pq , Pd ) >0.7 and determining the number of clusters k is not v v straightforward. Moreover, the ability for distinguishingS phase ( Pq , Pd ) >0.7. Comparison of feature regions before different features is reduced because the isolated featureand after selection is shown in Fig. 4. Unstable feature regions are forced to merge to a nearby cluster whichregions in Figs. 4(a)-4(c), such as trees and pedestrians may be with dissimilar characteristic of regionare removed by the proposed algorithm. The results of appearance. Consequently, we utilized the Density-selection are shown in Figs. 4(d)-4(f). Based Spatial Clustering algorithm (DBSCAN) algorithm [15] is used for clustering. Input: A group of feature regions in multi-view images. The DBSCAN algorithm relies on a density-based Output: Selected feature regions. notion of clusters. Two input parameters ε and MinPts For each feature region If there’s at least two similar regions in other views are required to determine the clustering conditions in Preserve the feature region; two steps. The first step chooses an arbitrary point from Else the database as a seed. Second, retrieve all reachable Delete the feature region; points from the seed. The parameter ε defines the size of neighborhood, and for each point to be included in a Figure 5. Feature region selection algorithm. cluster there must be at least a minimum number ( MinPts ) of points in an ε -neighborhood of a cluster3.2. Feature clustering point. 1094
• (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Figure 6. (a) - (e) Feature regions in the same cluster. (f) - (i) Another cluster of feature regions after DBSCAN. The input to the DBSCAN algorithm is 42- Then, based on the current minimum distance betweendimensional selected ZM magnitude vectors of all the query point and the single database point in the leafimages belonging to the same group or building. We node the KD-tree is revisited to search for the nextcalculate the mean of feature vectors as the available neighbor within the current minimum distance.representative one of the cluster, while preserving the The tree backtracking is repeated until no furtherisolated feature points. reduced minimum distance to the query point is found. The elliptical regions in Figs. 6 (a)-6(e) are featurevectors in the same cluster, and are replaced by a Input: N feature vectors in k dimensionrepresentative feature vector. Figs. 6 (f)-6(j) show the Output: A KD-tree, every leaf nodes contains a singleother feature cluster in the same group of multi-view feature vectorimages. kd_tree_build (Dataset, dim) { 4. IMAGE INDEXING AND RETRIEVAL If Dataset contains only one point Mark a leaf node containing the point;4.1. Descriptor indexing with a KD-tree Return; else After feature selection and clustering processes 1. Sort all points in Dataset according todescribed above, all extracted building regions are then feature dimension dim;indexed by a KD-tree according to their ZM magnitude 2. Determine the median value of featurevectors. The goal is to build an indexing structure so that dimension dim in Dataset, make a new nodethe nearest neighbors of a query vector can be searched and save the median value;rapidly. 3. Dataset_bigger = The points in Dataset with dim >= median value; A KD-tree (k-dimensional tree) is a binary tree that 4. Dataset_smaller = The points in Datasetrecursively partitions feature space into two parts by a with dim < median value;hyperplane perpendicular to a coordinate axis. The 5. Set Dataset_bigger as the new node’s rightbinary space partition is recursively executed until all child and set Dataset_smaller as the newleaf nodes contain each a single data point. The node’s left child;algorithm for constructing a KD-tree is given in Fig. 7 6. call kd_tree_build (Dataset_bigger ,by initializing dim as 1 and Dataset as the set of N (dim+1) % k);database points. 7. call kd_tree_build (Dataset_smaller, (dim+1) % k);4.2. Query by region vote counting } After establishing a KD-tree for organizing the ZM Figure 7. The KD-tree construction algorithm.magnitude feature vectors in the database, the KD-tree isdescended to find a leaf node into which the query point For each extracted region in the query buildingfalls. After obtaining the first candidate nearest neighbor, image one vote is cast to a certain database buildingwe verify with the ZM phase feature vector to confirm image which has a region to be claimed as the nearestthe candidate point is qualified or not. In our neighbor of the query region. After all extracted regionsexperiments, two vectors are qualified as similar when of the query image have voted, we count the number oftheir distance is as small as possible and their magnitude votes each database image receives. The database image v v with the maximum votes is returned as the most similarand phase similarity measures satisfy Smag ( Pq , Pd ) > 0.85 v v building to the query building.in equation (7) and S phase ( Pq , Pd ) > 0.85 in equation (8). 1095
• View 1 View 2 View 3 View 4 View 5EC BuildingED BuildingFace 1 (first side view)ED Building Face 2(second side view) Figure 8. Examples of the database images in the NCTU-Bud dataset. Class D Class E Class F Class A Class B Class C Correct exposure Over exposure Under exposure Correct exposure Over exposure Under exposure with occlusion with occlusion with occlusionSunny dayCloudy day Figure 9. Examples of query images for the NCTU-Bud dataset. weather condition, six images are collected, each with 5. EXPERIMENTAL RESULTS different exposure settings of and different occlusion In our experiments, the proposed algorithm is conditions. Totally 12 classes of images constitute thewritten in Matlab under the Windows environment and query dataset, as shown in Fig. 9. Furthermore, fiveevaluated on the platform with a 2.83GHz processor and additional camera poses, such as different rotations,3GB Ram. We test our proposed indexing and retrieval focal lengths, and translations, are recorded for furthersystem on two sets of building images: the NCTU-Bud testing. A total of 2280 query images is gathered.dataset created by our own and the publicly available 5.2. Experimental results for the NCTU-Bud datasetZuBud [11]. Table I shows the total number of different region5.1. The NCTU-Bud Dataset feature vectors collected in the database and the To evaluate our proposed approach and to establish recognition rate for the query images captured in thea benchmark for future work, we introduce the NCTU- normal exposure during cloudy days. From this table,Bud dataset. Our dataset contains 22 high resolution feature selection using multiple images does not raiseimages of the buildings on the NCTU campus. For each the query accuracy rate. However, we achieve 100%building in the database we capture at least one facet of accuracy after applying the feature selection andthe building from five different viewing directions. All DBSCAN clustering. In this case not only the regiondatabase images are in a resolution of 1600x1200 pixels. storage space is reduced, but also only the representativeThe database contains a total of 190 building images. feature vectors are stored for query search.Some representative database images are shown in Fig. Consequently, the image retrieval accuracy is raised to8. 100%. For query images, we capture with a different The storage size (the number of nodes) is decidedcamera of a 2352x1568 resolution in two different by the number of region feature vectors found from allweather conditions: sunny and cloudy. For each type of images in the database. Approximately 50% space is 1096
• saved by applying feature selection and the DBSCAN TABLE II. AVERAGE PROCESSING TIME OF FEATURE DETECTIONclustering method. AND DESCRIPTOR COMPUTATION IN DIFFERENT RESOLUTIONS. The time of feature region detection anddescription relies on the resolution and the content of an Resolution 2352x1568 1600x1200 640x480image. If the scene of an image is complex, the numberof detected extremal regions by MSER increases and the Avg. / std. 13.8 /4.3 5.8 / 1.58 1.8 / 0.7processing time increases as well. Table II shows the processing time (sec)average processing time of feature detection anddescriptor computation of 92 different images indifferent resolutions. TABLE III. QUERY ACCURACY RATE OF THE NCTU-BUD With the feature selection and DBSCAN clustering DATASET IN DIFFERENT WEATHER CONDITIONS.method, the average time of indexing the database is Sunny day Cloudy day22.4 seconds and the average query time for an image in Class A 93.6 % 100% Correct exposurea resolution of 2352x1568 pixels is 40 seconds. The Class Bimage query time comprises the time for feature region 92.1 % 92.1 % Over exposuredetection (MSER), descriptor computation (ZM) and Class Csearching time for the nearest neighbor in the database. 93.1 % 96.3 % Under exposure Class D Table III shows the query accuracy rate for the 12 93.6 % 96.3 % Correct exposure with occlusiondifferent classes of images. Each class consists of 190 Class E 92.1 % 94.2 %query images. We can find that the accuracy rate in Over exposure with occlusioncloudy days generally higher than that in sunny days. Class F 92.6 % 96.8 %This reason may be because strong shadows are casted Under exposure with occlusionby occluding object in sunny days. And the over-exposured images are harder to recognize comparing toother exposure conditions. Query Database Comparing classes of D-F with classes of A-C, we image imagecan find that the proposed methods also perform wellunder occlusion. It shows that the proposed system isable to distinguish region feature regions even whenbuildings are partially occluded.5.3. Experimental results for the ZuBud dataset The ZuBud dataset contains images of 201different buildings taken in Zurich, Switzerland. Thereare 5 different images taken for each building. Fig. 10shows some example images. The dataset consists of115 query images, which are taken with a differentcamera under different weather conditions. In the experimental result for the ZuBud dataset,the query accuracy rate with the feature selection andDBSCAN clustering is over 95%. The average querytime is 3.1 second with a variation of 1.16 second. Fromthe experimental results, our system still performs well Figure 10. Example of images of the ZuBud dataset.in this publicly available dataset. TABLE IV. TOTAL NUMBER OF REGION FEATURE VECTORS AND TABLE I. TOTAL NUMBER OF REGION FEATURE VECTORS AND QUERY ACCURACY RATE OF THE ZUBUD DATASET QUERY ACCURACY RATE OF THE NCTU-BUD DATASET Without With feature With feature Without With feature With feature feature selection selection feature selection selection selection & DBSCAN selection only & DBSCAN # Region feature# of region feature 488,527 264,311 256,261 113,194 68,036 56,089 vectors vectors Memory size Recognition 22 MB 12.9 MB 10.6 MB 89.57% 94.8% 95.6% of a KD-tree Accuracy Query accuracy 94.7% 94.7% 100% rate 1097
• 6. CONCLUSION Analysis with Applications to Street Landmark Localization”, Proceedings of ACM International In this paper, we have presented a novel image Conference on Multimedia, 2008.retrieval system based on the MSER detector and the [9] S. Obdrzalek, J. Matas, “Image Retrieval UsingZM descriptor, which can resist against the geometric Local Compact DCT-Based Representation”,and photometric transformations. Experimental results Pattern Recognition, 25th DAGM Symposium, vol.illustrate that the KD-tree indexing and retrieval system 2781 of Lecture Notes in Computer Science.with the magnitude and phase ZM feature vectors Magdeburg, Germany: Springer Verlag, p.490–achieves a query high accuracy rate. The accuracy rate 497, 2003.for our created NCTU-Bud dataset and the ZuBud [10] W. Zhang, J. Kosecka, “Hierarchical buildingdataset are 100% and 95%, respectively. recognition”, Image and Vision Computing, 2007. The success of our system are attributed to [11] H. Shao, T. Svoboda, L.V. Gool, “ZuBuD—(a) Selection of MSER feature vectors using multiple Zurich Buildings Database for Image Based images of the same building captured from Recognition” Technical Report 260, Computer different viewpoints removes the unreliable Vision Laboratory, Swiss Federal Institute of regions. Technology,2003(b) The DBSCAN clustering technique groups similar [12] J. H. Friedman, J. L. Bentley, R. A. Finkel, “An feature vectors into a representative feature Algorithm for Finding Best Matches in descriptor to tackle the problem of repeated Logarithmic Expected Time”, ACM Transactions feature patterns in the image. on Mathematical Software, vol. 3,no 3, pp 209- In the future, we will consider optimizing the 266,1977programs and porting to mobile phone for mobile device [13] J. Matas, O. Chum, M. Urban, T. Pajdla, “Robustapplications. Furthermore, the query results may be wide-baseline stereo from maximally stableverified using the multi-view geometry constraints for extremal regions,” Image and Vision Computing,eliminating the outliers in order to lower the miss vol.22, pp.761–767, 2004.recognition rate. [14] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. REFERENCES Zisserman, and J. Matas, “A comparison of affine region detectors,” Int’l J. Computer Vision, vol.[1] J. Wang, G. Wiederhold, O. Firschein, and S. Wei, 65, no. 1/2, pp. 43–72, 2005. “Content-Based Image Indexing and Searching [15] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, ” A Using Daubechies’ Wavelets,” Int’l J. Digital Density-Based Algorithm for Discovering Clusters Libraries, vol. 1, pp. 311-328, 1998. in Large Spatial Databases with Noise,” in Proc.[2] Z. Chen, and S.K. Sun, “A Zernike moment phase- Int’l Conf. Knowledge Discovery and Data based descriptor for local image representation Mining, 1996. and matching”, IEEE Trans. Image Processing, vol. 19, No. 1, pp. 205-219. 22 September 2009.[3] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. V. Gool, “A comparison of affine region detectors”. Int’l J. Computer Vision, vol. 65, 43– 72, 2005.[4] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int’l J. Computer Vision, vol. 60, no.2, pp. 91–110, 2004.[5] K. Mikolajczyk, and C. Schmid, “A Performance Evaluation of Local Descriptors,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 10, p1615-1630, 2005.[6] F. Jing, M. Li, “An Efficient and Effective Region-Based Image Retrieval Framework”, IEEE Trans. Image Processing, vol. 13, no. 5, MAY 2004.[7] J. Willamowski, D. Arregui, G. Csurka, C.R. Dance, and L. Fan., “Categorizing nine visual classes using local appearance descriptors”, ICPR Workshop on Learning for Adaptable Visual Systems, 2004[8] W. Wu, J. Yang, “Object Fingerprints for Content 1098
• Using Modified View-Based AAM to Reconstruct the Frontal Facial Image with Expression from Different Head Orientation 1 Po-Tsang Li(李柏蒼), 1Sheng-Yu Wang(王勝毓), 1,2Chung-Lin Huang(黃仲陵) 1 Dept. of Electrical Engineering, National Tsing Hua University, Hsin-Chu, Taiwan. 2 Dept. of Informatics, Fo-Guang University, I-Lan, Taiwan. E-mail: clhuang@ee.nthu.edu.tw information using 3D face laser scanner. 3DMM can accurately reconstruct 3D human face, however, it needs a lot of Abstract computation that limit its applications for only academicThis paper develops a method to solve the unpredictable head research.orientation problem in 2D facial analysis. We extend theexpression subspace to the view-based Active Appearance Blanz et al. [12] apply 3DMM for human identityModel (AAM) so that it can be applied for multi-view face recognition, however, the fitting process takes 4.5 minutes per frame on a workstation with 2GHz Pentium 4 processor. Forfitting and pose correction for facial image with any expression. facial expression recognition, the problem becomes moreOur multi-views model-based facial image fitting system can be obvious. Due to the non-sufficient 3D face expression data, oneapplied for a 2D face image (with expression variation) at any can only rely on single expression (neutral) 3D face model forpose. The facial image in any view can be reconstructed in 3D facial identity recognition. However, because more 3D faceanother different view. We divide the facial image into expression database become available, researchers such as Wangexpression component and identity component to increase the et al. [16], Amor et al. [17], and Kakadiaris et al. [18] develop aface identification accuracy. The experimental results method to identify the human face in different views anddemonstrate that the proposed algorithm can be applied to different expression. However, because the facial expressions areimprove the facial identification process. We test our system for complicate {surprise, sadness, happiness, disgust, anger, fear},the video sequence with frame size is 320*240 pixels. It requires the 3-D model for different facial expressions are non-practical.30~45 ms to fitting a face and 0.35~0.45 ms for warping. Lu et al. [20] only record the variations of the landmark pointsKeywords View-Based; AAM; Facial expression; and then apply the Thin-Plate-Spline warping method to synthesize other expression facial images for fitting face expression image. Chang et al. [15] also divide the training data into identity space and expression space, and use bilinear1. Introduction interpolation to synthesize other expression human faces. Ramanathan et al. [19] propose a method using 3DMM for facialThe facial image analysis consists of face detection, facial expression recognition。feature extraction, face identification and facial expressionrecognition. Currently, the 2D face recognition technology is To capture 3D face information, we may use either 3Dwell-developed with high recognition accuracy. However, the laser scanner or multi-view 2D images. Recently, the 2D+3Dunpredictable head orientation often causes a big problem for 2D active appearance models (AAM) method has been proposed byfacial analysis. Most of the previous facial identity identification Xiao et al. [21], Koterba et al.[22], and Sung et al.[23]. Based onor facial expression recognition methods are limited to the the known project matrix of certain view, the so-called 2D+3Dfrontal face and profile face. They work only for the faces in a AAM method trains a 2D AAM for single view for later trackingsingle view with ± 15 degrees variation. and fitting of the landmark points of 2D images. Then it uses the corresponding points to calculate the 3D position of the The 3D model is well-known as the 3D Morphable Model landmark points. Xiao et al.[21] use only 900 image frames of a(3DMM) which is proposed by Blanz and Veter [11]. 3DMM single camera to develop the 3D AAM model. Because of theand AAM are similar. They are model-based approach which precision error of 2D AAM tracking landmark point, Lucey et al.consists of shape model and texture model. They both use [24] point out that the feature points tracked by 2D+3D AAM isPrincipal Component Analysis (PCA) for dimension reduction. worse than the normalized shape obtained by 2D AAM fitting.The two major differences between them are (1) the optimization Their argument is that 2D+3D AAM can not obtain the depthalgorithm used in fitting, and (2) the feature points in shape information precisely and it causes the recognition errors.model for 3DMM are 3D feature points, whereas in AAM, theyare 2D locations. In data collection, AAM can be developed In this paper, we apply the view-based AAM proposed byusing 2D facial images, whereas 3DMM captures the depth Cootes et al.[4] for model-fitting of input face with any 1099
• expression view in any angl and then it c be warped t any wed le, can to regenerate the face e.target viewing angle. The vie ew-based AAM consists of se M everal2D AAMs wh hich can be fur rther divided in inter mode and nto el 2.1 Statistical A 1 Appearance Mo odelsintra model. The inter m model describ bes the para ametertransformation between any two 2D AAMs, whereas the intra e A statistical appearance m l model consists of two parts: t themodel describe the relations es ship between th model param he meters sha model desc ape cribing the shap of the objec and the textu pe ct ureand viewing an ngle for a single 2D AAM mo e odel. The view- -based model, describing the gray-leve information of the object. It g elAAM can be g generated by an off-line traini process. Be n ing esides use labeled face images to train the AAM. To train the AA es n T AMthe identity sub bspace, this pap extends the expression sub per bspace model, we must h have an annota ated set of faci images, call ial ledto inter model so that the vie ew-based AAM can be applie for M ed land dmark points. These landma points are selected as the ark tmulti-view face fitting and pose correction for an input fa of ace sali ient points on t face and ide the entifiable from any human fac ce.any expressionn. Fig gure 2 shows some annotated tr raining face im mage data set. The flow diagram is s w shown in Fig 1 For an input face 1. timage, based o the intra mo on odel, we may find the relatioonshipbetween param meters and viewwing angle, an then remov the nd veangle effect on the par rameters. Then we divide the n eangle-independ model para dent ameters into ide entity parameter and rsexpression para ameters which can be transforrmed to the targ 2D get igure 2. Examp of the train Fi ples ning set.AAM model by using the inte model. Finall based on the intra y er ly, emodel, we add the influence of the angle parameters ont the d e to The numb ber of landm mark points is determin nedmodel paramet esize the facial image in the target ters and synthe exp Although more landmark poi perimentally. A ints increase ttheviewing angle. acc curacy of the model, how e wever, it also increases t o the commputation of model fitting process. The distribution of e landdmark points d depends on the ccharacteristics on the faces, su o uch as t eyebrows, e the eyes, nose and m mouth. In these regions, we ne eed Input image Facial region detection & to p more landm put mark points, wh hereas in the ot ther regions (suuch Pose classification as e ears, forehead, o other non-visible area, we p no landmark or put ks. Modified View -based AAM d 2.2 Shape Mode 2 el Fitting Her we use trian re, ngular meshes t compose the human face. W to e We using i th AAM def fine a shape si as a vector co ontaining the c coordinates of Ns land dmark points in a face image Ii. n si = ( x1 , y1 , x 2 , y 2 , K , x N , y N ) Τ (1) Target Select the s s orientation Ө target model The model is cons e structed based on the coordinate of the label led points of training images. We aligned the locations of t g e the rresponding po cor oints on diffeerent training faces by usi ing Rotate model i→ j Pro ocrustes analysis as normalizzation. Given a set of alignned sha ape, we then appplied Principa component analysis (PCA) to al a the data. Any shap example can then be approx pe ximated by usin ng _ Reconstructed at s = s + Ρs bs (2) j th AAM whe s_ defines th mean shape o all aligned sh ere he of hape is calculatted usin s = Σ i=1Si/N, Ps= (ps1, ps2,…pst) is th matrix of t N ng he the firs t eigenvectors and bs is a s of shape par st s set rameters. pst is the t eigeenvector of shhape covarianc matrix. Figu 3 shows the ce ure t New vie ew in ang le Ө effe g parameters by ±2 ects of varying the first two shape model p stan ndard deviation ns. igure 1. The flo Fi owchart of our system.2. Active Ap ppearance M Model Figure 3. F First two modes of shape variat tion (±2sd)In the Modified View-based A d AAM, 2D AAM play a crucial part. M 2.3 Texture Mod 3 delThis chapter in ntroduces the ooverall structur of 2D AAM the re M,flow diagram o training and f of fitting algorithm The major g of m. goal The texture of A e AAM is defined the gray leve information at d elAAM, which is first proposed by Cootes et al. [2] is to fin the d nd T _model paramet ters that reduce the differenc between syn e ce npaper pix x=(x, y) th lie inside the mean shape s . First, we ali xels hat e ign _image (generated by the AA model) an the target im AM nd mage. the control points and the mean shape s of ev very training fa aceBased on the parameters and the AAM model, we may e M 1100
• image by using affine warping. Then we sample the gray level The shape parameters bs have unit distance and textureinformation gim of the warping images at the mean shape region. parameters bg have unit intensity. Because they are differenceBefore applying PCA on texture date, to minimize the effect of nature and difference relevance, they cannot be comparedlighting variation, we normalize gim first by applying a scaling α directly. To estimate the correct value of Ws, we systemically displace the element of bs from each example’s best matchand offset β as parameter in the training set and sample the sample the g = ( gim − β ⋅ Ι) / α (3) _ corresponding difference. In addition, active appearance modelwhere I is a vector of ones. Let g defined the mean of thenormalized texture data, scaled and offset so that the sum is zero have a pose parameter vector to describe the similarityand the variance is unity. α and β are selected to normalize gim as transformation of shape. The pose parameter vector t has four T _ elements as t=(kx, ky, tx, ty) . Where (tx, ty) is the translation and β=(gim ·I)/n and α= gim· g , (4) (kx, ky) represent the scaling k and in-plane rotation angle θ,where the n is the number of pixels in the mean _ shape. Weiteratively use Equations (3) and (4) to estimate g until the kx,=k(cosθ−1) and ky,=ksinθ.estimation stabilized. Then, we apply PCA to the normalizedtexture data so that the texture example can be expressed as _ 2.6 Active Appearance Model Search g = g + Ρg bg (5)where Pg is the eigenvectors and bg is the texture parameters. Here, we introduce the kernel of AAM. The ultimate goal ofFigure 4 shows the effects of varying the first two texture model applying AAM is that given an input facial image, we may findparameters through ±2 standard deviations. the model parameters the may be apply to the AAM model to synthesize an image similar to the input image. Given a new image, we have the initial estimate of appearance parameter c, and the position, orientation and scaling placed in the image. We need to minimize the difference E as E = g image − g mod el . (9) Figure 4. First two modes of texture variation (±2 sd). where based on the pre-estimated c , we may have _ _ −12.4 Appearance Model g mod el = g + Pg Q g c and s mod el = s + PsW Q s c . gimage denotes the sThe shape and texture of any example data in the training set can target image obtained by applying warp function using smodelbe summarized by the bs and bg. The appearance model combines _the two parameters vector into a single parameter bc as and s , and sampling the pixel intensity of region. It is need an algorithm to adjust parameters to make the input image and that ⎛Ws bs ⎞ ⎛Ws PsΤ ( s − s ) ⎞ _ image generated by model as closely as possible. There are many ⎟=⎜ ⎟ (6) bc = ⎜ ⎜ ⎟ ⎜ _ ⎟ optimization algorithms propose for parameters searching. In this ⎝ bg ⎠ ⎜ PgΤ ( g − g ) ⎟ ⎝ ⎠ paper, we apply the so-called AAM-API method [8]. Rewrite (9)where Ws is a diagonal matrix of weights for each shape asparameter. A further PCA is applied for remove the possiblecorrelations between the shape and texture variations. E ( p ) = g image − g mod el (10) bc = Qc c (7) where p is the parameters of model. p = (c Τ | t Τ | u Τ ) ,where the Qc is the eigenvectors and c is the appearance u = (α β ) Τ . Taylor expansion of (10) givesparameter. ∂E E ( p + ∇p ) = E ( p ) + ∂p (11) Given a appearance parameter c, we can synthesis a face ∂pimage by generate gray level g the interior_ of mean shape andwarping the texture from the mean shape s to model shape s, where the ijth element of matrix ∂E/∂p is ∂Ei/∂pj. Suppose E isusing our current matching error. We want to find ∇p such as to _ _ minimize E ( p + ∇p ) 2 . By equating Equation 2.11 to zero, we s = s + PsWs−1Qs c , g = g + Pg Q g c (8) obtain the RMS solution, Twhere Qc=(Qs, Qg) . Figure 5 shows the effects of varying the ∇p = − AE ( p )first two appearance model parameters through ±2 standarddeviations. −1 where A = ⎛ ∂E ∂E ⎞ ∂E Τ Τ ⎜ ⎜ ∂p ⎟ ⎝ ∂p ⎟ ∂p ⎠ If we apply a conventional optimization process, we need to recalculate ∂E/∂p after every match, and it requires heavy Figure 5. First two modes of appearance variation (±2 sd). computation. So, to simplify the optimization process, Cootes et al. assume that A can be approximately constant, and the relationship between E and ∇p is linear.2.5 Shape parameter weight 1101
• Therefor we systemat re, tically displace the parameter from e r 3.1 Training Da 1 atathe optimal v value on the example image and record the dcorresponding effect of texture dif fference. App plyingmulti-variance linear regressio on the displa on acements ∇p an the nd Sin we do not h nce have a large multiple expressio and multi-vie on ewcorresponding difference textu E to find A. Therefore, we need ure e faci image datab ial base of for 2D AAM training process, we ne eednot recalculate matrix A, wh e hich can be computed off-lin and ne to o ning data by using six camer to capture t obtain the train ras thestored in the m memory for ref ference afterward. When we want e muultiple expressio and multi-v on view facial ima database. W age Weto match a imag on-line, the step of procedu is as follow: ge ure : hav obtained mu ve ultiple expressio and multi-v on view facial ima age from 13 people (i.e., neutral, surprised, hap m ppiness, sadne ess, disggust, anger, fea There are t ar). totally 510 fac images in t cial theInitial estimate parameters pI e trai ining data set. F 6 shows som of the trainin samples of o Fig me ng our1. Calculate the model shape smodel. and mode texture gmodel. e el trai ining data.2. Warping the current image and sample tex xture gimage. ture E= gimage − gmodel.3. Evaluate the difference text4. update the mmodel paramete p→p+k∇p, ∇p=−AE(p), initial ers , k=1. 3.2 Intra-Model Rotate 25. Calculate the new model sh e hape smodel. and m model texture gmodel.6. Sample the iimage from new shape gimage. w Coo et al. [4] s otes suggest that the model parame e eters c are relat ted7. Calculate the new error E e to t view angle θ as the 28. if E < E 2 , then accept the new estim mate ; otherwis try se, c = c0 + cc cos(θ ) + c s sin θ ) n( (12) k=0.5, k=0.225. whe c0, cc, and cs are vectors learned from training data. W ere t WeThe iteration o the preceding steps stop wh the E 2 ca not of g hen an can find the opt n timal value of parameters ci of the traini f ingbe reduced, an we may as nd ssume that the iterative algo e orithm exaample and its c corresponding view angle θi. Cootes’ meth hodconverge. doe not θi prec es cisely, it allow ±10 degree errors. In o ws e our expperiment, we fifixed the camer so that the viewing angle is ra knoown beforehan However, i creates an under-determina nd. it u ant pro oblem. We use facial images f from two views to generate oneo AAAM. There are only two inpu that can be used to estima uts ate (a) thre unknowns. So we rando ee omly increase θi by ±1. It is reassonable becaus the error mad during the im se de mage capturing is g unaavoidable, such as the human s subject slightly movement of h y his bod or head. Usin this method to add more in dy ng nput data, we may m esti imate c0, cc, an cs by applyin multiple lin nd ng near regression on quations of cs an (1, cos(θ ), sin(θ )) Τ . the relationship eq nd (b) Given an fa acial image, to f find the best fit tting parameter cj, we may use Equations (13) and (14) to estim d mate the viewi ing ang θj as gle ( x j , y j ) Τ = Rc−1 ( c j − c0 ) (13) (c) -1 Figure 6. Exxamples from th training set f the models. (a) he for whe Rc is the le pseudo-inve of Rc−1 (cc | cs ) = Ι 2 . ere eft erse Right profile F Face, 90° and75 (b) Right Ha Face 60° and 45°, 5°; alf d θ j = ta −1 ( y j / x j ) an (14)(c) Frontal Face 0° and -15°. Fig 7 shows the predicted angle compared with the actual ang g p h gle for the training set for each mode The results are worse than t t el. a the3. Modified View-Based AAM d resu from Coote et al [4]. It is due to that ou model contai ults es ur ins muultiple expressio facial image data. onCootes et al. [4 propose View 4] w-based AAM, based on sever 2D ralAAM for 3D M Model fitting 2 image. The model-based fitting 2D f 150for model para ameter estimati can be div ion vided as intra-m model P red icted A n g le(d eg ree) 100and inter-mode His method h been succes e. has ssfully applied to the 50human face w without expressi ion. However, they have prob blems 0fitting the face with expressio It is becaus in the human face e on. se n -40 -20 0 20 2 40 60 80 100 -50parameter spac the expressi difference between the ch ce, ion b hangesfor intra-person is much bigge than the cha n er anges in inter-peerson. -100 Actua Angle(degree) alThe original liinear transformmation between the view-angl andleAAM paramete is no longer valid. Here w propose a m ers we method (a) ) b) (bto project the facial space to identity subsp o pace and expre essionsubspace to sol the problem Here we divid the viewing angle lve m. dein five ranges: [-90, -75], [-60 -45], [-15, 15 [45, 60], [75 90] 0, 5], 5, gure 7. Predic angle vs ac Fig cted ctual angle across training set ( (a)from leftward to rightward. S Since the human face is symm n metric, resu of our data. (b) Cootes’ exp ult perimental resul at ‘view-bas lts sedin the experim ments, we only develop the 2D AAM for three y acti appearance mode’. ivedifferent angles [-15, 0], [45, 60], and [75, 9 s: 90]. Giv a new pers image, we apply AAM f ven son fitting to find t the 1102
• best model parameters and e estimating the head angle as well. Τ _ (20)Then, we can remove to angle effect by using e g b exp = P ex ( r − b exp ) xp cresidual = c j − c0 − cc cos(θ j ) − cs sin(θ j ) (15) The we can comp the rneutral b using en pute byTherefore the mmodel paramete are separated into two parts: one ers _ (21)part that describ the variatio due to rotatio and the othe part bes on on, er rneutral = r − b exp − Pexp bexpthat describe th other variat he tions (e.g. the vvariation of ide entity, And get project in identity subs d nto spaceexpression, illumination). W can use the paramete to We er _reconstruction at a new angle φ as bneutral = Pne ( rneutral − r ne ) eutral (22) eutral c(φ ) = cresidual + c0 + cc cos(φ ) + cs sin(φ ) (16)This method ca only do smal angel rotation based on 2D A an ll n AAM.Cootes et al. [7] and Huis sman [25] hav proved tha the ve atintra-model pose can be applie for the huma face recognit ed an tion.3.3 Identity a Expressio Subspace and onTo make a l large angel wwarping, we m must transform the mparameters bet tween the 2D m models. We inte to find a s end simpletransformation between th two mode he els. However, the ,parameters in ((15) consist of identity compo onent and expreessioncomponent tha the transform at vial. Cootes use two mation non-trivdifferent methhods to remove the expressio and project into e on tidentity subspa The parameters are simplified to the var ace. riationof identity, whi is linear tran ich nsformation. Let r de efined as the r residual param meter after (15) We ). ning data as rneu and rexp whe exp ∈{happdivide the train utral ere piness,sadness, fear, anger, disgu ust, surprised} to compute the } eexpression and identity covar d riance matrix. R Remove the iddentitycomponent of rexp by eexp = rexp − rneutral (17) Figure 10. The facial space rel of identity and expression. F late a .where eexp be d defined the exp pression compon nent. Fig (15) s showsthe training ex xample of eexp, and Figure 9 shows the tra ainingexamples rneutra . By applying PCA to rneutral and eexp, we can find al n 3.4 Inter-Model Rotate 4 l pace and Pexp, into athe projection Pneutral. into an identity subsp n iexpression subspace as Now we may use M i w Multiple linear regression met thod to find b ex , ij xp _ r ijeutral in thej ith A ne AAM model and the relationshi (i.e., R neutral , j d ips e exp = e exp + Pexp bexp (18) R e ) with b exp and r neutral in t j AAM mo as exp the th odel _ rneutral = enettural + Rneutral rneuttral j ij ij i (23) eutral = r neutral + P and rne neutral bneutral (19) and d bexp = eexp + Rexp rexp j ij ij i (24) whe enetural and eexp are cons ere ij d ij stant. 3.5 Reconstruct a New View 5 t wFigure 8. some examples from expression co e m omponent traini ing Giv a match of a new person in a view, we can reconstruct a ven fset view by follow ste (as shown in Fig. 11). w eps n 1. R Remove the effe of orientati fects ion. (Eq. 15). 2. P Project into ide entity and exprression subspace of the mod del. ( (Eqs. 20, 21, 222). Project into the subspaces of th target model. (Eqs. 23, 24). 3. P he 4. P Project that into residual space and combined two vectors in o e d nto o vector (inve Eqs. 20, 21 22). one erse 1, Figure 9. T neutral imag for training i The ge identity subspac ce. 5. A the assigne orientation. ( Add ed (Eq. 16)Costen et al. [26] suggest ession changes are ted that expreorthogonal to the changes du to identity in framework. A new t ue nimage with pa arameter r, the expression pa arameter bexp,can becalculated by 1103
• Figure 11. The flowchart of Rotate Model.4. Experimental ResultsHere, we illustrate the results of our methods. We use sixcameras to capture the expression of one person. There are 13persons in our experiments with 5 or 6 different expressions. Weselect 510 pictures for our training data for Multi-Pose 2D AAMs. Figure 12. result of warping Right Half to Frontal vs GroundIn the testing phase, we apply the model fitting for all pictures truth.and about 90% of the testing pictures have been successful. Then, we illustrate the experimental results of warpingBecause we do not have enough training data, we apply right-side view to frontal view and compare with the groundleave-one-out to train and test our rotational model algorithm. truth as shown in Figure 13. Apparently, the performance is notBesides warp the input face to the pre-trained pose, we also try as good as the previous one.warping the face to other pose and compare with the videocaptured in that specific pose. Although our system allows us to do the model face fittingand then warp the face to any pose, for some view, the warpingresults are not as good as the others. To compare the results ofthe rotated model, we do the warping of the input face image inright half view to the front pose and compare with the groundtruth pre-stored in our database as shown in Figure 12. 1104
• Figure 13. The experimental results of warping right-side view Figure15. The experimental results of warping theto frontal view. right-side-view facial image to the front view. We use the distance similarity measure x1 ⋅ x 2 / x1 x 2 toWe use PC equipped with Intel C2D 6300 CPU and 2045 MB evaluate whether the warped image help increasing thememory to test our algorithm. For a video sequence (with frame recognition rate, where x1 represent the pre-stored frontal neutralresolution 320*240), the processing time is less than face image database, x2 represent the testing data of facial image45ms/frame. of any expression and in any viewing direction. The purpose of the warping the non-frontal face to the Table 1 The improvement of identity recognition, with ICOfrontal view is to increase the face identification accuracy. (identity component only) and PC (pose correction) with 15Before the warping process, we have separated the identity degree.component and expression component from the model parameter. ICO PC PC+ICOTo analyze the warped facial image, we may use identityparameter or expression parameter independently to increase the Frontal 18% 3.7% 21.5%recognition rate. In the following, we will synthesize the face intra-modelimage by using only the identity component or the expressioncomponent. The experimental results of right-half-view facial In Table 2, the comparison is done with the expression parameter.image and right-side-view are shown in Figures 14 and 15. We find that the identity component increase the similarity between the neutral face in the database. On the other hand, we In Figure 14, the lower-right figure illustrates the facial have the right-half-view faces with expression processed by PCimage by using identity component. The expression can hardly + ICO(45-60 degree), the average similarity is about 74.3%. It isbe found and it shows a facial image of neutral face. In Figure 15, lower than the PC+ICO Frontal expression face for 4.6% only.the warped image using identity component is worse than Figure However, the improvement of right-view face with expression is14, however, the warped image by using expression parameter very limited. The similarity is about 56.4%.looks fine. 5. Conclusions In this paper, we have demonstrated that the expression parameter can be linear transform between each two AAMs of the view-based AAM. Then, it can be used to match an expression variant face at any angle, and to predict the appearance from new viewpoints given a single image of a person. We anticipate this approach will be useful for face recognition and expression recognition system more invariant to viewing angle. In the future, we may establish a wide angle facial detection and recognition system with higher accuracy, less processing time, and more stable. References [1] T.F. Cootes, D. Cooper, C.J. Taylor and J. Graham, "ActiveFigure14. The experimental results of warping the Shape Models - Their Training and Application." Computerright-half-view facial image to front view. Vision and Image Understanding. Vol. 61, No. 1, pp. 38-59, 1105
• 1995. Page(s):511-518 Vol.1[2] T.F.Cootes, G.J. Edwards and C.J.Taylor. "Active [23] J. Sung, D. Kim "STAAM: Fitting a 2D+3D to Stereo Appearance Models", Proc. European Conf. on Computer Images" IEEE ICIP on 8-11 Oct. 2006. Vision, Vol. 2, pp. 484-498, 1998. [24] Lucey, S., Mathews, I., Changbo Hu, Ambadar, Z., de la[3] G.J.Edwards, C.J.Taylor, T.F.Cootes, "Interpreting Face Torre, F., Cohn, J., "AAM derived face representations for Images using Active Appearance Models", Int. Conf. on robust action recognition" Int. Conf. on Automatic Face and Face and Gesture Recognition 1998. Gesture Recognition, 10-12 April 2006 Page(s): 155-160[4] T. F. Cootes, G.V.Wheeler, K.N.Walker and C. J. Taylor, [25] Huisman, P., van Munster, R., Moro-Ellenberger, S., "View-Based Active Appearance Models", Image and Veldhuis, R., Bazen, A. "Making 2D face recognition more Vision Computing, Vol.20, 2002, pp.657-664 robust using AAMs for pose compensation" nt. Conf. on[5] T.F. Cootes, G.V.Wheeler, K.N. Walker and C.J.Taylor Automatic Face and Gesture Recognition, 10-12 April 2006 "Coupled-View Active Appearance Models", British [26] N. Costen, T. F. Cootes and C. J. Taylor, "Compensating for Machine Vision Conference 2000. Ensemble-Specificity Effects when Building Facial[6] T.F.Cootes, G.J. Edwards and C.J.Taylor. "Active Models", Proc. British Machine Vision Conference 2000, Appearance Models", IEEE PAMI, Vol.23, No.6, Vol. 1, pp.62-71. pp.681-685, 2001[7] H. Kang, T.F. Cootes and C.J. Taylor, "A Comparison of Face Verification Algorithms using Appearance Models", Proc. BMVC2002, Vol.2,pp.477-4862.[8] M. B. Stegmann, B. K. Ersbøll, R. Larsen, "FAME -- A Flexible Appearance Modelling Environment", IEEE Transactions on Medical Imaging, 2003[9] I. Matthews and S. Baker. “Active Appearance Models revisited.” IJCV, 2004. In Press[10] 陳曉瑩 "即時多角度人臉偵測" 國立清華大學電機工程 研究所碩士論文,2006[11] V. Blanz and T. Vetter. "A morphable model for the synpaper of 3d faces." Proc. Computer Graphics SIGGRAPH 99, 1999.[12] V. Blanz and T. Vetter. "Face recognition based on fitting a 3d morphable model. " IEEE Trans. On PAMI, 25(9), September 2003.[13] C. Christoudias, L. Morency, and T. DarreIl. "Light field appearance manifolds." European Conf. on Computer Vision, (4):482-493, 2004[14] R. Gross, I. Matthews, and S. Baker. "Eigen light-fields and face recognition across pose." Int. Conf on Automatic Face and Gesture Recognition, 2002.[15] Chang, J., Zheng, Y., Wang, Z. "Facial Expression Anaylsis and synthesis: a Bilinear Appraoach" Int. Conf. on Information Acquisition, ICIA’07, 8-11 July 2007.[16] Wang, Tueming, Pen, Gang, Wu, Zhaohui "3D Face Recognition in the Presence of Expression : A Guidance-based Constraint Deformation Approach" IEEE CVPR 2007.[17] Amor, B.B. Ardabilianm, M., Chen, L. "New Experiments on ICP-Based 3D Face Recognition and Authentication" ICPR 2006, Volume 3, Page(s) : 1195 – 1199[18] I.A. Kakadiaris, G.Passalis, , G.Toderici, , M.N Murtuza,., Y. Lu, Karampatziakis, N, Theoharis, T. "Three-Dimensional Face Recognition in the Presence of Facial Expression: An Annotated Deformable Model Approach" IEEE Trans. on PAMI, Vol. 29, Issue 4, April 2007 pp. 640-649[19] S. Ramanathan, A. Kassim, Y. Vemlatesh, S. W. Wu, "Human Facial Expression Recognition using a 3D Morphable Model" IEEE ICIP, Oct 2006,[20] Lu X. and Jain A., "Deformation Modeling for Robust 3D face Matching" IEEE Trans. on PAMI, 2007.[21] Jling Xiao, Baker, S., Mathews, I., Kanade, T. "Real-time conbined 2D+3D active appearance models" CVPR 2004, Page(s):II-535~II-542.[22] Koterba, S., Baker, S., Mathews, I., Changbo Hu, Jing Xiao, Cohn, J., Janade, T. "Multi-view AAM fitting and camera calibration" IEEE ICCV, Vol. 1, 17-21 Oct. 2005 1106
• 1107
• 1108
• 1109
• 1110
• 1111
• Patch-Based Occupant Classiﬁcation for Smart Airbag Shih-Shinh Huang Er-Liang Jian and Chi-Liang Chien Dept. of Computer and Communication Engineering Chung-Shan Institute of Science and Technology National Kaohsiung First University of Science and Technology Email: jianerliang@gmail.com Email: poww@nkfust.edu.tw Abstract—This paper presents a vision-based approach foroccupant classiﬁcation. In order to circumvent the intra-classvariance, we consider the empty class as reference and describethe occupant class by using appearance difference rather thanappearance itself in the tradition approaches. Each class inthis work is modeled by a set of representative parts calledpatches and each of which is represented by a Gaussiandistribution. This alleviates the mis-classiﬁcation resulting fromthe severe lighting change which makes the image locallyblooming or invisible. Instead of using maximum likelihood(ML) for patch selection and estimating the parameters of Figure 1. Challenges: (a) Severe lighting change. The images havethe proposed generative models, we discriminatively learn the considerably large dynamic range. These observed images have signiﬁcantlymodels through a boosting algorithm by minimizing the loss different appearance. (b) Intra-class variance. Persons wearing clothing with different styles or colors.of the training error. Keywords-patch-based model, discriminative learning considerably dynamic range from bright sunlight to dark I. I NTRODUCTION shadow. Extremely, this makes some regions of the image Until now, the integration of the airbags into automobiles blooming or invisible and thus complicates the classiﬁca-has signiﬁcantly improved the occupant safety in vehicle tion task. The intra-class variance denotes that the samecrashes. However, inappropriate deployment of airbags in occupant class may have different appearance. For instance,some situations may cause severe or even fatal injuries. For the passengers may wear clothing with different colors; theexample, it deploys on a rear-facing infant seat or in a case baby seats may have different styles. The difference in sceneof that a passenger sitting too close to the airbag. According resulting from the conﬁguration change of objects insideto the report of American National Highway Transportation the vehicle is referred to as the structure variance. Figureand Safety Administration (NHTSA), since 1990, more than 1 shows some images exhibiting the lighting change and200 occupants have been killed by the airbag deployed in intra-class variance. Similar to the works in the literature, welow-speed crashes. To prevent occupants from this kind of assume that the scene monitored has no structure variance,injure, NHTSA deﬁned the Federal Motor Vehicle Safety and the objective of this paper is to achieve high recognitionStandard (FMVSS) 208 in 2001. One of the fundamental rate against severe lighting change and intra-class variance.issues of FMVSS 208 is to recognize the occupant classinside the vehicle for controlling the deployment of airbags. A. Related WorkThe ﬁve basic classes deﬁned in FMVSS 208 are (i) Empty, Owechko etc., [1] who are the pioneer in this area,(ii) RFIS (Rear Facing Infant Seat), (iii) FFCS (Front Facing attempted to eliminate the illumination variance by ﬁrstlyChild Seat), (iv) Child, and (v) Adult. applying intensity normalization to the training images. The Some existing sensors, such as ultrasound, pressure, or coefﬁcients of the eigen vectors computed by the princi-camera, have been used for developing system which aims pal component analysis (PCA) are used to represent theat meeting the classiﬁcation requirements in FMVSS 208. occupant class. The input unknown image is then recog-In this work, we choose camera as the sensing device, since nized as the same class of the nearest neighbor sample.it can provide rich representation of the occupant in front In order to overcome lighting change, Haar wavelet ﬁltersof the dashboard. This makes the proposed approach have which describe the intensity difference among neighboringpotentially higher classiﬁcation accuracy. The success to the regions have been used for occupant representation. An over-problem of the occupant classiﬁcation based on computer complete and dense way using Haar ﬁlters over thousandsvision is challenging in the presence of severe change in of rectangular regions is adopted in [2]. Then, Supportlighting, large intra-class variance, and structure variance. Vector Machine (SVM) is applied to determine the bound-Since the vehicle is moving, the observed image may have aries among different occupant classes for handling intra- 1112
• class variance. In [3], [4], the edge map of the passenger image and that at the reference image. This makes theappearance is extracted through the background subtraction proposed approach be invariant to intra-class variance. Then,algorithm and further described by the use of high-order the likelihood ratios for evaluating the existence conﬁdenceLegendre moments. The classiﬁcation is achieved using the for giving image with respect to ﬁve trained models arek-nearest neighbors strategy. The edge map of the occupant computed and the classiﬁcation result is the occupant classis described by higher-order Tchebichef moments in [5] with the highest conﬁdence.and then Adaboost algorithm is applied to select a set of The remainder of this paper is organized as follows. Indiscriminative moments for classiﬁcation. To utilize more Section II, we introduce the generative models for repre-information for classiﬁcation, multiple features including senting the occupant classes and the way to perform oc-range [6], motion information, and edge map are fused under cupant classiﬁcation. The boosting algorithm for estimatinga two-layer architecture [7], [8]. The classiﬁers for each layer the parameters of the models in a discriminative mannerare the Non-linear Discriminant Analysis (NDA) classiﬁers. is then described in Section III. Section IV demonstrates The features used in the aforementioned works are all the effectiveness of the developed approach by providingglobal descriptors, such as dense edge map [7], [8], Legendre experimental results on an abundant database. Finally, wemoments [3], [4] or Tchebichef moments[5]. The main conclude the paper in Section V with some discussion.limitation in this kind of approaches is that the classiﬁcationaccuracy deteriorates in the two extreme cases (blooming II. PATCH -BASED C LASSIFIERand invisible) resulting from severe lighting change. To For every class, we model it by a generative modelcircumvent this, we present a patch-based model which consisting of several patches described by a Gaussian distri-is commonly used in recognition literature [9], [10] for bution. The observed image is classiﬁed by maximizing thehandling occlusion effect to describe the occupant class. likelihood probability. Here, the feature representation of aFurthermore, the above works directly model the appearance patch is appearance difference with respect to a referenceof the occupant and thus suffer from the signiﬁcant intra- image. In order to eliminate illumination factor, we recoverclass variance. The general way to solve or alleviate this the reﬂectance image of the empty class and consider itproblem is by by introducing some classiﬁcation algorithms, as the reference image. Negative normalized correlationsuch as SVM or NDA in the literature. According to the is then introduced to measure appearance difference forinsight that the silhouettes of different occupant classes are representing the feature of patches.distinct, we consider the empty class as the reference andthus model the appearance difference with respect to the A. Feature Representationempty class. The images for training are captured from various lighting conditions. Similar in foreground segmentation literatureB. Approach Overview [11], a reference image suitable for difference measurement The objective of occupant classiﬁcation is to assign one of should be illumination invariant and without any movingﬁve classes C = {CEmpty , CRF IS , CF F CS , CChild , CAdult } objects inside. As discussed in [12], an image is a productto the currently observed image. The system mainly consists of two images: a reﬂectance image and an illuminationof two phases: training and classiﬁcation. In training phase, image. The reﬂectance image of the scene is constant andwe ﬁrstly a reﬂectance image of empty class by removing illumination image changes with the lighting condition inthe illumination effect. The obtained reﬂectance image is the environment. Accordingly, the reﬂectance image of theconsidered as the reference image for further feature repre- empty class is recovered and thus considered as the referencesentation. In this work, each occupant class is described by a image here.patch-based generative model in order to handle with severe Giving a set of empty-class images in the traininglighting change which makes the image locally blooming or database, we apply the approach proposed in [13] to estimateinvisible. In tradition, the parameters of generative model is the empty-class reﬂectance image Ir based on an assumptiongenerally estimated using ML strategy which only samples that illumination images are lower contrast than reﬂectancewith the same label are considered and used for training that image. This leads to that the derivative ﬁlter outputs on thecorresponding model. However, the models learned in this illumination image will be sparse and the reﬂectance recov-way suffers from the problem of having less discriminativity ery problem can be re-formulated as maximum-likelihoodamong difference classes. Instead, we adopt a discriminitive estimation problem. Figure 2 shows the decomposition ofboosting algorithm to estimate the model parameters by three empty-class images into a constant reﬂectance imagedirectly minimizing the training error. and its corresponding illumination images. In classiﬁcation phase, the appearances at the trained Let p be a patch whose conﬁguration is θ(p) = (t, l, w, h),patches of a speciﬁc occupant model are taken into consider- where (t, l) is the coordinate of the top-left corner and (w, h)ation for feature representation. Feature used in this work is is the patch size. To impose locality property similar tothe difference in appearance between the patch at observed histogram of oriented gradients (HOGs) [14], we divided 1113
• B. Classiﬁcation Model A generative occupant model Mc = {pc : k = 1, ..., K c} k consisting of K c patches is proposed to describe the class c ∈ C in this work. Each patch pc is modeled by a Gaussian k distribution Nk = {µc , Σc } associated with the patch c k k conﬁguration θ(pk ), where µc and Σk are the mean and c k c covariance matrix, respectively. By assuming independence among patches, the log likelihood probability of an observed image I belonging to the class c is deﬁned as: Kc c c log Pr(I|z = 1) = log Pr(I|M ) = log Pr(f (I(pc ))|Nk ) k c k=1 (3)Figure 2. Examples of reﬂectance image recovery for three images at the where z c ∈ {+1, −1} is the membership label for the classﬁrst row. The second row shows the recovered reﬂectance image and thethree corresponding illumination images are shown at the third row. c and f (Io (pc )) is the aforementioned patch representation k c of the image I at the patch pk . Remarkably, the proposed model which learns the likelihood probability of a given observation is a generative one. Instead of solving the problem of occupant classiﬁcation directly using maximum likelihood (ML), that is, c∗ = arg max log Pr(I|z c = 1), we introduce existence conﬁ- dence to re-formulate it as ﬁve one-against-others binary classiﬁcation problems. The work in [9] claims that this beneﬁts both classiﬁcation and training to be done in a Figure 3. The deﬁnition of quadrant images Io (qi ) and Ir (qi ). discriminative manner and thus improve the classiﬁcation accuracy. Consequently, we deﬁne the existence conﬁdence of a speciﬁc class c given an observed image I as the log likelihood ratio test (LRT) which can be expressed as:a patch p into four quadrants {q1 , q2 , q3 , q4 }. We denotethe quadrant qi at the observed image Io and the recovered Pr(I|z c = 1) H(I, c) = log (4)reﬂectance image Ir as Io (qi ) and Ir (qi ), respectively. The Pr(I|z c = −1)schematic form is shown in Figure 3. Inspired by the Without assuming any prior, we approximate the backgroundwork [15] to deal with the presence of the severe lighting hypothesis Pr(I|z c = −1) by a constant Θc . Accordingly,change, a matching function (MF) γ(.) is applied to measure the function form H(.) of the LRT statistics in (4) becomes:the appearance difference between Io (qi ) and Ir (qi ). The H(I, c) = log Pr(I|z c = 1) − Θcγ(Io (qi ), Ir (qi )) is deﬁned as: Kc N (x, y) = log Pr(f (I(pc )|Nk ) − Θc k c (x,y)∈qi k=1 (5)γ(Io (qi ), Ir (qi )) = − Kc Do (x, y) Dr (x, y) = {log Pr(f (I(pc ))|Nk ) − Θc } k c k (x,y)∈qi (x,y)∈qi k=1 (1) c Kcwhere where Θ = k=1 Θc . Therefore, the classiﬁcation result k of giving I and ﬁve trained patch-based generative model N (x, y) ¯ ¯ = (Io (x, y) − Io (qi )) × (Ir (x, y) − Ir (qi )) {Mc : c ∈ C} is the class c∗ with the highest existence Do (x, y) ¯ ¯ = (Io (x, y) − Io (qi )) × (Io (x, y) − Io (qi )) conﬁdence value, that is, c∗ = arg max H(I, c). However, Dr (x, y) ¯ ¯ = (Ir (x, y) − Ir (qi )) × (Ir (x, y) − Ir (qi )) we still not mention about how to estimate the model pa- (2) rameters including Ωc = {(θk , µc , Σc , Θc ) : k = 1, .., K c }. c ¯ ¯ k k kHere, Io (qi ) and Ir (qi ) denote the average intensity of the In the next section, a boosting algorithm is proposed to trainquadrant images Io (qi ) and Ir (qi ), respectively. Remarkably, these parameters in a discriminative way.this function computes the negative normalized correlationbetween Io (qi ) and Ir (qi ). Hence, the range of γ(.) is III. D ISCRIMINATIVE L EARNING U SING B OOSTINGbetween [−1, 1]. Thus, the feature representation of the p In the learning literature [9], [10], several compellingis deﬁned as a 4-D vector. arguments indicate that the model with the parameters 1114