Upcoming SlideShare
×

# CVGIP 2010 Part 3

2,511 views

Published on

CVGIP 2010 The 23th IPPR Conference on Computer Vision, Graphics, and Image Processing

Published in: Business, Technology
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

• Be the first to like this

Views
Total views
2,511
On SlideShare
0
From Embeds
0
Number of Embeds
45
Actions
Shares
0
0
0
Likes
0
Embeds 0
No embeds

No notes for slide

### CVGIP 2010 Part 3

3. 3. relationship between points on two planes: sct = Hcs , (1)where s is a scalar factor and cs and ct are a pair of corre-sponding points in the source and target patches, respectively.If there are at least four correspondences where no threecorrespondences in each patch are collinear, we can estimateH through the least-squares approach. We regard cs as points of 3-D environment model and ctas points of 2-D image and then calculate the matrix H tomap points from the 3-D model to the images. In the reverseorder, we can also map points from the images to the 3-Dmodel.B. Planar patch modeling Precise camera calibration is not an easy job [13]. In the Fig. 4. The comparison of rendering layouts between different numbers and sizes of patches. A large distortion occurs if there are fewer patches forvirtual projector methods [4], [7], the texture image will be rendering (left). More patches make the rendering much better (right).miss-aligned to the model if the camera calibration or the3-D model reconstruction has large error. Alternatively, wedevelop a method that approximates the 3-D environment where Iij is the intensity of the point obtained from homog-model through multiple yet individual planar patches and ˜ raphy transformation, Iij is the intensity of the point obtainedthen renders the image content of every patches to generate from texture mapping, i and j are the coordinates of row anda synthesized and integrated view of the monitored scene. In column in the image, respectively, and m × n represents thethis way we can easily construct a surveillance system with dimension of the patch in the 2-D image. In order to have3-D view of the environment. an reference scale to quantify the distortion amount, a peak Mostly we can model the environment with two basic signal-to-noise ratio is calculated bybuilding components, horizontal planes and vertical planes.The horizontal planes for hallways and ﬂoors are usually MAX2 Isurrounded by doors and walls, which are modeled as the PSNR = 10 log10 , (3) MSEvertical planes. Both two kinds of planes are further dividedinto several patches according to the geometry of the scenes where MAXI is the maximum pixel value of the image.(Figure 3). If the scene consists of simple structures, a few Typical values for the PSNR are between 30 and 50 dB andlarge patches can well represent the scene with less rendering an acceptable value is considered to be about 20 dB to 25 dBcosts. On the other hand, more and smaller patches are in this work. We set a threshold T to determine the qualityrequired to accurately render a complex environment, at the of texture mapping byexpense of more computational costs. In the proposed system, the 3-D rendering platform is PSNR ≥ T . (4)developed on OpenGL and each patch is divided into tri-angles before rendering. Since linear interpolation is used If the PSNR of the patch is lower than T , the procedureto ﬁll triangles with texture in OpenGL and not suitable divides it into smaller patches and repeats the process untilfor the perspective projection, distortion will appear in the the PSNR values of every patches are greater than the givenrendering result. One can use a lot of triangles to reduce this threshold T .kind of distortion, as shown in Figure 4, it will enlarge thecomputational burden and therefore not feasible for real-time III. O N - LINE MONITORINGsurveillance systems. To make a compromise between visualization accuracy and The proposed system displays the videos on the 3-D model.rendering cost, we propose a procedure that automatically However, the 3-D foreground objects such as pedestrians aredivides each patch into smaller ones and decides suitable projected to image frame and become 2-D objects. They willsizes of patches for accurate rendering (Figure 4). We use the appear ﬂattened on the ﬂoor or wall since the system displaysfollowing mean-squared error method to estimate the amount them on planar patches. Furthermore, there might be ghostingof distortion when rendering image patches: effects when 3-D objects are in the overlapping areas of m−1 n−1 different camera views. We need to tackle this problem by 1 ˜ MSE = (Iij − Iij )2 , (2) separating and rendering 3-D foreground objects in addition m×n i=0 j=0 to the background environment. 990
5. 5. Fig. 7. Orientation determination of the axis-aligned billboarding. L is the location of the billboard, E is the location projected vertically from the viewpoint to the ﬂoor, and v is the vector from L to E. The normal vector (n) of the billboard is rotated according to the location of the viewpoint. Y is the rotation axis and φ is the rotation angle.Fig. 6. A ﬂowchart to illustrate the whole method. The purple part is basedon pixel. are always moving on the ﬂoor, the billboards can be aligned to be perpendicular to the ﬂoor in the 3-D model. The 3-Dwe propose another method to ﬁnd out Tθ (x, y, t) more fast. location of the billboard is estimated by mapping the bottom-The number of samples which are classiﬁed as shadow or middle point of the foreground bounding box in the 2-Dbackground at time t is ATr (x, y, t) by using FSMS. We {b,s} image through the lookup tables. The ratio between the heightdeﬁne a ratio R(Tr ) = ATr /A{b,s,f } where A{b,s,f } is all {b,s} of the bounding box and the 3-D model determines the heightsamples in position x, y, where b, s, f represent the back- of the billboard in the 3-D model. The relationship betweenground, shadow and foreground respectively. The threshold the direction of a billboard and the viewpoint is deﬁned asTθ (x, y, t) can be updating to Tθ (x, y, t) by R(Tr ). The shown in Figure 7.number of samples whose cos(θ(x, y)) values are larger than The following equations are used to calculate the rotationthe Tθ (x, y, t) is equal to A{b,s} and is required angle of the billboard: R(Tθ (x, y, t)) = R(Tr ) (10) Y = (n × v) , (12) Besides, we add a perturbation δTθ to the Tθ (x, y, t). TSince FSMS only ﬁnds out a threshold in If θ (x, y, t), if the φ = cos−1 (v · n) , (13)initial threshold Tθ (x, y, 0) is set larger than true threshold,the best updating threshold is equal to threshold Tθ not where v is the vector from the location of the billboard, L, tosmaller than threshold Tθ . Therefore the true angle threshold the location E projected vertically from the viewpoint to thewill never be found with time. To solve this problem, a per- ﬂoor, n is the normal vector of the billboard, Y is the rotationturbation of the updating threshold is added to the updating axis, and φ is the estimated rotation angle. The normal vectorthreshold of the billboard is parallel to the vector v and the billboard is always facing toward the viewpoint of the operator. Tθ (x, y, t) = Tθ (x, y, t) − δTθ (11) F. Video content integration Since the new threshold Tθ (x, y, t) has smaller valueto cover more samples, it can approach the true threshold If the ﬁelds of views of cameras are overlapped, objects inwith time. This perturbation can also make the method more these overlapping areas are seen by multiple cameras. In thisadaptable to the change of environment. Here is a ﬂowchart case, there might be ghosting effects when we simultaneouslyFigure 6 to illustrate the whole method. display videos from these cameras. To deal with this problem, we use 3-D locations of moving objects to identify the cor-E. Axis-aligned billboarding respondence of objects in different views. When the operator In visualization, axis-aligned billboarding [14] constructs chooses a viewpoint, the rotation angles of the correspondingbillboards in the 3-D model for moving objects, such as billboards are estimated by the method presented above andpedestrians, and the billboard always faces to the viewpoint of the system only render the billboard whose rotation angle isthe user. The billboard has three properties: location, height, the smallest among all of the corresponding billboards, asand direction. By assuming that all the foreground objects shown in Figure 8. 992
6. 6. C1 C3 C2 C1Fig. 8. Removal of the ghosting effects. When we render the foregroundobject from one view, the object may appear in another view and thuscause the ghosting effect (bottom-left). Static background images without Fig. 9. Determination of viewpoint switch. We divide the ﬂoor areaforeground objects are used to ﬁll the area of the foreground objects (top). depending on the ﬁelds of view of the cameras and associate each area to oneGhosting effects are removed and static background images can be update of the viewpoint close to a camera. The viewpoint is switched automaticallyby background modeling. to the predeﬁned viewpoint of the area containing more foreground objects.G. Automatic change of viewpoint The experimental results shown in Figure 12 demonstrate that the viewpoint can be able to be chosen arbitrarily in The proposed surveillance system provides target tracking the system and operators can track targets with a closerfeature by determining and automatic switching the view- view or any viewing direction by moving the virtual camera.points. Before rendering, several viewpoints are speciﬁed in Moreover, the moving objects are always facing the virtualadvance to be close to the locations of the cameras. During camera by billboarding and the operators can easily perceivethe viewpoint switching from one to another, the parameters the spatial information of the foreground objects from anyof the viewpoints are gradually changed from the starting viewpoint.point to the destination point for smooth view transition. The switching criterion is deﬁned as the number of blobs V. C ONCLUSIONSfound in the speciﬁc areas. First, we divide the ﬂoor area into In this work we have developed an integrated video surveil-several parts and associate them to each camera, as shown lance system that can provide a single comprehensive viewin Figure 9. When people move in the scene, the viewpoint for the monitored areas to facilitate tracking moving targetsis switched automatically to the predeﬁned viewpoint of the through its interactive control and immersive visualization.area containing more foreground objects. We also make the We utilize planar patches for 3-D environment model con-billboard transparent by setting the alpha value of textures, so struction. The scenes from cameras are divided into severalthe foreground objects appear with ﬁtting shapes, as shown patches according to their structures and the numbers andin Figure 10. sizes of patches are automatically determined for compromis- ing between the rendering effects and efﬁciency. To integrate IV. E XPERIMENT RESULTS video contents, homography transformations are estimated for relationships between image regions of the video contents We developed the proposed surveillance system on a PC and the corresponding areas of the 3D model. Moreover,with Intel Core Quad Q9550 processor, 2GB RAM, and one the proposed method to remove moving cast shadow cannVidia GeForce 9800GT graphic card. Three IP cameras with automatically decide thresholds by on-line learning. In this352 × 240 pixels resolution are connected to the PC through way, the manual setting can be avoided. Compared with theInternet. The frame rate of the system is about 25 frames per work based on frames, our method increases the accuracy tosecond. remove shadow. In visualization, the foreground objects are In the monitored area, automated doors and elevators are segmented accurately and displayed on billboards.speciﬁed as background objects, albeit their image do changewhen the doors open or close. These areas will be modeled in R EFERENCESbackground construction and not be visualized by billboards, [1] R. Sizemore, “Internet protocol/networked video surveillance market:the system use a ground mask to indicate the region of Equipment, technology and semiconductors,” Tech. Rep., 2008.interesting. Only the moving objects located in the indicated [2] Y. Wang, D. Krum, E. Coelho, and D. Bowman, “Contextualized videos: Combining videos with environment models to support situa-areas are considered as moving foreground objects, as shown tional understanding,” IEEE Transactions on Visualization and Com-in Figure 11. puter Graphics, 2007. 993
7. 7. Fig. 11. Dynamic background removal by ground mask. There is an automated door in the scene (top-left) and it is visualized by a billboard (top- right). A mask covered the ﬂoor (bottom-left) is used to decide whether to visualize the foreground or not. With the mask, we can remove unnecessary billboards (bottom-right).Fig. 10. Automatic switching the viewpoint for tracking targets. People Fig. 12. Immersive monitoring at arbitary viewpoint. We can zoom out thewalk in the lobby and the viewpoint of the operator automatically switches viewpoint to monitor the whole surveillance area or zoom in the viewpointto keep track of the targets. to focus on a particular place. [3] Y. Cheng, K. Lin, Y. Chen, J. Tarng, C. Yuan, and C. Kao, “Accurate transactions on Geosci. and remote sens., 2009. planar image registration for an integrated video surveillance system,” [10] J. Kim and H. Kim, “Efﬁcient regionbased motion segmentation for a Computational Intelligence for Visual Intelligence, 2009. video monitoring system,” Pattern Recognition Letters, 2003. [4] H. Sawhney, A. Arpa, R. Kumar, S. Samarasekera, M. Aggarwal, [11] E. J. Carmona, J. Mart´nez-Cantos, and J. Mira, “A new video seg- ı S. Hsu, D. Nister, and K. Hanna, “Video ﬂashlights: real time ren- mentation method of moving objects based on blob-level knowledge,” dering of multiple videos for immersive model visualization,” in 13th Pattern Recognition Letters, 2008. Eurographics workshop on Rendering, 2002. [12] N. Martel-Brisson and A. Zaccarin, “Learning and removing cast [5] U. Neumann, S. You, J. Hu, B. Jiang, and J. Lee, “Augmented virtual shadows through a multidistribution approach,” IEEE transactions on environments (ave): dynamic fusion of imagery and 3-d models,” IEEE pattern analysis and machine intelligence, 2007. Virtual Reality, 2003. [13] S. Teller, M. Antone, Z. Bodnar, M. Bosse, S. Coorg, M. Jethwa, and [6] S. You, J. Hu, U. Neumann, and P. Fox, “Urban site modeling from N. Master, “Calibrated, registered images of an extended urban area,” lidar,” Lecture Notes in Computer Science, 2003. International Journal of Computer Vision, 2003. [7] I. Sebe, J. Hu, S. You, and U. Neumann, “3-d video surveillance [14] A. Fernandes, “Billboarding tutorial,” 2005. with augmented virtual environments,” in International Multimedia Conference, 2003. [8] T. Horprasert, D. Harwood, and L. Davis, “A statistical approach for real-time robust background subtraction and shadow detection,” IEEE ICCV. (1999). [9] K. Chung, Y. Lin, and Y. Huang, “Efﬁcient shadow detection of color aerial images based on successive thresholding scheme,” IEEE 994