Stitching Video from Webcams 4212 Related WorkA tremendous amount of progress has been made in static image mosaicing. Forexample, strip panorama techniques  capture the horizontal outdoor scenescontinuously and then stitch them into a long panoramic picture, which can be usedfor digital tourism and the like. Many techniques such as plane-sweep  andmulti-view projection  have been developed for removing ghosting and blurringartifacts. As for panoramic video, however, the technology is still not mature. One of themain troubles is the real-time requirement. The common frame rate is 25~ 30 FPS,so if we want to create video panorama, we need to create each panoramic frameswithin at most 0.04 seconds, which means that the stitching algorithms for staticimage mosaicing cannot be applied to stitch real-time frames directly. And due tothe time-consuming computation involved, existing methods for improving staticpanorama can hardly be applied for stitching videos. To skirt these troubles, someresearchers resort to hardware. For example, a carefully designed camera clusterwhich guarantees an approximate common virtual COP (center of projection) can easily register the inputs and avoid parallax to some extent. But from anotherperspective, this kind of algorithm is undesirable because it relies heavily on thecapturing device. Our approach does not need special hardware. Instead, it makes use of the or-dinary webcams, which means that the system is inexpensive and easily applica-ble. Besides this, the positions and directions of the webcams are flexible as longas they have some overlapped field-of-view. We design a two-stage solution totackle this challenging situation. The whole system will be discussed in section 3and the implementation of each stage will be discussed in section 4 and 5 indetail.3 System FrameworkAs is shown in Fig. 1, the inputs of our system are independent frame sequences fromtwo common webcams and the output is the stitching video. To achieve real time, weseparate the processing into two stages. The first one, called the initialization stage,only needs to be run once after the webcams are fixed. This stage includes severaltime-consuming procedures which are responsible for calculating the geometric rela-tionship between the adjacent webcams. We firstly detect robust features in the initialframe of each webcams and then match them between the adjacent views. The correctmatches are then employed to estimate the perspective matrix. The next stage runs inreal time. In this stage, we make use of the matrix from the first stage to register theframes of different webcams on the same plane and blend the overlapped region usinga nonlinear weight mask. The implementation of the two stages will be discussed laterin detail.
422 M. Zheng, X. Chen, and L. Guo Real-time Stage Frames Projection Frames of Initialized Ｙ & of Narrow Blending Wide View ? View Display Webcams Ｎ Feature Detection Projective Matrix Feature Matching RANSAC InitializationFig. 1. Framework of our system. The initialization stage estimates the geometric relationshipbetween the webcams based on the initial frames. The real-time stage registers and blends theframe sequences in real time.4 Initialization StageSince the location and orientation of the webcams are flexible, the geometric relation-ship between the adjacent views is unknown before registration. We choose one web-cam as a base and use the full planar perspective motion model  to register theother view on the same plane. The planar perspective transform warps an image intoanother using 8 parameters: ⎡x ⎤ ⎛ h11 h12 h13 ⎞ ⎡ x ⎤ ⎢ y ⎥ = u ~ Hu = ⎜ h h ⎟ h23 ⎟ ⎢ y ⎥ . ⎢ ⎥ ⎜ 21 22 ⎢ ⎥ (1) ⎢1 ⎥ ⎜ ⎟⎢ ⎥ ⎣ ⎦ ⎝ h31 h32 1 ⎠ ⎣1 ⎦where u = ( x, y,1)T and u = ( x , y ,1)T are homogeneous coordinates of the twoviews, and ~ indicates equality up to scale since H is itself homogeneous. The per-spective transform is a superset of translation, rigid, and similarity well as affinetransforms. We seek to compute an optimized matrix H between the views so thatthey can be aligned well in the same plane. To recover the 8 parameters, we firstly extract keypoints in each input frame, andthen match them between the adjacent two. Many classic detectors such as Canny and Harris  can be employed to extract interesting points. However, they are notrobust enough for matching in our case, which involves rotation and some perspectiverelation between the adjacent views. In this paper, we calculate the SIFT features , which were originally used in object recognition.
Stitching Video from Webcams 423 Simply put, there are 4 extraction steps. In the first step, we filter the frame with aGaussian kernel: L( x, y,σ) =G( x, y,σ) ⋅ I ( x, y) . (2) (x +y ) 2 2 I ( x, y ) is the initial frame and G ( x, y,σ ) = 1 −where ⋅e 2σ 2 . Then we con- 2πσ 2struct a DoG (Difference of Gaussian) space as follows: D ( x, y, σ ) = L ( x, y, kσ ) − L ( x, y, σ ) . (3)where k is the scaling factor. The extrema in the DoG space are taken as keypoints. In the second step, we calculate the accurate localization of the keypoints throughTaylor expansion of the DoG function: ∂DT 1 ∂2D D( v) = D + v + vT v. (4) ∂v 2 ∂v 2where v = ( x, y , σ ) . From formula (4), we get the sub-pixel and sub-scale coordi- Tnates as follows: ^ ∂ 2 D −1 ∂DT v=− . . (5) ∂v 2 ∂v ^A threshold is added on the D ( v ) value to discard unstable points. Also, we makeuse of the Hessian matrix to eliminate edge responses: Tr ( M Hes ) ( r + 1) 2 < . (6) Det ( M Hes ) r ⎛ Dxx Dxy ⎞where M Hes = ⎜ ⎟ is the Hessian matrix, and Tr ( M Hes ) and Det ( M Hes ) are ⎝ Dxy Dyy ⎠the trace and determinant of M Hes . r is an experience threshold and we set r =10 inthis study. In the third step, the gradient orientations and magnitudes of the sample pixelswithin a Gaussian window are employed to calculate a histogram to assign the key-point with orientation. And finally, a 128D descriptor of every keypoint is obtained byconcatenating the orientation histograms over a 16 × 16 region. By comparing the Euclidean distances of the descriptors, we get an initial set ofcorresponding keypoints (Fig. 2 (a)). The feature descriptors are invariant to transla-tion and rotation as well as scaling. However, they are only partially affine-invariant,so the initial matched pairs often contain outliers in our cases. We prune the outliers
424 M. Zheng, X. Chen, and L. Guoby fitting the candidate correspondences into a perspective motion model based onRANSAC  iteration. Specifically, we randomly choose 4 pairs of matched pointsin each iteration and calculate an initial projective matrix, then use the formula belowto find out whether the matrix is suitable for other points: ⎛ xn ⎞ ⎛ xn ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ yn ⎟ − H ⋅ ⎜ yn ⎟ < θ . (7) ⎜1⎟ ⎜1⎟ ⎝ ⎠ ⎝ ⎠Here H is the initial projective matrix andθ is the threshold of outliers. In order toget a better toleration of parallax, a loose threshold of inliers is used. The matrix con-sistent with most initial matched pairs is considered as the best initial matrix, and thepairs fitting in it are considered as correct matches (Fig. 2 (b)). (a) (b)Fig. 2. (a) Two frames with large misregistration and initial matched features between them.Note that there are mismatched pairs besides the correct ones (b) Correct matches afterRANSAC filtering After purifying the matched pairs, the ideal perspective matrix H is estimated us-ing use a least squares method. In detail, we construct the error function below andminimize the sum of the squared distances between the coordinates of the correspond-ing features: N 2 N 2 Ferror = ∑ Hu warp,n − u base, n = ∑ u warp ,n − u base,n (8) n =1 n =1where ubase,n is the homogeneous coordinate of the nth feature in the image to beprojected to, and uwarp,n is the correspondence of ubase,n in another view.
Stitching Video from Webcams 4255 Real-Time StageAfter obtaining the perspective matrix between the adjacent webcams, we project theframes of one webcam onto another and blend them in real-time. Since the webcamsare placed relatively freely, they may not have a common center of projection andthus are likely to result in parallax. In other words, the frames of different webcamscannot be registered strictly. Therefore, we designed a nonlinear blending strategy tominimize the ghosting and blurring of the overlapped region. Essentially, this is akind of alpha-blending. The synthesized frames Fsyn can be presented as follows: Fsyn ( x, y ) = α ( x, y ) ∗ Fbase + (1 − α ( x, y ) ) * Fproj (9)where Fbase are the frames of the base webcam, Fproj are the frames projected fromthe adjacent webcam, and α ( x, y ) is the weight on pixel ( x, y ) . In the conventional blending method, the weight of pixels is a linear function inproportion to the distance to the image boundaries. This method treats the differentviews equally and performs well in normal cases. However, in the case of severeparallax, the linear combination will result in blurring and ghosting in the whole over-lapped region, as is the case in Fig 3(b), so we use a special α function to give prior-ity to one pass to avoid the conflict in the overlapped region of two webcams. Simplyput, we construct a nonlinear α mask as below: ⎧1,if min(x, y, x -W , y - H )>T ⎪ α ( x, y ) = ⎨sin(π ⋅ (min( x, y, x −W , y − H ) / T − 0.5) +1 (10) ⎪ , otherwise ⎩ 2where W and H are the width and height of the frame, and T is the width of thenonlinear decreased border. The mask is registered with the frames and clipped ac-cording to the region to be blended. The α value remains the same in the central partof the base frame, and begins to drop sharply when it comes close enough to theboundaries of another layer. The gradual change can be controlled by T . The transi-tion of different frames is smoother and more natural if T is larger, but the clear cen-tral region is also smaller, and vice versa. We refer to this method as nonlinear maskblending. By this nonlinear synthesis, we keep a balance between the smooth transi-tion of boundaries and the uniqueness and clarity of the interiors. (a) A typical pair of scene (b) Linear Blending (c) Our blending with strong parallaxFig. 3. Comparison between linear blending and our blending strategy on typical scenes withsevere parallax
426 M. Zheng, X. Chen, and L. Guo6 ResultsIn this section, we show the results of our method on different scenes. We built aprototype with two common webcams, as is shown in Fig, 4. The webcams are placedtogether on a simple support, and the lenses are flexible and can be rotated and di-rected to different orientations freely. Each webcam has a resolution of QVGA(320 × 240 pixels) with a frame rate of 30 FPS.Fig. 4. Two common webcams fixed on a simple support. The lenses are flexible and can berotated and adjusted to different orientations freely. Table 1. Processing time of the main procedures of the system Stage Procedure Time (second) Feature detection 0.320 ~ 0.450 Feature matching 0.040 ~ 0.050 Initialization RANSAC filtering 0.000 ~ 0.015 Matrix computation 0.000 ~ 0.001 Real time Projection and Blending 0.000 ~ 0.020 The processing time of the main procedures is listed in Table 1. The system runson a PC with E4500 2.2GHz CPU and 2GB memory. The initialization stage usuallytakes about 0.7 ~ 1 second, according to the content of the scene. The projection andblending usually takes less than 0.02 seconds for a pair of frames, thus can run in real-time. Note that whenever the webcams are changed, there should be an initializationstage to re-compute the geometric relationship between the webcams. Currently, thisre-initialization is started by user. After the initialization, the system can process thevideo at a rate of 30FPS. In our system, the positions and directions of the webcams are adjustable as long asthey have some overlapped field-of view. Typically, the overlapped region should be20% of the original view or above, otherwise there may not be enough robust featuresto match between the webcams. Fig. 5 shows the stitching result of some typicalframes. In these cases, the webcams are intentionally rotated to a certain angle or eventurned upside down. As can be seen in the figures, the system can still register andblend the frames into a natural whole scene. Fig. 6 shows some typical stitchingscenes from a real-time video. In (a), the two static indoor views are stitched into awide view. In (b) and (c), some moving objects show up in the scene, either far awayor close to the lens. As illustrated in the figures, the stitching views are as clear andnatural as the original narrow view.
Stitching Video from Webcams 427 (a)A pair of frames with 150 rotation and the stitching result (b) A pair of frames with 90 0 rotation and the stitching result (c) A pair of frames with 1800 rotation and the stitching resultFig. 5. Stitching frames of some typical scenes. The webcams are intentionally rotated to acertain angle or turned upside down. (a) A static scene (b) A far away object moving in the scene (c) A close object moving in the sceneFig. 6. Stitching results from a real-time video. Moving objects in the stitching scene are asclear as in the original narrow view.
428 M. Zheng, X. Chen, and L. Guo Although our system is flexible and robust enough in normal conditions, the qual-ity of the mosaicing video does drop severely in the following two cases: firstly, whenthe scene lacks salient features, as is the case of a white wall, then the geometric rela-tionship of the webcams cannot be estimated correctly; secondly, when the parallax istoo strong, there may be noticeable traces of stitching in the frame border. Theseproblems can be avoided by targeting the lens at some salient scenes and adjusting theorientation of the webcams.7 Conclusions and Future WorkIn this paper, we have presented a technique for stitching videos from webcams. Thesystem receives the frame sequences from common webcams and outputs a synthe-sized video with a wide field-of-view in real time. The positions and directions of thewebcams are flexible as long as they have some overlapped field-of-view. There aretwo stages in the system. The initialization stage calculates the geometric relationshipof frames from adjacent webcams. A nonlinear mask blending method which avoidsthe ghosting and blurring in the main part of the overlapped region is proposed forsynthesizing the frames in real time. As illustrated by experimental result, this is aneffective and inexpensive way to construct video with a wide field-of-view. Currently, we have only focused on using two webcams. As a natural extension ofour work, we would like to boost up to more webcams. We also plan to explore thehard and interesting issues of how to eliminate the exposure differences betweenwebcams in real time and solve the problems mentioned at the end of last section.AcknowledgmentThe financial support provided by National Natural Science Foundation of China(Project ID: 60772032) and Microsoft (China) Co., Ltd. are gratefully acknowledged.References 1. Szeliski, R., Shum, H.Y.: Creating Full View Panoramic Mosaics and Environment Maps. In: Proc. of SIGGRAPH 1997, Computer Graphics Proceedings. Annual Conference Se- ries, pp. 251–258 (1997) 2. Agarwala, A., Agrawala, M., Chen, M., Salesin, D., Szeliski, R.: Photographing Long Scenes with Multi-Viewpoint Panoramas. In: Proc. of SIGGRAPH, pp. 853–861 (2006) 3. Zheng, J.Y.: Digital route panoramas. IEEE MultiMedia 10(3), 57–67 (2003) 4. Hsu, C.-T., Cheng, T.-H., Beukers, R.A., Horng, J.-K.: Feature-based Video Mosaic. Im- age Processing, 887–890 (2000) 5. Kang, S.B., Szeliski, R., Uyttendaele, M.: Seamless Stitching Using Multi-Perspective Plane Sweep. Microsoft Research, Tech. Rep. MSR-TR-2004-48 (2004) 6. Zelnik-Manor, L., Peters, G., Perona, P.: Squaring the Circle in Panoramas. In: Proc. 10th IEEE Conf. on Computer Vision (ICCV 2005), pp. 1292–1299 (2005)
Stitching Video from Webcams 429 7. Majumder, A., Gopi, M., Seales, W.B., Fuchs, H.: Immersive Ieleconferencing: A New Algorithm to Generate Seamless Panoramic Imagery. In: Proc. of ACM Multimedia, pp. 169–178 (1999) 8. Szeliski, R.: Video Mosaics for Virtual Environments. IEEE Computer Graphics and Ap- plications, 22–23 (1996) 9. Canny, J.: A Computational Approach to Edge Detection. IEEE Trans Pattern Analysis and Machine Intelligence 8, 679–698 (1986)10. Harris, C., Stephens, M.: A Combined Corner and Edge Detector. In: Proc. of the 4th Alvey Vision Conference, pp. 147–151 (1988)11. Lowe, D.G.: Distinctive Image Features From Scale-invariant Keypoints. International Journal of Computer Vision 60(2), 91–110 (2004)12. Winder, S., Brown, M.: Learning Local Image Descriptors. In: Proc. of the International Conference on Computer Vision and Pattern Recognition (CVPR 2007), pp. 1–8 (2007)13. Forsyth, D.A., Ponce, J.: Computer Vision: A Modern Approach. Prentice-Hall, Engle- wood Cliffs (2003)