291A_report_Hannah-Deepa-YunSuk

HoloSwap: Object Removal and Replacement using Microsoft Hololens
Yun Suk Chang∗
Deepashree Gurumurthy†
Hannah Wolfe‡
ABSTRACT
Object removal and replacement in augmented reality has applica-
tions in interior design, marker hiding and many other fields. This
project explores the viability of object removal and replacement
for the Microsoft HoloLens. We implemented gesture based ob-
ject selection, video inpainting, tracking and 3 different ways to
display our results. We placed (1) the display plane and a white
Stanford bunny[24] at the inpainted depth, (2) a solid display plane
with video inpainting covering the field of view, and (3) a vignetted
view of the video inpainting. We found placing the display plane
and a white Stanford bunny at the inpainted depth to have the most
reasonable results. Due to the semi-transparency of the HoloLens
replacing the object with a white mesh and having the object on a
light background is important. We also tested 3 different inpainting
masks for the updating frame. Our results show that the HoloLens
can be a reasonable solution in certain cases for object replacement
in augmented reality.
1 INTRODUCTION
With increase in research development and popularity, the number
of augmented reality (AR) application increased in many different
areas. One of the popular AR application is object removal and re-
placement. The object removal and replacement application ranges
from interior design [23] to replacing a tracking marker in maker
based AR [12], using diminished reality techniques. There has been
much work in tablet or other mobile device based diminished real-
ity application [8, 12, 14]. However, little is known for diminished
reality on see-through, head-mounted displays.
There are many pros to export diminish reality applications onto
see-through, head-mounted display devices. First, the user is able
to see the replaced or removed object with correct projection from
their view rather than from camera’s projection. Second, the user
gets better experience since the actual object is hidden from their
view at all times while they are wearing the display, preventing
from breaking out of the immersion. Lastly, in see-through dis-
plays, only the diminished portion of the real world view is affected
by display and camera distortions unlike the view from the camera
where whole view is affected.
In this work, we perform diminished reality on Microsoft
HoloLens using inpainting technique to test the viability of object
replacement in see-through, head-mounted display. We test the via-
bility in three different ways: (1) placing an inpainted display frame
and replacement object on the object selected for replacement to
test the object replacement, (2) running video-based inpainting al-
gorithm in view to test object removal in changing viewpoints, and
(3) blending an inpaint display frame into the scene to test the limits
of minimal seam achieved.
Through our evaluation, we discovered the limitations of di-
minished reality on see-through, head-mounted displays, includ-
ing problem of eye separation and dominance, limitation of field
of view, and inability to completely block out real world scene.
∗e-mail: ychang@cs.ucsb.edu
†e-mail: deepashree@umail.ucsb.edu
‡e-mail: wolfe@cs.ucsb.edu
However, our results show that in certain environments, see-though
display device can achieve reasonable object replacement and re-
moval.
2 RELATED WORK
2.1 Object Selection in Augmented Reality
There has been much work done in selecting object in 3D space
using augmented reality. For example, Olwal et al. [19] used track-
ing glove to point and cast a primitive object to the scene to select
an area of interest, which is used for object selection by statistical
analysis. On the other hand, Oda et al. [18] uses tracked Wii remote
to select the object by moving an adjustable sphere into the object
of interest. More recently, Miksik et al. [15] presented the Seman-
tic Paintbrush for selecting the real-world objects via a ray-casting
technique and semantic segmentation. Our drawing method can be
considered a pointing or ray-casting technique, almost like using a
precise can of spray-paint [3, 15, 17].
2.2 Inpainting in Augmented Reality
Enomoto et al.[8] implemented inpainting of objects using multi-
ple cameras. Later predetermined object removal for augmented
reality was implemented by Herling et al. [10]. Researchers have
implemented inpainting to cover markers in augmented reality
applications[12, 14]. In their papers, they do not remove objects,
but just cover predetermined 2D markers with textures. We could
not find any examples of video inpainting in an augmented reality
head-mounted display.
2.3 Origins of Inpainting
Inpainting was first done digitally in 1995 to fix scratches in film
[13]. Harrison et al.proposed a non-hierarchical procedure for re-
synthesis of complex textures [9]. An output image was generated
by adding pixels selected from input image. This method produced
good results but took a long time for implementation.
Other researchers tried to accelerate Harrison’s algorithm, and
while their results were potentially faster they caused more artifacts
[7, 20]. Many of these papers tried to combine, texture synthesis,
replicating textures ad infinitum for large regions using exemplar-
based techniques, and inpainting, filling small gaps in images by
propagating linear structures via diffusion. ”Fragment-based im-
age completion” would leave blurry patches [7]. ”Structure and
texture filling-in of missing image blocks in wireless transmission
and compression applications,” would inpaint some sections and
use texture synthesis in others causing inconsistencies [20]. Cri-
minisi et al.was the first group to combine texture synthesis and
inpainting together effectively and efficiently. They did this by in-
painting based on a priority function instead of raster-scan order
[5].
The PatchMatch algorithm proposed by Barnes et al.is the ba-
sis for many modern inpainting algorithms. Their algorithm finds
similar patches and copies data from surrounding patches to inpaint
areas [1].
2.4 State of the Art Inpainting
Inpainting has been performed on video sequences in the recent
years. ”Video Inpainting of Complex Scenes” [16] produces very

good results of inpainting. The inpainting algorithm stores infor-
mation from previous frames and thereby can also recreate miss-
ing text. ”Real time video inpainting using PixMix” [11] also ac-
counted for illumination changes between frames while inpainting.
”Exemplar-Based Image Inpainting Using a Modified Priority
Definition” [6] produced high quality output images by propagat-
ing patches based on their priority definition. Lately, novel inpaint-
ing algorithms have been included in pipeline for modular real time
diminished reality for indoor applications [22]. This however only
proposed a method which could be used in augmented reality de-
vices and did not use any such device for their implementation.
3 APPROACH
There are many parts of the pipeline to build an interactive applica-
tion to select, remove and replace objects in augmented reality. The
desired object to be removed or replaced is first selected by inter-
action with HoloLens. For object removal, we use a method of in-
painting using PatchMatch [1]. The first frame is captured through
HoloLens and the output of the first frame is an inpainted area at
the location of the selected object. We used different approaches to
inpaint the current frame. The first method was by using informa-
tion from entire frame, the second method only used information
surrounding the selected object and the last method used informa-
tion from previous inpainted frame and small area around selected
object in current frame. In order to achieve effective inpainting for
consecutive frames at the location of the object in world space, we
incorporated object tracking into our algorithm. This would help
inpainting the current location of object in every frame, thereby
accounting for slight changes in position of the camera. Object
replacement is implemented by placing a plane in the scene, and
placing a 3D mesh(Stanford Bunny mesh) over the location of the
selected object. We also display the full screen capture in field of
view, which shows the resulting inpainted sequence of frames.
(a) (b)
Figure 1: (a) Shows what the user would see while selecting an area.
(b) Shows a selected object.
3.1 Object Selection
For object selection, we let the user draw a 3D contour around the
object to select it through the HoloLens. For the drawing method,
we designed the Surface-Drawing method, which lets the user draw
on the detected real world surface in following way: For the draw-
ing gesture, we use pinch-and-drag gestures to start the drawing on
the detected surface and release-pinch gestures to finish the draw-
ing. To reduce noise in the gesture input, we sample the user’s
drawing positions at 30 Hz and the finished annotation’s path points
at 1 point per 1 mm. For drawing on the surface, we define the draw-
ing position as the intersection between the detected surface mesh
data and a ray cast from user’s head through the finger tip position.
Consequently, as the user is drawing annotations, the user can eas-
ily verge on the object of interest since the annotation is displayed
at the detected surface. The drawing method is determined to be
effective through pilot studies.
The completed contour drawing would be transformed into 2D
pixel-space coordinate system so that it could be used for the in-
painting algorithm.
3.2 Tracking
Once the object was selected, we needed to track the objects be-
tween frames. We used OpenCV [4] implementations of Shi et al.’s
paper ”Good Features to Track” [21] and Bouguet’s ”Pyramidal im-
plementation of the affine Lucas Kanade feature tracker description
of the algorithm” [2].
On the original frame we ran cv::goodFeaturesToTrack to
find 1000 features and then culled the features to be only
ones in the selected area. In the proceeding frames we ran
cv::cvCalcOpticalFlowPyrLK on the features to see which ones
were still present. We then created a velocity vector from the cen-
troid the of features still present both in the original frame and the
update frame. We updated the original selection area with the ve-
locity vector to find the new selection area to inpaint.
Figure 2: Patchmatch Algorithm [CC BY-SA 3.0, Alexandre Delesse]
3.3 Inpainting
Our algorithm uses the PatchMatch algorithm for inpainting. The
PatchMatch algorithm aims at finding nearest-neighbor matches be-
tween image patches. Best matches are found via random sampling
and the coherence of imagery allows for smooth propagation across
patches.
In the first step of the PatchMatch algorithm, the nearest-
neighbor field is initialized by filling it with random offset values or
information available earlier. Iterative process is then applied to the
nearest-neighbour field. During this process, offsets are examined
in scan order and good patches are then propagated towards adja-
cent pixels. While propagating, if the coherent mapping is good ear-
lier, then all of the mapping is filled into the adjacent pixels of same
coherent region. A random search is carried out in the neighbour-
hood to look for the best offset found. The halting criteria for the
iterations depends upon the convergence of the nearest-neighbour
fields, which was found to be around 4-5 iterations as per the article
by Barnes et al. [1].
Figure 2 illustrates the basis of the PatchMatch. The grey region
in the ellipse is the area of the image that needs to be inpainted.
The entire image is scanned for the best match for the patches sur-
rounding the selected region and these patches are propagated ac-
cordingly.
3.3.1 Algorithm for Video Sequence
In our implementation for HoloSwap, this algorithm for inpainting
is used to ”remove” the selected objects in pixel-space. We first se-
lect the region of interest or the object in the visible scene through
interaction with the HoloLens as in Figure 3a. The selected region
of interest is then used to create a mask for inpainting. The ini-
tial mask is white in the region of interest and black everywhere

(a) Cow Selected (b) Full Inpainting Mask (c) Thin Inpainting Mask (d) Thick Inpainting Mask
Figure 3: Example masks for inpainting a selection.
else as shown in Figure 3b. The first frame of the video sequence
is inpainted in pixel-space using the initial mask and patterns from
image patches of the entire frame is used to fill in the region of inter-
est. For the consecutive frames, the mask is updated to account for
slight movements in the position of web camera on the HoloLens.
The initial mask is altered by using the OpenCV [4] implementation
of erosion, and the initial mask is subtracted from the altered previ-
ous mask to give a ring of selection area on mask. This ring-shaped
mask as in Figure 3c is the new update mask for current frame. The
update mask is also made to shift along with object’s location by
tracking the selected object’s features so that the mask remains in
the right location with respect to the selected object. This process
is repeated for every frame by updating masks from the previous
frame, thereby achieving object removal by inpainting at the loca-
tion of object in pixel-space.
3.3.2 Inpainting and Mask Updating Methods
Three methods of inpainting were incorporated in attempt to im-
prove efficiency. The first method involved inpainting by selecting
matching texture patches from the entire frame. The PatchMatch
algorithm scans the entire frame for suitable matches. In the sec-
ond method, we use only the area around the selected object for in-
painting in order to improve computation time. To further improve
performance, we copied the inpainted portion of the first frame into
the area enclosed by initial mask to maintain consistency between
frames and updated the inpainting using area around selected ob-
ject. In the third method, we used the ring-shaped mask as men-
tioned in Section 3.2.1 to further reduce computation time. The size
of the ring was varied by altering number of iterations in OpenCV’s
erosion function to check variation in accuracy of inpainting.
3.4 Display
We tested two forms of displaying the results: video inpainting in
field of view rendering and object replacement in still frame. For
both approaches, the display image is placed on a plane (in future
referred to as ”display plane”) in the scene. To display the image
correctly, we get the world-to-camera matrix and projection matrix
from the HoloLens webcam when the image is taken. Then, the
matrices are used in the shader to calculate UV texture space for
the plane in following ways: First, the plane’s corner vertex posi-
tions are calculated in world space. Second, the world space vertex
positions are transformed so that they are relative to the physical
web camera on the HoloLens. Third, the camera relative vertices
are converted into normalized clipping space. Lastly, x and y com-
ponents of the vertices are used to define UV space of the texture,
which is used to correctly apply the image texture on to the plane.
3.4.1 Video Field of View
For the video field of view, the display plane was placed slightly in
front of the user’s view. This was done by positioning the display
plane and rotating it based on camera to world matrix. When the in-
painted display plane is placed, we repeat the procedure, receiving
next frame, calling the inpainting update function, and displaying
it. We originally implemented this both as a solid display plane and
with the edges of the display plane vignetted to transparency.
3.4.2 Replacement in Still Frame
The object replacement in still frame is done in 4 steps. First, we
retrieve selected object’s 2D center position in pixel-space and in-
painted frame image from the inpainting algorithm. Second, we
unproject a ray from the 2D center position to 3D real world sur-
face mesh in attempt to find an intersecting position where we can
place the display plane. Third, if the intersection is found, we make
the display plane with inpaint image using the steps described pre-
viously and place it on the intersection. Lastly, we place the re-
placement object at the same intersection point.
In order to prevent, the display plane from occluding the replace-
ment object, we render both objects separately and render the re-
placement object last so that replacement object can be rendered on
top of the display plane.
3.5 System Design
We used Microsoft Visual Studio and Unity for creating this project.
The application was written in two parts, one was creating a
Dynamic-link library (DLL) for inpainting and the other was se-
lection and display in the HoloLens. The inpainting DLL required
OpenCV, so we had to build and add OpenCV’s pre-alpha DLLs
which were not complete. We also had a series of C# scripts which
managed data transfer with the inpainting DLL, taking screenshots,
placing display planes correctly, and capturing gestures.
4 RESULTS
We tested three different ways to display inpainting. We placed (1)
the display plane and a bunny at the inpainted depth, (2) a solid
display plane covering the field of view, and (3) a vignetted display
plane covering the field of view. Examples of these displays are
shown in Figure 4. In all three cases, having the frames update
every 4 seconds was not ideal, and lead to discontinuity when the
user moved too much between frames. This would lead to ghosting
until the next frame updated.
We found that object replacement, displaying a plane and a white
bunny at the inpainted depth, was the best was to obscure the in-
painted object. When this was done, the majority of the inpainted
area is covered by the bunny. Also we chose the color white for
the bunny, because it was the least transparent. We also found that
inpainting on a light or white background worked better for inpaint-
ing, though our example images are using a black background.
We tested both a solid and vignetted display plane covering the
users field of view. We found that both display settings had their
drawbacks. The solid display plane was better at inpainting the
object, because no portion of the plane was transparent. The issue

(a) Original Scene (b) Vignetted Inpainted Field of View Image overlayed
(c) Simulated image of inpainted scene with Stanford Bunny[24] replacing
the phaser
(d) Inpainted Field of View Image overlayed
Figure 4: The original scene and example views of the three ways we tested object removal and replacement.
Table 1: Inpainting Runtime: Different masks drawing patches from
different sized areas were tested on a frame. The selected area to
inpaint was an ellipse with width and height 1/16th of frame (1280 x
720)
Inpainting Mask Area Inpainted From Seconds
Full mask Full frame 10+
Full mask 1/4 frame 4.85
Thin mask 1/4 frame 4.05
Thick mask 1/4 frame 4.2
was that a user could see the display planes edges easily, which was
potentially distracting. The vignetted display plane’s edges melted
into the background, but the inpainting was semitransparent and
therefore did not cover the object as well.
5 ASSESSMENT
Assessment of HoloSwap was done based on the accuracy and ef-
ficiency of inpainting and object replacement. Our main goal was
to assess viability of real time object removal and replacement on
Microsoft Hololens and hence runtime was an important factor in
assessing our results.
5.1 Object Removal in Video Sequence
For Object Removal test cases, we used a contour of an ellipse with
size of about 1/16th of the frame. The selected object is centered at
the scene in these cases for consistency of evaluation. Table 1 shows
the effective runtime for each of the methods used. We first used
the initial mask as in Figure 3b for inpainting and the patches for
inpainting were extracted from the entire frame of size 1280x720.
For this method, we found that the time required to obtain the in-
painted result took about 10 seconds on an average. In order to
reduce the runtime, we reduced the area of frame for which patches
were extracted for inpaiting to about a quarter of the frame and this
gave us a runtime of 4.85 seconds on average.
To further improve the runtime, we retained the inpainting in-
formation from the first frame and reduced the area of mask such
that only the edges of the region of selected object was inpainted to
account for slight movements. We used 10 iterations in OpenCV’s
erosion function for the mask and this gave us a runtime of 4.05
seconds but this method significantly deteriorated the quality of in-
painting. To improve the quality of inpainting we used a thicker
mask by adjusting the number of iterations to 20 in the erosion
function and this had a runtime of 4.2 seconds. Though this method
was slightly slower than the one with thin mask, we retained this as
our final method due to the higher quality of inpainting. There was
a trade off between the quality of inpainting and runtime and we
chose the method which gave us the best results amongst all the
explored methods.
5.2 Object Replacement
We used the Stanford Bunny mesh to replace the selected object.
Object replacement was performed by inpainting the selected ob-
ject in pixel-space and then placing the bunny at the location of
previously selected object. In most of the test results, the bunny
mesh seemed to placed at the right location and occluded the previ-

ously selected object completely. Runtime for object replacement
was similar to that of inpainting the selected object using full mask
and information from full frame. The results were also significantly
better for selected objects which had sharp contrast with its back-
ground. Object replacement seemed to produce better results for
the augmented reality device as compared to just the removal of the
selected object.
6 DISCUSSION
There are some limitation in HoloLens that affects the user experi-
ence when using HoloSwap. First, the user would be able to still
see the real object after the inpainting since HoloLens uses a see-
through display. However, when HoloSwap is used in an environ-
ment where the selected object is surrounded by white background
objects and the HoloLens display brightness is at 100%, the real
object becomes hard to see due to the display brightness.
Second, every user would have different experience since they
have different eye dominance. For many users, the display plane
would look slightly shifted since the HoloLens display does not
account for eye dominance.
Third, there is a limitation to the amount we can inpaint at once
and display at once. Because HoloLens’ webcam has smaller field
of view than that of human eyes, we are unable to inpaint full field
of view that human can see. In addition, since HoloLens only has
small field of view display, it cannot display all of the content gen-
erated by HoloSwap. Consequently, if the selected object is big or
close to the user, user would not be able to completely swap the
selected object with replacement object.
7 CONCLUSION
We tested HoloLens viability for inpainting and object removal and
replacement in real time. We achieve near real time inpainting on
the HoloLens and believe with a more efficient algorithm, we could
have real time results. We found using a thick ring-shaped mask
was the best way to implement video inpainting with consistency
between frames. Among our three test cases: object replacement,
object inpainting with a solid plane and object inpainting with a
vignetted plane, we found that object replacement had the most be-
lievable results. There are certain disadvantages of the HoloLens,
like field of view, eye separation, and the see-through display.
However, object replacement on see-through, head-mounted dis-
plays could be reasonable under certain circumstances like having
white/light background and replacement objects.
ACKNOWLEDGEMENTS
The authors wish to thank Matthew Turk and Tobias Höllerer for
their support. We would also like to thank Adam and Brandon for
letting us use the HoloLens in time of need.
REFERENCES
[1] C. Barnes, E. Shechtman, A. Finkelstein, and D. Goldman. Patch-
match: a randomized correspondence algorithm for structural image
editing. ACM Transactions on Graphics-TOG, 28(3):24, 2009.
[2] J.-Y. Bouguet. Pyramidal implementation of the affine lucas kanade
feature tracker description of the algorithm. Intel Corporation, 5(1-
10):4, 2001.
[3] D. A. Bowman, E. Kruijff, J. J. LaViola, and I. Poupyrev. 3D User In-
terfaces: Theory and Practice. Addison Wesley Longman Publishing
Co., Inc., Redwood City, CA, USA, 2004.
[4] G. Bradski. Opencv library. Dr. Dobb’s Journal of Software Tools,
2000.
[5] A. Criminisi, P. Perez, and K. Toyama. Object removal by exemplar-
based inpainting. In Computer Vision and Pattern Recognition, 2003.
Proceedings. 2003 IEEE Computer Society Conference on, volume 2,
pages II–721. IEEE, 2003.
[6] L.-J. Deng, T.-Z. Huang, and X.-L. Zhao. Exemplar-based im-
age inpainting using a modified priority definition. PloS one,
10(10):e0141199, 2015.
[7] I. Drori, D. Cohen-Or, and H. Yeshurun. Fragment-based image com-
pletion. In ACM Transactions on graphics (TOG), volume 22:3, pages
303–312. ACM, 2003.
[8] A. Enomoto and H. Saito. Diminished reality using multiple handheld
cameras. In Proc. ACCV, volume 7, pages 130–135. Citeseer, 2007.
[9] P. Harrison. A non-hierarchical procedure for re-synthesis of complex
textures. University of West Bohemia, 2001.
[10] J. Herling and W. Broll. Advanced self-contained object removal for
realizing real-time diminished reality in unconstrained environments.
In Mixed and Augmented Reality (ISMAR), 2010 9th IEEE Interna-
tional Symposium on, pages 207–212. IEEE, 2010.
[11] J. Herling and W. Broll. High-quality real-time video inpainting with
pixmix. IEEE transactions on visualization and computer graphics,
20(6):866–879, 2014.
[12] N. Kawai, M. Yamasaki, T. Sato, and N. Yokoya. Ar marker hid-
ing based on image inpainting and reflection of illumination changes.
In Mixed and Augmented Reality (ISMAR), 2012 IEEE International
Symposium on, pages 293–294. IEEE, 2012.
[13] A. C. Kokaram, R. D. Morris, W. J. Fitzgerald, and P. J. Rayner. De-
tection of missing data in image sequences. IEEE Transactions on
Image Processing, 4(11):1496–1508, 1995.
[14] O. Korkalo, M. Aittala, and S. Siltanen. Light-weight marker hiding
for augmented reality. In Mixed and Augmented Reality (ISMAR),
2010 9th IEEE International Symposium on, pages 247–248. IEEE,
2010.
[15] O. Miksik, V. Vineet, M. Lidegaard, R. Prasaath, M. Nießner,
S. Golodetz, S. L. Hicks, P. Perez, S. Izadi, and P. H. S. Torr. The
semantic paintbrush: Interactive 3d mapping and recognition in large
outdoor spaces. Proceedings of the 33nd annual ACM conference on
Human factors in computing systems (CHI), 2015.
[16] A. Newson, A. Almansa, M. Fradet, Y. Gousseau, and P. Pérez. Video
inpainting of complex scenes. SIAM Journal on Imaging Sciences,
7(4):1993–2019, 2014.
[17] B. Nuernberger, K.-C. Lien, T. Höllerer, and M. Turk. Interpreting 2d
gesture annotations in 3d augmented reality. In 2016 IEEE Symposium
on 3D User Interfaces (3DUI), pages 149–158, March 2016.
[18] O. Oda and S. Feiner. 3d referencing techniques for physical objects in
shared augmented reality. In 2012 IEEE International Symposium on
Mixed and Augmented Reality (ISMAR), pages 207–215, Nov 2012.
[19] A. Olwal, H. Benko, and S. Feiner. Senseshapes: Using statistical ge-
ometry for object selection in a multimodal augmented reality system.
In Proceedings of the 2Nd IEEE/ACM International Symposium on
Mixed and Augmented Reality, ISMAR ’03, pages 300–, Washington,
DC, USA, 2003. IEEE Computer Society.
[20] S. D. Rane, G. Sapiro, and M. Bertalmio. Structure and texture filling-
in of missing image blocks in wireless transmission and compression
applications. IEEE Transactions on image processing, 12(3):296–303,
2003.
[21] J. Shi and C. Tomasi. Good features to track. In Computer Vision
and Pattern Recognition, 1994. Proceedings CVPR’94., 1994 IEEE
Computer Society Conference on, pages 593–600. IEEE, 1994.
[22] S. Siltanen. Diminished reality for augmented reality interior design.
The Visual Computer, pages 1–16, 2015.
[23] S. Siltanen, H. Sarasp, and J. Karvonen. [demo] a complete interior
design solution with diminished reality. In 2014 IEEE International
Symposium on Mixed and Augmented Reality (ISMAR), pages 371–
372, Sept 2014.
[24] G. Turk and M. Levoy. The stanford bunny, 1993.

291A_report_Hannah-Deepa-YunSuk

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to 291A_report_Hannah-Deepa-YunSuk

Similar to 291A_report_Hannah-Deepa-YunSuk (20)

291A_report_Hannah-Deepa-YunSuk