Dataset creation for Deep Learning-based Geometric Computer Vision problems

DatasetcreationforDeepLearning-basedGeometric
ComputerVisionproblems

Purpose of this presentation
● This is the more ‘pragmatic set’ accompanying the slideset for
analyzing SfM-Net architecture from Google.
● The main idea in the dataset creation is to have multiple sensor
quality levels in the same rig, in order for us to obtain good quality
reference data (ground truth, gold standard) with terrestrial laser
scanner that can be used for image restoration deep learning
networks
- In order to get more out from the inference of lower quality sensors
as well. Think of Google Tango, iPhone 8 with depth sensing, Kinect,
etc.
● Presentation tries to address the typical problem of finding the
relevant “seed literature” for a new topic helping fresh grad
students, postdocs, software engineers and startup founders.
- Answer to “Do you know if someone has done some work on the
various steps involved in SfM” to identify what wheels do not need to
be re-invented
●● Some of the RGB image enhancement/styling slides are not the most relevant when designing the hardware pipe per se,
but are there highlighting the need for systems engineering approach for the design of the whole pipe rather than just
obtaining the data somewhere and expecting the deep learning software to do all the magic for you.
Deep Learning for Structure-from-Motion (SfM)
https://www.slideshare.net/PetteriTeikariPhD/deconstructing-sfmnet

Future Hardware and dataset creation

Pipeline Dataset creation #1A
The Indoor Multi-sensor Acquisition System (IMAS)
presented in this paper consists of a wheeled platform
equipped with two 2D laser heads, RGB cameras,
thermographic camera, thermohygrometer, and luxmeter.
http://dx.doi.org/10.3390/s16060785
Inspired by the system of Armesto et al., one could have
a custom rig with:
● high-quality laser scanner giving the “gold standard”
for depth,
● accompanied with smart phone quality RGB and depth
sensing,
● accompanied by DSLR gold standard for RGB
● and some mid-level structured light scanner?
Rig config would allow multiframe exposure techniques
to be used easily than a handheld system (see next slide)
We saw previous that the brightness constancy assumption might be
tricky with some materials, and polarization measurement for example
can help distinguishing materials (dielectric materials polarize light,
whereas conductive do not), or if there is some other way of
estimating Bidirectional Reflectance Distribution Function (BRDF)
Multicamera rig calibration by double-sided
thick checkerboard
Marco Marcon; Augusto Sarti; Stefano Tubaro
IET Computer Vision 2017
http://dx.doi.org/10.1049/iet-cvi.2016.0193

Pipeline Dataset creation #2a
: Multiframe Techniques
Note! In deep learning, the term super-resolution refers to “statistical upsampling” whereas in optical imaging super-resolution
typically refers to imaging techniques. Note2! Nothing should stop someone marrying them two though
In practice anyone can play with super-resolution at home by putting a camera on a tripod and then taking multiple shots of the
same static scene, and post-processing them through super-resolution that can improve modulation transfer function (MTF) for
RGB images, improve depth resolution and reduce noise for laser scans and depth sensing e.g. with Kinect.
https://doi.org/10.2312/SPBG/SPBG06/009-015
Cited by 47 articles
(a) One scan. (b) Final super-resolved surface from 100 scans.
“PhotoAcute software processes sets of photographs taken in continuous
mode. It utilizes superresolution algorithms to convert a sequence of images
into a single high-resolution and low-noise picture, that could only be taken with
much better camera.”
Depth looks a lot nicer when reconstructed using
50 consecutive Kinec v1 frames in comparison to
just one frame. [Data from Petteri Teikari[
Kinect multiframe reconstruction with
SiftFu [Xiao et al. (2013)]
https://github.com/jianxiongxiao/ProfXkit

Pipeline Dataset creation #2b
: Multiframe Techniques
Boring to take manually e.g. 100 shots of the same scene involving even 360 rotation of the imaging devices, in practice this would
need to be automated in some way with a step motor driven by Arduino or if no good commercial systems are not available.
Multiframe techniques would allow another level of “nesting” of ground truths for a joint image enhancement block along with the
proposed structure and motion network.
● The reconstructed laser scan / depth image / RGB from 100 images would the target, and the single-frame version the input that
need to be enhanced
Meinhardt et al. (2017)
Diamond et al. (2017)

Pipeline Dataset creation #3
A Pipeline for Generating Ground
Truth Labels for Real RGBD Data
of Cluttered Scenes
Pat Marion, Peter R. Florence, Lucas Manuelli, Russ Tedrake
Submitted on 15 Jul 2017, last revised 25 Jul 2017
https://arxiv.org/abs/1707.04796
In this paper we develop a pipeline to rapidly
generate high quality RGBD data with
pixelwise labels and object poses. We use an
RGBD camera to collect video of a scene from
multiple viewpoints and leverage existing
reconstruction techniques to produce a 3D
dense reconstruction. We label the 3D
reconstruction using a human assisted ICP-
fitting of object meshes. By reprojecting the
results of labeling the 3D scene we can
produce labels for each RGBD image of the
scene. This pipeline enabled us to collect over
1,000,000 labeled object instances in just a
few days.
We use this dataset to answer
questions related to how much
training data is required, and of what
quality the data must be, to achieve
high performance from a DNN
architecture.
Overview of the data generation pipeline. (a) Xtion RGBD sensor
mounted on Kuka IIWA arm for raw data collection. (b) RGBD data
processed by ElasticFusion into reconstructed pointcloud. (c) User
annotation tool that allows for easy alignment using 3 clicks. User
clicks are shown as red and blue spheres. The transform mapping the
red spheres to the green spheres is then the user specified guess. (d)
Cropped pointcloud coming from user specified pose estimate is
shown in green. The mesh model shown in grey is then finely aligned
using ICP on the cropped pointcloud and starting from the user
provided guess. (e) All the aligned meshes shown in reconstructed
pointcloud. (f) The aligned meshes are rendered as masks in the RGB
image, producing pixelwise labeled RGBD images for each view.
Increasing the variety of backgrounds in the
training data for single-object scenes also
improved generalization performance for new
backgrounds, with approximately 50 different
backgrounds breaking into above- 50% IoU on
entirely novel scenes. Our recommendation is to
focus on multi-object data collection in a variety
of backgrounds for the most gains in
generalization performance.
We hope that our pipeline lowers the barrier to
entry for using deep learning approaches for
perception in support of robotic manipulation
tasks by reducing the amount of human time
needed to generate vast quantities of labeled
data for your specific environment and set of
objects. It is also our hope that our analysis of
segmentation network performance provides
guidance on the type and quantity of data that
needs to be collected to achieve desired levels of
generalization performance.

Pipeline Dataset creation #4
A Novel Benchmark RGBD Dataset for Dormant Apple
Trees and Its Application to Automatic Pruning
Shayan A. Akbar, Somrita Chattopadhyay, Noha M. Elfiky, Avinash Kak;
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2016
https://doi.org/10.1109/CVPRW.2016.50
Extending of the Kinect device functionality and the
corresponding database
Libor Bolecek ; Pavel Němec ; Jan Kufa ; Vaclav Ricny
Radioelektronika (RADIOELEKTRONIKA), 2017
https://doi.org/10.1109/RADIOELEK.2017.7937594
One of the possible research directions is use of infrared version of the investigated scene for improvement of the depth
map. However, the databases of the Kinect data which would contain the corresponding infrared images do not exist.
Therefore, our aim was to create such database. We want to increase the usability of the database by adding stereo
images. Moreover, the same scenes were captured by Kinect v2. It was also investigated the impact of simultaneous use
Kinect v1 and Kinect v2 to improve depth map investigated the scene. The database contains sequences of objects
on turntable and simple scenes containing several objects.
The depth map of the scene
obtained by a) Kinect v1, b)
Kinect v2.
The comparison of the one row
of the depth map obtained by a)
Kinect v1 b) Kinect v2 with true
depth map.
Kinect inrared image after change
of the dynamics of brightness

Pipeline Multiframe Pipe #1
100
Multiframe reconstruction enhancement block
1
2
100
3
...
...
Depth image (e.g. Kinect)
Laser scan (e.g. Velodyne)
RGB Image
Target
Learn to improve image
quality from single
image when the system
is deployed
Reconstruction
could be done
using traditional
algorithms
(e.g. OpenCV)
to start with
need then to
save all
individual
frames when
reconstruction
algorithms
improve, and all
blocks can be
iterated then ad
infinitum
Mix different image qualities and sensor
qualities then in the training set to build
invariance to scan quality

Pipeline Multiframe Pipe #2
You could cascade different levels of quality If you want to make things complex in deeply supervised fashion
LOWEST QUALITY
Just with RGB
HIGHEST QUALITY
Depth map with professional
laser scanner
2
1
3
4
5
6
The following step in the cascade is closer in quality to the previous one, and one could assume that this enhancement
would be easier to learn, and the pipeline would output the enhanced quality as a “side effect” which is good for
visualization purposes.

Pipeline acquisition example with Kinect
KinectFusion (Newcombe et al. 2011), one of the pioneering works, showed that a real-world object as well as an
indoor scene canbe reconstructed in real-time with GPU acceleration. It exploits the iterative closest point (ICP)
algorithm (Besl and McKay 1992) to track 6-DoF poses and the volumetric surface representation scheme with
signed distance functions (Curless and Levoy, 1996) to fuse 3D measurements. A number of following studies (e.g.
Choi et al. 2015) have tackled the limitation of KinectFusion; as the scale ofa scene increases, it is hard to
completely reconstruct thescene due to the drift problem of the ICP algorithm as wellas the large memory
consumption of volumetric integration.
To scale up the KinectFusion algorithm, Whelan et al . (2012)] presented a spatially extended KinectFusion,
named as Kintinuous, by incrementally adding KinectFusion results as the form of triangular meshes.
Whelan et al . (2015) also proposed ElasticFusion to tackle similar problems as well as to overcome the
problem of a pose graph optimization by using the surface loop closure optimization and the surfel-based
representation. Moreover, to decrease the space complexity, ElasticFusion deallocates invisible surfels from
the memory; invisible surfels are allocated in the memory again only if they are likely to be visible in the near
future.

Pipeline Multiframe Pipe into Sfm-Net

Pipeline Multiframe Pipe Quality simulation
Simulated Imagery Rendering Workflow for UAS-
Based Photogrammetric 3D Reconstruction
Accuracy Assessments
Richard K. Slocum and Christopher E. Parrish
Remote Sensing 2017, 9(4), 396; doi :10.3390/rs9040396
“Here, we present a workflow to render computer generated imagery using a virtual environment which can
mimic the independent variables that would be experienced in a real-world UAS imagery acquisition
scenario. The resultant modular workflow utilizes Blender Python API, an open source computer graphics
software, for the generation of photogrammetrically-accurate imagery suitable for SfM processing, with
explicit control of camera interior orientation, exterior orientation, texture of objects in the scene, placement
of objects in the scene, and ground control point (GCP) accuracy.”
Pictorial representation of the simUAS (simulated UAS)
imagery rendering workflow. Note: The SfM-MVS step is
shown as a “black box” to highlight the fact that the procedure
can be implemented using any SfM-MVS software, including
proprietary commercial software.
The imagery from Blender, rendered using a pinhole camera
model, is postprocessed to introduce lens and camera effects.
The magnitudes of the postprocessing effects are set high in this
example to clearly demonstrate the effect of each. The fullsize
image (left) and a close up image (right) are both shown in order to
depict both the large and small scale effects.
A 50 cm wide section of
the point cloud containing
a box (3 m cube) is shown
with the dense
reconstruction point
clouds overlaid to
demonstrate the effect of
point cloud dense
reconstruction quality on
accuracy near sharp
edges.
The points along the side of a
vertical plane on a box were
isolated and the error
perpendicular to the plane of
the box were visualized for
each dense reconstruction
setting, with white regions
indicating no point cloud
data. Notice that the region
with data gaps in the point
cloud from the ultra-high
setting corresponds to the
region of the plane with low
image texture, as shown in
the lower right plot.

Data fusion combining multimodal data

Pipeline data Fusion / Registration #1
“Rough estimates for 3D structure obtained using structure
from motion (SfM) on the uncalibrated images are first co-
registered with the lidar scan and then a precise alignment
between the datasets is estimated by identifying
correspondences between the captured images and
reprojected images for individual cameras from the 3D lidar
point clouds. The precise alignment is used to update both the
camera geometry parameters for the images and the individual
camera radial distortion estimates, thereby providing a 3D-to-
2D transformation that accurately maps the 3D lidar scan
onto the 2D image planes. The 3D to 2D map is then utilized to
estimate a dense depth map for each image. Experimental
results on two datasets that include independently acquired
high-resolution color images and 3D point cloud datasets
indicate the utility of the framework. The proposed approach
offers significant improvements on results obtained with
SfM alone.”
Fusing structure from motion and lidar for
dense accurate depth map estimation
Li Ding ; Gaurav Sharma
Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on
https://doi.org/10.1109/ICASSP.2017.7952363
“In this paper, we present RegNet, the first deep convolutional neural
network (CNN) to infer a 6 degrees of freedom (DOF) extrinsic
calibration between multimodal sensors, exemplified using a
scanning LiDAR and a monocular camera. Compared to existing
approaches, RegNet casts all three conventional calibration steps
(feature extraction, feature matching and global regression) into a single
real-time capable CNN.”
Development of the mean absolute error (MAE) of the
rotational components over training iteration for different
output representations: Euler angles are represented in red,
quaternions in brown and dual quaternions in blue. Both
quaternion representations outperform the Euler angles
representation.
“Our method yields a mean calibration error of 6 cm for translation and 0.28◦
for rotation with decalibration
magnitudes of up to 1.5 m and 20◦
, which competes with state-of-the-art online and offline methods.”

Depth refinement for binocular Kinect RGB-D cameras
Jinghui Bai ; Jingyu Yang ; Xinchen Ye ; Chunping Hou
Visual Communications and Image Processing (VCIP), 2016
https://doi.org/10.1109/VCIP.2016.7805545

Used Kinects
inexpensive
~£29.95 (eBay)
Use multiple Kinects at once
for better occlusion handling
Tanwi Mallick ; Partha Pratim Das ; Arun Kumar Majumdar
IEEE Sensors Journal ( Volume: 14, Issue: 6, June 2014 )
https://doi.org/10.1109/JSEN.2014.2309987
Characterization of Different Microsoft Kinect Sensor Models
IEEE Sensors Journal (Volume: 15, Issue: 8, Aug. 2015)
An ANOVA analysis was performed to determine if the model of the Kinect, the operating temperature,
or their interaction were significant factors in the Kinect's ability to determine the distance to the target.
Different sized gauge blocks were also used to test how well a Kinect could reconstruct precise objects.
Machinist blocks were used to examine how well the Kinect could reconstruct objects setup on an angle
and determine the location of the center of a hole. All the Kinect models were able to determine the
location of a target with a low standard deviation (<;2 mm). At close distances, the resolutions of all the
Kinect models were 1 mm. Through the ANOVA analysis, the best performing Kinect at close distances
was the Kinect model 1414, and at farther distances was the Kinect model 1473. The internal
temperature of the Kinect sensor had an effect on the distance reported by the sensor. Using different
correction factors, the Kinect was able to determine the volume of a gauge block and the angles machinist
blocks were setup at, with under a 10% error.

A Generic Approach for Error Estimation of Depth
Data from (Stereo and RGB-D) 3D Sensors
Luis Fernandez, Viviana Avila and Luiz Gonçalves
Preprints | Posted: 23 May 2017 |
http://dx.doi.org/10.20944/preprints201705.0170.v1
“We propose an approach for estimating
the error in depth data provided by
generic 3D sensors, which are modern
devices capable of generating an image
(RGB data) and a depth map (distance)
or other similar 2.5D structure (e.g.
stereo disparity) of the scene.
We come up with a multi-platform
system and its verification and
evaluation has been done, using the
development kit of the board NVIDIA
Jetson TK1 with the MS Kinects v1/v2
and the Stereolabs ZED camera. So the
main contribution is the error
determination procedure that does
not need any data set or benchmark,
thus relying only on data acquired on-
the-fly. With a simple checkerboard, our
approach is able to determine the error
for any device”
In the article of Yang [16], an MS Kinect v2 structure is proposed to improve the accuracy of the
sensors and the depth of capture of objects that are placed more than four meters apart. It has
been concluded that an object covered with light-absorbing materials, may cause less reflected IR
light back to the MS Kinect and therefore erroneous depth data. Other factors, such as power
consumption, complex wiring and high requirement for laptop computer also limits the use of the
sensor.
The characteristics of MS Kinect stochastic errors are presented for each direction of the axis in the work by Choo [17].
The depth error is measured using a 3D chessboard, similarly the one used in our approach. The results show that, for all
three axes, the error should be considered independently. In the work of Song [18] it is proposed an approach to
generate a per-pixel confidence measure for each depth map captured by MS Kinect in indoor scenes through
supervised learning and the use of artificial intelligence
Detection (a) and ordering (b) of corners in the three planes
of the pattern.
It would make sense to combine versions 1 and 2 for the same rig as Kinect v1 is
more accurate for close distances, and Kinect v2 more accurate for far distances

Precise 3D/2D calibration between a RGB-D
sensor and a C-arm fluoroscope
International Journal of Computer Assisted Radiology and Surgery
August 2016, Volume 11, Issue 8, pp 1385–1395
https://doi.org/10.1007/s11548-015-1347-2
“A RMS reprojection error of 0.5 mm is achieved using
our calibration method which is promising for surgical
applications. Our calibration method is more accurate
when compared to Tsai’s method. Lastly, the simulation
result shows that using a projection matrix has a lower
error than using intrinsic and extrinsic parameters in
the rotation estimation.”
While the color camera has a relative high resolution (1920 px ×
1080 px for Kinect 2.0), the depth camera is mid-resolution (512 px
× 424 px for Kinect 2.0) and highly noisy. Furthermore, RGB-D
sensors have a minimal distance to the scene from which they can
estimate the depth. For instance, the minimum optimal distance of
Kinect 2.0 is 50 cm.
On the other hand, C-arm fluoroscopes have a short focus, which is
typically 40 cm, and a much narrower field of view than the RGB-D
sensor with also a mid-resolution image (ours is 640 px × 480 px).
All these factors lead to a high disparity in the field of view
between the C-arm and the RGB-D sensor if the two were to be
integrated in a single system. This means that the calibration
process is crucial. We need to achieve high accuracy for the
localization of 3D points using RGB-D sensors, and we require a
calibration phantom which can be clearly imaged by both devices.
Workflow of the calibration process between the RGB-D sensor
and a C-arm. The input data include a sequence of infrared,
depth, and color images from the RGB-D sensor and X-ray images
from the C-arm. The output of the calibration pipeline is the
projection matrix, which is calculated by the 3D/2D
correspondences detected from the input data

Fusing Depth and Silhouette for Scanning Transparent
Object with RGB-D Sensor
Yijun Ji, Qing Xia, and Zhijiang Zhang
System overview; TSDF: truncated signed distance function; SFS: shape from silhouette.
Results on noise region. (a) Color images captured by
stationary camera with a rotating platform. (b) The noisy voxels
detected by multiple depth images are in red. (c) and (d) show
the experimental results done by a moving Kinect; the
background is changing in these two cases.

Intensity Video Guided 4D Fusion for Improved
Highly Dynamic 3D Reconstruction
Jie Zhang, Christos Maniatis, Luis Horna, Robert B. Fisher
(Submitted on 6 Aug 2017)
Temporal tracking of intensity image points (of moving and
deforming objects) allows registration of the corresponding 3D
data points, whose 3D noise and fluctuations are then reduced
by spatio-temporal multi-frame 4D fusion. The results
demonstrate that the proposed algorithm is effective at
reducing 3D noise and is robust against intensity noise. It
outperforms existing algorithms with good scalability on both
stationary and dynamic objects.
The system framework (using 3 consecutive
frames as an example)
Static Plane (first row): (a) mean roughness;
(b) std of roughness vs. number of frames
fused. Falling ball (second row): (c) mean
roughness; (d) std of roughness vs.
number of frames fused
Texture-related
3D noise on a static
plane: (a) 3D frame;
(b) 3D frame with
textures. The 3D
noise is closely
related to the
textures in the
intensity image.
Illustration of 3D noise
reduction on the ball.
Spatial-temporal divisive
normalized bilateral filter (DNBF)

Utilization of a Terrestrial Laser Scanner for
the Calibration of Mobile Mapping Systems
Seunghwan Hong, Ilsuk Park, Jisang Lee, Kwangyong Lim, Yoonjo
Choi and Hong-Gyoo Sohn
Sensors 2017, 17(3), 474; doi :10.3390/s17030474
Configuration of mobile mapping system: network video cameras (F:
front, L: left, R: right), mobile laser scanner, and Global Navigation
Satellite System (GNSS)/Inertial Navigation System (INS).
To integrate the datasets captured by each sensor mounted on the
Mobile Mapping System (MMS) into the unified single coordinate
system, the calibration, which is the process to estimate the orientation
(boresight) and position (lever-arm) parameters, is required with the
reference datasets [Schwarz and El-Sheimy 2004, Habib et al. 2010,
Chan et al. 2010].
When the boresight and lever-arm parameters defining the geometric relationship
between each sensing data and GNSS/INS data are determined, georeferenced data
can be generated. However, even after precise calibration, the boresight and lever-
arm parameters of an MMS can be shaken and the errors that deteriorate the
accuracy of the georeferenced data might accumulate. Accordingly, for the stable
operation of multiple sensors, precise calibration must be conducted periodically.
(a) Sphere target used for registration
of terrestrial laser scanning data; (b)
sphere target detected in a point cloud
(the green sphere is a fitted sphere
model).
Network video camera: AXIS F1005-E
GNSS/INS unit: OxTS Survey+
Terrestrial laser scanner (TLS): Faro Focus 3D
Mobile laser scanner: Velodyne HDL 32-E

Dense Semantic Labeling of Very-High-Resolution Aerial
Imagery and LiDAR with Fully-Convolutional Neural
Networks and Higher-Order CRFs
Yansong Liu, Sankaranarayanan Piramanayagam, Sildomar T. Monteiro, Eli Saber
http://openaccess.thecvf.com/content_cvpr_2017_workshops/w18/papers/Liu_Dense_Semantic_Labeling_CVPR_2017_paper.pdf
Our proposed decision-level fusion scheme: training one fully-convolutional
neural network on the color-infrared image (CIR) and one logistic regression using
hand-crafted features. Two probabilistic results: PFCN
and PLR
are then combined in a
higher-order CRF framework
Main original contributions of our work are: 1) the use of energy based CRFs for efficient decision-
level multisensor data fusion for the task of dense semantic labeling. 2) the use of higher-order CRFs
for generating labeling outputs with accurate object boundaries. 3) the proposed fusion scheme has
a simpler architecture than training two separate neural networks, yet it still yields the state-of-the-
art dense semantic labeling results.
Guiding multimodal registration with learned
optimization updates
Gutierrez-Becker B, Mateus D, Peter L, Navab N
Medical Image Analysis Volume 41, October 2017, Pages 2-17
https://doi.org/10.1016/j.media.2017.05.002
Training stage (left): A set of aligned multimodal images is used to generate a training set of images with known
transformations. From this training set we train an ensemble of trees mapping the joint appearance of the images to
displacement vectors. Testing stage (right): We register a pair of multimodal images by predicting with our trained
ensemble the required displacements δ for alignment at different locations z. The predicted displacements are then
used to devise the updates of the transformation parameters to be applied to the moving image. The procedure is
repeated until convergence is achieved.
Corresponding CT (left) and MR-T1
(middle) images of the brain obtained
from the RIRE dataset. The
highlighted regions are corresponding
areas between both images (right).
Some multimodal similarity metrics
rely on structural similarities between
images obtained using different
modalities, like the ones inside the
blue boxes. However in many cases
structures which are clearly visible in
one imaging modality correspond to
regions with homogeneous voxel
values in the other modality (red and
green boxes).

Future Image restoration Natural Images (RGB)

PipelineRGB image Restoration #1
Our method
includes a sub-
pixel motion
compensation
(SPMC) layer that
can better handle
inter-frame motion
for this task. Our
detail fusion (DF)
network that can
effectively fuse
image details from
multiple images
after SPMC
alignment
“Hardware Super-resolution” of course all via deep learning too
https://petapixel.com/2015/02/21/a-pract
ical-guide-to-creating-superresolution-p
hotos-with-photoshop/

PipelineRGB image Restoration #2A
“Data-driven Super-resolution” what super-resolution typically means in the deep learning space
Output of the “hardware super-resolution” can be used as a target for the “data-driven super-resolution”
External Prior Guided Internal Prior
Learning for Real Noisy Image Denoising
Jun Xu, Lei Zhang, David Zhang
(Submitted on 12 May 2017)
Denoised images of a region cropped from the real noisy
image from DSLR “Nikon D800 ISO 3200 A3”,
Nam et al. 2016 (+video) by different methods. The
scene was shot 500 times with the same camera and
camera setting. The mean image of the 500 shots is
roughly taken as the “ground truth”, with which the
PSNR index can be computed. The images are better
viewed by zooming in on screen
Benchmarking Denoising Algorithms
with Real Photographs
Tobias Plötz, Stefan Roth
(Submitted on 5 Jul 2017)
“We then capture a novel benchmark dataset, the
Darmstadt Noise Dataset (DND), with consumer
cameras of differing sensor sizes. One interesting
finding is that various recent techniques that
perform well on synthetic noise are clearly
outperformed by BM3D on photographs with real
noise. Our benchmark delineates realistic evaluation
scenarios that deviate strongly from those
commonly used in the scientific literature.”
Image formation process underlying the observed low-ISO image xr
and
high-ISO image xn
. They are generated from latent noise-free images yr
and yn
, respectively, which in turn are related by a linear scaling of image
intensities (LS), a small camera translation (T), and a residual low-
frequency pattern (LF). To obtain the denoising ground truth yp
, we apply
post-processing to xr
aiming at undoing these undesirable transformations.
Mean PSNR (in dB) of the denoising methods tested on our DND benchmark. We
apply denoising either on linear raw intensities, after a variance stabilizing transformation
(VST, Anscombe), or after conversion to the sRGB space. Likewise, we evaluate the result
either in linear raw space or in sRGB space. The noisy images have a PSNR of 39.39 dB
(linear raw) and 29.98 dB (sRGB).
Difference between blue channels of low- and high-ISO images from Fig. 1 after various post-
processing stages. Images are smoothed for display to highlight structured residuals,
attenuating the noise.

PipelineRGB image Restoration #2b
“Data-driven Super-resolution” what super-resolution typically means in the deep learning space
MemNet: A Persistent Memory
Network for Image Restoration
Ying Tai, Jian Yang, Xiaoming Liu, Chunyan Xu
https://github.com/tyshiwo/MemNet.
Output of the “hardware super-resolution” can be used as a target for the “data-driven super-resolution”
The same MemNet structure achieves the state-of-the-art performance in image denoising, super-resolution and
JPEG deblocking. Due to the strong learning ability, our MemNet can be trained to handle different levels of
corruption even using a single model.
Training Setting: Following the method of Mao et al. (2016), for
image denoising, the grayscale image is used; while for SISR and
JPEG deblocking, the luminance component is fed into the
model.
Deep Generative Adversarial
Compression Artifact Removal
Leonardo Galteri, Lorenzo Seidenari, Marco Bertini,
Alberto Del Bimbo
(Submitted on 8 Apr 2017)
In this work we address the problem of artifact removal using convolutional neural networks. The proposed
approach can be used as a post-processing technique applied to decompressed images, and thus can be
applied to different compression algorithms (typically applied in YCrCb color space) such as JPEG, intra-frame
coding of H.264/AVC and H.265/HEVC. Compared to super resolution techniques, working on compressed
images instead of down-sampled ones, is more practical, since it does not require to change the compression
pipeline, that is typically hardware based, to subsample the image before its coding; moreover, camera
resolutions have increased during the latest years, a trend that we can expect to continue.

PipelineRGB image Restoration #3
An attempt to improve smartphone camera quality with DSLR high quality image as the ‘gold standard’ with deep learning
Andrey Ignatov, Nikolay Kobyshev, Kenneth Vanhoey, Radu Timofte, Luc Van Gool
Computer Vision Laboratory, ETH Zurich, Switzerland
“Quality transfer”

Pipelineimage enhancement #1
Aesthetics enhancement: “AI-driven Interior Design”
“Re-colorization” of scanned indoor scenes or intrinsic decomposition based editing
Limitations. We have to manually correct
inaccurate segmentation, though seldom
encountered. This is a limitation of our method.
However, segmentation errors are seldom
encountered during experiments. Since our
method is object-based, our segmentation method
does not consider the color patterns among similar
components of an image object.
Currently, our system is not capable of segmenting
the mesh according to the colored components
with similar geometry for this kind of objects. This
is another limitation of our method.
An intrinsic image decomposition method could
be helpful to our image database, for extracting
lighting-free textures to be further used in
rendering colorized scenes. However, such
methods are not so robust that can be directly
applied to various images in a large image
database. On the other hand, intrinsic image
decomposition is not essential to achieve good
results in our experiments. So we did not
incorporate it in our work, but we will further study
it to improve our database.

“Auto-adjust” RGB texture maps for indoor scans with user interaction
We use the CIELab color space for both the input and
output images. We can use 3-channel Lab color as the
color features. However, it generates color variations in
smooth regions since each color is processed
independently. To alleviate this issue, we add the local
neighborhood information by concatenating the Lab color
and the L2 normalized first-layer convolutional feature
maps of ResNet-50.
Although the proposed method provides the users with
automatically adjusted photos, some users may want their photos
to be retouched by their own preference. In the first row of Fig. 2 for
example, a user may want only the color of the people to be changed.
For such situations, we provide a way for the users to give their own
adjustment maps to the system. Figure 4 shows some examples of
the personalization. When the input image is forwarded, we
substitute the extracted semantic adjustment map with the new
adjustment map from the user. As shown in the figure, the proposed
method effectively creates the personalized images adjusted by
user’s own style.
Deep Semantics-Aware Photo Adjustment
Seonghyeon Nam, Seon Joo Kim (Submitted on 26 Jun 2017) https://arxiv.org/abs/1706.08260

Aesthetic-Driven Image Enhancement
by Adversarial Learning
Yubin Deng, Chen Change Loy, Xiaoou Tang
GAN
GAN
GAN
Pro
Pro
Pro
Examples of image enhancement given original input (a).
The architecture of our
proposed EnhanceGAN
framework. ResNet module is
the feature extractor (for image in
CIELab color space); in this work,
we use the ResNet-101 and
removed the last average pooling
layer and the final fc layer. The
switch icons in the discriminator
network represent zero-masking
during stage-wise training
“Auto-adjust” RGB texture maps for indoor scans with GANs

“Auto-adjust” RGB texture maps for indoor scans with GANs for “auto-matting”
Creatism: A deep-learning photographer
capable of creating professional work
Hui Fang, Meng Zhang (Submitted on 11 Jul 2017)
https://google.github.io/creatism/
Datasets were created that contain ratings of
photographs based on aesthetic quality
[Murray et al., 2012] [Kong et al., 2016] [Lu et al., 2015].
Using our system, we mimic the workflow of a
landscape photographer, from framing for the best
composition to carrying out various post-processing
operations. The environment for our virtual
photographer is simulated by a collection of panorama
images from Google Street View. We design a "Turing-
test"-like experiment to objectively measure quality of
its creations, where professional photographers rate a
mixture of photographs from different sources blindly.
We work with professional photographers to empirically define 4 levels of aesthetic quality:
● 1: point-and-shoot photos without consideration.
● 2: Good photos from the majority of population without art background. Nothing artistic stands out.
● 3: Semi-pro. Great photos showing clear artistic aspects. The photographer is on the right track of
becoming a professional.
● 4: Pro-work. Clearly each professional has his/her unique taste that needs calibration.
We use AVA dataset [Murray et al., 2012] wto bootstrap a consensus among them.
Assume there exists a universal aesthetics metric, Φ. By
definition, needs to incorporate all aesthetic aspects, suchΦ
as saturation, detail level, composition... To define withΦ
examples, number of images needs to grow exponentially
to cover more aspects [Jaroensri et al., 2015]. To make things
worse, unlike traditional problems such as object recognition,
what we need are not only natural images, but also pro-
level photos, which are much less in quantity.

“Auto-adjust” images based on different user groups (or personalizing for different markets for indoor scan products)
Multimodal Prediction and Personalization of
Photo Edits with Deep Generative Models
Ardavan Saeedi, Matthew D. Hoffman, Stephen J. DiVerdi,
Asma Ghandeharioun, Matthew J. Johnson, Ryan P. Adams
CSAIL, MI; Adobe Research; Media Lab, MIT; Harvard and Google Brain
(Submitted on 17 Apr 2017) https://arxiv.org/abs/1704.04997
The main goals of our proposed models: (a) Multimodal photo
edits: For a given photo, there may be multiple valid aesthetic
choices that are quite different from one another. (b) User
categorization: A synthetic example where different user clusters
tend to prefer different slider values. Group 1 users prefer to
increase the exposure and temperature for the baby images;
group 2 users reduce clarity and saturation for similar images.
Predictive log-likelihood for users in the test set of different datasets. For each user in the test set, we compute the predictive log-likelihood of 20 images, given 0 to
30 images and their corresponding sliders from the same user. 30 sample trajectories and the overall average ± s.e. is shown for casual, frequent and expert users.
The figure shows that knowing more about the user (up to around 10 images) can increase the predictive log-likelihood. The log-likelihood is normalized by
subtracting off the predictive log-likelihood computed given zero images. Note the different y-axis in the plots. The rightmost plot is provided for comparing the
average predictive log-likelihood across datasets.

Combining semantic segmentation for higher quality “Instagram filters”
Exemplar-Based Image and Video Stylization
Using Fully Convolutional Semantic Features
Feida Zhu ; Zhicheng Yan ; Jiajun Bu ; Yizhou Yu
IEEE Transactions on Image Processing ( Volume: 26, Issue: 7, July 2017 )
https://doi.org/10.1109/TIP.2017.2703099
Color and tone stylization in images and videos strives to enhance unique themes with artistic color
and tone adjustments. It has a broad range of applications from professional image postp-rocessing
to photo sharing over social networks. Mainstream photo enhancement softwares, such as Adobe
Lightroom and Instagram, provide users with predefined styles, which are often hand-crafted
through a trial-and-error process. Such photo adjustment tools lack a semantic understanding of
image contents and the resulting global color transform limits the range of artistic styles it can
represent. On the other hand, stylistic enhancement needs to apply distinct adjustments to various
semantic regions. Such an ability enables a broader range of visual styles.
Traditional professional video editing softwares (Adobe After Effects, Nuke, etc.) offer a suite of
predefined operations with tunable parameters that apply common global adjustments
(exposure/color correction, white balancing, sharpening, denoising, etc). Local adjustments within
specific spatiotemporal regions are usually accomplished with masking layers created with intensive
user interaction. Both parameter tuning and masking layer creation are labor intensive processes.
An example of learning semantics-aware photo adjustment styles. Left: Input image. Middle: Manually enhanced by
photographer. Distinct adjustments are applied to different semantic regions. Right: Automatically enhanced by our
deep learning model trained from image exemplars. (a) Input image. (b) Ground truth. (c) Our result.
Given a set of exemplar image pairs, each representing a photo before and
after pixel-level color (in CIELab space) and tone adjustments following a
particular style, we wish to learn a computational model that can automatically
adjust a novel input photo in the same style. We still cast this learning task as
a regression problem as in Yan et al. (2016). For completeness, let us first
review their problem definition and then present our new deep learning based
architecture and solution.

Pipelineimage enhancement #7A
Combining semantic segmentation for higher quality “Instagram filters”
Deep Bilateral Learning for Real-Time Image Enhancement
Michaël Gharbi, Jiawen Chen, Jonathan T. Barron, Samuel W. Hasinoff, Frédo Durand MIT CSAIL, Google Research, MIT CSAIL / Inria, Université Côte d’Azur
https://arxiv.org/abs/1707.02880 | https://github.com/mgharbi/hdrnet | https://groups.csail.mit.edu/graphics/hdrnet/
https://youtu.be/GAe0qKKQY_I
Our novel neural network architecture can reproduce sophisticated image enhancements with inference running in real
time at full HD resolution on mobile devices. It can not only be used to dramatically accelerate reference
implementations, but can also learn subjective effects from human retouching (“copycat” filter).
By performing most of its computation within a bilateral grid and by predicting local affine color transforms, our model
is able to strike the right balance between expressivity and speed. To build this model we have introduced two new
layers: a data-dependent lookup that enables slicing into the bilateral grid, and a multiplicative operation for affine
transformation. By training in an end-to-end fashion and optimizing our loss function at full resolution (despite most of
our network being at a heavily reduced resolution), our model is capable of learning full-resolution and non-scale-
invariant effects.

Blind Image Quality assessment e.g. for quantifying RGB scan quality real-time
RankIQA: Learning from Rankings for No-
reference Image Quality Assessment
Xialei Liu, Joost van de Weijer, Andrew D. Bagdanov
The classical approach trains a
deep CNN regressor directly on
the ground-truth. Our approach
trains a network from an image
ranking dataset. These ranked
images can be easily generated
by applying distortions of varying
intensities. The network
parameters are then transferred
to the regression network for
finetuning. This allows for the
training of deeper and wider
networks.
Siamese network output for JPEG distortion considering 6 levels. This graphs illustrate
the fact that the Siamese network successfully manages to separate the different
distortion levels.
Blind Deep S3D Image Quality Evaluation via Local to
Global Feature Aggregation
Heeseok Oh ; Sewoong Ahn ; Jongyoo Kim ; Sanghoon Lee
IEEE Transactions on Image Processing ( Volume: 26, Issue: 10, Oct. 2017 )
https://doi.org/10.1109/TIP.2017.2725584

Pipelineimage Styling #1
Aesthetics enhancement: High Dynamic Range from SfM
Large scale structure-from-motion (SfM) algorithms have recently
enabled the reconstruction of highly detailed 3-D models of our surroundings
simply by taking photographs. In this paper, we propose to leverage these
reconstruction techniques to automatically estimate the outdoor
illumination conditions for each image in a SfM photo collection. We
introduce a novel dataset of outdoor photo collections, where the ground
truth lighting conditions are known at each image. We also present an
inverse rendering approach that recovers a high dynamic range
estimate of the lighting conditions for each low dynamic range input image.
Our novel database is used to quantitatively evaluate the performance of our
algorithm. Results show that physically plausible lighting estimates can
faithfully be recovered, both in terms of light direction and intensity.
Lighting Estimation in Outdoor Image Collections
Jean-Francois Lalonde (Laval University); Iain Matthews (Disney Research)
3D Vision (3DV), 2014 2nd International Conference on
https://www.disneyresearch.com/publication/lighting-estimation-in-outdoor-image-collections/
https://doi.org/10.1109/3DV.2014.112
The main limitation of our approach is that it can recover precise lighting parameters only when lighting actually creates strongly visible
effects—such as cast shadows, shading differences amongst surfaces of different orientations—on the image. When the camera does not
observe significant lighting variations, for example when the sun is shining on a part of the building that the camera does not observe, or when the
camera only see a very small fraction of the landmark with little geometric details, our approach recovers a coarse estimate of the full lighting
conditions. In addition, our approach is sensitive to errors in geometry estimation, or to the presence of unobserved, nearby objects.
Because it does not know about these objects, our method tries to explain their cast shadows with the available geometry, which may result in
errors. Our approach is also sensitive to inter-reflections. Incorporating more sophisticated image formation models such as radiosity could
help alleviating this problem, at the expense of significantly more computation. Finally, our approach relies on knowledge of the camera
exposure and white balance settings, which might be less applicable to the case of images downloaded on the Internet. We plan to explore
these issues in future work.
Exploring material recognition for estimating
reflectance and illumination from a single image
Michael Weinmann; Reinhard Klein
MAM '16 Proceedings of the Eurographics 2016 Workshop on Material Appearance Modeling
https://doi.org/10.2312/mam.20161253
We demonstrate that reflectance and illumination can be estimated
reliably for several materials that are beyond simple Lambertian
surface reflectance behavior because of exhibiting mesoscopic
effects such as interreflections and shadows.
Shading Annotations in the Wild
Balazs Kovacs, Sean Bell, Noah Snavely, Kavita Bala
http://opensurfaces.cs.cornell.edu/saw/
We use this data to train a
convolutional neural network
to predict per-pixel shading
information in an image. We
demonstrate the value of our
data and network in an
application to intrinsic
images, where we can reduce
decomposition artifacts
produced by existing
algorithms.

Pipelineimage Styling #2A
Aesthetics enhancement: High Dynamic Range #1
Learning High Dynamic Range from
Outdoor Panoramas
Jinsong Zhang, Jean-François Lalonde
(Submitted on 29 Mar 2017 (v1), last revised 8 Aug 2017 (this version, v2))
http://www.jflalonde.ca/projects/learningHDR
Qualitative results on the synthetic dataset.
Top row: the ground truth HDR panorama, middle row: the LDR panorama, and
bottom row: the predicted HDR panorama obtained with our method.
To illustrate dynamic range, each panorama is shown at two exposures, with a factor of 16
between the two. For each example, we show the panorama itself (left column), and the
rendering of a 3D object lit with the panorama (right column). The object is a “spiky
sphere” on a ground plane, seen from above. Our method accurately predicts the extremely
high dynamic range of outdoor lighting in a wide variety of lighting conditions. A tonemapping
of γ = 2.2 is used for display purposes.
Real cameras have non-linear response functions. To simulate this, we randomly sample real camera
response functions from the Database of Response Functions (DoRF) [Grossberg and Nayar, 2003],
and apply them to the linear synthetic data before training.
Examples from our real dataset. For each case, we show the LDR panorama
captured by the Ricoh Theta S camera, a consumer grade point-and-shoot 360º
camera (left), and the corresponding HDR panorama captured by the Canon 5D
Mark III DSLR mounted on a tripod, equipped with a Sigma 8mm fisheye lens
(right, shown at a different exposure to illustrate the high dynamic range).
We present a full end-to-end learning approach to estimate the extremely high
dynamic range of outdoor lighting from a single, LDR 360º panorama. Our main
insight is to exploit a large dataset of synthetic data composed of a realistic virtual
city model, lit with real world HDR sky light probes [Lalonde et al. 2016
http://www.hdrdb.com/] to train a deep convolutional autoencoder

Pipelineimage Styling #2b
High Dynamic Range #2: Learn illumination for relighting purposes
Learning to Predict Indoor Illumination from a Single Image
Marc-André Gardner, Kalyan Sunkavalli, Ersin Yumer, Xiaohui Shen, Emiliano Gambaretto, Christian Gagné, Jean-François Lalonde
(Submitted on 1 Apr 2017 (v1), last revised 25 May 2017 (this version, v2))

Pipelineimage Styling #3a
Improving photocompositing and relighting of RGB textures
Deep Image Harmonization
Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli, Xin Lu,
Ming-Hsuan Yang
(Submitted on 28 Feb 2017)
Our method can adjust the appearances of the composite
foreground to make it compatible with the background
region. Given a composite image, we show the harmonized
images generated by Xue et al. (2012), Zhu et al. (2015) and
our deep harmonization network.
The overview of the proposed joint network architecture. Given a composite image and a provided foreground mask, we first pass the input through an encoder
for learning feature representations. The encoder is then connected to two decoders, including a harmonization decoder for reconstructing the harmonized output
and a scene parsing decoder to predict pixel-wise semantic labels. In order to use the learned semantics and improve harmonization results, we concatenate the
feature maps from the scene parsing decoder to the harmonization decoder (denoted as dot-orange lines). In addition, we add skip links (denoted as blue-dot
lines) between the encoder and decoders for retaining image details and textures. Note that, to keep the figure clean, we only depict the links for the harmonization
decoder, while the scene parsing decoder has the same skip links connected to the encoder.
Given an input image (a), our network
can adjust the foreground region
according to the provided mask (b)
and produce the output (c). In this
example, we invert the mask from the
one in the first row to the one in the
second row, and generate
harmonization results that account for
different context and semantic
information.

Sky is not the limit: semantic-aware sky replacement
YH Tsai, X Shen, Z Lin, K Sunkavalli; Ming-Hsuan Yang
ACM Transactions on Graphics (TOG) - Volume 35 Issue 4, July 2016
https://doi.org/10.1145/2897824.2925942
In order to find proper skies for replacement, we propose a data-driven sky search
scheme based on semantic layout of the input image. Finally, to re-compose the
stylized sky with the original foreground naturally, an appearance transfer method is
developed to match statistics locally and semantically.
Sample sky segmentation results. Given an input image, the FCN generates results that localize the sky well but
contain inaccurate boundaries and noisy segments. The proposed online model refines segmentations that are
complete and accurate, especially around the boundaries (best viewedin color with enlarged images).
Overview of the proposed algorithm. Given an input image, we first utilize the FCN to obtain scene parsing results
and semantic response for each category. A coarse-to-fine strategy is adopted to segment sky regions (illustrated
as the red mask). To find reference images for sky replacement, we develop a method to search images with
similar semantic layout. After re-composing images with the found skies, we transfer visual semantics to match
foreground statistics between the input image and the reference image. Finally, a set of composite images with
different stylized skies are generated automatically.
GP-GAN: Towards Realistic High-Resolution Image Blending
Huikai Wu, Shuai Zheng, Junge Zhang, Kaiqi Huang
(Submitted on 21 Mar 2017 (v1), last revised 25 Mar 2017 (this version, v2))
Qualitative illustration of high-resolution
image blending. a) shows the composited
copy-and-paste image where the inserted
object is circled out by red lines. Users usually
expect image blending algorithms to make this
image more natural. b) represents the result
based on Modified Poisson image editing [32]. c)
indicates the result from Multi-splines approach.
d) is the result of our method Gaussian-Poisson
GAN (GP-GAN). Our approach produces better
quality images than that from the alternatives in
terms of illumination, spatial, and color
consistencies.
We advanced the state-of-the-art in conditional image generation by combining the ideas from the generative
model GAN, Laplacian Pyramid, and Gauss-Poisson Equation. This combination is the first time a generative
model could produce realistic images in arbitrary resolution. In spite of the effectiveness, our algorithm fails to
generate realistic images when the composited images are far away from the distribution of the training
dataset. We aim to address this issue in future work.

Pipelineimage Styling #3c
Live User-Guided Intrinsic Video for Static Scenes
Abhimitra Meka ; Gereon Fox ; Michael Zollhofer ; Christian Richardt ; Christian Theobalt
IEEE Transactions on Visualization and Computer Graphics ( Volume: PP, Issue: 99 )
https://doi.org/10.1109/TVCG.2017.2734425
User constraints, in the form of constant shading and reflectance strokes,
can be placed directly on the real-world geometry using an intuitive touch-
based interaction metaphor, or using interactive mouse strokes. Fusing the
decomposition results and constraints in three-dimensional space allows for
robust propagation of this information to novel views by re-projection.
We propose a novel approach for live, user-guided intrinsic video decomposition. We
first obtain a dense volumetric reconstruction of the scene using a commodity RGB-D
sensor. The reconstruction is leveraged to store reflectance estimates and user-provided
constraints in 3D space to inform the ill-posed intrinsic video decomposition problem. Our
approach runs at real-time frame rates, and we apply it to applications such as relighting,
recoloring and material editing.
Our novel user-guided intrinsic video approach enables real-time applications such
as recoloring, relighting and material editing.
Constant reflectance strokes improve the decomposition by moving the high-frequency shading of the cloth to the shading layer.
Comparison to state-of-the-art intrinsic video decomposition techniques on the ‘girl’ dataset. Our approach matches the real-time
performance of Meka et al. (2016), while achieving the same quality as previous off-line techniques

Beyond low-level style transfer for high-level manipulation
Generative Semantic Manipulation
with Contrasting GAN
Xiaodan Liang, Hao Zhang, Eric P. Xing
Generative Adversarial Networks (GANs) have recently achieved significant improvement on paired/unpaired
image-to-image translation, such as photo sketch and artist painting style transfer. However, existing models→
can only be capable of transferring the low-level information (e.g. color or texture changes), but fail to edit high-
level semantic meanings (e.g., geometric structure or content) of objects.
Some example semantic manipulation results by our model, which takes one image
and a desired object category (e.g. cat, dog) as inputs and then learns to
automatically change the object semantics by modifying their appearance or
geometric structure. We show the original image (left) and manipulated result (right)
in each pair.
Although our method can achieve compelling results in many semantic manipulation tasks, it shows
little success for some cases which require very large geometric changes, such as car truck and↔
car bus. Integrating spatial transformation layers for explicitly learning pixel-wise offsets may help↔
resolve very large geometric changes. To be more general, our model can be extended to replace the
mask annotations with the predicted object masks or automatically learned attentive regions via
attention modeling. This paper pushes forward the research of unsupervised setting by
demonstrating the possibility of manipulating high-level object semantics rather than the low-level
color and texture changes as previous works did. In addition, it would be more interesting to develop
techniques that are able to manipulate object interactions and activities in images/videos for the
future work.

Aesthetics enhancement: Style Transfer | Introduction #1
Neural Style Transfer: A Review
Yongcheng Jing, Yezhou Yang, Zunlei Feng, Jingwen Ye, Mingli Song
A list of mentioned papers in this review, corresponding codes and pre-trained models are
publicly available at: https://github.com/ ycjing/Neural-Style-Transfer-Papers
One of the reasons why Neural Style Transfer catches eyes in both academia and industry is its
popularity in some social networking sites (e.g., Twitter and Facebook). A mobile application
Prisma [36] is one of the first industrial applications that provides the Neural Style Transfer
algorithm as a service. Before Prisma, the general public almost never imagines that one day
they are able to turn their photos into art paintings in only a few minutes. Due to its high
quality, Prisma achieved great success and is becoming popular around the world.
Another use of Neural Style Transfer is to act as user-assisted creation tools.
Although, to the best of our knowledge, there are no popular applications that
applied the Neural Style Transfer technique in creation tools, we believe that it will
be a promising potential usage in the future. Neural Style Transfer is capable of
acting as a creation tool for painters and designers. Neural Style Transfer makes
it more convenient for a painter to create an artifact of a specific style, especially
when creating computer-made fine art images. Moreover, with Neural Style Transfer
algorithms it is trivial to produce stylized fashion elements for fashion designers and
stylized CAD drawings for architects in a variety of styles, which is costly to
produce them by hand.

Aesthetics enhancement: Style Transfer | Introduction #2
Neural Style Transfer: A Review
Yongcheng Jing, Yezhou Yang, Zunlei Feng, Jingwen Ye, Mingli Song
A list of mentioned papers in this review, corresponding codes and pre-trained models are
publicly available at: https://github.com/ ycjing/Neural-Style-Transfer-Papers
Promising directions for future research in Neural Style Transfer mainly focus on two
aspects. The first aspect is to solve the existing aforementioned challenges for current
algorithms, i.e., problem of parameter tuning, problem of stroke orientation control and
problem existing in “Fast” and “Faster” Neural Style Transfer algorithms. The second aspect of
promising directions is to focus on new extensions to Neural Style Transfer (e.g., Fashion
Style Transfer and Character Style Transfer). There are already some preliminary work related
with this direction, such as the recent work of Yang et al. (2016) on Text Effects Transfer.
These interesting extensions may become trending topics in the future and related new areas
may be created subsequently.

Pipelineimage Styling #5C
Aesthetics enhancement: Video Style Transfer
DeepMovie: Using Optical Flow and
Deep Neural Networks to Stylize Movies
Alexander G. Anderson, Cory P. Berg, Daniel P. Mossing,
Bruno A. Olshausen (Submitted on 26 May 2016)
https://github.com/anishathalye/neural-style
Coherent Online Video Style Transfer
Dongdong Chen, Jing Liao, Lu Yuan, Nenghai Yu, Gang Hua
(Submitted on 27 Mar 2017 (v1), last revised 28 Mar 2017 (this version, v2))
The main contribution of this paper is to use optical flow to initialize the
style transfer optimization so that the texture features move with the
objects in the video. Finally, we suggest a method to incorporate optical
flow explicitly into the cost function.
Overview of Our Approach: We begin by applying the style transfer algorithm to the first frame of the
movie using the content image as the initialization. Next, we calculate the optical flow field that takes the
first frame of the movie to the second frame. We apply this flow-field to the rendered version of the first
frame and use that as the initialization for the style transfer optimization for the next frame. Note, for
instance, that a blue pixel in the flow field image means that the underlying object in the video at that pixel
moved to the left from frame one to frame two. Intuitively, in order to apply the flow field to the styled
image, you move the parts of the image that have a blue pixel in the flow field to the left.
We propose the first end-to-end network for online video style transfer, which generates temporally coherent
stylized video sequences in near real-time. Two key ideas include an efficient network by incorporating short-
term coherence, and propagating short-term coherence to long-term, which ensures the consistency over
larger period of time. Our network can incorporate different image stylization networks. We show that the
proposed method clearly outperforms the per-frame baseline both qualitatively and quantitatively. Moreover, it
can achieve visually comparable coherence to optimization-based video style transfer, but is three orders of
magnitudes faster in runtime.
There are still some limitations in our
method. For instance, limited by the
accuracy of ground-truth optical flow
(given by DeepFlow2 [Weinzaepfel et al. 2013]
),
our results may suffer from some
incoherence where the motion is too
large for the flow to track. And after
propagation over a long period, small
flow errors may accumulate, causing
blurriness. These open questions are
interesting for further exploration in the
future work.

Aesthetics enhancement: Texture synthesis and upsampling
TextureGAN: Controlling Deep Image
Synthesis with Texture Patches
Wenqi Xian, Patsorn Sangkloy, Jingwan Lu, Chen Fang, Fisher Yu,
James Hays
(Submitted on 9 Jun 2017)
TextureGAN pipeline.
A feed-forward
generative network is
trained end-to-end to
directly transform a 4-
channel input to a high-
res photo with realistic
textural details.
Photo-realistic Facial Texture Transfer
Parneet Kaur, Hang Zhang, Kristin J. Dana
(Submitted on 14 Jun 2017)
Overview of our method. Facial identity is preserved
using Facial Semantic Regularization which
regularizes the update of meso-structures using a
facial prior and facial semantic structural loss.
Texture loss regularizes the update of local textures
from the style image. The output image is initialized
with the content image and updated at each iteration
by back-propagating the error gradients for the
combined losses. Content/style photos: Martin
Scheoller/Art+Commerce.
Identity-preserving Facial Texture Transfer (FaceTex).
The textural details are transferred from style image to
content image while preserving its identity. FaceTex
outperforms existing methods perceptually as well as
quantitatively. Column 3 uses input 1 as the style image
and input 2 as the content. Column 4 uses input 1 as
the content image and input 2 as the style image.
Figure 3 shows more examples and comparison with
existing methods. Input photos: Martin
Scheoller/Art+Commerce.

Pipelineimage Styling #6B
Aesthetics enhancement: Texture synthesis with style transfer
Stable and Controllable Neural Texture Synthesis
and Style Transfer Using Histogram Losses
Eric Risser, Pierre Wilmot, Connelly Barnes
Artomatix, University of Virginia
(Submitted on 31 Jan 2017 (v1), last revised 1 Feb 2017 (this version, v2))
Our style transfer and texture synthesis results. The input styles are
shown in (a), and style transfer results are in (b, c). Note that the angular
shapes of the Picasso painting are successfully transferred on the top row,
and that the more subtle brush strokes are transferred on the bottom row.
The original content images are inset in the upper right corner. Unless
otherwise noted, our algorithm is always run with default parameters (we do
not manually tune parameters). Input textures are shown in (d) and texture
synthesis results are in (e). For the texture synthesis, note that the algorithm
synthesizes creative new patterns and connectivities in the output. Different statistics that can be used for neural network texture synthesis.

Pipelineimage Styling #6C
Aesthetics enhancement: Enhancing texture maps
Depth Texture Synthesis for Realistic
Architectural Modeling
Félix Labrie-Larrivée ; Denis Laurendeau ; Jean-François Lalonde
Computer and Robot Vision (CRV), 2016 13th Conference on
https://doi.org/10.1109/CRV.2016.77
In this paper, we present a novel approach that
improves the resolution and geometry of 3D
meshes of large scenes with such repeating
elements. By leveraging structure from motion
reconstruction and an off-the-shelf depth sensor,
our approach captures a small sample of the scene
in high resolution and automatically extends that
information to similar regions of the scene.
Using RGB and SfM depth information as a guide
and simple geometric primitives as canvas, our
approach extends the high resolution mesh by
exploiting powerful, image-based texture synthesis
approaches. The final results improves on
standard SfM reconstruction with higher detail.
Our approach benefits from reduced manual
labor as opposed to full RGBD reconstruction, and
can be done much more cheaply than with LiDAR-
based solutions.
In the future, we plan to work on a more
generalized 3D texture synthesis
procedure capable of synthesizing a more
varied set of objects, and able to
reconstruct multiple parts of the scene by
exploiting several high resolution scan
samples at once in an effort to address the
tradeoff mentioned above. We also plan to
improve the robustness of the approach
to a more varied set of large scale scenes,
irrespective of the lighting conditions,
material colors, and geometric
configurations. Finally, we plan to evaluate
how our approach compares to SfM on a
more quantitative level by leveraging
LiDAR data as ground truth.
Overview of the data collection and alignment procedure. Top row: a collection of photos of the scene is acquired with a typical camera, and used to generate a
point cloud via SfM [Agarwal et al. 2009] and dense multi-view stereo (MVS) [ Furukawa and Ponce, 2012]. Bottom row: a repeating feature of the scene (in this
example, the left-most window) is recorded with a Kinect sensor, and reconstructed into a high resolution mesh via the RGB-D SLAM technique KinectFusion [
Newcombe et al. 2011]. The mesh is then automatically aligned to the SfM reconstruction using bundle adjustment and our automatic scale adaptation
technique (see sec. III-C). Right: the high resolution Kinect mesh is correctly aligned to the low resolution SfM point cloud

Pipelineimage Styling #6D
Aesthetics enhancement: Towards photorealism with good maps
One Ph.D. position (supervision by Profs Niessner and Rüdiger
Westermann) is available at our chair in the area of photorealistic rendering
for deep learning and online reconstruction
Research in this project includes the development of photorealistic realtime rendering
algorithms that can be used in deep learning applications for scene understanding, and for
high-quality scalable rendering of point scans from depth sensors and RGB stereo image
reconstruction. If you are interested in applying, you should have a strong background in
computer science, i.e., efficient algorithms and data structures, and GPU programming,
have experience implementing C/C++ algorithms, and you should be excited to work on
state-of-the-art research in the 3D computer graphics.
https://wwwcg.in.tum.de/group/joboffers/phd-position-photorealistic-rendering-for-deep-le
arning-and-online-reconstruction.html
Ph.D. Position – Photorealistic Rendering for
Deep Learning and Online Reconstruction
Photorealism Explained
Blender Guru Published on May 25, 2016
http://www.blenderguru.com/tutorials/photorealism-explained/
https://youtu.be/R1-Ef54uTeU
Stop wasting time creating
texture maps by hand. All
materials on Poliigon come
with the relevant normal,
displacement, reflection and
gloss maps included. Just
plug them into your
software, and your material
is ready to render.
https://www.poliigon.com/
How to Make
Photorealistic PBR
Materials - Part 1
Blender Guru Published
on Jun 28, 2016
http://www.blenderguru.com/tutoria
ls/pbr-shader-tutorial-pt1/
https://youtu.be/V3wghb
Z-Vh4?t=24m5s
Physically Based Rendering (PBR)

Styling line graphics (e.g. floorplans, 2D CADs) and monochrome images e.g. for desired visual identity
Real-Time User-Guided Image Colorization with
Learned Deep Priors
Richard Zhang, Jun-Yan Zhu, Phillip Isola, Xinyang Geng, Angela S. Lin, Tianhe Yu, Alexei A. Efros
Our proposed method colorizes a grayscale image (left), guided by sparse user inputs (second), in real-time,
providing the capability for quickly generating multiple plausible colorizations (middle to right). Photograph of
Migrant Mother by Dorothea Lange, 1936 (Public Domain).
Network architecture We train two variants of the user interaction colorization network. Both variants use the blue layers for predicting
a colorization. The Local Hints Network also uses red layers to (a) incorporate user points Ul and (b) predict a color distribution bZ. The
Global Hints Network uses the green layers, which transforms global input Uд by 1 × 1 conv layers, and adds the result into the main
colorization network. Each box represents a conv layer, with vertical dimension indicating feature map spatial resolution, and horizontal
dimension indicating number of channels. Changes in resolution are achieved through subsampling and upsampling operations. In the
main network, when resolution is decreased, the number of feature channels are doubled. Shortcut connections are added to
upsampling convolution layers.
Style Transfer for Anime Sketches with
Enhanced Residual U-net and Auxiliary
Classifier GAN
Lvmin Zhang, Yi Ji, Xin Lin
(Submitted on 11 Jun 2017 (v1), last revised 13 Jun 2017 (this version, v2))
Examples of combination results on sketch images (top-left) and style images
(bottom-left). Our approach automatically applies the semantic features of an existing
painting to an unfinished sketch. Our network has learned to classify the hair, eyes,
skin and clothes, and has the ability to paint these features according to a sketch.
In this paper, we integrated residual U-net to apply the style to the grayscale sketch
with auxiliary classifier generative adversarial network (AC-GAN, Odena et al. 2016).
The whole process is automatic and fast, and the results are creditable in the quality
of art style as well as colorization
Limitation: the pretrained VGG is for ImageNet photograph classification, but not for
paintings. In the future, we will train a classification network only for paintings to
achieve better results. Furthermore, due to the large quantity of layers in our residual
network, the batch size during training is limited to no more than 4. It remains for
future study to reach a balance between the batch size and quantity of layers.
+

Future Image restoration Depth Images (Kinect, etc.)

PipelineDepth image enhancement #1a
Image Formation #1
Pinhole Camera Model: ideal projection of a 3D
object on a 2D image. Fernandez et al. (2017)
Dot patterns of a Kinect for Windows (a) and two Kinects for Xbox (b) and (c)
are projected on a flat wall from a distance of 1000 mm. Note that the
projection of each pattern is similar, and related by a 3-D rotation depending
on the orientation of the Kinect diffuser installation. The installation variability
can clearly be observed from differences in the bright dot locations (yellow
stars), which differ by an average distance of 10 pixels. Also displayed in (d) is
the idealized binary replication of the Kinect dot pattern [Kinect Patter Uncovered]
, which
was used in this project to simulate IR images. - Landau et al. (2016)
Landau et al. (2016)

PipelineDepth image enhancement #1b
Image Formation #2
Characterizations of Noise in Kinect
Depth Images: A Review
Tanwi Mallick ; Partha Pratim Das ; Arun Kumar Majumdar
IEEE Sensors Journal ( Volume: 14, Issue: 6, June 2014 )
Kinect outputs for a scene. (a) RGB Image. (b) Depth data rendered as an 8-
bit gray-scale image with nearer depth values mapped to lower intensities.
Invalid depth values are set to 0. Note the fixed band of invalid (black) pixels
on left. (c) Depth image showing too near depths in blue, too far depths in red
and unknown depths due to highly specular objects in green. Often these are
all taken as invalid zero depth.
Shadow is created in a depth image (Yu et al. 2013) when the incident IR
from the emitter gets obstructed by an object and no depth can be
estimated.
PROPERTIES OF IR LIGHT [Rose]

Pipeline Depth image enhancement #1c
Image Formation #3
Authors’ experiments on structural noise using a plane in 400 frames.
(a) Error at 1.2m. (b) Error at 1.6m. (c) Error at 1.8m.
Smisek et al. (2013) calibrate a Kinect against a stereo-rig
(comprising two Nikon D60 DSLR cameras) to estimate and
improve its overall accuracy. They have taken images and
fitted planar objects at 18 different distances (from 0.7 to 1.3
meters) to estimate the error between the depths measured
by the two sensors. The experiments corroborate that the
accuracy varies inversely with the square of depth [2].
However, even after the calibration of Kinect, the procedure
still exhibits relatively complex residual errors (Fig. 8).
Fig. 8. Residual noise of a plane. (a) Plane at 86cm. (b) Plane
at 104cm.
Authors’ experiments on temporal noise. Entropy and SD
of each pixel in a depth frame over 400 frames for a
stationary wall at 1.6m. (a) Entropy image. (b) SD image.
Authors’ experiments with vibrating noise showing
ZD samples as white dots. A pixel is taken as noise if
it is zero in frame i and nonzero in frames i±1. Note
that noise follows depth edges and shadow. (a)
Frame (i−1). (b) Frame i. (c) Frame (i+1). (d) Noise for
frame i.

PipelineDepth image enhancement #1d
Image Formation #4
The filtered intensity samples generated from
unsaturated IR dots (blue dots) were used to fit the
intensity model (red line), which follows an inverse
square model for the distance between the sensor
and the surface point Landau et al. (2016)
(a) Multiplicative
speckle distribution
is unitless, and can
be represented as a
gamma distribution
Γ (4.54, 0.196). (b)
Additive detector
noise distribution
can be represented
as a normal
distribution Ν
(−0.126, 10.4), and
has units of 10-bit
intensity.
The standard error in depth estimation (mm)
as a function of radial distance (pix) is plotted
for the (a) experimental and (b) simulated
data sets of flat walls at various depths (mm).
The experimental standard depth error
increases faster with an increase in radial
distance due to lens distortion.

PipelineDepth image enhancement #2A
Metrological Calibration #1
A New Calibration Method for
Commercial RGB-D Sensors
Walid Darwish, Shenjun Tang, Wenbin Li and Wu Chen
Sensors 2017, 17(6), 1204; doi:10.3390/s17061204
Based on these calibration algorithms, different
calibration methods have been implemented and tested.
Methods include the use of 1D [Liu et al. 2012]
2D [Shibo and Qing 2012]
,
and 3D [Gui et al. 2014]
calibration objects that work with the
depth images directly; calibration of the manufacture
parameters of the IR camera and projector [Herrera et al. 2012]
;
or photogrammetric bundle adjustments used to model
the systematic errors of the IR sensors [
Davoodianidaliki and Saadatseresht 2013; Chow and Lichti 2013]
. To enhance the
depth precision, additional depth error models are
added to the calibration procedure [7,8,21,22,23].
All of these error models are used to compensate only
for the distortion effect of the IR projector and camera.
Other research works have been conducted to obtain
the relative calibration between an RGB camera and an
IR camera by accessing the IR camera [24,25,26]. This
can achieve relatively high accuracy calibration
parameters for a baseline between IR and RGB cameras,
while the remaining limitation is that the distortion
parameters for the IR camera cannot represent the full
distortion effect for the depth sensor.
This study addressed these issues using a two-step
calibration procedure to calibrate all of the geometric
parameters of RGB-D sensors. The first step was related
to the joint calibration between the RGB and IR cameras,
which was achieved by adopting the procedure
discussed in [27] to compute the external baseline
between the cameras and the distortion parameters of
the RGB camera. The second step focused on the depth
sensor calibration.
Point cloud of two perpendicular planes (blue
color: default depth; red color: modeled depth):
highlighted black dashed circles shows the
significant impact of the calibration method on the
point cloud quality.
The main difference between both sensors is the
baseline between the IR camera and projector. The
longer the sensor’s baseline, the longer working
distance can be achieved. The working range of
Kinect v1 is 0.80 m to 4.0 m, while it is 0.35 m to 3.5
m for Structure Sensor.

Photogrammetric Bundle
Adjustment With Self-
Calibration of the PrimeSense
3D Camera Technology:
Microsoft Kinect
IEEE Access ( Volume: 1 ) 2013
https://doi.org/10.1109/ACCESS.2013.2271860
Roughness of point cloud before calibration. (Bottom) Roughness of point
cloud after calibration. The colours indicate the roughness as measured
by the normalized smallest eigenvalue.
Estimated Standard Deviation of the Observation Residuals
To quantify the external accuracy of the Kinect and the benefit of the proposed calibration,
a target board located at 1.5–1.8 m away with 20 signalized targets was imaged using an in-
house program based on the Microsoft Kinect SDK and with RGBDemo. Spatial distances
between the targets were known from surveying using the FARO Focus3D terrestrial laser
scanner with a standard deviation of 0.7 mm. By comparing the 10 independent spatial
distances measured by the Kinect to those made by the Focus3D, the RMSE was 7.8 mm
using RGBDemo and 3.7 mm using the calibrated Kinect results; showing a 53%
improvement to the accuracy. This accuracy check assesses the quality of all the imaging
sensors and not just the IR camera-projector pair alone.
The results show improvements in geometric accuracy up to 53%
compared with uncalibrated point clouds captured using the popular
software RGBDemo. Systematic depth discontinuities were also
reduced and in the check-plane analysis the noise of the Kinect point
cloud was reduced by 17%.

PipelineDepth image enhancement #2B
Evaluating and Improving the Depth Accuracy of
Kinect for Windows v2
Lin Yang ; Longyu Zhang ; Haiwei Dong ; Abdulhameed
Alelaiwi ; Abdulmotaleb El Saddik
IEEE Sensors Journal (Volume: 15, Issue: 8, Aug. 2015)
Illustration of accuracy assessment of Kinect v2. (a) Depth accuracy. (b) Depth
resolution. (c) Depth entropy. (d) Edge noise. (e) Structural noise. The target plates in (a-
c) and (d-e) are parallel and perpendicular with the depth axis, respectively.
Accuracy error distribution
of Kinect for Windows v2.

PipelineDepth image enhancement #2c
A Comparative Error Analysis of Current
Time-of-Flight Sensors
IEEE Transactions on Computational Imaging (Volume: 2, Issue: 1, March 2016)
https://doi.org/10.1109/TCI.2015.2510506
For evaluating the presence of wiggling, ground truth distance
information is required. We calculate the true distance by setting
up a stereo camera system. This system consists of the ToF camera
to be evaluated and a high resolution monochrome camera (IDS
UI-1241LE7) which we call the reference camera.
The cameras are calibrated with Zhang (2000)’s algorithm with
point correspondences computed with ROCHADE (
Placht et al. 2014). Ground truth is calculated by intersecting the
rays of all ToF camera pixels with the 3D plane of the
checkerboard. For higher accuracy, we compute this plane from
corners detected in the reference image and transform the plane
into the coordinate system of the ToF camera
This experiment aims to quantify the so-
called amplitude-related distance error and
also to show that this effect is not related to
scattering. This effect can be observed when
looking at a planar surface with high
reflectivity variations. With some sensors the
distance measurements for pixels with
different amplitudes do not lie on the same
plane, even though they should.
To the best of our knowledge no evaluation
setup has been presented for this error
source so far. In the past this error has been
typically observed with images of
checkerboards or other high contrast
patterns. However, the analysis of single
images allows no differentiation between
amplitude-related errors andinternal
scattering

PipelineDepth image enhancement #2c
Low-Cost Reflectance-Based Method for the
Radiometric Calibration of Kinect 2
IEEE Sensors Journal ( Volume: 16, Issue: 7, April1, 2016 )
In this paper, a reflectance-based radiometric
method for the second generation of gaming
sensors, Kinect 2, is presented and discussed. In
particular, a repeatable methodology generalizable
to different gaming sensors by means of a calibrated
reference panel with Lambertian behavior is
developed.
The relationship between the received power and
the final digital level is obtained by means of a
combination of linear sensor relationship and
signal attenuation, into a least squares adjustment
with an outlier detector. The results confirm that
the quality of the method (standard deviation better
than 2% in laboratory conditions and discrepancies
lower than 7% b) is valid for exploiting the
radiometric possibilities of this low-cost sensor,
which ranges from the pathological analysis
(moisture, crusts, etc.…); to agricultural and forest
resource evaluation.
3D data acquired with Kinect 2 (left) and digital
number (DN) distribution (right) for the reference
panel at 0.7 m (units: counts).
Visible-RGB view of the brick wall (a), intensity-IR
digital levels (DN) (b-d) and calibrated reflectance
values (e-g) for the three acquisition distances
The objective of this paper was to develop a radiometric calibration equation of an IR projector-
camera for the second generation of gaming sensors, Kinect 2, to convert the recorded
digital levels into physical values (reflectance). By the proposed equation, the reflectance
properties of the IR projector-camera set of Kinect 2 were obtained. This new equation will
increase the number of application fields of gaming sensors, favored by the possibility of
working outdoors.
The process of radiometric calibration should be incorporated as part of an integral process
where the geometry obtained is also corrected (i.e., lens distortion, mapping function, depth
errors, etc.). As future perspectives, the effects of the diffuse radiance, which does not belong to
the sensor footprint and contaminate the received signal, will be evaluated to determine the
error budget in the active sensor.

PipelineDepth image enhancement #3
‘Old-school’ depth refining techniques
Depth enhancement with improved exemplar-based
inpainting and joint trilateral guided filtering
Liang Zhang ; Peiyi Shen ; Shu'e Zhang ; Juan Song ; Guangming Zhu
Image Processing (ICIP), 2016 IEEE International Conference on
https://doi.org/10.1109/ICIP.2016.7533131
In this paper, a novel depth enhancement algorithm with improved
exemplar-based inpainting and joint trilateral guided filtering is
proposed. The improved examplar-based inpainting method is
applied to fill the holes in the depth images, in which the level set
distance component is introduced in the priority evaluation function.
Then a joint trilateral guided filter is adopted to denoise and smooth
the inpainted results. Experimental results reveal that the proposed
algorithm can achieve better enhancement results compared with the
existing methods in terms of subjective and objective quality
measurements.
Robust depth enhancement and optimization based on
advanced multilateral filters
Ting-An ChangYang-Ting ChouJar-Ferr Yang
EURASIP Journal on Advances in Signal Processing December 2017, 2017:51
https://doi.org/10.1186/s13634-017-0487-7
Results of the depth enhancement coupled with hole filling results obtainedby a noisy depth map, b joint bilateral filter (JBF) [16
], c intensity guided depth superresolution (IGDS) [39], d compressive sensing based depth upsampling (CSDU) [40], e adaptive
joint trilateral filter (AJTF) [18], and f the proposed AMF for Art, Books, Doily, Moebius, RGBD_1, and RGBD_2

Deep learning-based depth refining techniques
DepthComp : real-time depth image completion based on
prior semantic scene segmentation
Atapour-Abarghouei, A. and Breckon, T.P.
28th British Machine Vision Conference (BMVC) 2017 London, UK, 4-7 September 2017.
http://dro.dur.ac.uk/22375/
Exemplar results on the KITTI dataset. S denotes the segmented images [3] and D the original (unfilled) disparity maps.
Results are compared with [1, 2, 29, 35, 45]. Results of cubic and linear interpolations are omitted due to space.
Comparison of the proposed method using different initial segmentation techniques on the KITTI dataset [27].
Original color and disparity image (top-left), results with manual labels (top-right), results with SegNet [3] (bottom-left)
and results with mean-shift [26] (bottom-right).
Fast depth image denoising and enhancement using a deep
convolutional network
Xin Zhang and Ruiyuan Wu
Acoustics, Speech and Signal Processing (ICASSP), 2016 IEE

PipelineDepth image enhancement #4b
Deep learning-based depth refining techniques
Guided deep network for depth map super-resolution: How
much can color help?
Wentian Zhou ; Xin Li ; Daryl Reynolds
Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE
https://anvoy.github.io/publication.html
Depth map upsampling using joint edge-guided
convolutional neural network for virtual view synthesizing
Yan Dong; Chunyu Lin; Yao Zhao; Chao Yao
Journal of Electronic Imaging Volume 26, Issue 4
http://dx.doi.org/10.1117/1.JEI.26.4.043004
Depth map upsampling. Input: (a) low-resolution depth map and (b) the corresponding color image.
Output: (c) recovered high-resolution depth map.
When the depth edges become unreliable, our network
tends to rely on color-based prediction network (CBPN) for
restoring more accurate depth edges. Therefore,
contribution of color image increases when the reliability of
the LR depth map decreases (e.g., as noise gets stronger).
We adopt the popular deep CNN to learn non-linear
mapping between LR and HR depth maps. Furthermore, a
novel color-based prediction network is proposed to
properly exploit supplementary color information in
addition to the depth enhancement network.
In our experiments, we have shown that deep neural
network based approach is superior to several existing
state-of-the-art methods. Further comparisons are
reported to confirm our analysis that the contributions of
color image vary significantly depending on the reliability
of LR depth maps.

Future Image restoration Depth Images (Laser scanning)

PipelineLaser range Finding #1a
Versatile Approach to Probabilistic
Modeling of Hokuyo UTM-30LX
IEEE Sensors Journal ( Volume: 16, Issue: 6, March15, 2016 )
When working with Laser Range Finding (LRF), it is
necessary to know the principle of sensor’s
measurement and its properties. There are several
measurement principles used in LRFs [Nejad and Olyaee 2006], [
Łabęcki et al. 2012], [Adams 1999]
:
● Triangulation
● Time of flight (TOF)
● Frequency modulation continuous wave (FMCW)
● Phase shift measurement (PSM)
The geometry of terrestrial laser scanning; identification of
errors, modeling and mitigation of scanning geometry
Soudarissanane, S.S.. TU Delft. Doctoral Thesis (2016)
http://doi.org/10.4233/uuid:b7ae0bd3-23b8-4a8a-9b7d-5e494ebb54e5
Distance measurement principle of time-of-flight laser
scanners (top) and phase based laser scanners
(bottom).
Laser Range Finding : Image formation #1

PipelineLaser range Finding #1b
Two ways link budget between the receiver (Rx) and
the transmitter (Tx) in a Free Space Path (FSP)
propagation model.
Schematic representation of the signal propagation from
the transmitter to the receiver.
Effect of increasing incidence angle and range to the signal
deterioration. (left) Plot of the the signal deterioration due
to increasing incidence angle α, (right) plot of the signal
deterioration due to increasing ranges ρ, with ρmin
= 0 m
and ρmax
= 100 m
Relationship between scan angle and normal
vector orientation used for the segmentation of
point cloud with respect to planar features. A
point P = [ , , ]θ ϕ ρ is measured on the plane with
the normal parameters N = [ , , ]α β γ . Different
angles used for the range image gradients are
plotted
Theoretical number of points. Practical
example of a plate of 1×1 m placed at 3 m, oriented
at 0º and being rotated at 60º.
Theoretical number of points. (left) Number of
points with respect to the orientation of the patch
and the distance.
Reference plate
measurement set-up. A white
coated plywood board is
mounted on a tripod via a
screw clamp mechanism
provided with a 2º precision
goniometer.

PipelineLaser range Finding #1c
Terrestrial Laser Scanning (TLS) good practice of survey planning
Future directions At the time this research
started, terrestrial laser scanners were mainly
being used by research institutes and
manufacturers. However, nowadays, terrestrial
laser scanners are present in almost every field
of work, e.g. forensics, architecture, civil
engineering, gaming industry, movie industry.
Mobile mapping systems, such as scanners
capturing a scene while driving a car, or
scanners mounted on drones are currently
making use of the same range determination
techniques used in terrestrial laser scanners.
The number of applications that make use of
3D point clouds is rapidly growing. The need
for a sound quality product is even more
significant as it impacts the quality of a huge
panel of end-products.

PipelineLaser range Finding #1D
Ray-Tracing Method for Deriving
Terrestrial Laser Scanner Systematic
Errors
Derek D. Lichti, Ph.D., P.Eng.
Journal of Surveying Engineering | Volume 143 Issue 2 - May 2017
https://www.doi.org/10.1061/(ASCE)SU.1943-5428.0000213
Error model of direct georeferencing
procedure of terrestrial laser scanning
Pandžić, Jelena; Pejić, Marko; Božić, Branko; Erić, Verica
Automation in Construction Volume 78, June 2017, Pages 13-23
https://doi.org/10.1016/j.autcon.2017.01.003

PipelineLaser range Finding #2A
Calibration #1
Statistical Calibration Algorithms for Lidars
Anas Alhashimi, Luleå University of Technology, Control Engineering
Licentiate thesis (2016), ORCID iD: 0000-0001-6868-2210
A rigorous cylinder-based self-calibration approach for
terrestrial laser scanners
Ting On Chan; Derek D. Licht; David Belton
ISPRS Journal of Photogrammetry and Remote Sensing; Volume 99, January 2015
https://doi.org/10.1016/j.isprsjprs.2014.11.003
The proposed method and its variants were first applied to two simulated datasets, to compare
their effectiveness, and then to three real datasets captured by three different types of scanners
are presented: a Faro Focus 3D (a phase-based panoramic scanner); a Velodyne HDL-32E (a
pulse-based multi spinning beam scanner); and a Leica ScanStation C10 (a dual operating-mode
scanner).
In situ self-calibration is essential for
terrestrial laser scanners (TLSs) to
maintain high accuracy for many
applications such as structural
deformation monitoring (Lindenbergh, 2010)
. This
is particularly true for aged TLSs and
instruments being operated for long hours
outdoors with varying environmental
conditions.
Although the plane-based methods are now widely adopted for TLS
calibration, they also suffer from the problem of high parameter correlation
when there is a low diversity in the plane orientations (Chow et al., 2013). In
practice, not all locations possess large and smooth planar features that can be
used to perform a calibration. Even though planar features are available, their
planarity is not always guaranteed. Because of the drawbacks to the point-
based and plane-based calibrations, an alternative geometric feature, namely
circular cylindrical features (e.g. Rabbani et al., 2007), should be considered
and incorporated in to the self-calibration procedure.
Estimate d without being aware of the mode hopping,
i.e., assuming a certain λ0
without actually knowing that
the average λ jumps between different lasing modes,
reflects thus in a multimodal measurement of d
Potential temperature-bias dependencies for the
polynomial model.
The plot explaining the cavity
modes, gain profile and lasing
modes for typical laser diode. The
upper drawing shows the
wavelength v1
as the dominant
lasing mode while the lower drawing
shows how both wavelengths v1
and
v2
are competing; this latter case is
responsible for the mode-hopping
effects.

PipelineLaser range Finding #2b
Calibration #2
Calibration of a multi-beam Laser System
by using a TLS-generated Reference
Gordon, M.; Meidow, J.
ISPRS Annals of Photogrammetry, Remote Sensing and Spatial
Information Sciences, Volume II-5/W2, 2013, pp.85-90
http://dx.doi.org/10.5194/isprsannals-II-5-W2-85-2013
Extrinsic calibration of a multi-beam LiDAR
system with improved intrinsic laser parameters
using v-shaped planes and infrared images
Po-Sen Huang ; Wen-Bin Hong ; Hsiang-Jen Chien ; Chia-Yen Chen
IVMSP Workshop, 2013 IEEE 11th
https://doi.org/10.1109/IVMSPW.2013.6611921
Velodyne HDL-64E S2, the
LiDAR system studied in this
work , for example, is a mobile
scanner consisting of 64 pairs of
laser emitter-receiver which are
rigidly attached to a rotating
motor and provides real-time
panoramic range data with
measurement errors of around
2.5 mm.
In this paper we propose a
method to use IR images as
feedbacks in finding
optimized intrinsic and
extrinsic parameters of the
LiDAR-vision scanner.
First, we apply the IR-based calibration technique to a LiDAR system that
fires multiple beams, which significantly increases the problem's
complexity and difficulty. Second, the adjustment of parameters is applied
to not only the extrinsic parameters, but also the laser parameters as well
as the intrinsic parameters of the camera. Third, we use two different
objective functions to avoid generalization failure of the optimized
parameters.
It is assumed that the accuracy of this point clouds is considerably higher than that from the multi-
beam LIDAR and that the data represent faces of man-made objects at different distances. We
inspect the Velodyne HDL-64E S2 system as the best-known representative for this kind of sensor
system, while Z+F Imager 5010 serve as reference data. Beside the improvement of the point
accuracy by considering the calibration results, we test the significance of the parameters related to
the sensor model and consider the uncertainty of measurements w.r.t. the measured distances.
Standard deviation of the planar misclosure is
nearly halved from 3.2 cm to 1.7 cm. The variance
component estimation as well as the standard
deviation of the range residuals reveal that the
manufactures standard deviation of the distance
accuracy with 2 cm is a bit too optimistic.
The histograms of the planar misclosures and the residuals reveal that this
quantities are not normal distributed. Our investigation of the distance
depending misclosure variance change is one reason. Other sources were
investigated by Glennie and Lichti (2010): the incidence angle and the vertical
angle. A further possibility is the focal distance, which is different for each
laser and the average is at 8 m for the lower block and at 15 m for the
upper block. This may introduce a distance depending—but nonlinear—
variance change. Further research is needed to find the sources of these
observations.

Dataset creation for Deep Learning-based Geometric Computer Vision problems

Dataset creation for Deep Learning-based Geometric Computer Vision problems

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Dataset creation for Deep Learning-based Geometric Computer Vision problems

Similar to Dataset creation for Deep Learning-based Geometric Computer Vision problems (20)

More from PetteriTeikariPhD

More from PetteriTeikariPhD (20)

Recently uploaded

Recently uploaded (20)

Dataset creation for Deep Learning-based Geometric Computer Vision problems