[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields

Susang Kim(healess1@gmail.com)
3D Representation
GIRAFFE : Representing Scenes as Compositional Generative Neural Feature Fields
(CVPR 2021 Best Paper Award)

NeRF : Neural Radiance Fields (ECCV 2020 - Best Paper Honorable Mention)
Input is a single continuous 5D coordinate (spatial
location (x, y, z) and viewing direction (θ, φ)) and
whose output is the volume density and
view-dependent emitted radiance at that spatial
location
FΘ : (x, d) → (c, σ) and optimize its
weights Θ to map from each input 5D
coordinate to its corresponding
volume density and directional emitted
color
Positional encoding : γ(·) is applied separately to each of the three coordinate values in x (which are
normalized to lie in [−1, 1]) and to the three components of the Cartesian viewing direction unit vector d
(which by construction lie in [−1, 1]). In our experiments, we set L = 10 for γ(x) and L = 4 for γ(d).
higher dimensional space to enable our MLP to more easily approximate a higher frequency function

GRAF: Generative Radiance Fields (NeurIPS 2020)
A generative model for radiance fields for high-resolution 3D-aware image synthesis from unposed
images. A patch-based discriminator that samples the image at multiple scales and which is key to learn
high-resolution generative radiance fields efficiently
camera matrix
camera pose
2D sampling
pattern
Γ(I, ν) to denote this bilinear sampling operation
Generate high resolution images with better multi-view consistency compared to voxel-based approaches.
Limitation is simple scenes with single objects. (inductive bias : depth maps, symmetry and real-world scenes)
the dimensionalities of the latent codes

Frechet Inception Distance (FID) (NeurlPS 2017)
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
For the evaluation of the performance of GANs at image generation, FID captures the similarity of
generated images to real ones better and more consistent than the Inception Score.
FID score(real(m,c), fake(m,c)) :
the Gaussian with mean and covariance (m, C)
the trace of a square matrix A, denoted Tr(A)
Giraffe report FID score to quantify image quality.
(20,000 real and fake samples)

GIRAFFE: Representing Scenes as Compositional
Generative Neural Feature Fields
(CVPR 2021 Best Paper Award)

Abstract
Most approaches, however, do not consider
the compositional nature of scenes
(not 2D but 3D). A key limitation of NeRF
and GRAF is that the entire scene is
represented by a single model.
How to disentangle underlying factors of
variation in the data, most of them operate in
2D and hence ignore that our world is
three-dimensional.
Incorporating a compositional 3D scene
representation into the generative model
leads to more controllable image synthesis.
Our model is able to disentangle individual
objects and allows for translating and
rotating them in the scene as well as
changing the camera pose.

Image Generation
disentangled representations
(without changing)
disentangled representations
(changing)
Definitions of disentanglement vary, but commonly refer to being able to control an attribute of
interest, e.g. object shape, size, or pose, without changing other attributes.

Overview
Incorporating a compositional
3D scene representation
A novel method for generating scenes in a controllable and photorealistic manner while training from
raw unstructured image collections
A neural renderer processes these feature
images and outputs the final renderings
controllable
image
synthesis
GIRAFFE achieves high-quality images and scales to real-world scenes.

Method
Object Representation: Disentangle different entities in the scene -> Represent each object using a
separate feature field in combination with an affine transformation
rotation matrix : R ∈ SO(3)
canonical object space
Neural Radiance Fields: Low dimensional input x and d needs to be mapped to higher-dimensional
features to be able to represent complex signals when f is parameterized with a MLP.
<- positional encoding (t is scalar)
Generative Neural Feature Fields: To learn a latent space of NeRFs, they condition the MLP on
shape and appearance codes
component of x or d, and L the number of frequency octaves
the output dimensionalities of the positional encodings
viewing direction
(σ : volume density, c: RGB color value)
c : 3D color -> f:Multi-D

GIRAFFE
objects
background
shape and appearance
A novel method for generating scenes in a controllable and photorealistic manner while training from raw
unstructured image collections. Orange indicates learnable and blue non-learnable operations.
transmittance, alpha
Given Pixel
distance between neighboring sample points

Feature Fields Architecture the positional encoding γ to the viewing direction d,
concatenate γ(d) to the latent appearance code Za,
fully-connected layers (yellow color) with ReLU activation (red color)
3D point x and viewing direction d together with latent
shape and appearance codes Zs, Za

2D Neural Rendering
StyleGAN architecture : Analyzing and Improving
the Image Quality of StyleGAN (CVPR 2020)
Neural Rendering Operator map the feature image
to an RGB image at every spatial resolution, and
add the previous output to the next via bilinear
upsampling. These skip connections ensure a
strong gradient flow to the feature fields. We
obtain our final image prediction ˆI by applying a
sigmoid activation to the last RGB layer.
Gray color indicates outputs, orange learnable,
and blue non-learnable operations.
N blocks
Final Stage
Train GAN

Datasets PhotoShape: Photorealistic Materials for Large-Scale Shape Collections (ACM 2018)
CLEVR: A Diagnostic Dataset for Compositional Language
and Elementary Visual Reasoning (CVPR 2017)
We present a diagnostic dataset that tests a range of visual reasoning
abilities. It contains minimal biases and has detailed annotations.
(3D Shapes : Color, Material, Rotation, Size)
To render multi-object scenes of random primitives. We adjust the
camera position to have a rotation of 0 ◦ instead of 43◦ . We save
renderings and positions of placed primitives to files. During training,
we sample the translations of object feature fields from the saved
positions. Controllable Image Synthesis(CIS) and our method on
scenes with 0, 1, 2, or 3 primitives (Clevr-0123) at 64^2 pixels https://knowyourdata-tfds.withgoogle.com/#tab=STATS&dataset=clevr
Automatically assign
high-quality, realistic
appearance models to
large scale 3D shape
collections.

Datasets (In supplementary document)
Dataset Parameters : object rotation, background rotation, camera elevation, horizontal and depth
translation, and object size from uniform distributions over the indicated ranges. For the Clevr datasets,
we sample object locations from the distribution we obtain during dataset generation.
Image Center Cropping : center crop (CelebA, CelebA-HQ)
Image Random Cropping : rescale and random crop(CompCars)
Data Augmentation : For all experiments, randomly flip horizontally during training
Clevr Dataset Generation: the script to render multi-object scenes of random primitives adjust the
camera position to have a rotation of 0 ◦ instead of 43◦

Experiments Our model correctly disentangles individual objects
when trained on multi-object scenes with fixed or
varying number of objects

Conclusion & Limitations
Conclusion
A novel method for controllable image synthesis.
Incorporate a compositional 3D scene representation into the generative model
Disentangle individual objects from the background as well as their shape and
appearance without explicit supervision
Limitations (Dataset Bias & Object Transformation Distributions)
Investigate how the distributions over object level transformations and camera poses
can be learned from data. (Assume simple uniform priors)
Incorporating supervision which is easy to obtain(predicted object masks, scale, more
complex, multi-object scenes)
entangled eye & hair

Thanks
Any Questions?
You can send mail to
Susang Kim(healess1@gmail.com)

[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields

More Related Content

What's hot

Similar to [Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields

More from Susang Kim

Recently uploaded

[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields