Susang Kim(healess1@gmail.com)
3D Representation
GIRAFFE : Representing Scenes as Compositional Generative Neural Feature Fields
(CVPR 2021 Best Paper Award)
Related work & References
NeRF : Neural Radiance Fields (ECCV 2020 - Best Paper Honorable Mention)
Input is a single continuous 5D coordinate (spatial
location (x, y, z) and viewing direction (θ, φ)) and
whose output is the volume density and
view-dependent emitted radiance at that spatial
location
FΘ : (x, d) → (c, σ) and optimize its
weights Θ to map from each input 5D
coordinate to its corresponding
volume density and directional emitted
color
Positional encoding : γ(·) is applied separately to each of the three coordinate values in x (which are
normalized to lie in [−1, 1]) and to the three components of the Cartesian viewing direction unit vector d
(which by construction lie in [−1, 1]). In our experiments, we set L = 10 for γ(x) and L = 4 for γ(d).
higher dimensional space to enable our MLP to more easily approximate a higher frequency function
GRAF: Generative Radiance Fields (NeurIPS 2020)
A generative model for radiance fields for high-resolution 3D-aware image synthesis from unposed
images. A patch-based discriminator that samples the image at multiple scales and which is key to learn
high-resolution generative radiance fields efficiently
camera matrix
camera pose
2D sampling
pattern
Γ(I, ν) to denote this bilinear sampling operation
Generate high resolution images with better multi-view consistency compared to voxel-based approaches.
Limitation is simple scenes with single objects. (inductive bias : depth maps, symmetry and real-world scenes)
the dimensionalities of the latent codes
Frechet Inception Distance (FID) (NeurlPS 2017)
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
For the evaluation of the performance of GANs at image generation, FID captures the similarity of
generated images to real ones better and more consistent than the Inception Score.
FID score(real(m,c), fake(m,c)) :
the Gaussian with mean and covariance (m, C)
the trace of a square matrix A, denoted Tr(A)
Giraffe report FID score to quantify image quality.
(20,000 real and fake samples)
GIRAFFE: Representing Scenes as Compositional
Generative Neural Feature Fields
(CVPR 2021 Best Paper Award)
Abstract
Most approaches, however, do not consider
the compositional nature of scenes
(not 2D but 3D). A key limitation of NeRF
and GRAF is that the entire scene is
represented by a single model.
How to disentangle underlying factors of
variation in the data, most of them operate in
2D and hence ignore that our world is
three-dimensional.
Incorporating a compositional 3D scene
representation into the generative model
leads to more controllable image synthesis.
Our model is able to disentangle individual
objects and allows for translating and
rotating them in the scene as well as
changing the camera pose.
Image Generation
disentangled representations
(without changing)
disentangled representations
(changing)
Definitions of disentanglement vary, but commonly refer to being able to control an attribute of
interest, e.g. object shape, size, or pose, without changing other attributes.
Overview
Incorporating a compositional
3D scene representation
A novel method for generating scenes in a controllable and photorealistic manner while training from
raw unstructured image collections
A neural renderer processes these feature
images and outputs the final renderings
controllable
image
synthesis
GIRAFFE achieves high-quality images and scales to real-world scenes.
Method
Object Representation: Disentangle different entities in the scene -> Represent each object using a
separate feature field in combination with an affine transformation
rotation matrix : R ∈ SO(3)
canonical object space
Neural Radiance Fields: Low dimensional input x and d needs to be mapped to higher-dimensional
features to be able to represent complex signals when f is parameterized with a MLP.
<- positional encoding (t is scalar)
Generative Neural Feature Fields: To learn a latent space of NeRFs, they condition the MLP on
shape and appearance codes
component of x or d, and L the number of frequency octaves
the output dimensionalities of the positional encodings
viewing direction
(σ : volume density, c: RGB color value)
c : 3D color -> f:Multi-D
GIRAFFE
objects
background
shape and appearance
A novel method for generating scenes in a controllable and photorealistic manner while training from raw
unstructured image collections. Orange indicates learnable and blue non-learnable operations.
transmittance, alpha
Given Pixel
distance between neighboring sample points
Feature Fields Architecture the positional encoding γ to the viewing direction d,
concatenate γ(d) to the latent appearance code Za,
fully-connected layers (yellow color) with ReLU activation (red color)
3D point x and viewing direction d together with latent
shape and appearance codes Zs, Za
2D Neural Rendering
StyleGAN architecture : Analyzing and Improving
the Image Quality of StyleGAN (CVPR 2020)
Neural Rendering Operator map the feature image
to an RGB image at every spatial resolution, and
add the previous output to the next via bilinear
upsampling. These skip connections ensure a
strong gradient flow to the feature fields. We
obtain our final image prediction ˆI by applying a
sigmoid activation to the last RGB layer.
Gray color indicates outputs, orange learnable,
and blue non-learnable operations.
N blocks
Final Stage
Train GAN
Datasets PhotoShape: Photorealistic Materials for Large-Scale Shape Collections (ACM 2018)
CLEVR: A Diagnostic Dataset for Compositional Language
and Elementary Visual Reasoning (CVPR 2017)
We present a diagnostic dataset that tests a range of visual reasoning
abilities. It contains minimal biases and has detailed annotations.
(3D Shapes : Color, Material, Rotation, Size)
To render multi-object scenes of random primitives. We adjust the
camera position to have a rotation of 0 ◦ instead of 43◦ . We save
renderings and positions of placed primitives to files. During training,
we sample the translations of object feature fields from the saved
positions. Controllable Image Synthesis(CIS) and our method on
scenes with 0, 1, 2, or 3 primitives (Clevr-0123) at 64^2 pixels https://knowyourdata-tfds.withgoogle.com/#tab=STATS&dataset=clevr
Automatically assign
high-quality, realistic
appearance models to
large scale 3D shape
collections.
Datasets (In supplementary document)
Dataset Parameters : object rotation, background rotation, camera elevation, horizontal and depth
translation, and object size from uniform distributions over the indicated ranges. For the Clevr datasets,
we sample object locations from the distribution we obtain during dataset generation.
Image Center Cropping : center crop (CelebA, CelebA-HQ)
Image Random Cropping : rescale and random crop(CompCars)
Data Augmentation : For all experiments, randomly flip horizontally during training
Clevr Dataset Generation: the script to render multi-object scenes of random primitives adjust the
camera position to have a rotation of 0 ◦ instead of 43◦
Experiments Our model correctly disentangles individual objects
when trained on multi-object scenes with fixed or
varying number of objects
Conclusion & Limitations
Conclusion
A novel method for controllable image synthesis.
Incorporate a compositional 3D scene representation into the generative model
Disentangle individual objects from the background as well as their shape and
appearance without explicit supervision
Limitations (Dataset Bias & Object Transformation Distributions)
Investigate how the distributions over object level transformations and camera poses
can be learned from data. (Assume simple uniform priors)
Incorporating supervision which is easy to obtain(predicted object masks, scale, more
complex, multi-object scenes)
entangled eye & hair
Thanks
Any Questions?
You can send mail to
Susang Kim(healess1@gmail.com)

[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields

  • 1.
    Susang Kim(healess1@gmail.com) 3D Representation GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields (CVPR 2021 Best Paper Award)
  • 2.
    Related work &References
  • 3.
    NeRF : NeuralRadiance Fields (ECCV 2020 - Best Paper Honorable Mention) Input is a single continuous 5D coordinate (spatial location (x, y, z) and viewing direction (θ, φ)) and whose output is the volume density and view-dependent emitted radiance at that spatial location FΘ : (x, d) → (c, σ) and optimize its weights Θ to map from each input 5D coordinate to its corresponding volume density and directional emitted color Positional encoding : γ(·) is applied separately to each of the three coordinate values in x (which are normalized to lie in [−1, 1]) and to the three components of the Cartesian viewing direction unit vector d (which by construction lie in [−1, 1]). In our experiments, we set L = 10 for γ(x) and L = 4 for γ(d). higher dimensional space to enable our MLP to more easily approximate a higher frequency function
  • 4.
    GRAF: Generative RadianceFields (NeurIPS 2020) A generative model for radiance fields for high-resolution 3D-aware image synthesis from unposed images. A patch-based discriminator that samples the image at multiple scales and which is key to learn high-resolution generative radiance fields efficiently camera matrix camera pose 2D sampling pattern Γ(I, ν) to denote this bilinear sampling operation Generate high resolution images with better multi-view consistency compared to voxel-based approaches. Limitation is simple scenes with single objects. (inductive bias : depth maps, symmetry and real-world scenes) the dimensionalities of the latent codes
  • 5.
    Frechet Inception Distance(FID) (NeurlPS 2017) GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium For the evaluation of the performance of GANs at image generation, FID captures the similarity of generated images to real ones better and more consistent than the Inception Score. FID score(real(m,c), fake(m,c)) : the Gaussian with mean and covariance (m, C) the trace of a square matrix A, denoted Tr(A) Giraffe report FID score to quantify image quality. (20,000 real and fake samples)
  • 6.
    GIRAFFE: Representing Scenesas Compositional Generative Neural Feature Fields (CVPR 2021 Best Paper Award)
  • 7.
    Abstract Most approaches, however,do not consider the compositional nature of scenes (not 2D but 3D). A key limitation of NeRF and GRAF is that the entire scene is represented by a single model. How to disentangle underlying factors of variation in the data, most of them operate in 2D and hence ignore that our world is three-dimensional. Incorporating a compositional 3D scene representation into the generative model leads to more controllable image synthesis. Our model is able to disentangle individual objects and allows for translating and rotating them in the scene as well as changing the camera pose.
  • 8.
    Image Generation disentangled representations (withoutchanging) disentangled representations (changing) Definitions of disentanglement vary, but commonly refer to being able to control an attribute of interest, e.g. object shape, size, or pose, without changing other attributes.
  • 9.
    Overview Incorporating a compositional 3Dscene representation A novel method for generating scenes in a controllable and photorealistic manner while training from raw unstructured image collections A neural renderer processes these feature images and outputs the final renderings controllable image synthesis GIRAFFE achieves high-quality images and scales to real-world scenes.
  • 10.
    Method Object Representation: Disentangledifferent entities in the scene -> Represent each object using a separate feature field in combination with an affine transformation rotation matrix : R ∈ SO(3) canonical object space Neural Radiance Fields: Low dimensional input x and d needs to be mapped to higher-dimensional features to be able to represent complex signals when f is parameterized with a MLP. <- positional encoding (t is scalar) Generative Neural Feature Fields: To learn a latent space of NeRFs, they condition the MLP on shape and appearance codes component of x or d, and L the number of frequency octaves the output dimensionalities of the positional encodings viewing direction (σ : volume density, c: RGB color value) c : 3D color -> f:Multi-D
  • 11.
    GIRAFFE objects background shape and appearance Anovel method for generating scenes in a controllable and photorealistic manner while training from raw unstructured image collections. Orange indicates learnable and blue non-learnable operations. transmittance, alpha Given Pixel distance between neighboring sample points
  • 12.
    Feature Fields Architecturethe positional encoding γ to the viewing direction d, concatenate γ(d) to the latent appearance code Za, fully-connected layers (yellow color) with ReLU activation (red color) 3D point x and viewing direction d together with latent shape and appearance codes Zs, Za
  • 13.
    2D Neural Rendering StyleGANarchitecture : Analyzing and Improving the Image Quality of StyleGAN (CVPR 2020) Neural Rendering Operator map the feature image to an RGB image at every spatial resolution, and add the previous output to the next via bilinear upsampling. These skip connections ensure a strong gradient flow to the feature fields. We obtain our final image prediction ˆI by applying a sigmoid activation to the last RGB layer. Gray color indicates outputs, orange learnable, and blue non-learnable operations. N blocks Final Stage Train GAN
  • 14.
    Datasets PhotoShape: PhotorealisticMaterials for Large-Scale Shape Collections (ACM 2018) CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning (CVPR 2017) We present a diagnostic dataset that tests a range of visual reasoning abilities. It contains minimal biases and has detailed annotations. (3D Shapes : Color, Material, Rotation, Size) To render multi-object scenes of random primitives. We adjust the camera position to have a rotation of 0 ◦ instead of 43◦ . We save renderings and positions of placed primitives to files. During training, we sample the translations of object feature fields from the saved positions. Controllable Image Synthesis(CIS) and our method on scenes with 0, 1, 2, or 3 primitives (Clevr-0123) at 64^2 pixels https://knowyourdata-tfds.withgoogle.com/#tab=STATS&dataset=clevr Automatically assign high-quality, realistic appearance models to large scale 3D shape collections.
  • 15.
    Datasets (In supplementarydocument) Dataset Parameters : object rotation, background rotation, camera elevation, horizontal and depth translation, and object size from uniform distributions over the indicated ranges. For the Clevr datasets, we sample object locations from the distribution we obtain during dataset generation. Image Center Cropping : center crop (CelebA, CelebA-HQ) Image Random Cropping : rescale and random crop(CompCars) Data Augmentation : For all experiments, randomly flip horizontally during training Clevr Dataset Generation: the script to render multi-object scenes of random primitives adjust the camera position to have a rotation of 0 ◦ instead of 43◦
  • 16.
    Experiments Our modelcorrectly disentangles individual objects when trained on multi-object scenes with fixed or varying number of objects
  • 17.
    Conclusion & Limitations Conclusion Anovel method for controllable image synthesis. Incorporate a compositional 3D scene representation into the generative model Disentangle individual objects from the background as well as their shape and appearance without explicit supervision Limitations (Dataset Bias & Object Transformation Distributions) Investigate how the distributions over object level transformations and camera poses can be learned from data. (Assume simple uniform priors) Incorporating supervision which is easy to obtain(predicted object masks, scale, more complex, multi-object scenes) entangled eye & hair
  • 18.
    Thanks Any Questions? You cansend mail to Susang Kim(healess1@gmail.com)