Implementation Initial ‘deep learning’ idea
.XYZ point cloud better than the
reconstructed .obj file for automatic
segmentation due to higher resolution
InputPointCloud
3D CAD MODEL
No need to have
planar surfaces
Sampled too densely
www.outsource3dcadmodeling.com
2DCAD MODEL
Straightforward from 3D to 2D
cadcrowd.com
RECONSTRUCT 3D
“Deep Learning”
3DSemantic Segmentation
frompointcloud / reconstructed mesh
youtube.com/watch?v=cGuoyNY54kU
arxiv.org/1608.04236
Primitive-based deep learning segmentation
The order between semantic segmentation and reconstruction could be swapped
NIPS 2016: 3D Workshop
very early still for point cloud pipelines compared to “ordered images”
Deep learning is proven to be a powerful tool to build
models for language (one-dimensional) and image
(two-dimensional) understanding. Tremendous efforts
have been devoted to these areas, however, it is still
at the early stage to apply deep learning to 3D data,
despite their great research values and broad real-
world applications. In particular, existing methods
poorly serve the three-dimensional data that drives
a broad range of critical applications such as
augmented reality, autonomous driving, graphics,
robotics, medical imaging, neuroscience, and
scientific simulations. These problems have drawn
the attention of researchers in different fields such as
neuroscience, computer vision, and graphics.
The goal of this workshop is to foster interdisciplinary
communication of researchers working on 3D data
(Computer Vision and Computer Graphics) so that
more attention of broader community can be drawn
to 3D deep learning problems. Through those
studies, new ideas and discoveries are expected to
emerge, which can inspire advances in related fields.
This workshop is composed of invited talks, oral
presentations of outstanding submissions and a
poster session to showcase the state-of-the-art
results on the topic. In particular, a panel discussion
among leading researchers in the field is planned, so
as to provide a common playground for inspiring
discussions and stimulating debates.
The workshop will be held on Dec 9 at NIPS 2016 in
Barcelona, Spain. http://3ddl.cs.princeton.edu/2016/
ORGANIZERS
●
Fisher Yu - Princeton University
●
Joseph Lim - Stanford University
●
Matthew Fisher - Stanford University
●
Qixing Huang - University of Texas at Austin
●
Jianxiong Xiao - AutoX Inc.
http://cvpr2017.thecvf.com/ In Honolulu, Hawaii
“I am co-organizing the
2nd Workshop on Visual
Understanding for
Interaction in conjunction
with CVPR 2017. Stay
tuned for the details!”
“Our workshop on Large-
Scale Scene Under-
standing Challenge is
accepted by CVPR 2017.
http://3ddl.cs.princeton.edu/2016/slides/su.pdf
PointNet Deep learning for point cloud classification and segmentation
https://github.com/charlesq34/pointnethttps://arxiv.org/abs/1612.00593
Applications of PointNet. We propose a novel deep net
architecture that consumes raw unordered point cloud (set of
points) without voxelization or rendering.
It is a unified architecture that learns both global and local
point features, providing a simple, efficient and effective
approach for a number of 3D recognition tasks
PointNet Architecture
Our network has three key modules:
1) the max pooling layer as a symmetric function to aggregate information from all the points,
2) a local and global information combination structure,
3) and two joint alignment networks that align both input points and point features.
PointNet symmetry function #1: Multi-layer Perceptron
http://iamaaditya.github.io/2016/03/one-by-one-convolution/
https://github.com/charlesq34/pointnet/blob/master/models/pointnet_cls_basic.py
MLP implented
as 1x1 2D convolution
PointNet symmetry function #2: Max Pooling
https://www.quora.com/How-is-a-convolutional-neural-network-able-to-learn-invariant-features
Jean Da Rolt, PhD, Computer Engineer, Professor: “After some thought, I do not believe that pooling
operations are responsible for the translation invariant property in CNNs. I believe that invariance (at least to
translation) is due to the convolution filters (not specifically the pooling) and due to the fully-connected layer. In
conclusion, what makes a CNN invariant to object translation is the architecture of the neural network: the
convolution filters and the fully-connected layer.”
Artem Rozantsev, PhD Computer Vision & Machine Learning: “In addition to the previous answers,
standard ConvNets are invariant only to transformationas that are present in the training data. However, there are
works, which made a step towards training networks that are inherently invariant to transformations such as
rotation and translation, for example”
https://arxiv.org/abs/1703.00356,
https://arxiv.org/abs/1612.04642
https://arxiv.org/abs/1512.07108
University College London
Ecole Polytechnique Fedérale de Lausanne (EPFL),
Lausanne, Switzerland
Key to our approach is the use of a single
symmetric function, max pooling. E
ffectively the network learns a set of
optimization functions/criteria that select
interesting or informative points of the point
cloud and encode the reason for their selection.
The final fully connected layers of the network
aggregate these learnt optimal values into the
global descriptor for the entire shape as
mentioned above (shape classification) or are
used to predict per point labels (shape
segmentation
PointNet Combination Structure
(pg. 3)
" Therefore, the model needs to be able to capture local structures from nearby points,
and the combinatorial interactions among local structures"
(pg. 4)
" After computing the global point cloud feature vector, we feed it back to per point
features by concatenating the global feature with each of the point features. Then we
extract new per point features based on the combined point features - this time the per
point feature is aware of both the local and global information"
(pg. 8)
"As discussed in Sec 4.2 (pg. 4), our network computes K (we take K = 1024 in this
experiment) dimension point features for each point and aggregates all the *per-point
local features* via a max pooling layer into a single K-dim vector, which forms the global
shape descriptor."
(pg. 13)
"Normal Estimation In segmentation version of PointNet, local point features and global
feature are concatenated in order to provide context to local points. However, it’s unclear
whether the context is learnt through this concatenation. In this experiment, we
validate our design by showing that our segmentation network can be trained to predict
point normals, a local geometric property that is determined by a point’s neighborhood"
PointNet Alignment Network
PointNet: (pg. 1)
"Thus we can add a data-dependent
spatial transformer network that
attempts to canonicalize the data before
the PointNet processes them, so as to
further improve the results."
PointNet: (pg. 4)
However, transformation matrix in the
feature space has much higher dimension
than the spatial transform matrix (e.g.
from 3 × 3 to 64 × 64), which greatly
increase the difficulty of optimization. We
therefore add a regularization term to
our softmax training loss. We constraint
the feature transformation matrix to be
close to orthogonal matrix.
We find that by adding the regularization
term, the optimization becomes more
stable and our model achieves better
performance.
In Fig 15 we see that performance grows as we increase the
number of points however it saturates at around 1K points.
The max layer size plays an important role, increasing the layer
size from 64 to 1024 results in a 2−4% performance gain. It
indicates that we need enough point feature functions to cover
the 3D space in order to discriminate different shapes.
PointNet Modifications input data,increase dimensionality?
PointNet: (pg. 1)
"In the basic setting each point is represented by
just its three coordinates (x, y, z). Additional
dimensions may be added by computing normals
and other local or global features."
Data columns: x, y, z, red, green, blue, no normals
Pointclouds canbe huge
https://www.we-get-around.com/wegetaround-
atlanta-our-blog/2015/10/cubicasa-creates-
2d-and-3d-floor-plans-for-matterport-photo
graphers-from-3d-showcase-tours
6-dimensional inputdata
With the x,y,z coordinates one
obtains also R,G,B values (or CIE LAB
colorspace) that are very useful in
segmenting objects.
7-dimensional inputdata
Normals could be obtained too if the
camera position were known
Eurographics Symposium on Geometry Processing 2016, Volume 35
(2016), Number 5 http://dx.doi.org/10.1111/cgf.12983
PointNet: (pg. 13)
PointNet Modifications Architecture #1: Uncertainty estimation?
https://arxiv.org/pdf/1703.04977.pdf
http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html
[in classification
pipeline only] not in
segmentation part
PointNet Modifications Architecture #2: component variations?
Nonlinearity Pooling Layer Normalization
In order to make a model invariant to input
permutation, the authors use max pooling
as the simple symmetric function to
aggregate the information from each point.
[in classification[ All layers, except the last
one, include ReLU and batch normalization.
[in classification[ All layers, except the last
one, include ReLU and batch normalization.
http://arxiv.org/abs/1604.04112
“One possible future line of work is to embed the network in its
entirety in the frequency domain. In models that employ Fourier
transforms to compute convolutions, at every convolutional layer
the input is FFT-ed and the elementwise multiplication output is
then inverse FFT-ed. These back-andforth transformations are very
computationally intensive, and as such it would be desirable to
strictly remain in the frequency domain. However, the reason for
these repeated transformations is the application of nonlinearities
in the forward domain: if one were to propose a sensible
nonlinearity in the frequency domain, this would spare us from
the incessant domain switching.”
Our reparameterization is inspired by batch
normalization but does not introduce any
dependencies between the examples in a
minibatch. This means that our method can also
be applied successfully to recurrent models such
as LSTMs and to noise-sensitive applications
such as deep reinforcement learning or
generative models, for which batch
normalization is less well suited.
https://arxiv.org/abs/1602.07868
https://arxiv.org/abs/1605.09332
http://arxiv.org/abs/1512.07108
PointNet Modifications Architecture #3: Unsupervised/Semi-supervised extensions?

PointNet

  • 2.
    Implementation Initial ‘deeplearning’ idea .XYZ point cloud better than the reconstructed .obj file for automatic segmentation due to higher resolution InputPointCloud 3D CAD MODEL No need to have planar surfaces Sampled too densely www.outsource3dcadmodeling.com 2DCAD MODEL Straightforward from 3D to 2D cadcrowd.com RECONSTRUCT 3D “Deep Learning” 3DSemantic Segmentation frompointcloud / reconstructed mesh youtube.com/watch?v=cGuoyNY54kU arxiv.org/1608.04236 Primitive-based deep learning segmentation The order between semantic segmentation and reconstruction could be swapped
  • 3.
    NIPS 2016: 3DWorkshop very early still for point cloud pipelines compared to “ordered images” Deep learning is proven to be a powerful tool to build models for language (one-dimensional) and image (two-dimensional) understanding. Tremendous efforts have been devoted to these areas, however, it is still at the early stage to apply deep learning to 3D data, despite their great research values and broad real- world applications. In particular, existing methods poorly serve the three-dimensional data that drives a broad range of critical applications such as augmented reality, autonomous driving, graphics, robotics, medical imaging, neuroscience, and scientific simulations. These problems have drawn the attention of researchers in different fields such as neuroscience, computer vision, and graphics. The goal of this workshop is to foster interdisciplinary communication of researchers working on 3D data (Computer Vision and Computer Graphics) so that more attention of broader community can be drawn to 3D deep learning problems. Through those studies, new ideas and discoveries are expected to emerge, which can inspire advances in related fields. This workshop is composed of invited talks, oral presentations of outstanding submissions and a poster session to showcase the state-of-the-art results on the topic. In particular, a panel discussion among leading researchers in the field is planned, so as to provide a common playground for inspiring discussions and stimulating debates. The workshop will be held on Dec 9 at NIPS 2016 in Barcelona, Spain. http://3ddl.cs.princeton.edu/2016/ ORGANIZERS ● Fisher Yu - Princeton University ● Joseph Lim - Stanford University ● Matthew Fisher - Stanford University ● Qixing Huang - University of Texas at Austin ● Jianxiong Xiao - AutoX Inc. http://cvpr2017.thecvf.com/ In Honolulu, Hawaii “I am co-organizing the 2nd Workshop on Visual Understanding for Interaction in conjunction with CVPR 2017. Stay tuned for the details!” “Our workshop on Large- Scale Scene Under- standing Challenge is accepted by CVPR 2017. http://3ddl.cs.princeton.edu/2016/slides/su.pdf
  • 4.
    PointNet Deep learningfor point cloud classification and segmentation https://github.com/charlesq34/pointnethttps://arxiv.org/abs/1612.00593 Applications of PointNet. We propose a novel deep net architecture that consumes raw unordered point cloud (set of points) without voxelization or rendering. It is a unified architecture that learns both global and local point features, providing a simple, efficient and effective approach for a number of 3D recognition tasks
  • 5.
    PointNet Architecture Our networkhas three key modules: 1) the max pooling layer as a symmetric function to aggregate information from all the points, 2) a local and global information combination structure, 3) and two joint alignment networks that align both input points and point features.
  • 6.
    PointNet symmetry function#1: Multi-layer Perceptron http://iamaaditya.github.io/2016/03/one-by-one-convolution/ https://github.com/charlesq34/pointnet/blob/master/models/pointnet_cls_basic.py MLP implented as 1x1 2D convolution
  • 7.
    PointNet symmetry function#2: Max Pooling https://www.quora.com/How-is-a-convolutional-neural-network-able-to-learn-invariant-features Jean Da Rolt, PhD, Computer Engineer, Professor: “After some thought, I do not believe that pooling operations are responsible for the translation invariant property in CNNs. I believe that invariance (at least to translation) is due to the convolution filters (not specifically the pooling) and due to the fully-connected layer. In conclusion, what makes a CNN invariant to object translation is the architecture of the neural network: the convolution filters and the fully-connected layer.” Artem Rozantsev, PhD Computer Vision & Machine Learning: “In addition to the previous answers, standard ConvNets are invariant only to transformationas that are present in the training data. However, there are works, which made a step towards training networks that are inherently invariant to transformations such as rotation and translation, for example” https://arxiv.org/abs/1703.00356, https://arxiv.org/abs/1612.04642 https://arxiv.org/abs/1512.07108 University College London Ecole Polytechnique Fedérale de Lausanne (EPFL), Lausanne, Switzerland Key to our approach is the use of a single symmetric function, max pooling. E ffectively the network learns a set of optimization functions/criteria that select interesting or informative points of the point cloud and encode the reason for their selection. The final fully connected layers of the network aggregate these learnt optimal values into the global descriptor for the entire shape as mentioned above (shape classification) or are used to predict per point labels (shape segmentation
  • 8.
    PointNet Combination Structure (pg.3) " Therefore, the model needs to be able to capture local structures from nearby points, and the combinatorial interactions among local structures" (pg. 4) " After computing the global point cloud feature vector, we feed it back to per point features by concatenating the global feature with each of the point features. Then we extract new per point features based on the combined point features - this time the per point feature is aware of both the local and global information" (pg. 8) "As discussed in Sec 4.2 (pg. 4), our network computes K (we take K = 1024 in this experiment) dimension point features for each point and aggregates all the *per-point local features* via a max pooling layer into a single K-dim vector, which forms the global shape descriptor." (pg. 13) "Normal Estimation In segmentation version of PointNet, local point features and global feature are concatenated in order to provide context to local points. However, it’s unclear whether the context is learnt through this concatenation. In this experiment, we validate our design by showing that our segmentation network can be trained to predict point normals, a local geometric property that is determined by a point’s neighborhood"
  • 9.
    PointNet Alignment Network PointNet:(pg. 1) "Thus we can add a data-dependent spatial transformer network that attempts to canonicalize the data before the PointNet processes them, so as to further improve the results." PointNet: (pg. 4) However, transformation matrix in the feature space has much higher dimension than the spatial transform matrix (e.g. from 3 × 3 to 64 × 64), which greatly increase the difficulty of optimization. We therefore add a regularization term to our softmax training loss. We constraint the feature transformation matrix to be close to orthogonal matrix. We find that by adding the regularization term, the optimization becomes more stable and our model achieves better performance. In Fig 15 we see that performance grows as we increase the number of points however it saturates at around 1K points. The max layer size plays an important role, increasing the layer size from 64 to 1024 results in a 2−4% performance gain. It indicates that we need enough point feature functions to cover the 3D space in order to discriminate different shapes.
  • 10.
    PointNet Modifications inputdata,increase dimensionality? PointNet: (pg. 1) "In the basic setting each point is represented by just its three coordinates (x, y, z). Additional dimensions may be added by computing normals and other local or global features." Data columns: x, y, z, red, green, blue, no normals Pointclouds canbe huge https://www.we-get-around.com/wegetaround- atlanta-our-blog/2015/10/cubicasa-creates- 2d-and-3d-floor-plans-for-matterport-photo graphers-from-3d-showcase-tours 6-dimensional inputdata With the x,y,z coordinates one obtains also R,G,B values (or CIE LAB colorspace) that are very useful in segmenting objects. 7-dimensional inputdata Normals could be obtained too if the camera position were known Eurographics Symposium on Geometry Processing 2016, Volume 35 (2016), Number 5 http://dx.doi.org/10.1111/cgf.12983 PointNet: (pg. 13)
  • 11.
    PointNet Modifications Architecture#1: Uncertainty estimation? https://arxiv.org/pdf/1703.04977.pdf http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html [in classification pipeline only] not in segmentation part
  • 12.
    PointNet Modifications Architecture#2: component variations? Nonlinearity Pooling Layer Normalization In order to make a model invariant to input permutation, the authors use max pooling as the simple symmetric function to aggregate the information from each point. [in classification[ All layers, except the last one, include ReLU and batch normalization. [in classification[ All layers, except the last one, include ReLU and batch normalization. http://arxiv.org/abs/1604.04112 “One possible future line of work is to embed the network in its entirety in the frequency domain. In models that employ Fourier transforms to compute convolutions, at every convolutional layer the input is FFT-ed and the elementwise multiplication output is then inverse FFT-ed. These back-andforth transformations are very computationally intensive, and as such it would be desirable to strictly remain in the frequency domain. However, the reason for these repeated transformations is the application of nonlinearities in the forward domain: if one were to propose a sensible nonlinearity in the frequency domain, this would spare us from the incessant domain switching.” Our reparameterization is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited. https://arxiv.org/abs/1602.07868 https://arxiv.org/abs/1605.09332 http://arxiv.org/abs/1512.07108
  • 13.
    PointNet Modifications Architecture#3: Unsupervised/Semi-supervised extensions?