PetteriTeikariPhD
3,650 views
13 slides
May 03, 2017
Slide 1 of 13
1
2
3
4
5
6
7
8
9
10
11
12
13
About This Presentation
Journal club done with Vid Stojevic for PointNet:
https://arxiv.org/abs/1612.00593
https://github.com/charlesq34/pointnet
http://stanford.edu/~rqi/pointnet/
Deep learning for Indoor Point Cloud processing. PointNet, provides a unified architecture operating directly on unordered point clouds withou...
Journal club done with Vid Stojevic for PointNet:
https://arxiv.org/abs/1612.00593
https://github.com/charlesq34/pointnet
http://stanford.edu/~rqi/pointnet/
Deep learning for Indoor Point Cloud processing. PointNet, provides a unified architecture operating directly on unordered point clouds without voxelisation for applications ranging from object classification, part segmentation, to scene semantic parsing.
Alternative download link:
https://www.dropbox.com/s/ziyhgi627vg9lyi/3D_v2017_initReport.pdf?dl=0
Size: 4.26 MB
Language: en
Added: May 03, 2017
Slides: 13 pages
Slide Content
Implementation Initial ‘deep learning’ idea
.XYZ point cloud better than the
reconstructed .obj file for automatic
segmentation due to higher resolution
Input Point Cloud
3D CAD MODEL
No need to have
planar surfaces
Sampled too densely
www.outsource3dcadmodeling.com
2D CAD MODEL
Straightforward from 3D to 2D
cadcrowd.com
RECONSTRUCT 3D
“Deep Learning”
3D Semantic Segmentation
from point cloud / reconstructed mesh
youtube.com/watch?v=cGuoyNY54kU
arxiv.org/1608.04236
Primitive-based deep learning segmentation
The order between semantic segmentation and reconstruction could be swapped
NIPS 2016: 3D Workshop
very early still for point cloud pipelines compared to “ordered images”
Deep learning is proven to be a powerful tool to build
models for language (one-dimensional) and image
(two-dimensional) understanding. Tremendous efforts
have been devoted to these areas, however, it is still
at the early stage to apply deep learning to 3D data,
despite their great research values and broad real-
world applications. In particular, existing methods
poorly serve the three-dimensional data that drives
a broad range of critical applications such as
augmented reality, autonomous driving, graphics,
robotics, medical imaging, neuroscience, and
scientific simulations. These problems have drawn
the attention of researchers in different fields such as
neuroscience, computer vision, and graphics.
The goal of this workshop is to foster interdisciplinary
communication of researchers working on 3D data
(Computer Vision and Computer Graphics) so that
more attention of broader community can be drawn
to 3D deep learning problems. Through those
studies, new ideas and discoveries are expected to
emerge, which can inspire advances in related fields.
This workshop is composed of invited talks, oral
presentations of outstanding submissions and a
poster session to showcase the state-of-the-art
results on the topic. In particular, a panel discussion
among leading researchers in the field is planned, so
as to provide a common playground for inspiring
discussions and stimulating debates.
The workshop will be held on Dec 9 at NIPS 2016 in
Barcelona, Spain. http://3ddl.cs.princeton.edu/2016/
ORGANIZERS
●
Fisher Yu - Princeton University
●
Joseph Lim - Stanford University
●
Matthew Fisher - Stanford University
●
Qixing Huang - University of Texas at Austin
●
Jianxiong Xiao - AutoX Inc.
http://cvpr2017.thecvf.com/ In Honolulu, Hawaii
“I am co-organizing the
2nd Workshop on Visual
Understanding for
Interaction in conjunction
with CVPR 2017. Stay
tuned for the details!”
“Our workshop on Large-
Scale Scene Under-
standing Challenge is
accepted by CVPR 2017.
http://3ddl.cs.princeton.edu/2016/slides/su.pdf
PointNet Deep learning for point cloud classification and segmentation
https://github.com/charlesq34/pointnethttps://arxiv.org/abs/1612.00593
Applications of PointNet. We propose a novel deep net
architecture that consumes raw unordered point cloud (set of
points) without voxelization or rendering.
It is a unified architecture that learns both global and local
point features, providing a simple, efficient and effective
approach for a number of 3D recognition tasks
PointNet Architecture
Our network has three key modules:
1)the max pooling layer as a symmetric function to aggregate information from all the points,
2)a local and global information combination structure,
3)and two joint alignment networks that align both input points and point features.
PointNet symmetry function #1: Multi-layer Perceptron
http://iamaaditya.github.io/2016/03/one-by-one-convolution/
https://github.com/charlesq34/pointnet/blob/master/models/pointnet_cls_basic.py
MLP implented
as 1x1 2D convolution
PointNet symmetry function #2: Max Pooling
https://www.quora.com/How-is-a-convolutional-neural-network-able-to-learn-invariant-features
Jean Da Rolt, PhD, Computer Engineer, Professor: “After some thought, I do not believe that pooling
operations are responsible for the translation invariant property in CNNs. I believe that invariance (at least to
translation) is due to the convolution filters (not specifically the pooling) and due to the fully-connected layer. In
conclusion, what makes a CNN invariant to object translation is the architecture of the neural network: the
convolution filters and the fully-connected layer.”
Artem Rozantsev, PhD Computer Vision & Machine Learning: “In addition to the previous answers,
standard ConvNets are invariant only to transformationas that are present in the training data. However, there are
works, which made a step towards training networks that are inherently invariant to transformations such as
rotation and translation, for example”
https://arxiv.org/abs/1703.00356,
https://arxiv.org/abs/1612.04642
https://arxiv.org/abs/1512.07108
University College London
Ecole Polytechnique Fedérale de Lausanne (EPFL),
Lausanne, Switzerland
Key to our approach is the use of a single
symmetric function, max pooling. E
ffectively the network learns a set of
optimization functions/criteria that select
interesting or informative points of the point
cloud and encode the reason for their selection.
The final fully connected layers of the network
aggregate these learnt optimal values into the
global descriptor for the entire shape as
mentioned above (shape classification) or are
used to predict per point labels (shape
segmentation
PointNet Combination Structure
(pg. 3)
" Therefore, the model needs to be able to capture local structures from nearby points,
and the combinatorial interactions among local structures"
(pg. 4)
" After computing the global point cloud feature vector, we feed it back to per point
features by concatenating the global feature with each of the point features. Then we
extract new per point features based on the combined point features - this time the per
point feature is aware of both the local and global information"
(pg. 8)
"As discussed in Sec 4.2 (pg. 4), our network computes K (we take K = 1024 in this
experiment) dimension point features for each point and aggregates all the *per-point
local features* via a max pooling layer into a single K-dim vector, which forms the global
shape descriptor."
(pg. 13)
"Normal Estimation In segmentation version of PointNet, local point features and global
feature are concatenated in order to provide context to local points. However, it’s unclear
whether the context is learnt through this concatenation. In this experiment, we
validate our design by showing that our segmentation network can be trained to predict
point normals, a local geometric property that is determined by a point’s neighborhood"
PointNet Alignment Network
PointNet: (pg. 1)
"Thus we can add a data-dependent
spatial transformer network that
attempts to canonicalize the data before
the PointNet processes them, so as to
further improve the results."
PointNet: (pg. 4)
However, transformation matrix in the
feature space has much higher dimension
than the spatial transform matrix (e.g.
from 3 × 3 to 64 × 64), which greatly
increase the difficulty of optimization. We
therefore add a regularization term to
our softmax training loss. We constraint
the feature transformation matrix to be
close to orthogonal matrix.
We find that by adding the regularization
term, the optimization becomes more
stable and our model achieves better
performance.
In Fig 15 we see that performance grows as we increase the
number of points however it saturates at around 1K points.
The max layer size plays an important role, increasing the layer
size from 64 to 1024 results in a 2−4% performance gain. It
indicates that we need enough point feature functions to cover
the 3D space in order to discriminate different shapes.
PointNet Modifications input data, increase dimensionality?
PointNet: (pg. 1)
"In the basic setting each point is represented by
just its three coordinates (x, y, z). Additional
dimensions may be added by computing normals
and other local or global features."
Data columns: x, y, z, red, green, blue, no normals
Point clouds can be huge
https://www.we-get-around.com/wegetaround-
atlanta-our-blog/2015/10/cubicasa-creates-
2d-and-3d-floor-plans-for-matterport-photo
graphers-from-3d-showcase-tours
6-dimensional input data
With the x,y,z coordinates one
obtains also R,G,B values (or CIE LAB
colorspace) that are very useful in
segmenting objects.
7-dimensional input data
Normals could be obtained too if the
camera position were known
Eurographics Symposium on Geometry Processing 2016, Volume 35
(2016), Number 5 http://dx.doi.org/10.1111/cgf.12983
PointNet: (pg. 13)
PointNet Modifications Architecture #1: Uncertainty estimation?
https://arxiv.org/pdf/1703.04977.pdf
http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html
[in classification
pipeline only] not in
segmentation part
PointNet Modifications Architecture #2: component variations?
Nonlinearity Pooling Layer Normalization
In order to make a model invariant to input
permutation, the authors use max pooling
as the simple symmetric function to
aggregate the information from each point.
[in classification[ All layers, except the last
one, include ReLU and batch normalization.
[in classification[ All layers, except the last
one, include ReLU and batch normalization.
http://arxiv.org/abs/1604.04112
“One possible future line of work is to embed the network in its
entirety in the frequency domain. In models that employ Fourier
transforms to compute convolutions, at every convolutional layer
the input is FFT-ed and the elementwise multiplication output is
then inverse FFT-ed. These back-andforth transformations are very
computationally intensive, and as such it would be desirable to
strictly remain in the frequency domain. However, the reason for
these repeated transformations is the application of nonlinearities
in the forward domain: if one were to propose a sensible
nonlinearity in the frequency domain, this would spare us from
the incessant domain switching.”
Our reparameterization is inspired by batch
normalization but does not introduce any
dependencies between the examples in a
minibatch. This means that our method can also
be applied successfully to recurrent models such
as LSTMs and to noise-sensitive applications
such as deep reinforcement learning or
generative models, for which batch
normalization is less well suited.
https://arxiv.org/abs/1602.07868
https://arxiv.org/abs/1605.09332
http://arxiv.org/abs/1512.07108