Successfully reported this slideshow.
Upcoming SlideShare
×

# Oleksandr Obiednikov “Affine transforms and how CNN lives with them”

887 views

Published on

Data Science

Published in: Data & Analytics
• Full Name
Comment goes here.

Are you sure you want to Yes No

• Be the first to like this

### Oleksandr Obiednikov “Affine transforms and how CNN lives with them”

1. 1. Affine transforms and how CNNs live with them Oleksandr Obiednikov Head of Face Recognition Research @ RingUkraine March 10 AI & Big Data Day 2018
2. 2. ● Gentle reminder: What is CNN? What is Affine Transformation? ● Problem formulation ● Common Methods to increase robustness Practical Solution: ● Spatial Transformer Networks ○ STN structure ○ Examples, applications, results Not Yet Practical Solution: ● Capsule Networks ○ What’s wrong with CNN? ○ Capsule Networks ○ Results, Applications? Agenda
3. 3. Key operations: ● Dense / Fully Connected= σ(w • x) ● Convolution = σ(w ✳ x) ● Max or Avg Pooling ● Others: BatchNorm, Dropout, Skip-connections, etc. Gentle reminder: What is CNN?
4. 4. “In geometry, an affine transformation (from the Latin, affinis, "connected with") is a function between affine spaces which preserves points, straight lines and planes.” Wikipedia Gentle reminder: What is an affine transformation ? In other words it’s a function: Where M is an invertible matrix Examples: ● Shifts ● Rotations ● Reflections ● Stretching and compression ● ….
5. 5. ● When we switch from MLP to CNN, we get shift invariance for free, but NO rotation invariance, NO scale invariance, NO flip invariance, etc. ● Key object “in the wild” may have different amount of context. Problems Formulation
6. 6. ● Getting more data... ● Augmentations ● Pooling ● Attention mechanisms Ways to solve. Common ways.
7. 7. Reference: https://arxiv.org/abs/1506.02025 from DeepMind Key features: ● The action of the STN is conditioned on individual sample ● Does much more than attention: cropping, translation, rotation, scale, skew ● No additional supervision ● No modifications of the optimization process ● Computationally efficient ● Can be applied to any feature map or particularly to input image Spatial Transformer Network
8. 8. STN consists of: 1.Localization net: predicts parameters of the transform theta. For 2d case, it's 2 x 3 matrix. For 3d case, it's 3 x 4 matrix. 2.Grid generator: Uses predictions of Localization net to create a sampling grid, which is a set of points where the input map should be sampled to produce the transformed output. 3.Sampler: Produces the output map sampled from the input feature map at the predicted grid points. Spatial Transformer Structure
9. 9. Notes: Localization network 1.Should predict 6 (2 x 3) parameters for 2D transform. 12 (3 x 4) for 3D transform and in general case N(N+1) parameters 2.LocNet’s “Feature extractor” can have any structure In practice, for the most of the problems 2 x Conv + Dense is enough. 3.Can be modified to predict several transformations
10. 10. 1. The transformation can have any parameterized form differentiable with respect to the parameters 2. In 2D case: 3. In particular, attention is: Notes: Grid generator and transforms
11. 11. ● THE KEY WHY EVERYTHING WORKS ● Sub-Differentiable sampling mechanism. (In other words gradient can go through up to “input” feature map) ● For details please refer the paper. Notes: Sampling mechanism
12. 12. ● Alignment-free Face Recognition ● “OCR in the wild”. SOTA. ● Street View House Numbers. SOTA. ● Traffic Signs dataset. SOTA. ● etc.. STN Examples and Applications
13. 13. ● Grid extrapolation & Padding with an original image ● Chain of STNs ● Generating several transforms for N-way siamese networks ● <Your suggestions> STN Modifications
14. 14. References: ● Dynamic Routing Between Capsules https://arxiv.org/abs/1710.09829 ● What is wrong with convolutional neural nets? https://youtu.be/rTawFwUvnLE ● “Matrix capsules with EM routing” https://openreview.net/forum?id=HJWLfGWRb&noteId=HJWLfGWR b All from Geoffrey Hinton Capsule Networks
15. 15. 1.Too few layers of structures 2.No explicit notion of entity 3.Pooling lose information about position of features What is wrong with CNN? (according to G. Hinton) Neuron Layer Net
16. 16. How our brain works. Inverse graphics inverse pose matrix Initial object Different viewpoints
17. 17. How our brain works. Inverse graphics Front Top
18. 18. Equivariance vs Invariance ● CNN try to make the neural activities invariant to small changes in viewpoint. ● But it is better to aim for equivariance: changes in viewpoint should lead to corresponding changes in neural activities.
19. 19. Capsule Network. Architecture
21. 21. Capsules vs traditional neurons
22. 22. 1. Matrix multiplication of input vectors 1. Scalar weighting of obtained vectors 3. Vector-to-vector “SQUASH” non-linearity Capsules
23. 23. Capsules. Step 1 ● u - output vectors of previous capsules. ○ Length encodes probabilities of detected corresponding object. ○ The direction encodes some internal state of the detected object. ● W - affine transformation matrix. Encode important spatial and other relationships between lower level features. ○ For example, matrix W12 may encode relationships between nose and face. ● u_hat - predicted position of the higher level feature. ○ For example, u1_hat represent where the face should be according to the detected position of the eyes.
24. 24. Capsules. Step 2 scalar weighting of input vector softmax routing of output from one low-level capsule send more send less Capsule j Capsule k ● Lower level capsule will send its input to the higher level capsule that “agrees” with its input. ● This is the essence of the dynamic routing algorithm c12 c11
25. 25. Capsules Forward Path. Step 3. New “SQUASH” non-linearity
26. 26. Dynamic Routing Aim: find C (coupling coefficients) such that low level capsule output vector “route” to high-level capsule that “agrees” with its input.
27. 27. Dynamic Routing. General idea ● Once again: Capsule output is a vector = internal representation of a feature. ● Want to maximize capsule K instead of capsule J as it’s closer to the “red” point cluster
28. 28. Dynamic Routing. “Vivisection”. Step 2 Capsule j = 1 W11 low-level capsule Initialize all log priors to 0 b = 0 In that way, due to “routing softmax”, we would have even probability distribution of low-level capsule's output goes to high-level b11 = 0 b12 = 0 b13 = 0 i = 1 Capsule j = 3 W13 Capsule j = 2 W12
29. 29. Dynamic Routing. Step 3-4 (first iteration) Capsule j = 1 W11 low-level capsule Calculate coupling coefficient C b11 = 0 c11 = 0.33 b12 = 0 c12 = 0.33 b13 = 0 c13 = 0.33 i = 1 Capsule j = 3 W13 Capsule j = 2 W12
30. 30. Dynamic Routing. Step 5-6 (first iteration) Capsule j = 1 W11 low-level capsule ● Calculate squash function for each high-level capsule with input of each low- level capsule ● Let’s suppose that: |V3| < |V1| < |V2| b11 = 0 c11 = 0.33 V1 = [...] b12 = 0 c12 = 0.33 V2 = [...] b13 = 0 c13 = 0.33 V3 = [...] i = 1 Capsule j = 3 W13 Capsule j = 2 W12
31. 31. Dynamic Routing. Step 7 (update prior) Capsule j = 1 W11 low-level capsule ● Update priors b using formula: ● This agreement is treated as if it was a log likelihood and is added to the initial logit. b11 = 0 + 0.4 c11 = 0.3 V1 = [...] b12 = 0 + 0.9 c12 = 0.47 V2 = [...] b13 = 0 + 0.2 c13 = 0.23 V3 = [...] i = 1 Capsule j = 3 W13 Capsule j = 2 W12
32. 32. Loss function (Margin loss) ● Capsules use a separate margin loss for each category in the picture. which Tc=1 if an object of class cc is present. m+ = 0.9 and m− = 0.1. ● The λ down-weighting (default 0.5) stops the initial learning from shrinking the activity vectors of all classes. ● The total loss is the sum of the losses of all classes. Capsules use a separate margin loss for each category in the picture.
33. 33. Reconstruction as Regularization
34. 34. ● SUPER SLOW. ● SUPER^2 SLOW training due to expensive routing procedure. ● Uses length of vector to represent probability => To keep the length ≤ 1 need “unprincipled” non-linearity that prevents them from being any sensible objective function that is minimized by the iterative routing procedure. ● COS distance to measure agreement is worth than LogVariance of Gaussian cluster since it cannot distinguish “good” vs “very good”. ● Operates with vectors instead of scalars => requires N^2 rather than N parameters for transformation matrix. CapsNets. Cons
35. 35. ● Naturally invariant equivatiant to affine transforms ● Fresh ideas ● SOTA on MNIST ● SOTA on Brain Tumor Type Classification [https://arxiv.org/pdf/1802.10200.pdf] CapsNets. Pros
36. 36. Questions Oleksandr Obiednikov Head of Face Recognition Research @ RingUkraine obednikov.alex@gmail.com alexandr.obednikov@ring.com