Wild Data - The Data Science Meetup

Fernando Velasco @fer_maat
Raúl de la Fuente @neurozetta
Wild Data

© Stratio 2018. Confidential, All Rights Reserved.
Confusion Matrix
8

Layers, layers, layers
9

Building the structures: how can we define a neuron?
11

BackPropagation Basics
12
Forward Propagation: get a result
Backward Propagation: who’s to blame?
Input hidden hidden hidden
Output
Error
Estimation:
evaluate
performances
● A cost function C is
defined
● Every parameter has
its impact on the cost
given some training
examples
● Impacts are computed
in terms of derivations
● Use the chain rule to
propagate error
backwards

Image Data
● Images are composed
by pixels.
● Grayscale images can
be seen as matrixes
● Coloured images are
usually represented as
mixes of three colours:
Red, Green and Blue
● Each one can be seen
as a grayscale-like
filter.

Convolutions
14

Introducing
Keras
Convolutional
representations

Convolution Examples (I)
Edge
Detection
Edge
Enhance
(right)

Convolutions Examples (II)
Blur
Sharpen
Emboss

Relu activation
- Avoids vanishing gradient
- Efficient computation
- Sparsity
- Adaptability

Putting it all to together: Backprop to the rescue!
● Forward propagation is performed the usual way
● So is the loss (remember in most cases we are performing a
classification)
Lenet-like network
● Backprop allows the filter parameter computation (Conv layers)
● Pooling is (mostly) not affected by backprop

CNN
21
Are Convolutional Neural Networks
invariant to…
● Scale?
● Rotation?
● Translation?

CNN
22
invariant to…
● Scale? No
● Rotation? No
● Translation? Partially
invariant to…
● Scale?
● Rotation?
● Translation?

© Stratio 2018. Confidential, All Rights Reserved. 23
Be careful!!!

What will I need?
24
1. Data
2. Data
3. and more Data

Rank top 10 ways to data augmentation
25
1. You
2. Can
3. Not
4. Rank
5. Them
6. Without
7. Knowing
8. The
9. Data
10.Distribution

Change labeling

Weighting the loss
27

Ignore Sampling

Over- or undersample

Augmentation
30
“porg”

Augmentation
31
“porg”

Get creative!
32
Mix of :
● translation
● rotation
● stretching
● shearing
● random erasing
● adding noise
● lend distorsion, … (go crazy)

DATA
AUGMENTATIOOOON!!!

Test Time Augmentation
34
While augmentation helped give us a better model...
prediction accuracy can be further improved by TTA

• Simple to implement,
can be done on the fly!
• Especially useful for
small datasets
Data Augmentation: Takeaway
35

IDEA / SOLUTION
● Let’s extract content from original photo.
● Let’s extract style from reference photo.
● Now combine content and style together
to get a new “styled” result.

I HAVE GOT AN IDEA!!!
43

Content Loss
- Layer complexity increases with the position
- The responses in a layer l can be stored in a matrix F(l;i, j), where
l; ij is the activation of the i th filter at position j in layer l.
- Let p and x be the original image and the generated one, and P(l)
and F(l) their feature representations in layer l. We define the
squared-error loss between the representations:
- When this content-loss is minimized, it means that the mixed-
image has feature activation in the given layers that are very
similar to the activation of the content-image
- Input image is transformed into representations increasingly
sensitive to the content of the image, but relatively invariant to
its precise appearance.
- Higher layers capture the high-level content in terms of objects
and their arrangement in the input image but do not constrain
the exact pixel values of the reconstruction very much

Style Loss
- Which features in the style-layers activate simultaneously for the style-image? To
obtain a representation of the style, we use the correlations between the different
filter responses. These feature correlations are given by the Gram matrix.
- If an entry in the Gram-matrix has a value close to zero then it means the two
features in the given layer do not activate simultaneously for the given style-image
and vice versa.
- We can construct an image that matches the style representation of a given input
image by minimising the distance between the Gram matrices from the original
image and the generated one (A, G):
- And we can define a style loss, weighted depending on the layers to boost:

Gradient of an image
- To transfer the style of an artwork a onto a photograph p we
jointly minimise the distance of the feature representations of a
white noise image from the content representation of the
photograph in one layer and the style representation of the
painting (where α and β are the weighting factors for content and
style reconstruction, respectively)
- Please note both the style and content loss functions are
differentiable wrt the activations F(l; i, j). We then can differentiate
the loss function with respect to the pixel values x in order to
obtain a gradient that can be used as input for some numerical
optimisation strategy.

Let’s make an experiment
Concept Image
Knowledge
Residual Ideas

Autoencoders (Idea)
51
Input hidden hidden hidden
Output
● Supervised neural networks try to predict
labels from input data
● It is not always possible to obtain labels
● Unsupervised learning can help obtain data
structure.
● What if we turn the output to be the input?

Autoencoders (Idea)
52
This is not the Generative
Model you are looking for
Input image Output image
● It tries to predict x from x, but no labels are
needed.
● The idea is learning an approximation of the
identity function.
● Along the way, some restrictions are placed:
typically the hidden layers compress the data.
● The original input is represented at the output,
even if it comes from noisy or corrupted data.

Autoencoders (Encoder and decoder)
53
● The latent space is commonly a narrow hidden
layer between encoder and decoder
● It learns the data structure
● Encoder and decoder can share the same
(inversed) structure or be different.
● Each one can have its own depth (number of
layers) and complexity.
Encode Decode
Latent Space

Autoencoders BackPropagation
54
● A cost function can be defined taking into
account differences between input and
Decoded(Encoded(Input))
● This allows BackProp to be carried along
Encoder and Decoder
● To prevent function composition to be the
Identity, some regularizations can be taken
● One of the most common is just reducing the
latent space dimension (i.e: compressing the
data on the encoding)
Encode Decode
Latent Space
BackPropagation

Generative Models (Idea)
55
Generative Models
“What I cannot create, I do
not understand.”
—Richard Feynman

Generative Models (Idea 2)
56
● They model how the data was generated in
order to categorize a signal.
● Instead of modeling P(y|x) as the usual
discriminative models, the distribution under
the hood is P(x, y)
● The number of parameters is significantly
smaller than the amount of data on which they
are trained.
● This forces the models to discover the data
essence
● What the model does is understanding the
world around the data, and provide good data
representations of it

Generative Models Applications
57
● Generate potentially unfeasible examples for
Reinforcement Learning
● Denoising/Pretraining
● Structured prediction exploration in RL
● Entirely plausible generation of images to
depict image/video
● Feature understanding

Variational Autoencoder Idea
58
Latent Space
Mean Vector
Standard Deviation
Vector
Encoder
Network
Decoder
Network
Sample on Latent Space => Generate new representations
Prior distribution

Keras
Introducing
Keras
Demogorgon smile
generation is beyond the
state of the art

Latent Space Distribution (I)
60
Latent Space
Mean Vector
Standard Deviation
Vector
Encoder
Network
Decoder
Network

Latent Space Distribution (II): VAE Loss function
61
Latent Space
Mean Vector
Standard Deviation
Vector
Encoder
Network
Decoder
Network
● Encoder and decoder can be denoted as conditional
probability representations of data:
● Typically the encoder reduces dimensions as decoder
increases it . So, when reconstructing the inputs some
information is lost. This information loss can be
measured using the reconstruction log-likelihood:
● In order to keep the latent image distribution under
control, we can introduce a regularizer into the loss
function. The Kullback-Leibler divergence between the
encoder distribution and a given and known distribution,
such as the standard Gaussian:
● With this penalty in the loss encoder, outputs are forced
to be sufficiently diverse: similar inputs will be kept close
(smoothly) together in the latent space.

Relu
Distribution
Divergence (K-L)
Reconstruction
Loss

Latent Space Distribution (III): Probability overview
63
Latent Space
Mean Vector
Standard Deviation
Vector
Encoder
Network
Decoder
Network● The VAE contains a specific probability model of
data x and latent variables z.
● We can write the joint probability of the model as
p(x,z): “how likely is observation x under the joint
distribution”.
● By definition, p(x, z)=p(x∣z)p(z)
● In order to generate the data, the process is as
follows:
For each datapoint i:
- Draw latent variables zi∼p(z)
- Draw datapoint xi∼p(x∣z)
● We need to figure out p(z) and p(x|z)
● The likelihood is the representation to be learnt from
the decoder
● Encoder likelihood can be used to estimate
parameters from the prior.

Generative
Adversarial
Networks
(GAN)

Generator
● The generator is trained to fool the
discriminator.
● It creates samples that are intended to
come from the same distribution as the
training data.
● We define the generator as a function G
that takes z as input and uses as
parameters. This is simply a
differentiable function G. When z is
sampled from some simple prior
distribution, G(z) yields a sample of x
drawn from p model
● The generator wishes to minimize
a and must do so while
controlling only .

Discriminator
● The discriminator examines samples to
determine whether they are real or fake.
● It learns using traditional supervised
learning techniques, dividing inputs into
two classes (real or fake).
● The discriminator is a function D that
takes x as input and uses as
parameters.
● The discriminator wishes to minimize
a and must do so while
controlling only .

Come together
Generate n
Fake
Images
Get n
training
examples
Train
Discriminator
Train
Generator
Repeat
● Generator must learn to cheat the
discriminator learning to create samples from
the same distribution as the training data.
● Players are represented by two functions,
differentiable both with respect to its inputs
and parameters.
● The training process consists of simultaneous
SGD. On each step, two minibatches are
sampled: a minibatch of x values from the
dataset and a minibatch of z values from the
model’s prior over latent variables. Then both
cost functions are updated simultaneously.
● Each player’s cost depends on the other
player’s parameters, but each player cannot
control the other player’s parameters.
This is a game, not an
optimization!

So far, so good
● Network architectures can also be used to hardcode
invariances: convolutional networks bake in translation
invariance, whereas physics models bake in
invariance to translations, rotations, and permutations
of atoms.
● Elastic distortions, scale, translation, and rotation
during training is an effective data augmentation
method on MNIST, due to the different symmetries
present in these datasets.
● On natural image datasets, such as CIFAR-10 and
ImageNet, random cropping, image mirroring and
color shifting / whitening are more common.
● Common data augmentation methods for image
recognition have been designed manually and the
best augmentation policies are dataset-specific.

Reformulating the problem
● Finding the best augmentation policy
can be formulated as a discrete search
problem: two operations to be applied
in sequence, with two hyperparameters
● 1) the probability of applying the operation
● 2) the magnitude of the operation.
● The policy has 5 sub-policies with 16
operations. Magnitudes and
probabilities are discretized to 10 (resp
11) values
● For every image in a mini-batch, a sub-
policy uniformly is chosen at random to
train the neural network.
● Stochasticity
● The search space with 5 sub-policies
then has roughly (16 × 10 × 11)**10 ≈
2.9 × 10**32 possibilities.

Algo highlights
● The search algorithm has two components: a controller (RNN),
feeded by the subsequent predictions, and the training algorithm,
which is the Proximal Policy Optimization algorithm (RL)
● In total the controller has 30 softmax predictions in order to predict 5
sub-policies (2 operations, magnitude and probability)
● The controller is trained with a reward signal, which is how good the
policy is in improving the generalization of a "child model" (a neural
network trained as part of the search process) trained with
augmented data generated by applying the 5 sub-policies on the
training set
● For each example in the mini-batch, one of the 5 sub-policies is
chosen randomly to augment the image.
● On each dataset, the controller samples about 15,000 policies.
● At the end of the search, we concatenate the sub-policies from the
best 5 policies into a single policy (with 25 sub-policies), which will
train the models for each dataset.

Results
Imagenet
Fine Grained Visual Classification Datasets
CIFAR 100CIFAR 10

Índice Analítico
Introducción: ¿por qué combinar modelos?
Boosting & Bagging basics
Demo:
○ Implementación de Adaboost con árboles
binarios
○ Feature Selection con Random Forest
1
2
3
Not all that
wander are lost
Any Questions?

THANK YOU!

BE AWARE!
78

Let me introduce you to my friend Cajal. He knew something about neurons
79
dendrite
axon
synapses: impulse transmission

Wild Data - The Data Science Meetup

More Related Content

What's hot

Similar to Wild Data - The Data Science Meetup

More from Stratio

Recently uploaded

Wild Data - The Data Science Meetup

Editor's Notes