Fernando Velasco @fer_maat
Raúl de la Fuente @neurozetta
Wild Data
Introduction
7
© Stratio 2018. Confidential, All Rights Reserved.
Confusion Matrix
8
© Stratio 2018. Confidential, All Rights Reserved.
Layers, layers, layers
9
CNN Stuff
© Stratio 2018. Confidential, All Rights Reserved.
Building the structures: how can we define a neuron?
11
© Stratio 2018. Confidential, All Rights Reserved.
BackPropagation Basics
12
Forward Propagation: get a result
Backward Propagation: who’s to blame?
Input hidden hidden hidden
Output
Error
Estimation:
evaluate
performances
● A cost function C is
defined
● Every parameter has
its impact on the cost
given some training
examples
● Impacts are computed
in terms of derivations
● Use the chain rule to
propagate error
backwards
Image Data
● Images are composed
by pixels.
● Grayscale images can
be seen as matrixes
● Coloured images are
usually represented as
mixes of three colours:
Red, Green and Blue
● Each one can be seen
as a grayscale-like
filter.
© Stratio 2018. Confidential, All Rights Reserved.
Convolutions
14
Introducing
Keras
Convolutional
representations
Convolution Examples (I)
Edge
Detection
Edge
Enhance
(right)
Convolutions Examples (II)
Blur
Sharpen
Emboss
Relu activation
- Avoids vanishing gradient
- Efficient computation
- Sparsity
- Adaptability
Putting it all to together: Backprop to the rescue!
● Forward propagation is performed the usual way
● So is the loss (remember in most cases we are performing a
classification)
Lenet-like network
● Backprop allows the filter parameter computation (Conv layers)
● Pooling is (mostly) not affected by backprop
Classic Data
Augmentation
© Stratio 2018. Confidential, All Rights Reserved.
CNN
21
Are Convolutional Neural Networks
invariant to…
● Scale?
● Rotation?
● Translation?
© Stratio 2018. Confidential, All Rights Reserved.
CNN
22
Are Convolutional Neural Networks
invariant to…
● Scale? No
● Rotation? No
● Translation? Partially
Are Convolutional Neural Networks
invariant to…
● Scale?
● Rotation?
● Translation?
© Stratio 2018. Confidential, All Rights Reserved. 23
Be careful!!!
© Stratio 2018. Confidential, All Rights Reserved.
What will I need?
24
1. Data
2. Data
3. and more Data
© Stratio 2018. Confidential, All Rights Reserved.
Rank top 10 ways to data augmentation
25
1. You
2. Can
3. Not
4. Rank
5. Them
6. Without
7. Knowing
8. The
9. Data
10.Distribution
© Stratio 2018. Confidential, All Rights Reserved. 26
Change labeling
© Stratio 2018. Confidential, All Rights Reserved.
Weighting the loss
27
© Stratio 2018. Confidential, All Rights Reserved. 28
Ignore Sampling
© Stratio 2018. Confidential, All Rights Reserved. 29
Over- or undersample
© Stratio 2018. Confidential, All Rights Reserved.
Augmentation
30
“porg”
© Stratio 2018. Confidential, All Rights Reserved.
Augmentation
31
“porg”
© Stratio 2018. Confidential, All Rights Reserved.
Get creative!
32
Mix of :
● translation
● rotation
● stretching
● shearing
● random erasing
● adding noise
● lend distorsion, … (go crazy)
© Stratio 2018. Confidential, All Rights Reserved. 33
DATA
AUGMENTATIOOOON!!!
© Stratio 2018. Confidential, All Rights Reserved.
Test Time Augmentation
34
While augmentation helped give us a better model...
prediction accuracy can be further improved by TTA
© Stratio 2018. Confidential, All Rights Reserved.
• Simple to implement,
can be done on the fly!
• Especially useful for
small datasets
Data Augmentation: Takeaway
35
Transfer Style
IDEA / SOLUTION
● Let’s extract content from original photo.
● Let’s extract style from reference photo.
● Now combine content and style together
to get a new “styled” result.
RECREATION
© Stratio 2018. Confidential, All Rights Reserved.
I HAVE GOT AN IDEA!!!
43
I HAVE GOT AN IDEA!!!
I HAVE GOT AN IDEA!!!
Content Loss
- Layer complexity increases with the position
- The responses in a layer l can be stored in a matrix F(l;i, j), where
l; ij is the activation of the i th filter at position j in layer l.
- Let p and x be the original image and the generated one, and P(l)
and F(l) their feature representations in layer l. We define the
squared-error loss between the representations:
- When this content-loss is minimized, it means that the mixed-
image has feature activation in the given layers that are very
similar to the activation of the content-image
- Input image is transformed into representations increasingly
sensitive to the content of the image, but relatively invariant to
its precise appearance.
- Higher layers capture the high-level content in terms of objects
and their arrangement in the input image but do not constrain
the exact pixel values of the reconstruction very much
Style Loss
- Which features in the style-layers activate simultaneously for the style-image? To
obtain a representation of the style, we use the correlations between the different
filter responses. These feature correlations are given by the Gram matrix.
- If an entry in the Gram-matrix has a value close to zero then it means the two
features in the given layer do not activate simultaneously for the given style-image
and vice versa.
- We can construct an image that matches the style representation of a given input
image by minimising the distance between the Gram matrices from the original
image and the generated one (A, G):
- And we can define a style loss, weighted depending on the layers to boost:
Gradient of an image
- To transfer the style of an artwork a onto a photograph p we
jointly minimise the distance of the feature representations of a
white noise image from the content representation of the
photograph in one layer and the style representation of the
painting (where α and β are the weighting factors for content and
style reconstruction, respectively)
- Please note both the style and content loss functions are
differentiable wrt the activations F(l; i, j). We then can differentiate
the loss function with respect to the pixel values x in order to
obtain a gradient that can be used as input for some numerical
optimisation strategy.
Generative Models
Let’s make an experiment
Concept Image
Knowledge
Residual Ideas
© Stratio 2018. Confidential, All Rights Reserved.
Autoencoders (Idea)
51
Input hidden hidden hidden
Output
● Supervised neural networks try to predict
labels from input data
● It is not always possible to obtain labels
● Unsupervised learning can help obtain data
structure.
● What if we turn the output to be the input?
© Stratio 2018. Confidential, All Rights Reserved.
Autoencoders (Idea)
52
This is not the Generative
Model you are looking for
Input image Output image
● It tries to predict x from x, but no labels are
needed.
● The idea is learning an approximation of the
identity function.
● Along the way, some restrictions are placed:
typically the hidden layers compress the data.
● The original input is represented at the output,
even if it comes from noisy or corrupted data.
© Stratio 2018. Confidential, All Rights Reserved.
Autoencoders (Encoder and decoder)
53
This is not the Generative
Model you are looking for
Input image Output image
● The latent space is commonly a narrow hidden
layer between encoder and decoder
● It learns the data structure
● Encoder and decoder can share the same
(inversed) structure or be different.
● Each one can have its own depth (number of
layers) and complexity.
Encode Decode
Latent Space
© Stratio 2018. Confidential, All Rights Reserved.
Autoencoders BackPropagation
54
This is not the Generative
Model you are looking for
Input image Output image
● A cost function can be defined taking into
account differences between input and
Decoded(Encoded(Input))
● This allows BackProp to be carried along
Encoder and Decoder
● To prevent function composition to be the
Identity, some regularizations can be taken
● One of the most common is just reducing the
latent space dimension (i.e: compressing the
data on the encoding)
Encode Decode
Latent Space
BackPropagation
© Stratio 2018. Confidential, All Rights Reserved.
Generative Models (Idea)
55
Generative Models
“What I cannot create, I do
not understand.”
—Richard Feynman
© Stratio 2018. Confidential, All Rights Reserved.
Generative Models (Idea 2)
56
● They model how the data was generated in
order to categorize a signal.
● Instead of modeling P(y|x) as the usual
discriminative models, the distribution under
the hood is P(x, y)
● The number of parameters is significantly
smaller than the amount of data on which they
are trained.
● This forces the models to discover the data
essence
● What the model does is understanding the
world around the data, and provide good data
representations of it
© Stratio 2018. Confidential, All Rights Reserved.
Generative Models Applications
57
● Generate potentially unfeasible examples for
Reinforcement Learning
● Denoising/Pretraining
● Structured prediction exploration in RL
● Entirely plausible generation of images to
depict image/video
● Feature understanding
© Stratio 2018. Confidential, All Rights Reserved.
Variational Autoencoder Idea
58
Input image Output image
Latent Space
Mean Vector
Standard Deviation
Vector
Encoder
Network
Decoder
Network
Sample on Latent Space => Generate new representations
Prior distribution
Keras
Introducing
Keras
Demogorgon smile
generation is beyond the
state of the art
© Stratio 2018. Confidential, All Rights Reserved.
Latent Space Distribution (I)
60
Latent Space
Mean Vector
Standard Deviation
Vector
Encoder
Network
Decoder
Network
© Stratio 2018. Confidential, All Rights Reserved.
Latent Space Distribution (II): VAE Loss function
61
Latent Space
Mean Vector
Standard Deviation
Vector
Encoder
Network
Decoder
Network
● Encoder and decoder can be denoted as conditional
probability representations of data:
● Typically the encoder reduces dimensions as decoder
increases it . So, when reconstructing the inputs some
information is lost. This information loss can be
measured using the reconstruction log-likelihood:
● In order to keep the latent image distribution under
control, we can introduce a regularizer into the loss
function. The Kullback-Leibler divergence between the
encoder distribution and a given and known distribution,
such as the standard Gaussian:
● With this penalty in the loss encoder, outputs are forced
to be sufficiently diverse: similar inputs will be kept close
(smoothly) together in the latent space.
Relu
Distribution
Divergence (K-L)
Reconstruction
Loss
© Stratio 2018. Confidential, All Rights Reserved.
Latent Space Distribution (III): Probability overview
63
Latent Space
Mean Vector
Standard Deviation
Vector
Encoder
Network
Decoder
Network● The VAE contains a specific probability model of
data x and latent variables z.
● We can write the joint probability of the model as
p(x,z): “how likely is observation x under the joint
distribution”.
● By definition, p(x, z)=p(x∣z)p(z)
● In order to generate the data, the process is as
follows:
For each datapoint i:
- Draw latent variables zi∼p(z)
- Draw datapoint xi∼p(x∣z)
● We need to figure out p(z) and p(x|z)
● The likelihood is the representation to be learnt from
the decoder
● Encoder likelihood can be used to estimate
parameters from the prior.
© Stratio 2016. Confidential, All Rights Reserved. 64
Generative
Adversarial
Networks
(GAN)
Generator
● The generator is trained to fool the
discriminator.
● It creates samples that are intended to
come from the same distribution as the
training data.
● We define the generator as a function G
that takes z as input and uses as
parameters. This is simply a
differentiable function G. When z is
sampled from some simple prior
distribution, G(z) yields a sample of x
drawn from p model
● The generator wishes to minimize
a and must do so while
controlling only .
Discriminator
● The discriminator examines samples to
determine whether they are real or fake.
● It learns using traditional supervised
learning techniques, dividing inputs into
two classes (real or fake).
● The discriminator is a function D that
takes x as input and uses as
parameters.
● The discriminator wishes to minimize
a and must do so while
controlling only .
Come together
Generate n
Fake
Images
Get n
training
examples
Train
Discriminator
Train
Generator
Repeat
● Generator must learn to cheat the
discriminator learning to create samples from
the same distribution as the training data.
● Players are represented by two functions,
differentiable both with respect to its inputs
and parameters.
● The training process consists of simultaneous
SGD. On each step, two minibatches are
sampled: a minibatch of x values from the
dataset and a minibatch of z values from the
model’s prior over latent variables. Then both
cost functions are updated simultaneously.
● Each player’s cost depends on the other
player’s parameters, but each player cannot
control the other player’s parameters.
This is a game, not an
optimization!
Google
AutoAugment
So far, so good
● Network architectures can also be used to hardcode
invariances: convolutional networks bake in translation
invariance, whereas physics models bake in
invariance to translations, rotations, and permutations
of atoms.
● Elastic distortions, scale, translation, and rotation
during training is an effective data augmentation
method on MNIST, due to the different symmetries
present in these datasets.
● On natural image datasets, such as CIFAR-10 and
ImageNet, random cropping, image mirroring and
color shifting / whitening are more common.
● Common data augmentation methods for image
recognition have been designed manually and the
best augmentation policies are dataset-specific.
Reformulating the problem
● Finding the best augmentation policy
can be formulated as a discrete search
problem: two operations to be applied
in sequence, with two hyperparameters
● 1) the probability of applying the operation
● 2) the magnitude of the operation.
● The policy has 5 sub-policies with 16
operations. Magnitudes and
probabilities are discretized to 10 (resp
11) values
● For every image in a mini-batch, a sub-
policy uniformly is chosen at random to
train the neural network.
● Stochasticity
● The search space with 5 sub-policies
then has roughly (16 × 10 × 11)**10 ≈
2.9 × 10**32 possibilities.
Algo highlights
● The search algorithm has two components: a controller (RNN),
feeded by the subsequent predictions, and the training algorithm,
which is the Proximal Policy Optimization algorithm (RL)
● In total the controller has 30 softmax predictions in order to predict 5
sub-policies (2 operations, magnitude and probability)
● The controller is trained with a reward signal, which is how good the
policy is in improving the generalization of a "child model" (a neural
network trained as part of the search process) trained with
augmented data generated by applying the 5 sub-policies on the
training set
● For each example in the mini-batch, one of the 5 sub-policies is
chosen randomly to augment the image.
● On each dataset, the controller samples about 15,000 policies.
● At the end of the search, we concatenate the sub-policies from the
best 5 policies into a single policy (with 25 sub-policies), which will
train the models for each dataset.
Results
Imagenet
Fine Grained Visual Classification Datasets
CIFAR 100CIFAR 10
Índice Analítico
Introducción: ¿por qué combinar modelos?
Boosting & Bagging basics
Demo:
○ Implementación de Adaboost con árboles
binarios
○ Feature Selection con Random Forest
1
2
3
Not all that
wander are lost
Any Questions?
Fernando Velasco @fer_maat
Raúl de la Fuente @neurozetta
THANK YOU!
Fernando Velasco @fer_maat
Raúl de la Fuente @neurozetta
I.A.
© Stratio 2018. Confidential, All Rights Reserved.
BE AWARE!
78
© Stratio 2018. Confidential, All Rights Reserved.
Let me introduce you to my friend Cajal. He knew something about neurons
79
dendrite
axon
synapses: impulse transmission

Wild Data - The Data Science Meetup

  • 1.
    Fernando Velasco @fer_maat Raúlde la Fuente @neurozetta Wild Data
  • 3.
  • 6.
  • 7.
    © Stratio 2018.Confidential, All Rights Reserved. Confusion Matrix 8
  • 8.
    © Stratio 2018.Confidential, All Rights Reserved. Layers, layers, layers 9
  • 9.
  • 10.
    © Stratio 2018.Confidential, All Rights Reserved. Building the structures: how can we define a neuron? 11
  • 11.
    © Stratio 2018.Confidential, All Rights Reserved. BackPropagation Basics 12 Forward Propagation: get a result Backward Propagation: who’s to blame? Input hidden hidden hidden Output Error Estimation: evaluate performances ● A cost function C is defined ● Every parameter has its impact on the cost given some training examples ● Impacts are computed in terms of derivations ● Use the chain rule to propagate error backwards
  • 12.
    Image Data ● Imagesare composed by pixels. ● Grayscale images can be seen as matrixes ● Coloured images are usually represented as mixes of three colours: Red, Green and Blue ● Each one can be seen as a grayscale-like filter.
  • 13.
    © Stratio 2018.Confidential, All Rights Reserved. Convolutions 14
  • 14.
  • 15.
  • 16.
  • 17.
    Relu activation - Avoidsvanishing gradient - Efficient computation - Sparsity - Adaptability
  • 18.
    Putting it allto together: Backprop to the rescue! ● Forward propagation is performed the usual way ● So is the loss (remember in most cases we are performing a classification) Lenet-like network ● Backprop allows the filter parameter computation (Conv layers) ● Pooling is (mostly) not affected by backprop
  • 19.
  • 20.
    © Stratio 2018.Confidential, All Rights Reserved. CNN 21 Are Convolutional Neural Networks invariant to… ● Scale? ● Rotation? ● Translation?
  • 21.
    © Stratio 2018.Confidential, All Rights Reserved. CNN 22 Are Convolutional Neural Networks invariant to… ● Scale? No ● Rotation? No ● Translation? Partially Are Convolutional Neural Networks invariant to… ● Scale? ● Rotation? ● Translation?
  • 22.
    © Stratio 2018.Confidential, All Rights Reserved. 23 Be careful!!!
  • 23.
    © Stratio 2018.Confidential, All Rights Reserved. What will I need? 24 1. Data 2. Data 3. and more Data
  • 24.
    © Stratio 2018.Confidential, All Rights Reserved. Rank top 10 ways to data augmentation 25 1. You 2. Can 3. Not 4. Rank 5. Them 6. Without 7. Knowing 8. The 9. Data 10.Distribution
  • 25.
    © Stratio 2018.Confidential, All Rights Reserved. 26 Change labeling
  • 26.
    © Stratio 2018.Confidential, All Rights Reserved. Weighting the loss 27
  • 27.
    © Stratio 2018.Confidential, All Rights Reserved. 28 Ignore Sampling
  • 28.
    © Stratio 2018.Confidential, All Rights Reserved. 29 Over- or undersample
  • 29.
    © Stratio 2018.Confidential, All Rights Reserved. Augmentation 30 “porg”
  • 30.
    © Stratio 2018.Confidential, All Rights Reserved. Augmentation 31 “porg”
  • 31.
    © Stratio 2018.Confidential, All Rights Reserved. Get creative! 32 Mix of : ● translation ● rotation ● stretching ● shearing ● random erasing ● adding noise ● lend distorsion, … (go crazy)
  • 32.
    © Stratio 2018.Confidential, All Rights Reserved. 33 DATA AUGMENTATIOOOON!!!
  • 33.
    © Stratio 2018.Confidential, All Rights Reserved. Test Time Augmentation 34 While augmentation helped give us a better model... prediction accuracy can be further improved by TTA
  • 34.
    © Stratio 2018.Confidential, All Rights Reserved. • Simple to implement, can be done on the fly! • Especially useful for small datasets Data Augmentation: Takeaway 35
  • 35.
  • 36.
    IDEA / SOLUTION ●Let’s extract content from original photo. ● Let’s extract style from reference photo. ● Now combine content and style together to get a new “styled” result.
  • 37.
  • 42.
    © Stratio 2018.Confidential, All Rights Reserved. I HAVE GOT AN IDEA!!! 43
  • 43.
    I HAVE GOTAN IDEA!!!
  • 44.
    I HAVE GOTAN IDEA!!!
  • 45.
    Content Loss - Layercomplexity increases with the position - The responses in a layer l can be stored in a matrix F(l;i, j), where l; ij is the activation of the i th filter at position j in layer l. - Let p and x be the original image and the generated one, and P(l) and F(l) their feature representations in layer l. We define the squared-error loss between the representations: - When this content-loss is minimized, it means that the mixed- image has feature activation in the given layers that are very similar to the activation of the content-image - Input image is transformed into representations increasingly sensitive to the content of the image, but relatively invariant to its precise appearance. - Higher layers capture the high-level content in terms of objects and their arrangement in the input image but do not constrain the exact pixel values of the reconstruction very much
  • 46.
    Style Loss - Whichfeatures in the style-layers activate simultaneously for the style-image? To obtain a representation of the style, we use the correlations between the different filter responses. These feature correlations are given by the Gram matrix. - If an entry in the Gram-matrix has a value close to zero then it means the two features in the given layer do not activate simultaneously for the given style-image and vice versa. - We can construct an image that matches the style representation of a given input image by minimising the distance between the Gram matrices from the original image and the generated one (A, G): - And we can define a style loss, weighted depending on the layers to boost:
  • 47.
    Gradient of animage - To transfer the style of an artwork a onto a photograph p we jointly minimise the distance of the feature representations of a white noise image from the content representation of the photograph in one layer and the style representation of the painting (where α and β are the weighting factors for content and style reconstruction, respectively) - Please note both the style and content loss functions are differentiable wrt the activations F(l; i, j). We then can differentiate the loss function with respect to the pixel values x in order to obtain a gradient that can be used as input for some numerical optimisation strategy.
  • 48.
  • 49.
    Let’s make anexperiment Concept Image Knowledge Residual Ideas
  • 50.
    © Stratio 2018.Confidential, All Rights Reserved. Autoencoders (Idea) 51 Input hidden hidden hidden Output ● Supervised neural networks try to predict labels from input data ● It is not always possible to obtain labels ● Unsupervised learning can help obtain data structure. ● What if we turn the output to be the input?
  • 51.
    © Stratio 2018.Confidential, All Rights Reserved. Autoencoders (Idea) 52 This is not the Generative Model you are looking for Input image Output image ● It tries to predict x from x, but no labels are needed. ● The idea is learning an approximation of the identity function. ● Along the way, some restrictions are placed: typically the hidden layers compress the data. ● The original input is represented at the output, even if it comes from noisy or corrupted data.
  • 52.
    © Stratio 2018.Confidential, All Rights Reserved. Autoencoders (Encoder and decoder) 53 This is not the Generative Model you are looking for Input image Output image ● The latent space is commonly a narrow hidden layer between encoder and decoder ● It learns the data structure ● Encoder and decoder can share the same (inversed) structure or be different. ● Each one can have its own depth (number of layers) and complexity. Encode Decode Latent Space
  • 53.
    © Stratio 2018.Confidential, All Rights Reserved. Autoencoders BackPropagation 54 This is not the Generative Model you are looking for Input image Output image ● A cost function can be defined taking into account differences between input and Decoded(Encoded(Input)) ● This allows BackProp to be carried along Encoder and Decoder ● To prevent function composition to be the Identity, some regularizations can be taken ● One of the most common is just reducing the latent space dimension (i.e: compressing the data on the encoding) Encode Decode Latent Space BackPropagation
  • 54.
    © Stratio 2018.Confidential, All Rights Reserved. Generative Models (Idea) 55 Generative Models “What I cannot create, I do not understand.” —Richard Feynman
  • 55.
    © Stratio 2018.Confidential, All Rights Reserved. Generative Models (Idea 2) 56 ● They model how the data was generated in order to categorize a signal. ● Instead of modeling P(y|x) as the usual discriminative models, the distribution under the hood is P(x, y) ● The number of parameters is significantly smaller than the amount of data on which they are trained. ● This forces the models to discover the data essence ● What the model does is understanding the world around the data, and provide good data representations of it
  • 56.
    © Stratio 2018.Confidential, All Rights Reserved. Generative Models Applications 57 ● Generate potentially unfeasible examples for Reinforcement Learning ● Denoising/Pretraining ● Structured prediction exploration in RL ● Entirely plausible generation of images to depict image/video ● Feature understanding
  • 57.
    © Stratio 2018.Confidential, All Rights Reserved. Variational Autoencoder Idea 58 Input image Output image Latent Space Mean Vector Standard Deviation Vector Encoder Network Decoder Network Sample on Latent Space => Generate new representations Prior distribution
  • 58.
  • 59.
    © Stratio 2018.Confidential, All Rights Reserved. Latent Space Distribution (I) 60 Latent Space Mean Vector Standard Deviation Vector Encoder Network Decoder Network
  • 60.
    © Stratio 2018.Confidential, All Rights Reserved. Latent Space Distribution (II): VAE Loss function 61 Latent Space Mean Vector Standard Deviation Vector Encoder Network Decoder Network ● Encoder and decoder can be denoted as conditional probability representations of data: ● Typically the encoder reduces dimensions as decoder increases it . So, when reconstructing the inputs some information is lost. This information loss can be measured using the reconstruction log-likelihood: ● In order to keep the latent image distribution under control, we can introduce a regularizer into the loss function. The Kullback-Leibler divergence between the encoder distribution and a given and known distribution, such as the standard Gaussian: ● With this penalty in the loss encoder, outputs are forced to be sufficiently diverse: similar inputs will be kept close (smoothly) together in the latent space.
  • 61.
  • 62.
    © Stratio 2018.Confidential, All Rights Reserved. Latent Space Distribution (III): Probability overview 63 Latent Space Mean Vector Standard Deviation Vector Encoder Network Decoder Network● The VAE contains a specific probability model of data x and latent variables z. ● We can write the joint probability of the model as p(x,z): “how likely is observation x under the joint distribution”. ● By definition, p(x, z)=p(x∣z)p(z) ● In order to generate the data, the process is as follows: For each datapoint i: - Draw latent variables zi∼p(z) - Draw datapoint xi∼p(x∣z) ● We need to figure out p(z) and p(x|z) ● The likelihood is the representation to be learnt from the decoder ● Encoder likelihood can be used to estimate parameters from the prior.
  • 63.
    © Stratio 2016.Confidential, All Rights Reserved. 64 Generative Adversarial Networks (GAN)
  • 64.
    Generator ● The generatoris trained to fool the discriminator. ● It creates samples that are intended to come from the same distribution as the training data. ● We define the generator as a function G that takes z as input and uses as parameters. This is simply a differentiable function G. When z is sampled from some simple prior distribution, G(z) yields a sample of x drawn from p model ● The generator wishes to minimize a and must do so while controlling only .
  • 65.
    Discriminator ● The discriminatorexamines samples to determine whether they are real or fake. ● It learns using traditional supervised learning techniques, dividing inputs into two classes (real or fake). ● The discriminator is a function D that takes x as input and uses as parameters. ● The discriminator wishes to minimize a and must do so while controlling only .
  • 66.
    Come together Generate n Fake Images Getn training examples Train Discriminator Train Generator Repeat ● Generator must learn to cheat the discriminator learning to create samples from the same distribution as the training data. ● Players are represented by two functions, differentiable both with respect to its inputs and parameters. ● The training process consists of simultaneous SGD. On each step, two minibatches are sampled: a minibatch of x values from the dataset and a minibatch of z values from the model’s prior over latent variables. Then both cost functions are updated simultaneously. ● Each player’s cost depends on the other player’s parameters, but each player cannot control the other player’s parameters. This is a game, not an optimization!
  • 67.
  • 68.
    So far, sogood ● Network architectures can also be used to hardcode invariances: convolutional networks bake in translation invariance, whereas physics models bake in invariance to translations, rotations, and permutations of atoms. ● Elastic distortions, scale, translation, and rotation during training is an effective data augmentation method on MNIST, due to the different symmetries present in these datasets. ● On natural image datasets, such as CIFAR-10 and ImageNet, random cropping, image mirroring and color shifting / whitening are more common. ● Common data augmentation methods for image recognition have been designed manually and the best augmentation policies are dataset-specific.
  • 69.
    Reformulating the problem ●Finding the best augmentation policy can be formulated as a discrete search problem: two operations to be applied in sequence, with two hyperparameters ● 1) the probability of applying the operation ● 2) the magnitude of the operation. ● The policy has 5 sub-policies with 16 operations. Magnitudes and probabilities are discretized to 10 (resp 11) values ● For every image in a mini-batch, a sub- policy uniformly is chosen at random to train the neural network. ● Stochasticity ● The search space with 5 sub-policies then has roughly (16 × 10 × 11)**10 ≈ 2.9 × 10**32 possibilities.
  • 70.
    Algo highlights ● Thesearch algorithm has two components: a controller (RNN), feeded by the subsequent predictions, and the training algorithm, which is the Proximal Policy Optimization algorithm (RL) ● In total the controller has 30 softmax predictions in order to predict 5 sub-policies (2 operations, magnitude and probability) ● The controller is trained with a reward signal, which is how good the policy is in improving the generalization of a "child model" (a neural network trained as part of the search process) trained with augmented data generated by applying the 5 sub-policies on the training set ● For each example in the mini-batch, one of the 5 sub-policies is chosen randomly to augment the image. ● On each dataset, the controller samples about 15,000 policies. ● At the end of the search, we concatenate the sub-policies from the best 5 policies into a single policy (with 25 sub-policies), which will train the models for each dataset.
  • 71.
    Results Imagenet Fine Grained VisualClassification Datasets CIFAR 100CIFAR 10
  • 72.
    Índice Analítico Introducción: ¿porqué combinar modelos? Boosting & Bagging basics Demo: ○ Implementación de Adaboost con árboles binarios ○ Feature Selection con Random Forest 1 2 3 Not all that wander are lost Any Questions? Fernando Velasco @fer_maat Raúl de la Fuente @neurozetta
  • 73.
    THANK YOU! Fernando Velasco@fer_maat Raúl de la Fuente @neurozetta
  • 76.
  • 77.
    © Stratio 2018.Confidential, All Rights Reserved. BE AWARE! 78
  • 78.
    © Stratio 2018.Confidential, All Rights Reserved. Let me introduce you to my friend Cajal. He knew something about neurons 79 dendrite axon synapses: impulse transmission

Editor's Notes

  • #2 Pon el simplescreenrecorder
  • #3 Tal vez combinar el paper de google justo después del classic data augmentation cn demo Te añado ideas de imágenes en tus partes por si te mola incluirlas Me preocupa la parte de introducción NN y CNN porque no me encajan, pero hay que meterlas en algún sitio y antes del transfer style. Tal vez entre el classic y el TS?
  • #4 ¿qué es preventa? neurociencia Ingenieros emocionales profe en ESIC, universidades y factoría cultural matemático, data scientist, del Glorioso
  • #5 4
  • #7 El hombre que confundió a su mujer con un sombrero
  • #10 sistemas biologicos, sistema cognitivo natural que ya conocemos y lo intentamos replicar con las máquinas...
  • #11 10
  • #14 "valid" means "no padding". "same" results in padding the input such that the output has the same length as the original input.
  • #17 "valid" means "no padding". "same" results in padding the input such that the output has the same length as the original input.
  • #21 20
  • #23 El cerebro está programado para tratar de encontrar soluciones lógicas y como estamos acostumbrados a ver dos ojos y una boca en un rostro por eso tratamos de buscar algo que probablemente nos llevase a descubrir el misterio sobre el cual nos enfocamos principalmente, todo esto nos lleva a un cúmulo de pensamientos claramente esparcidos en el interior del cerebelo y nos da hambre, por lo que iras a comer algo justo ahora:)
  • #24 Tenga en cuenta que debe tener cuidado al ajustar la distribución de las etiquetas de entrenamiento, ya que esto tendrá un impacto en la forma en que el modelo predice al momento de la inferencia: si aumenta el número de pacientes enfermos en su conjunto de entrenamiento, el modelo también predice la enfermedad con más frecuencia.
  • #25 The obvious approach is to try to collect more data from the rare classes. For the medical vision example, this implies that we try to focus on collecting images of patients who have a certain disease we try to diagnose. If this is not possible because it is too expensive, there might be other ways to obtain training data, as mentioned in the previous section. Note that you have to be careful when adjusting the distribution of training labels, as this will have an impact on the way the model predicts at inference: if you increase the number of sick patients in your training set, the model will also predict sickness more often.
  • #27 Si no puede obtener más datos de las clases raras, otro enfoque es repensar la taxonomía. Para la aplicación práctica podría no ser necesario diferenciar entre la enfermedad A o B, siempre y cuando usted reconozca que es cualquiera de las dos. En ese caso, puedes unirte a las dos clases. Ya sea en el momento del entrenamiento para simplificar el procedimiento de entrenamiento, o durante la inferencia, lo que significa que no se penaliza si la enfermedad A o B se confunde.
  • #28 falsos positivos y falsos negativos la funcion de coste del backpropagation
  • #29 Negative mining. The third group of sampling methods is a bit more complex but indeed the most powerful one. Instead of over- or undersampling, we choose the samples intentionally. Although we have much more samples of the frequent class we care most about the most difficult samples, i.e. the samples which are misclassified with the highest probabilities. Thus, we can regularly evaluate the model during training and investigate the samples to identify those that are misclassified more likely. This enables us to wisely select the samples that are shown to the algorithm more often.
  • #30 The advantage of this technique compared to the previous is that no samples are ignored.
  • #35 Mientras que el aumento nos ayudó a obtener un mejor modelo, la precisión de la predicción puede mejorarse aún más mediante lo que se denomina (TTA). Para mitigar errores como estos usamos TTA donde predecimos la clase para la imagen de prueba original junto con 4 formas aleatorias de la misma imagen. Luego tomamos un promedio de las predicciones para determinar a qué clase pertenece la imagen.
  • #36 Data augmentation has been shown to produce promising ways to increase the accuracy of classification tasks. While traditional augmentation is very effective alone, other techniques enabled by generative models have proved to be even better
  • #37 36
  • #41 balrog
  • #43 En este método, no utilizamos una red neuronal en un sentido verdadero. Es decir, no estamos entrenando a una red para hacer nada. Simplemente estamos aprovechando backpropagation para minimizar dos valores de la función de coste definidos. Tal vez se pregunte por qué utilizamos los resultados de las capas intermedias de una red de clasificación de imágenes preconfigurada para calcular nuestras pérdidas de estilo y contenido. Esto se debe a que, para que una red pueda realizar la clasificación de imágenes, tiene que entender la imagen. Así, entre la toma de la imagen como entrada y la salida de su estimación de lo que es, está haciendo transformaciones para convertir los píxeles de la imagen en una comprensión interna del contenido de la imagen. Como ejemplo: si pasamos dos imágenes de gatos a través de una red de clasificación de imágenes, aunque las imágenes iniciales se vean muy diferentes, después de pasar a través de muchas capas internas, sus representaciones serán muy cercanas en valor neto. Esta es la pérdida de contenido - pasar tanto la imagen del pastiche como la imagen del contenido a través de algunas capas de una red de clasificación de imágenes y encontrar la distancia euclidiana entre las representaciones intermedias de esas imágenes. Podemos interpretar estos entendimientos internos como representaciones semánticas intermedias de la imagen inicial y usar esas representaciones para "comparar" el contenido de dos imágenes. Podemos hacerlo con un algoritmo de optimización. P: ¿Un algoritmo de optimización? R: Es una manera de minimizar (o maximizar) una función. Dado que tenemos una función de pérdida total que depende de la imagen combinada, el algoritmo de optimización nos dirá cómo cambiar la imagen de combinación para hacer la pérdida un poco más pequeña. P: ¿Qué algoritmos de optimización existen? Los métodos de primer orden minimizan o maximizan la función utilizando su gradiente. El método de primer orden más ampliamente utilizado es el Descenso por Gradiente y sus variantes. El método de segundo orden utiliza la segunda derivada (Hessian) para minimizar o maximizar la función. Dado que el segundo derivado es costoso de calcular, el método de segundo orden en cuestión, L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno) utiliza una aproximación del Hessian. Normalmente se usan métodos de descenso de dirección de gradiente, sobre los que se pueden aplicar backpropagation . Los métodos de segundo orden se basan en en el hessiano o una aproximación del mismo (es decir, información de la segunda derivada) El problema está en que calcular el hessiano tiene complejidad cuadratica(si la funcion tiene N variables, N²). Hay métodos numéricos que hacen esto, sin tener que calcular el hessiano, sin tener que calcular el hessiano de forma exacta. Para la imagen de contenido, extraeríamos las activaciones de capas ocultas de la capa (l). Si la capa oculta (l) es un número muy temprano, si usas la capa oculta uno, entonces realmente forzará a tu imagen generada a tener valores de píxeles muy similares a los de tu imagen de contenido. Mientras que, si usas una capa muy profunda, entonces sólo es preguntar, "Bueno, si hay un perro en tu imagen de contenido, entonces asegúrate de que hay un perro en algún lugar de tu imagen generada. "Por lo tanto, en la práctica, la capa (l) elegida en algún punto intermedio. The steps behind the optimization are the following: synthesize a white noise image extract the content and the style of our_image calculate the distance between the content of our_image and the content of content_image calculate the distance between the style of our_image and the style of style_image calculate the cost function and the gradient If the gradient is zero, end the optimization If the gradient is not zero, run one more iteration of the optimization. This will generate a new our_image, that is closer to content_image content-wise, and is closer to style_image style-wise. If the preset number of iterations is achieved, finish. Otherwise, go back to step 2.
  • #45 El enfoque de la matriz de Gram es excelente para propósitos artísticos, pero sufre cuando queremos tener algún control sobre qué texturas se transfieren a cada parte del resultado. La razón es que destruye la semántica de la imagen de estilo, conservando únicamente los componentes básicos de la textura. Así que no podemos, por ejemplo, transferir el estilo de nariz porque el estimador de textura ha destruido la información del cabello. mejoras→ markov random field
  • #46 El enfoque de la matriz de Gram es excelente para propósitos artísticos, pero sufre cuando queremos tener algún control sobre qué texturas se transfieren a cada parte del resultado. La razón es que destruye la semántica de la imagen de estilo, conservando únicamente los componentes básicos de la textura. Así que no podemos, por ejemplo, transferir el estilo de nariz porque el estimador de textura ha destruido la información del cabello. mejoras→ markov random field
  • #48 La Gram-matrix es esencialmente una matriz de productos punteados para los vectores de las activaciones de características de una capa de estilo.
  • #49 Foto Salamanca
  • #50 49
  • #57 these neural networks are learning what the visual world looks like! These models usually have only about 100 million parameters, so a network trained on ImageNet has to (lossily) compress 200GB of pixel data into 100MB of weights. This incentivizes it to discover the most salient features of the data: for example, it will likely learn that pixels nearby are likely to have the same color, or that the world is made up of horizontal or vertical edges, or blobs of different colors. Eventually, the model may discover many more complex regularities: that there are certain types of backgrounds, objects, textures, that they occur in certain likely arrangements, or that they transform in certain ways over time in videos, etc.
  • #66 Imagenet organises a challenge every year from 2010. In the chart we can see how the results have been improving since then
  • #69 68
  • #77 Foto Salamanca
  • #78 Foto Salamanca