Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Artificial Intelligence on Data Centric Platform

232 views

Published on

Digital Transformation starts with data. What if a solution existed that put data at the center, in a single place, serving all applications around it? This training will include a demonstration in a distributed data-centric platform which provides a data intelligence layer, composed of artificial intelligence models able to make use of a whole company’s data.

Nowadays, one of the most innovative techniques in the realm of artificial intelligence is Deep Neural Nets. Among the many applications, language modelling, machine translation and image generation are receiving particular attention. Deep nets are also powerful in predictive modelling ambits such as stock pricing and the energy industry. We will address a few case studies modeled with TensorFlow, running on Stratio’s data-centric product in a distributed cluster.
By: Fernando Velasco

Published in: Technology
  • Be the first to comment

Artificial Intelligence on Data Centric Platform

  1. 1. By: Fernando Velasco Training Big Data Spain http://xlic.es/v/7208D8
  2. 2. © Stratio 2016. Confidential, All Rights Reserved. 2 A man needs three names
  3. 3. © Stratio 2016. Confidential, All Rights Reserved. 3 ● Mathematician A man needs three names
  4. 4. © Stratio 2016. Confidential, All Rights Reserved. 4 ● Data Scientist ● Mathematician A man needs three names
  5. 5. © Stratio 2016. Confidential, All Rights Reserved. 5 ● Data Scientist ● Mathematician ● Stratian A man needs three names
  6. 6. © Stratio 2016. Confidential, All Rights Reserved. 6 ● Data Scientist ● Mathematician ● Stratian fvelasco@stratio.com A man needs three names
  7. 7. 1 2 3 4 5 © Stratio 2016. Confidential, All Rights Reserved. INDEX Introduction Data Centric Environment ● Distributed TensorFlow example. Keras Neural Nets ● BackPropagation Recurrent Neural Networks ● LSTM Autoencoders ● Data Augmentation ● VAE
  8. 8. 1 2 3 4 5 © Stratio 2016. Confidential, All Rights Reserved. INDEX Introduction Data Centric Environment ● Distributed TensorFlow example. Keras Neural Nets ● BackPropagation Recurrent Neural Networks ● LSTM Autoencoders ● Data Augmentation ● VAE
  9. 9. © Stratio 2016. Confidential, All Rights Reserved. Who are we? Where do we come from? Where are we going? 9
  10. 10. © Stratio 2016. Confidential, All Rights Reserved. Who are we? Where do we come from? Where are we going? 10
  11. 11. © Stratio 2016. Confidential, All Rights Reserved. 11
  12. 12. © Stratio 2016. Confidential, All Rights Reserved. 12
  13. 13. © Stratio 2016. Confidential, All Rights Reserved. 13
  14. 14. Technical Environment
  15. 15. © Stratio 2016. Confidential, All Rights Reserved. Data Centricity 15 Mobile APP Campaign Management E-commerce Digital Marketing Legacy Application Call centerSAP : ERP ATG TPV APP CRM
  16. 16. © Stratio 2016. Confidential, All Rights Reserved. Data Centricity 16 DATA Mobile APP Campaign Management E-commerce Digital Marketing Legacy Application Call centerSAP : ERP ATG TPV APP CRM Data Intelligence API DaaS
  17. 17. © Stratio 2016. Confidential, All Rights Reserved. Data Centricity 17 DATA Mobile APP Campaign Management E-commerce Digital Marketing Legacy Application Call centerSAP : ERP ATG TPV APP CRM Data Intelligence API DaaS ● Unique data at the center, surrounded by applications that use it in real-time, gaining maximum data intelligence ● In order to allow simultaneous updates, the consistency is eventual. ● Applications use the microservices in the DaaS layer to access the Data ● The Data Intelligence layer provides access to the applications to the Data Intelligence ● Applications are developed through microservices orchestration
  18. 18. Environment Resume Multiuser Environment Manage users and provision of notebooks Analytic Environment User 1 front-end User N front-end User 1 back-end User code Analytic Environment User n back-end
  19. 19. © Stratio 2016. Confidential, All Rights Reserved. tf.motivation 19 ● Growing Community: One of the main reasons to use TensorFlow is the huge community behind it. TensorFlow is widely known and used. ● Great Technical Capabilities: - Multi-GPU support - Distributed training - Queues for operations like data loading and preprocessing on the graph. - Graph visualization using TensorBoard. - Model checkpointing. - High Performance and GPU memory usage optimization ● High-quality metaframeworks: Keras because of TensorFlow and perhaps also TensorFlow because of Keras. Once again both lead the list of Deep Learning libraries. Continuous release schedule and maintenance. New features and tests are integrated first so that early adopters can try them before documentation. This is great for such a big community and allows the framework to keep improving.
  20. 20. Distribution strategies: Data vs. Model Parallelism When splitting the training of a neural network across multiple compute nodes, two strategies are commonly employed: ● Data parallelism: individual instances of the model are created on each node and fed different training samples; this allows for higher training throughput. ● Model parallelism: a single instance of the model is split across multiple nodes allowing for larger models, ones which may not necessarily fit in the memory of a single node, to be trained. ● Mixed: if desired, these two strategies can also be composed resulting in multiple instances of a given model with each instance spanning multiple
  21. 21. Distributed Computation Synchrony There are many ways to specify Distributed structure in TensorFlow. Possible approaches include: Asynchronous training: In this approach, each replica of the graph has an independent training loop that executes without coordination. It is compatible with both forms of replication above. Synchronous training: In this approach, all of the replicas read the same values for the current parameters, compute gradients in parallel, and then apply them together. It is compatible with in-graph replication (e.g. using gradient averaging), and between-graph replication.
  22. 22. © Stratio 2016. Confidential, All Rights Reserved. tf.motivation 22 ● Growing Community: One of the main reasons to use TensorFlow is the huge community behind it. TensorFlow is widely known and used. ● Great Technical Capabilities: - Multi-GPU support - Distributed training - Queues for operations like data loading and preprocessing on the graph. - Graph visualization using TensorBoard. - Model checkpointing. - High Performance and GPU memory usage optimization ● High-quality metaframeworks: Keras because of TensorFlow and perhaps also TensorFlow because of Keras. Once again both lead the list of Deep Learning libraries. Continuous release schedule and maintenance. New features and tests are integrated first so that early adopters can try them before documentation. This is great for such a big community and allows the framework to keep inproving.
  23. 23. © Stratio 2016. Confidential, All Rights Reserved. Who are we? Where do we come from? Where are we going? 23 Stimulating the brain
  24. 24. © Stratio 2016. Confidential, All Rights Reserved. Let me introduce you to my friend Cajal. He knew something about neurons 24
  25. 25. © Stratio 2016. Confidential, All Rights Reserved. Let me introduce you to my friend Cajal. He knew something about neurons 25 dendrite
  26. 26. © Stratio 2016. Confidential, All Rights Reserved. Let me introduce you to my friend Cajal. He knew something about neurons 26 dendrite axon
  27. 27. © Stratio 2016. Confidential, All Rights Reserved. Let me introduce you to my friend Cajal. He knew something about neurons 27 dendrite axon synapses: impulse transmission
  28. 28. © Stratio 2016. Confidential, All Rights Reserved. Building the structures: how can we define a neuron? 28
  29. 29. © Stratio 2016. Confidential, All Rights Reserved. Layers, layers, layers 29 Activation Functions
  30. 30. © Stratio 2016. Confidential, All Rights Reserved. Layers, layers, layers 30 Activation Functions
  31. 31. © Stratio 2016. Confidential, All Rights Reserved. BackPropagation Basics 31 Input hidden hidden hidden Output
  32. 32. © Stratio 2016. Confidential, All Rights Reserved. BackPropagation Basics 32 Forward Propagation: get a result Input hidden hidden hidden Output
  33. 33. © Stratio 2016. Confidential, All Rights Reserved. BackPropagation Basics 33 Forward Propagation: get a result Input hidden hidden hidden Output Error Estimation: evaluate performances
  34. 34. © Stratio 2016. Confidential, All Rights Reserved. BackPropagation Basics 34 Forward Propagation: get a result Backward Propagation: who’s to blame? Input hidden hidden hidden Output Error Estimation: evaluate performances
  35. 35. © Stratio 2016. Confidential, All Rights Reserved. BackPropagation Basics 35 Forward Propagation: get a result Backward Propagation: who’s to blame? Input hidden hidden hidden Output Error Estimation: evaluate performances ● A cost function C is defined ● Every parameter has its impact on the cost given some training examples ● Impacts are computed in terms of derivations ● Use the chain rule to propagate error backwards
  36. 36. Funciones de activación: Salidas ● Lineales: ● sf ● Binomiales : sigmoide ● ad ● Multinomiales: softmax Activation Functions
  37. 37. Sigmoid and Relu functions - Bounded - Probability-like function - Dense computation - Differentiable - On many examples of fully connected layers
  38. 38. Sigmoid and Relu functions - Bounded - Probability-like function - Dense computation - Differentiable - On many examples of fully connected layers We are too cool to speak about linear activators, aren’t we? Not entirely...
  39. 39. Sigmoid and Relu functions - Sparse activation - Efficient computation - “Differentiable” - Unbounded - Potential Dying Relu - Convolutional-friendly - Bounded - Probability-like function - Dense computation - Differentiable - On many examples of fully connected layers We are too cool to speak about linear activators, aren’t we? Not entirely...
  40. 40. Hyperbolic Tangent - Bounded - Positive/negative values - Dense computation - Differentiable - Nice to LSTM-like thinking
  41. 41. Softmax - Represents probability on a categorical distribution - Multiclass normalization - Bounded - Differentiable - Used on final layers
  42. 42. Funciones de activación: Salidas Differentiation is the key
  43. 43. On the ease of Derivations ● Sigmoid ● Hyperbolic Tangent ● ReLU ● Softmax
  44. 44. On the ease of Derivations ● Sigmoid ● Hyperbolic Tangent ● ReLU ● Softmax Handset value
  45. 45. Funciones de activación: Salidas ● Lineales: ● sf ● Binomiales : sigmoide ● ad ● Multinomiales: softmax Activation Functions Loss Functions
  46. 46. Regression error ● The most classic measure ● Penalizes highly big mistakes ● Less interpretable ● Scale invariant ● Symmetric ● Interpretable ● Harder differentiability and convergence ● Harder differentiability and convergence ● Penalizes less on higher mistakes ● Interpretable
  47. 47. Regression error ● The most classic measure ● Penalizes highly big mistakes ● Less interpretable ● Scale invariant ● Symmetric ● Interpretable ● Harder differentiability and convergence ● Harder differentiability and convergence ● Penalizes less on higher mistakes ● Interpretable The choice is always problem-dependent
  48. 48. Funciones de coste ● Regresión: ● Clasificación: The shortest way is not always the best one
  49. 49. Classification and Categorical Cross- Entropy ● Categorical Cross-Entropy Where indexes i and j stand for each example and resp. class, the ys stand for the true labels and the ps stand for their assigned probabilities On two classes it turns into the easy-to-understand, most common When compared to accuracy, Cross-Entropy turns to be a more granular way to compute error closeness of a prediction, as it takes into account the closeness of a prediction . Derivation also eases calculus compared with RMSE
  50. 50. Classification and Categorical Cross- Entropy ● Categorical Cross-Entropy Where indexes i and j stand for each example and resp. class, the ys stand for the true labels and the ps stand for their assigned probabilities On two classes it turns into the easy-to-understand, most common When compared to accuracy, Cross-Entropy turns to be a more granular way to compute error closeness of a prediction, as it takes into account the closeness of a prediction . Derivation also eases calculus compared with RMSE Classificator 1 Classificator 2
  51. 51. Classification and Categorical Cross- Entropy ● Categorical Cross-Entropy Where indexes i and j stand for each example and resp. class, the ys stand for the true labels and the ps stand for their assigned probabilities On two classes it turns into the easy-to-understand, most common When compared to accuracy, Cross-Entropy turns to be a more granular way to compute error closeness of a prediction, as it takes into account the closeness of a prediction . Derivation also eases calculus compared with RMSE Classificator 1 Classificator 2
  52. 52. Regularization
  53. 53. Regularization: Norm penalties ● Add a penalty to the loss function: ● L2: ○ Keep weights near zero. ○ Simplest one, differentiable. ● L1: ○ Sparse results, feature selection. ○ Not differentiable, slower.
  54. 54. Regularization: Dropout ● Randomly drop neurons (along with their connections) during training. ● Acts like adding noise. ● Very effective, computationally inexpensive. ● Ensemble of all sub-networks generated.
  55. 55. Regularization: Dropout ● Randomly drop neurons (along with their connections) during training. ● Acts like adding noise. ● Very effective, computationally inexpensive. ● Ensemble of all sub-networks generated.
  56. 56. Regularization: Dropout ● Randomly drop neurons (along with their connections) during training. ● Acts like adding noise. ● Very effective, computationally inexpensive. ● Ensemble of all sub-networks generated.
  57. 57. Optimization
  58. 58. Optimization: Challenges ● The difficulty in training neural networks is mainly attributed to their optimization part. ● Plateaus, saddle points and local minima grows exponentially with the dimension ● Classical convex optimization algorithms don’t perform well.
  59. 59. Optimization: Batch Gradient descent ● Goes over the whole training set. ● Very expensive. ● There isn’t an easy way to incorporate new data to training set.
  60. 60. Optimization: Mini-Batch Gradient descent ● Stochastic Gradient Descent (SGD) ● Randomly sample a small number of examples (minibatch) ● Estimate cost function and gradient: ● Batch size: Length of the minibatch ● Iteration: Every time we update the weights ● Epoch: One pass over the whole training set. ● k = 1 => online learning ● Small batches => regularization
  61. 61. Optimization: Variants ● Momentum:The momentum algorithm accumulates an exponentially decaying moving average of past gradients and continues to move in their direction. ● AdaGrad: The learning rate is adapted component-wise, and is given by the square root of sum of squares of the historical. ● RMSProp: modifies AdaGrad to perform better in the non-convex setting by changing the gradient accumulation into an exponentially weighted moving average ● ADAM(Adaptive Moment): Combination of RMSPROP and momentum.
  62. 62. Momentum basics Negative of the gradient Momentum Real Movement
  63. 63. © Stratio 2016. Confidential, All Rights Reserved. A fistful of cool applications 63 Not all that wander are lost
  64. 64. © Stratio 2016. Confidential, All Rights Reserved. A fistful of cool applications 64 Not all that wander are lost Object Classification and Detection
  65. 65. © Stratio 2016. Confidential, All Rights Reserved. A fistful of cool applications 65 Not all that wander are lost Object Classification and Detection RBM on Recommender Systems
  66. 66. © Stratio 2016. Confidential, All Rights Reserved. A fistful of cool applications 66 Not all that wander are lost Object Classification and Detection Instant Visual translation RBM on Recommender Systems
  67. 67. © Stratio 2016. Confidential, All Rights Reserved. A fistful of cool applications 67 Not all that wander are lost Object Classification and Detection Instant Visual translation RBM on Recommender Systems Generative Models (GAN/VAE)
  68. 68. Keras Introducing Keras
  69. 69. © Stratio 2016. Confidential, All Rights Reserved. tf.motivation 69 ● Growing Community: One of the main reasons to use TensorFlow is the huge community behind it. TensorFlow is widely known and used. ● Great Technical Capabilities: - Multi-GPU support - Distributed training - Queues for operations like data loading and preprocessing on the graph. - Graph visualization using TensorBoard. - Model checkpointing. - High Performance and GPU memory usage optimization ● High-quality metaframeworks: Keras because of TensorFlow and perhaps also TensorFlow because of Keras. Once again both lead the list of Deep Learning libraries. Continuous release schedule and maintenance. New features and tests are integrated first so that early adopters can try them before documentation. This is great for such a big community and allows the framework to keep inproving.
  70. 70. Keras Introducing Keras
  71. 71. Welcome to the jungle! ● Me Tarzán, you Cheetah. Human friendly interface. User actions are minimized in order to ease the process, isolating users from the backend. ● Territorial behaviors are allowed. Several backends can be used: Tensorflow, CNTK and Theano (poor Theano!), but there is also another interesting property on modularization: every model is a sequence of standalone modules plugged together with as little restrictions as possible, and allowing us to fully configure cost functions, optimizers, initializations, activation functions ... ● Keeps your model herd a-growin’. New modules are simple to add, and existing modules provide ample examples. ● Kaa is our friend. We love Python! It makes the lives of data scientists easier. the code is compact, easier to debug, and allows for ease of extensibility.
  72. 72. © Stratio 2016. Confidential, All Rights Reserved. Ever felt lost in Automatic Translation? 72
  73. 73. Índice Analítico Introducción: ¿por qué combinar modelos? Boosting & Bagging basics Demo: ○ Implementación de Adaboost con árboles binarios ○ Feature Selection con Random Forest 1 2 3 Not all that wander are lost What do we say to those who think machine translation sucks? Not today!
  74. 74. © Stratio 2016. Confidential, All Rights Reserved. Neural Machine Translation Idea 74 Introducción: ¿por qué combinar modelos? Boosting & Bagging basics Demo: ○ Implementación de Adaboost con árboles binarios ○ Feature Selection con Random Forest 1 2 3 Not all that wander are lost Encoder: words => hidden state Decoder : hidden state => words Hidden states are not entirely universal languages!!
  75. 75. © Stratio 2016. Confidential, All Rights Reserved. Attention Basics 75 ● Not every word is a one-to one translation ● Whole-Weighting combination increases computation time ● Some other more human approaches can be taken (e.g: reinforcement learning)
  76. 76. © Stratio 2016. Confidential, All Rights Reserved. BaseSlide 76 Sequential Data
  77. 77. © Stratio 2016. Confidential, All Rights Reserved. Sequence Statement 77 ● Most machine learning algorithms are designed for independent, unordered data. ● Many real problems uses sequential data: ○ Time series, behavior, audio signals… ○ t does not have to be time, can be spatial measure (images), or any order measure (Recommender systems) ● The sequences are a natural way of representing reality: vision, hearing, action-reaction, words, sentences, etc. ● Don’t forget order matters!!
  78. 78. © Stratio 2016. Confidential, All Rights Reserved. Introducing Recurrent Neural Networks 78 ● Neural Networks with recurrent connections, specialized in processing sequential data. ● Recurrent connections allows a ‘memory’ of previous inputs. ● Can scale to long sequences (variable length), not practical for other types of nets. ● Same parameters for every timestep (t) => generalize RNN images by Christopher Olah
  79. 79. © Stratio 2016. Confidential, All Rights Reserved. Recurrent Neural Networks Architecture 79 0 1 2 t Looping the loop: Backpropagation Through Time ● Same idea as in the standard backpropagation, but the recurrent net needs to be unfolded through time for a certain amount of timesteps. ● The weight changes calculated for each network copy are summed before individual weights are adapted. ● The set of weights for each copy(time step) always remain the same.
  80. 80. © Stratio 2016. Confidential, All Rights Reserved. BackPropagation Through Time 80 0 1 2 t ● Cost function: ● Network parameters depend on the parameters on the previous timestep. So do derivations during backprop. ● Chain rule application lead to a lot of derivation products. where each Li stands for the usual cost on one timestep (e.g: MSE on regression, etc)
  81. 81. © Stratio 2016. Confidential, All Rights Reserved. BackPropagation Through Time 81 0 1 2 t ● Cost function: ● Network parameters depend on the parameters on the previous timestep. So do derivations during backprop. ● Chain rule application lead to a lot of derivation products. where each Li stands for the usual cost on one timestep (e.g: MSE on regression, etc)
  82. 82. © Stratio 2016. Confidential, All Rights Reserved. BackPropagation Through Time 82 0 1 2 t ● Cost function: ● Network parameters depend on the parameters on the previous timestep. So do derivations during backprop. ● Chain rule application lead to a lot of derivation products. where each Li stands for the usual cost on one timestep (e.g: MSE on regression, etc) BackProp
  83. 83. © Stratio 2016. Confidential, All Rights Reserved. BackPropagation Through Time 83 0 1 2 t ● Cost function: ● Network parameters depend on the parameters on the previous timestep. So do derivations during backprop. ● Chain rule application lead to a lot of derivation products. where each Li stands for the usual cost on one timestep (e.g: MSE on regression, etc) BackProp
  84. 84. © Stratio 2016. Confidential, All Rights Reserved. BackPropagation Through Time 84 0 1 2 t ● Cost function: ● Network parameters depend on the parameters on the previous timestep. So do derivations during backprop. ● Chain rule application lead to a lot of derivation products. where each Li stands for the usual cost on one timestep (e.g: MSE on regression, etc) BackProp
  85. 85. © Stratio 2016. Confidential, All Rights Reserved. BaseSlide 85 Beware of the Vanishing Gradient!!
  86. 86. © Stratio 2016. Confidential, All Rights Reserved. Gradients in time 86 ● Backpropagating the error in time involves as many recurrent derivation terms as timesteps on the net. ● It can be problematic if matrix W is too large or too low in terms of its values. ● Thus, the very first terms would have no influence on the result as there is no memory related to them
  87. 87. Short-term modulation Long-term modulation
  88. 88. © Stratio 2016. Confidential, All Rights Reserved. LSTM Briefing (Sepp Hochreiter and Jürgen Schmidhuber, 1997) 88 ● And up to three outputs, two of them are states: Long and short. ● Third output (if exists or considered) is similar to the classic output ● Timesteps are still the key ● From now on, we are going to have two connections (states) ● Each timestep receives an input LSTM images also by Christopher Olah
  89. 89. Índice Analítico
  90. 90. © Stratio 2016. Confidential, All Rights Reserved. LSTM Briefing (II) 90 ● Each timestep may have one or more units ● Each state corresponds to each kind of memory at play: Long and Short ● Inside each cell, there are four questions asked: ○ Which part of the Long memory has to be deleted? ○ From the new info, is there anything interesting to be remembered? ○ If there is, How do we combine it along with the Long memory? ○ What is the Short term impression for this step?
  91. 91. © Stratio 2016. Confidential, All Rights Reserved. LSTM Briefing (II) 91 ● Each timestep may have one or more units ● Each state corresponds to each kind of memory at play: Long and Short ● Inside each cell, there are four questions asked: ○ Which part of the Long memory has to be deleted? ○ From the new info, is there anything interesting to be remembered? ○ If there is, How do we combine it along with the Long memory? ○ What is the Short term impression for this step? forget gate f
  92. 92. © Stratio 2016. Confidential, All Rights Reserved. LSTM Briefing (II) 92 ● Each timestep may have one or more units ● Each state corresponds to each kind of memory at play: Long and Short ● Inside each cell, there are four questions asked: ○ Which part of the Long memory has to be deleted? ○ From the new info, is there anything interesting to be remembered? ○ If there is, How do we combine it along with the Long memory? ○ What is the Short term impression for this step? input gate forget gate if
  93. 93. © Stratio 2016. Confidential, All Rights Reserved. LSTM Briefing (II) 93 ● Each timestep may have one or more units ● Each state corresponds to each kind of memory at play: Long and Short ● Inside each cell, there are four questions asked: ○ Which part of the Long memory has to be deleted? ○ From the new info, is there anything interesting to be remembered? ○ If there is, How do we combine it along with the Long memory? ○ What is the Short term impression for this step? input gate forget gate candidate gate cif
  94. 94. © Stratio 2016. Confidential, All Rights Reserved. LSTM Briefing (II) 94 ● Each timestep may have one or more units ● Each state corresponds to each kind of memory at play: Long and Short ● Inside each cell, there are four questions asked: ○ Which part of the Long memory has to be deleted? ○ From the new info, is there anything interesting to be remembered? ○ If there is, How do we combine it along with the Long memory? ○ What is the Short term impression for this step? input gate forget gate candidate gate output gate ocif
  95. 95. © Stratio 2016. Confidential, All Rights Reserved. BaseSlide 95 Focusing on forget gate the question is answered as follows: where the h is the activation the b is the associated bias and the W is the weight matrix on the forget gate. Or, in a more explicit way: Where the Wfx is the input weight matrix (the classic one) and Whh is the hidden state matrix between timesteps. On a similar way, one can express input and output equations this very same way: hit and hot Anyway, there are some differences on the candidate gate ones, mainly related to its activation function: the hyperbolic tangent. On the same notation:
  96. 96. © Stratio 2016. Confidential, All Rights Reserved. BaseSlide 96 Focusing on forget gate the question is answered as follows: where the h is the activation the b is the associated bias and the W is the weight matrix on the forget gate. Or, in a more explicit way: Where the Wfx is the input weight matrix (the classic one) and Whh is the hidden state matrix between timesteps. On a similar way, one can express input and output equations this very same way: hit and hot Anyway, there are some differences on the candidate gate ones, mainly related to its activation function: the hyperbolic tangent. On the same notation:tanh values in a [-1, 1] range. This way we are able to add and subtract on the Long term memory
  97. 97. © Stratio 2016. Confidential, All Rights Reserved. Inside a LSTM Cell (II) 97 And finally, we can update states, including output. This way: Or on simpler words, we forget what is to be forgotten and we add what is to be added. At the very end, with the same tanh idea, we put Short and Long terms together:
  98. 98. So, LSTM nets are That easy?
  99. 99. © Stratio 2016. Confidential, All Rights Reserved. Cool Applications 99 Not all that wander are lost CNN + LSTM to describe pictures Film scripts. Yes, it’s for real
  100. 100. © Stratio 2016. Confidential, All Rights Reserved. BaseSlide 100 The man who creates the network should write the code Demo Time!!
  101. 101. © Stratio 2016. Confidential, All Rights Reserved. 101
  102. 102. © Stratio 2016. Confidential, All Rights Reserved. 102 AutoEncoders
  103. 103. © Stratio 2016. Confidential, All Rights Reserved. Autoencoders (Idea) 103 Input hidden hidden hidden Output ● Supervised neural networks try to predict labels from input data ● It is not always possible to obtain labels ● Unsupervised learning can help obtain data structure. ● What if we turn the output to be the input?
  104. 104. © Stratio 2016. Confidential, All Rights Reserved. Autoencoders (Idea) 104 This is not the Generative Model you are looking for Input image
  105. 105. © Stratio 2016. Confidential, All Rights Reserved. Autoencoders (Idea) 105 This is not the Generative Model you are looking for Input image
  106. 106. © Stratio 2016. Confidential, All Rights Reserved. Autoencoders (Idea) 106 This is not the Generative Model you are looking for Input image
  107. 107. © Stratio 2016. Confidential, All Rights Reserved. Autoencoders (Idea) 107 This is not the Generative Model you are looking for Input image
  108. 108. © Stratio 2016. Confidential, All Rights Reserved. Autoencoders (Idea) 108 This is not the Generative Model you are looking for Input image Output image
  109. 109. © Stratio 2016. Confidential, All Rights Reserved. Autoencoders (Idea) 109 This is not the Generative Model you are looking for Input image Output image It tries to predict x from x, but no labels are needed. The idea is learning an approximation of the identity function. Along the way, some restrictions are placed: typically the hidden layers compress the data. The original input is represented at the output, even if it comes from noisy or corrupted data.
  110. 110. © Stratio 2016. Confidential, All Rights Reserved. Autoencoders (Encoder and decoder) 110 This is not the Generative Model you are looking for Input image Output image
  111. 111. © Stratio 2016. Confidential, All Rights Reserved. Autoencoders (Encoder and decoder) 111 This is not the Generative Model you are looking for Input image Output image Encode Decode
  112. 112. © Stratio 2016. Confidential, All Rights Reserved. Autoencoders (Encoder and decoder) 112 This is not the Generative Model you are looking for Input image Output image The latent space is commonly a narrow hidden layer between encoder and decoder It learns the data structure Encoder and decoder can share the same (inversed) structure or be different. Each one can have its own depth (number of layers) and complexity. Encode Decode Latent Space
  113. 113. © Stratio 2016. Confidential, All Rights Reserved. Autoencoders BackPropagation 113 This is not the Generative Model you are looking for Input image Output image Encode Decode Latent Space
  114. 114. © Stratio 2016. Confidential, All Rights Reserved. Autoencoders BackPropagation 114 This is not the Generative Model you are looking for Input image Output image A cost function can be defined taking into account differences between input and Decoded(Encoded(Input)) This allows BackProp to be carried along Encoder and Decoder To prevent function composition to be the Identity, some regularizations can be taken One of the most common is just reducing the latent space dimension (i.e: compressing the data on the encoding) Encode Decode Latent Space BackPropagation
  115. 115. © Stratio 2016. Confidential, All Rights Reserved. Autoencoders Applications 115 Reduction of dimensionality Data Structure/Feature learning Denoising or data cleaning Pre-training deep networks
  116. 116. © Stratio 2016. Confidential, All Rights Reserved. Data Augmentation 116
  117. 117. © Stratio 2016. Confidential, All Rights Reserved. Data Augmentation 117
  118. 118. © Stratio 2016. Confidential, All Rights Reserved. Data Augmentation 118 ● Specialized image and video classification tasks often have insufficient data. ● Traditional transformations consist of using a combination of affine transformations to manipulate the training data ● Data augmentation has been shown to produce promising ways to increase the accuracy of classification tasks. ● While traditional augmentation is very effective alone, other techniques enabled by generative models have proved to be even better
  119. 119. © Stratio 2016. Confidential, All Rights Reserved. Generative Models (Idea) 119 Generative Models “What I cannot create, I do not understand.” —Richard Feynman
  120. 120. © Stratio 2016. Confidential, All Rights Reserved. Generative Models (Idea) 120 ● They model how the data was generated in order to categorize a signal. ● Instead of modeling P(y|x) as the usual discriminative models, the distribution under the hood is P(x, y) ● The number of parameters is significantly smaller than the amount of data on which they are trained. ● This forces the models to discover the data essence ● What the model does is understanding the world around the data, and provide good data representations of it
  121. 121. © Stratio 2016. Confidential, All Rights Reserved. Generative Models Applications 121 ● Generate potentially unfeasible examples for Reinforcement Learning ● Denoising/Pretraining ● Structured prediction exploration in RL ● Entirely plausible generation of images to depict image/video ● Feature understanding
  122. 122. © Stratio 2016. Confidential, All Rights Reserved. Generative Models Applications 122 ● Generate potentially unfeasible examples for Reinforcement Learning ● Denoising/Pretraining ● Structured prediction exploration in RL ● Entirely plausible generation of images to depict image/video ● Feature understanding
  123. 123. © Stratio 2016. Confidential, All Rights Reserved. Variational Autoencoder Idea (I) 123 Input image Output image Latent Space Mean Vector Standard Deviation Vector Encoder Network Decoder Network
  124. 124. © Stratio 2016. Confidential, All Rights Reserved. Variational Autoencoder Idea (II) 124 Input image Output image Latent Space Mean Vector Standard Deviation Vector Encoder Network Decoder Network
  125. 125. © Stratio 2016. Confidential, All Rights Reserved. Variational Autoencoder Idea (II) 125 Output image Latent Space Mean Vector Standard Deviation Vector Decoder Network
  126. 126. © Stratio 2016. Confidential, All Rights Reserved. Variational Autoencoder Idea (II) 126 Latent Space Mean Vector Standard Deviation Vector Decoder Network
  127. 127. © Stratio 2016. Confidential, All Rights Reserved. Variational Autoencoder Idea (II) 127 Output image Latent Space Mean Vector Standard Deviation Vector Decoder Network Sample on Latent Space => Generate new representations Prior distribution
  128. 128. Keras Introducing Keras Demogorgon smile generation is beyond the state of the art
  129. 129. © Stratio 2016. Confidential, All Rights Reserved. Latent Space Distribution (I) 129 Latent Space Mean Vector Standard Deviation Vector Encoder Network Decoder Network
  130. 130. © Stratio 2016. Confidential, All Rights Reserved. Latent Space Distribution (II): VAE Loss function 130 Latent Space Mean Vector Standard Deviation Vector Encoder Network Decoder Network ● Encoder and decoder can be denoted as conditional probability representations of data:
  131. 131. © Stratio 2016. Confidential, All Rights Reserved. Latent Space Distribution (II): VAE Loss function 131 Latent Space Mean Vector Standard Deviation Vector Encoder Network Decoder Network ● Encoder and decoder can be denoted as conditional probability representations of data: ● Typically the encoder reduces dimensions as decoder increases it . So, when reconstructing the inputs some information is lost. This information loss can be measured using the reconstruction log-likelihood:
  132. 132. © Stratio 2016. Confidential, All Rights Reserved. Latent Space Distribution (II): VAE Loss function 132 Latent Space Mean Vector Standard Deviation Vector Encoder Network Decoder Network ● Encoder and decoder can be denoted as conditional probability representations of data: ● Typically the encoder reduces dimensions as decoder increases it . So, when reconstructing the inputs some information is lost. This information loss can be measured using the reconstruction log-likelihood: ● In order to keep the latent image distribution under control, we can introduce a regularizer into the loss function. The Kullback-Leibler divergence between the encoder distribution and a given and known distribution, such as the standard Gaussian:
  133. 133. © Stratio 2016. Confidential, All Rights Reserved. Latent Space Distribution (II): VAE Loss function 133 Latent Space Mean Vector Standard Deviation Vector Encoder Network Decoder Network ● Encoder and decoder can be denoted as conditional probability representations of data: ● Typically the encoder reduces dimensions as decoder increases it . So, when reconstructing the inputs some information is lost. This information loss can be measured using the reconstruction log-likelihood: ● In order to keep the latent image distribution under control, we can introduce a regularizer into the loss function. The Kullback-Leibler divergence between the encoder distribution and a given and known distribution, such as the standard Gaussian: ● With this penalty in the loss encoder, outputs are forced to be sufficiently diverse: similar inputs will be kept close (smoothly) together in the latent space.
  134. 134. Relu Distribution Divergence (K-L) Reconstruction Loss
  135. 135. © Stratio 2016. Confidential, All Rights Reserved. Latent Space Distribution (III): Probability overview 135 Latent Space Mean Vector Standard Deviation Vector Encoder Network Decoder Network● The VAE contains a specific probability model of data x and latent variables z. ● We can write the joint probability of the model as p(x,z): “how likely is observation x under the joint distribution”. ● By definition, p(x, z)=p(x∣z)p(z) ● In order to generate the data, the process is as follows: For each datapoint i: - Draw latent variables zi∼p(z) - Draw datapoint xi∼p(x∣z) ● We need to figure out p(z) and p(x|z) ● The likelihood is the representation to be learnt from the decoder ● Encoder likelihood can be used to estimate parameters from the prior.
  136. 136. © Stratio 2016. Confidential, All Rights Reserved. Variational Autoencoder: BackProp +reparametrization trick 136 ● VAEs are built by using Backpropagation on the previously defined loss function. ● Mean and variance estimations doesn’t get us Z but its distribution parameters. ● In order to get Z we could sample directly from the true posterior given the parameters, but sampling cannot be differentiated. ● Instead a trick can be applied so that the non- differentiable part is left outside the network ● By stating we can remove the sampling from the backprop part
  137. 137. Índice Analítico Introducción: ¿por qué combinar modelos? Boosting & Bagging basics Demo: ○ Implementación de Adaboost con árboles binarios ○ Feature Selection con Random Forest 1 2 3 Not all that wander are lost Any Questions?
  138. 138. THANK YOU!

×