Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deep recurrent neutral networks for Sequence Learning in Spark

550 views

Published on

Deep recurrent neutral networks for Sequence Learning in Spark

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Deep recurrent neutral networks for Sequence Learning in Spark

  1. 1. www.thalesgroup.com OPEN Deep recurrent neural networks for Sequence Learning in Spark YVES MABIALA
  2. 2. 2 OPEN Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorinpartor disclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved. Outline ▌Thales & Big Data ▌On the difficulty of Sequence Learning ▌Deep Learning for Sequence Learning ▌Spark implementation of Deep Learning ▌Use cases Predictive maintenance NLP
  3. 3. 3 OPEN Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorinpartor disclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved. Thales & Big Data Thales systems produce a huge quantity of data Transportation systems (ticketing, supervision, …) Security (radar traces, network logs, …) Satellite (photos, videos, …) which is often Massive Heterogeneous Extremely dynamic and where understanding the dynamic of the monitored phenomena is mandatory Sequence Learning
  4. 4. 4 OPEN Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorinpartor disclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved. What is sequence learning ? Sequence learning refers to a set of ML tasks where a model has to either deal with sequences as input, produce sequences as output or both Goal : Understand the dynamic of a sequence to Classify Predict Model Typical applications Text - Classify texts (sentiment analysis) - Generate textual description of images (image captioning) Video - Video classification Speech - Speech to text
  5. 5. 5 OPEN Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorinpartor disclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved. How is it typically handled ? Taking into account the dynamic is difficult Often people do not bother - E.g. text analysis using bag of word (one hot encoding) – Problem for certain tasks such as sentiment classification (order of the words is important) Or use popular statistical approaches - (Hidden) Markov model for prediction (and classification) – Short term dependency (order 1) : 𝑃( 𝑋 𝑘 = 𝑥 (𝑋 𝑘−1 = 𝑥 𝑘−1, … , 𝑋 𝑘−𝑛 = 𝑥 𝑘−𝑛)) = 𝑃( 𝑋 𝑘 = 𝑥 𝑘 𝑋 𝑘−1 = 𝑥 𝑘−1) - Autoregressive approaches for time series forecasting The chair is red 1 0 1 1 0 0 0 0 The cat is on a chair The cat is young 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 1 The is chair red young cat on a
  6. 6. 6 OPEN Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorinpartor disclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved. Link with artificial neural network ? Artificial neural networks are statistical models inspired from the brain Transforms the input by applying at each layer (non linear) functions More layers equals more capabilities (≥ 2 hidden layers : Deep Learning) Set of transformation and activation operations Affine : 𝒀 = 𝑾 𝒕 𝑿 + 𝒃, sigmoid activation : 𝟏 𝟏+𝐞𝐱𝐩(−𝑿) , tanh activation : 𝒀 = 𝐭𝐚𝐧𝐡(𝑿) Convolutional : Apply a spatial convolution on the 1D/2D input (signal, image, …): 𝐘 = 𝒄𝒐𝒏𝒗 𝑿, 𝑾 + 𝒃 - Learns spatial features used for classification or prediction (mostly on images/videos) Recurrent : Learn dependencies between successive observations (features related to the dynamic) Objective Find the best weights W to minimize the difference between the predicted output and the desired one (using back-propagation algorithm) input hidden layers output
  7. 7. 7 OPEN Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorinpartor disclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved. Able to cope with varying size sequences either at the input or at the output Recurrent Neural Network basics One to many (fixed size input, sequence output) e.g. Image captioning Many to many (sequence input to sequence output) e.g. Speech to text Many to one (sequence input to fixed size output) e.g. Text classification Artificial neural networks with one or more recurrent layers Classical neural network Recurrent neural network 𝒀 𝒌−𝟑 𝒀 𝒌−𝟐 𝒀 𝒌−𝟏 𝒀 𝒌 𝒀 𝒌 𝑿 𝒌−𝟑 𝑿 𝒌−𝟐 𝑿 𝒌−𝟏 𝑿 𝒌 𝒀 𝒌 = 𝒇(𝑾 𝒕 𝑿 𝒌 + 𝑯𝒀 𝒌−𝟏) 𝑿 𝒌𝑿 𝒀 𝒌 = 𝒇(𝑾 𝒕 𝑿 𝒌) 𝒀 Unrolled through time 𝒀 𝒌−𝟑 𝒀 𝒌−𝟐 𝒀 𝒌−𝟏 𝒀 𝒌 𝑿 𝒀 𝒌−𝟑 𝒀 𝒌−𝟐 𝒀 𝒌−𝟏 𝒀 𝒌 𝑿 𝒌−𝟑 𝑿 𝒌−𝟐 𝑿 𝒌−𝟏 𝑿 𝒌 𝑿 𝒌−𝟑 𝑿 𝒌−𝟐 𝑿 𝒌−𝟏 𝑿 𝒌 𝒀
  8. 8. 8 OPEN Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorinpartor disclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved. On the difficulty of training recurrent networks RNNs are (were) known to be difficult to learn More weights and more computational steps - More computationally expensive (accelerator needed for matrix ops : Blas or GPU) - More data needed to converge (scalability over Big Data architectures : Spark) – Theano, Tensor Flow, Caffe do not have distributed versions Unable to learn long range dependencies (Graves & Al 2014) - At a given time t, RNN does not remember the observations before 𝑋𝑡−𝑛  New RNN architectures with memory preservation (more context) 𝑍 𝑘 = 𝑓 𝑊𝑧 𝑇 𝑋 𝑘 + 𝐻𝑧 𝑌𝑘−1 𝑅 𝑘 = 𝑓(𝑊𝑟 𝑇 𝑋 𝑘 + 𝐻𝑟 𝑌𝑘−1) 𝐻 𝑘 = tanh(𝑊ℎ𝑡𝑖𝑑𝑒 𝑇 𝑋 𝑘 + 𝑈 𝑌𝑘−1 o 𝑅 𝑘 ) 𝑌𝑘 = 1 − 𝑍 𝑘 𝑌𝑘−1 + 𝑍 𝑘 𝐻 𝑘LSTM GRU
  9. 9. 9 OPEN Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorinpartor disclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved. Recurrent neural networks in Spark Spark implementation of DL algorithms (data parallel) All the needed blocks - Affine, convolutional, recurrent layers (Simple and GRU) - SGD, rmsprop, adadelta optimizers - Sigmoid, tanh, reLu activations CPU (and GPU backend) Fully compatible with existing DL library in Spark ML Performance On 6 nodes cluster (CPU) - 5.46 average speedup (some communication overhead) – About the same speedup as MLP in Spark ML Driver Worker 1 Worker 2 Worker 3 Resulting gradients (2) Model broadcast (1)
  10. 10. 10 OPEN Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorinpartor disclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved. Use case 1 : predictive maintenance (1) Context Thales and its clients build systems in different domains - Transportation (ticketing, controlling), Defense (radar), Satellites Need better and more accurate maintenance services - From planned maintenance (every x days) to an alert maintenance - From expert detection to automatic failure prediction - From whole subsystem changes to more localized reparations Goal Detect early signs of a (sub)system failure using data coming from sensors monitoring the health of a system (HUMS)
  11. 11. 11 OPEN Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorinpartor disclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved. Use case 1 : predictive maintenance (2) Example on a real system 20 sensors (20 values every 5 minutes), label (failure or not) Take 3 hours of data and predict the probability of failure in the next hour (fully customizable) Learning using MLLIB
  12. 12. 12 OPEN Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorinpartor disclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved. Use case 1 : predictive maintenance (3) Recurrent net learning Impact of recurrent nets Logistic regression - 70% detection with 70% accuracy Recurrent Neural Network • 85% detection with 75% accuracy
  13. 13. 13 OPEN Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorinpartor disclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved. Use case 2 : Sentiment analysis (1) Context Social network analysis application developed at Thales (Twitter, Facebook, blogs, forums) - Analyze both the content of the texts and the relations (texts, actors) Multiple (big data) analysis - Actor community detection - Text clustering (themes) - … Focus on Sentiment analysis on the collected texts - Classify texts based on their sentiment
  14. 14. 14 OPEN Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorinpartor disclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved. Use case 2 : Sentiment analysis (2) Learning dataset Sentiment140 + Kaggle challenge (1.5M labeled tweets) 50% positives, 50% negatives Compare Bag of words + traditional classifiers (Naïve Bayes, SVM, logistic regression) versus RNN
  15. 15. 15 OPEN Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorinpartor disclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved. Use case 2 : Sentiment analysis (3) NB SVM Log Reg Neural Net (perceptron) RNN (GRU) 100 61.4 58.4 58.4 55.6 NA 1 000 70.6 70.6 70.6 70.8 68.1 10 000 75.4 75.1 75.4 76.1 72.3 100 000 78.1 76.6 76.9 78.5 79.2 700 000 80 78.3 78.3 80 84.1 Results 40 45 50 55 60 65 70 75 80 85 90 NB SVM LogReg NeuralNet RNN (GRU)
  16. 16. 16 OPEN Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorinpartor disclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved. The end… THANK YOU !

×