Distributed LSTM implementation on Spark and Tensorflow

Distributed implementation
of a LSTM on Spark and
Tensorflow
Emanuel Di Nardo
Source code: https://github.com/EmanuelOverflow/LSTM-TensorSpark

Overview
● Introduction
● Apache Spark
● Tensorflow
● RNN-LSTM
● Implementation
● Results
● Conclusions

Introduction
Distributed environment:
● Many computation units;
● Each unit is called ‘node’;
● Node collaboration/competition;
● Message passing;
● Synchronization and global
state management;

Apache Spark
● Large-scale data processing framework;
● In-memory processing;
● General purpose:
○ MapReduce;
○ Batch and streaming processing;
○ Machine learning;
○ Graph theory;
○ Etc…
● Scalable;
● Open source;

Apache Spark
● Resilient Distributed Dataset (RDD):
○ Fault-tolerant collection of elements;
○ Transformation and actions;
○ Lazy computation;
● Spark core:
○ Tasks dispatching;
○ Scheduling;
○ I/O;
● Essentially:
○ A master driver organizes nodes and demands tasks to workers passing a RDD;
○ Worker executioner runs tasks and returns results in new RDD;

Apache Spark Streaming
● Streaming computation;
● Mini-batch strategy;
● Latency depends on mini-batch elaboration time/size;
● Easy to combine with batch strategy;
● Fault tolerance;

Apache Spark
● API for many languages:
○ Java;
○ Python;
○ Scala;
○ R;
● Runs on
○ Hadoop;
○ Mesos;
○ Standalone;
○ Vloud.
● It can access diverse data sources including:
○ HDFS;
○ Cassandra;
○ HBase;

Tensorflow
● Numerical computation library;
● Computation is graph-based:
○ Nodes are mathematical operations;
○ Edges are I/O multidimensional array (tensors);
● Distributed on multiple CPU/GPU;
● API:
○ Python;
○ C++;
● Open source;
● A Google product;

Tensorflow
● Data Flow Graph:
○ Oriented graph;
○ Nodes are mathematical operations or
data I/O;
○ Edges are I/O tensors;
○ Operations are asynchronous and parallel:
■ Performed once all input tensors are
available;
● Flexible and easily extendible;
● Auto-differentiation;
● Lazy computation;

RNN-LSTM
● Recurrent Neural Network;
● Cyclic networks:
○ At each training step the output of
the previous step is used to feed the
same layer with a different input
data;
● Input Xt is transformed in the
hidden layer A, the output is also
used to feed itself;
*Image from http://colah.github.io/posts/2015-08-Understanding-LSTMs/

RNN-LSTM
● Recurrent Neural Network;
● Cyclic networks:
○ At each training step the output of the previous step is used to feed the same layer with a
different input data;
● Unrolled network:
○ Each input feed the network;
○ The output is passed to the next step as a supplementary input data;

RNN-LSTM
● This kind of network has a great problem...:
○ It is unable to learn long data sequence;
○ It works only with in short term;
● It is needed a ‘long memory’ model:
○ Long-short term memory;
● Hidden layer is able to memorize long data sequence using:
○ Current input;
○ Previous output;
○ Network memory state;

RNN-LSTM
● Hidden layer is able to memorize long data sequence using:
○ Current input;
○ Previous output;
○ Network memory state;
● Four ‘gate layers’ to preserve information:
○ Forget gate layer;
○ Input gate layer;
○ ‘Candidate’ gate layer;
○ Output gate layer;
● Multiple activation functions:
○ Sigmoid for the first three layers;
○ Tanh for the output layer;

Implementation
● RNN-LSTM:
○ Distributed on Spark;
○ Mathematical operations with Tensorflow;
● Distribution of mini-batch computation:
○ Each partition takes care of a subset of the whole dataset;
○ Each subset has the same size, it is not required in the mini-batch strategy, using proper
techniques, but we want to test performances over all partitions with a balanced loading;
● Tensorflow provides many LSTM implementations, but it has been decided to
implement a network from scratch for learning purpose;

Implementation
● A master driver splits the input data in partitions organized by key:
○ Input data is shuffled and normalized;
○ Each partition will have its own RDD;
● Each spark-worker runs an entire LSTM training cycle:
○ We will have a number of LSTM equal to number of partitions;
○ It is possible to choose number of epochs, number of hidden layers and number of partitions;
○ Memory to assign to each worker and many other parameters;
● At the end of training step the returning RDD will be mapped in a key-value
data structure with weights and biases values;
● At the end, all elements in the RDDs are averaged to achieve the final result;

Implementation
● With tensorflow mathematical operations a new LSTM is created:
○ Operations are executed in a lazy manner;
○ Initialization builds and organizes the data graph;
● Weights and biases are initialized randomly;
● An optimizer is chosen and an OutputLayer is instantiate;
● For the lazy-strategy all operations must be placed in a ‘session’ window:
○ Session handles initialization ops and graph execution;
○ All variables must be initialized before any run;
● Taking advantages of python function passing, all computation layers are
performed with a unique method:
○ Each time a different function and the right variables are used;

Implementation
● At the end minimization is performed:
○ Loss function is computed in the output layer;
○ Minimization uses tensorflow auto-differentiation;
● At the end data are organized in a key-value structure with weights and
biases;
● It is also possible to perform data evaluation, but it is not a very
time-consuming task, therefore it is not reported.

Results
● Tested locally in a multicore environment:
○ Distributed environment is not available;
○ Each partition is assigned to a core;
● No GPU usage;
● Iris dataset*;
● Overloaded CPUs vs Idle CPUs;
● 12 Core - 64GB RAM;
* http://archive.ics.uci.edu/ml/datasets/Iris

Results
● 3 partitions:
Partition T. exec(s) T. exec(m)
1 1385.62 ~23
2 1675.76 ~28
3 1692.48 ~28
Tot+weight average 1704.81 ~28
Tot+repartition 1704.81 ~28

Results
● 5 partitions:
Partition T. exec(s) T. exec(m)
1 867.18 ~14
2 834.31 ~14
3 995.37 ~16
4 970.46 ~16
5 1015.47 ~17

Results
● 15 partitions:
Part. T. exec(s) T. exec(m) Part. T. exec(s) T. exec(m) Part. T. exec(s) T. exec(m)
1 476.76 ~8 6 482.82 ~8 11 458.05 ~8
2 448..91 ~7 7 499.66 ~8 12 504.85 ~8
3 472.05 ~8 8 454.78 ~8 13 470.93 ~8
4 493.39 ~8 9 479.61 ~8 14 450.84 ~8
5 485.66 ~8 10 493.21 ~8 15 454.29 ~8

Results
● Comparison without distribution:
System T. exec(s) T. exec(m) Speed up mb Speed up loc.
dist-3 1704.81 ~28 96% 61%
dist-5 1023.91 ~17 97% 76%
dist-15 510.89 ~9 98% 88%
local-opt 4080.94 ~68 89% 6%
local 4335.66 ~72 88% -
local-mb-10 34699.58 ~578 - -
local: not distributed implementation
local-opt: not distributed - optimized implementation
local-mb-10: not distributed implementation with mini-batch each 10 elements (like dist-15 organization)

Results
● 3 partitions [overloaded vs idle]:
Part. T. exec busy(s) T. exec busy(m) T. exec idle(s) T. exec idle(m)
1 2679.76 ~44 1385.62 ~23
2 2910.69 ~48 1675.76 ~28
3 3063.88 ~51 1692.48 ~28
Tot 3078.15 ~51 1704.81 ~28

Results
● 5 partitions [overloaded vs idle]:
Part. T. exec busy(s) T. exec busy(m) T. exec idle(s) T. exec idle(m)
1 1356.44 ~22 867.18 ~14
2 1358.28 ~22 834.31 ~14
3 1373.25 ~22 995.37 ~16
4 1370.11 ~23 970.46 ~16
5 1372.25 ~23 1015.47 ~17
Tot 1393.91 ~23 1023.43 ~17

Distributed LSTM implementation on Spark and Tensorflow

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Distributed LSTM implementation on Spark and Tensorflow

Similar to Distributed LSTM implementation on Spark and Tensorflow (20)

Recently uploaded

Recently uploaded (20)

Distributed LSTM implementation on Spark and Tensorflow