Distributed deep learning_over_spark_20_nov_2014_ver_2.8

Deep Learning: Evolution of ML
from Statistical to Brain-like
Computing
Dr. Dobbs Conference Keynote
20th Nov 2014, Bangalore.
Dr. Vijay Srinivas Agneeswaran,
Director, Big-data Labs,
Impetus

Contents
Introduction to Artificial Neural Networks
Deep learning networks
Towards deep learning
From ANNs to DLNs.
Basics of DLNs.
Related Approaches.
Distributed DLNs: Challenges
Introduction to Spark
Distributed DLNs over Spark
Copyright @Impetus Technologies, 2014

Deep Learning: Evolution Timeline

Introduction to Artificial Neural Networks (ANNs)
Perceptron
Copyright @Impetus Technologies,
2014

Introduction to Artificial Neural Networks (ANNs)
Sigmoid Neuron
• Small change in input = small change in behaviour.
• Output of a sigmoid neuron is given below:
• Small change in input = small change in behaviour.
• Output of a sigmoid neuron is given below:

Introduction to Artificial Neural Networks
(ANNs): Back Propagation
What is this?
NAND Gate!
http://zerkpage.tripod.com/ann.htm
2014
initialize network weights (often small random values)
do forEach training example ex
prediction = neural-net-output(network, ex) // forward pass
actual = teacher-output(ex)
compute error (prediction - actual) at the output units
compute delta(wh)for all weights from hidden layer to output layer //
backward pass
compute delta(wi) for all weights from input layer to hidden layer
// backward pass continued
update network weights until all examples classified correctly or
another stopping criterion satisfied
return the network

The network to identify the individual digits
from the input image
http://neuralnetworksanddeeplearning.com/chap1.html

Different Shallow Architectures
Weighted
Sum
Weighted
Sum
Weighted
Sum
Template
matchers
Fixed Basis
Functions
Simple
Trainable Basis
Functions
Linear predictor Kernel Machines ANN, Radial Basis Functions
Y. Bengio and Y. LeCun, "Scaling learning algorithms towards AI," in Large Scale Kernel Machines, (L.
Bottou, O. Chapelle, D. DeCoste, and J. Weston, eds.), MIT Press, 2007.

ANNs for Face Recognition?

DLN for Face Recognition
http://theanalyticsstore.com/deep-learning/

Deep Learning Networks: Learning
No general
learning
algorithm (No-free-
lunch
theorem by
Wolpert 1996).
Learning
algorithm
for specific
tasks –
perception,
control,
prediction,
planning,
reasoning,
language
understand
ing.
Limitations
of BP –
local
minima,
optimization
challenges
for non-convex
objective
functions.
Hinton’s
deep belief
networks as
stack of
RBMs.
Lecun’s
energy
based
learning for
DBNs.

Deep Belief Networks
• This is a deep neural network
composed of multiple layers of
latent variables (hidden units or
feature detectors)
• Can be viewed as a stack of
RBMs
• Hinton along with his student
proposed that these networks
can be trained greedily one
layer at a time
• Boltzmann Machine is a
specific energy model with
linear energy function.
http://www.iro.umontreal.ca/~lisa/twiki/pub/Public/DeepBeliefNetworks/DBNs.png

Other DL Networks: Auto Encoders (Auto-associators
or Diabolo Network)
• Aim of auto encoders network is to
learn a compressed representation for
set of data
• Is an unsupervised learning algorithm
that applies back propagation, setting
the target values equal to inputs
(identity function)
• Denoising auto encoder addresses
identity function by randomly corrupting
input that the auto encoder must then
reconstruct or denoise
• Best applied when there is structure in
the data
• Applications : Dimensionality reduction,
feature selection

Why Deep Learning Networks are Brain-like?
Statistical
approach of
traditional ML –
SVMs or kernel
approaches.
• Not applicable in
deep learning
networks.
Human
brain –
trophic
factors
Traditional ML – lot of
data munging,
representational
issues (feature
abstractor), before
classifier can kick in.
Deep learning –
allows the
system to learn
representations
as well
naturally.

2014
Success stories of DLNs
Android voice
recognition system –
based on DLNs
Improves accuracy by
25% compared to state-of-
art
Microsoft Skype Translate software
and Digital assistant Cortana
1.2 million images, 1000
classes (ImageNet Data)
– error rate of 15.3%,
better than state of art at
26.1%

Success stories of DLNs…..
Senna system – PoS tagging, chunking, NER,
semantic role labeling, syntactic parsing
Comparable F1 score with state-of-art with huge speed
advantage (5 days VS few hours).
DLNs VS TF-IDF: 1 million
documents, relevance search.
3.2ms VS 1.2s.
Robot navigation

Potential Applications of DLNs
Speech recognition/enhancement
Video sequencing
Emotion recognition (video/audio),
Malware detection,
Robotics – navigation.
multi-modal learning (text and image).
Natural Language Processing

Available resources
• Deeplearning4j – open source
implementation of Jeffery Dean’s
distributed deep learning paper.
• Theano: python library of math functions.
• Efficient use of GPUs transparently.
• Hinton’ courses on Coursera:
https://www.coursera.org/instructor/~154

Challenges in Realizing DLNs
Large no. of training
examples – high
accuracy.
• Large no. of
parameters can also
improve accuracy.
Inherently sequential
nature – freeze up
one layer for learning.
GPUs to improve
training speedup
• Limitations –
CPU_to_GPU data
transfers.
Distributed DLNs –
Jeffrey Dean’s work.

Distributed DLNs
• Motivation
• Scalable, low latency training
• Parallelize training data and learn fast
• Jeffrey Dean’s work DistBelief
• Pseudo-centralized realization

What is Spark?
21
Spark provides a
computing
abstraction that
generalizes Map-
Reduce.
More powerful set
of operations than
just map and
reduce – group by,
order by, sort,
reduce by key,
sample, union, etc.
Provides efficient
execution
environment
based on
distributed shared
memory – keep
working set of data
in memory.
Shark provides
Hive Query
Language (HQL)
interface over
Spark

What is Spark? Data Flow in Hadoop
22

What is Spark? Data Flow in Spark
23

Real world use-case example: HITS algorithm
The Hub score and Authority score for a node is calculated with the following algorithm:
 Start with each node having a hub score and authority score of 1 i.e. auth(p) = 1 and
hub(p) = 1
 Run the Authority Update Rule: Update each node's Authority score to be equal
to the sum of the Hub Scores of each node that points to it. That is, a node is given
a high authority score by being linked to by pages that are recognized as Hubs for
information.
 Run the Hub Update Rule: Update each node's Hub Score to be equal to the sum
of the Authority Scores of each node that it points to. That is, a node is given a high
hub score by linking to nodes that are considered to be authorities on the subject.
 Normalize the values by dividing each Hub score by square root of the sum of the
squares of all Hub scores, and dividing each Authority score by square root of the
sum of the squares of all Authority scores.
 Repeat from the second step as necessary.
24

Solve HITS algorithm using Hadoop MR
HDFS
Storag
e
Step 1 : auth(p) = 1 and
hub(p) = 1
Step 2 : Run Authority Update
Rule auth(p) = X
Step 3 : Run Hub Update Rule
hub(p) = Y
Step 4 : Normalize hub(p) and
auth(p) Write
Read
25 Flow

Solve HITS algorithm using Spark
HDFS
Storag
e
Step 1 : auth(p) = 1 and
hub(p) = 1
Step 2 : Run Authority Update
Rule auth(p) = X
Step 3 : Run Hub Update Rule
hub(p) = Y
Step 4 : Normalize hub(p) and
auth(p)
Write
Read
26 Flow

Spark
Transformations/Actions Description
Map(function f1) Pass each element of the RDD through f1 in parallel and return the resulting RDD.
Filter(function f2) Select elements of RDD that return true when passed through f2.
flatMap(function f3) Similar to Map, but f3 returns a sequence to facilitate mapping single input to multiple
outputs.
Union(RDD r1) Returns result of union of the RDD r1 with the self.
Sample(flag, p, seed) Returns a randomly sampled (with seed) p percentage of the RDD.
groupByKey(noTasks) Can only be invoked on key-value paired data – returns data grouped by value. No. of
parallel tasks is given as an argument (default is 8).
reduceByKey(function f4,
noTasks)
Aggregates result of applying f4 on elements with same key. No. of parallel tasks is the
second argument.
Join(RDD r2, noTasks) Joins RDD r2 with self – computes all possible pairs for given key.
groupWith(RDD r3,
Joins RDD r3 with self and groups by key.
noTasks)
sortByKey(flag) Sorts the self RDD in ascending or descending based on flag.
Reduce(function f5) Aggregates result of applying function f5 on all elements of self RDD
Collect() Return all elements of the RDD as an array.
Count() Count no. of elements in RDD
take(n) Get first n elements of RDD.
First() Equivalent to take(1)
saveAsTextFile(path) Persists RDD in a file in HDFS or other Hadoop supported file system at given path.
saveAsSequenceFile(path
)
Persist RDD as a Hadoop sequence file. Can be invoked only on key-value paired RDDs
that implement Hadoop writable interface or equivalent.
foreach(function f6) Run f6 in parallel on elements of self RDD.
[MZ12] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael
J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory
cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and
Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2-2.

Berkeley Big-data Analytics Stack (BDAS)
28

Spark: Use Cases
29
Ooyala
Uses Cassandra for
video data
personalization.
Pre-compute
aggregates VS on-the-
fly queries.
Moved to Spark for
ML and computing
views.
Moved to Shark for on-the-fly
queries – C* OLAP aggregate
queries on Cassandra 130 secs, 60
ms in Spark
Conviva
Uses Hive for
repeatedly running
ad-hoc queries on
video data.
Optimized ad-hoc
queries using Spark
RDDs – found Spark
is 30 times faster
than Hive
ML for connection
analysis and video
streaming
optimization.
Yahoo
Advertisement
targeting: 30K nodes
on Hadoop Yarn
Hadoop – batch processing
Spark – iterative processing
Storm – on-the-fly processing
Content
recommendation –
collaborative
filtering

30
Spark Use Cases: Spark is good for linear algebra, optimization and
N-body problems.
Computations/Operations
Giant 1 (simple stats) is perfect
for Hadoop 1.0.
Giants 2 (linear algebra), 3 (N-body),
4 (optimization) Spark
from UC Berkeley is efficient.
Logistic regression, kernel SVMs,
conjugate gradient descent,
collaborative filtering, Gibbs
sampling, alternating least squares.
Example is social group-first
approach for consumer churn
analysis [2]
Interactive/On-the-fly data
processing – Storm.
OLAP – data cube operations.
Dremel/Drill
Data sets – not embarrassingly
parallel?
Deep Learning
Artificial Neural Networks/Deep
Belief Networks
Machine vision from Google [3]
Speech analysis from Microsoft
Giant 5 – Graph processing –
GraphLab, Pregel, Giraph
[1] National Research Council. Frontiers in Massive Data Analysis . Washington, DC: The National Academies Press, 2013.
[2] Richter, Yossi ; Yom-Tov, Elad ; Slonim, Noam: Predicting Customer Churn in Mobile Networks through Analysis of Social
Groups. In: Proceedings of SIAM International Conference on Data Mining, 2010, S. 732-741
[3] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio
Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, Andrew Y. Ng: Large Scale Distributed Deep Networks. NIPS 2012:

Some Spark(ling) examples
Scala code (serial)
var count = 0
for (i <- 1 to 100000)
{ val x = Math.random * 2 - 1
val y = Math.random * 2 - 1
if (x*x + y*y < 1) count += 1 }
println("Pi is roughly " + 4 * count / 100000.0)
Sample random point on unit circle – count how many are inside them (roughly about PI/4).
Hence, u get approximate value for PI.
Based on the PS/PC = AS/AC=4/PI, so PI = 4 * (PC/PS).

Some Spark(ling) examples
Spark code (parallel)
val spark = new SparkContext(<Mesos master>)
var count = spark.accumulator(0)
for (i <- spark.parallelize(1 to 100000, 12))
{ val x = Math.random * 2 – 1
val y = Math.random * 2 - 1
if (x*x + y*y < 1) count += 1 }
println("Pi is roughly " + 4 * count / 100000.0)
Notable points:
1. Spark context created – talks to Mesos1 master.
2. Count becomes shared variable – accumulator.
3. For loop is an RDD – breaks scala range object (1 to 100000) into 12 slices.
4. Parallelize method invokes foreach method of RDD.
1 Mesos is an Apache incubated clustering system – http://mesosproject.org

Logistic Regression in Spark: Serial Code
// Read data file and convert it into Point objects
val lines = scala.io.Source.fromFile("data.txt").getLines()
val points = lines.map(x => parsePoint(x))
// Run logistic regression
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
val gradient = Vector.zeros(D)
for (p <- points) {
val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y
gradient += scale * p.x
}
w -= gradient
}
println("Result: " + w)

Logistic Regression in Spark
// Read data file and transform it into Point objects
val spark = new SparkContext(<Mesos master>)
val lines = spark.hdfsTextFile("hdfs://.../data.txt")
val points = lines.map(x => parsePoint(x)).cache()
// Run logistic regression
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
val gradient = spark.accumulator(Vector.zeros(D))
for (p <- points) {
val scale = (1/(1+Math.exp(-p.y*(w dot p.x)))-1)*p.y
gradient += scale * p.x
}
w -= gradient.value
}
println("Result: " + w)

Deep Learning on
SparkFully Distributed Deep
learning network
implementation on
Spark.
Spark would handle
the parallelism,
synchronization,
distribution, and fail
over.
The input data set in
HDFS, intermediate
data in local file
system
Publish/subscribe
message passing
framework built on top
of Apache Spark using
Akka Framework.

Conclusions
• ANN to Distributed Deep Learning
• Key ideas in deep learning
• Need for distributed realizations.
• DistBelief, deeplearning4j etc.
• Our work on large scale distributed deep learning
• Deep learning leads us from statistics based
machine learning towards brain inspired AI.

Thank You!
Mail • vijay.sa@impetus.co.in
LinkedIn • http://in.linkedin.com/in/vijaysrinivasagneeswaran
Blogs • blogs.impetus.com
Twitter • @a_vijaysrinivas.

Backup Slides
Copyright @Impetus
Technologies, 2014

2014
Energy Based Models
http://www.cs.nyu.edu/~yann/research/ebm/loss-func.
png
• RBM are Energy Based Models (EBM)
• EBM associate an energy with every configuration of a
system
• Learning corresponds to modifying the shape of energy
function, so that it has desirable properties
• Like in physics, lower energy = more stability
• So, modify shape of energy function such that the
desirable configurations have lower energy

Other DL networks:
Convolutional Networks
Yann LeCun, Patrick Haffner, Léon Bottou, and Yoshua Bengio. 1999. Object Recognition with Gradient-Based
Learning. In Shape, Contour and Grouping in Computer Vision, David A. Forsyth, Joseph L. Mundy, Vito Di
Gesù, and Roberto Cipolla (Eds.). Springer-Verlag, London, UK, UK, 319-.

Other Brain-like Approaches
• Recurrent Neural networks
• Long Short Term Memory (LSTM), Temporal
data
• Sum-product networks
• Deep architectures of sum-product networks
• Hierarchical temporal memory
• online structural and algorithmic model of
neocortex.

Recurrent Neural Networks
• Connections between units form a Directed
cycle i.e. a typical feed back connections
• RNNs can use their internal memory to process
arbitrary sequences of inputs
• RNNs cannot learn to look far back past
• LSTM solve this problem by introducing stem
cells
• These stem cells can remember a value for an
arbitrary amount of time

Sum-Product Networks (SPN)
• SPN is deep network model and is a directed
acyclic graph
• These networks allow to compute the
probability of an event quickly
• SPNs try to convert multi linear functions to
ones in computationally short forms i.e. it must
consist of multiple additions and multiplications
• Leaves correspond to variables and nodes
correspond to sums and products

Hierarchical Temporal Memory
• Is a online machine learning model developed by
Jeff Hawkins
• This model learns one instance at a time
• Best explained by online stock model. Today’s
situation of stock helps in prediction of tomorrow’s
stock
• A HTM network is tree shaped hierarchy of levels
• Higher hierarchy levels can use patterns learned at
lower levels. This is adopted from learning model
adopted by brain in the form of neo cortex

http://en.wikipedia.org/wiki/Hierarchical_temporal_memory

Mathematical Equations
• The Energy Function is defined as follows:
퐸 푥, ℎ = −푏′푥 − 푐′ℎ − ℎ′푊푥
where, W represents the
weights connecting
visible layer and hidden
layer.
b’ and c’ are the biases

Learning Energy Based Models
• Energy based models can be learnt by performing gradient
descent on negative log-likelihood of training data
• It has the following form:
−
휕 log 푝 푥
휕θ
=
휕 퐹 푥
휕θ
−
푥̃
푝 푥
휕 퐹 푥
휕θ
Positive phaseNegative phase

Distributed deep learning_over_spark_20_nov_2014_ver_2.8

More Related Content

What's hot

Similar to Distributed deep learning_over_spark_20_nov_2014_ver_2.8

Recently uploaded

Distributed deep learning_over_spark_20_nov_2014_ver_2.8

Editor's Notes