Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018

Deep Learning In Spark/ 1
Deep Learning in Spark
Emiliano Martínez
Codemotion 2018

Deep Learning In Spark2
ABOUT ME
BBVA Innovation Labs
Cybersecurity
Hyperscale
Artificial Intelligence
https://www.bbva.com/en/guest-authors/bbva-labs/

THE FIRST Deep Learning
ML - “the field of study that gives
computers the ability to learn
without being explicitly
programmed”
Arthur Samuel
DL - “for most flavors of the old
generations of learning algorithms …
performance will plateau. … deep
learning … is the first class of
algorithms … that is scalable. …
performance just keeps getting
better as you feed them more data”
Andrew Ng
http://www.cs.ox.ac.uk/activities/machlearn/

Intutition
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6
Logistic Regression ANN

THE LAST Apache Spark I
https://databricks.com/spark/about

Apache Spark II
“Apache Spark is a high performance,
general-purpose distributed computing that
enables us to process large … very large
quantities of data beyond that can fit in a
single machine”
“Internet powerhouses such as Netflix,
Yahoo, and eBay have deployed Spark
at massive scale, collectively processing
multiple petabytes of data on clusters
of over 8,000 nodes”
https://amplab.cs.berkeley.edu/projects/spark-lightning-fast-cluster-
computing/
https://techvidvan.com/tutorials/spark-cluster-manager-yarn-mesos-and-standalone/

http://uc-r.github.io/feedforward_DNN
https://blog.insightdatascience.com/automating-breast-cancer-
detection-with-deep-learning-d8b49da17950
https://isaacchanghau.github.io/post/lstm-gru-formula/
Apache Spark + DL

Deep Learning Over Spark - Software
https://software.intel.com/en-us/articles/building-large-scale-image-feature-extraction-with-bigdl-at-jdcom
You may want to write your deep learning programs using BigDL if:
You want to analyze a large amount of data on the same Big Data (Hadoop/Spark) cluster where the
data are stored (in, say, HDFS, HBase, Hive, etc.).
You want to add deep learning functionalities (either training or prediction) to your Big Data
(Spark) programs and/or workflow.
You want to leverage existing Hadoop/Spark clusters to run your deep learning applications, which
can be then dynamically shared with other workloads (e.g., ETL, data warehouse, feature
engineering, classical machine learning, graph analytics, etc.)

DL4J takes advantage of the latest distributed computing frameworks including Apache Spark and
Hadoop to accelerate training. On multi-GPUs, it is equal to Caffe in performance.
Deeplearning4j is written in Java and is compatible with any JVM language, such as Scala, Clojure
or Kotlin. The underlying computations are written in C, C++ and Cuda. Keras will serve as the
Python API.

“MMLSpark integrates the distributed computing framework Apache Spark with the flexible deep
learning framework CNTK. Enabling deep learning at unprecedented scales.”
“Spark is well known for it's ability to switch between batch and streaming workloads by modifying a
single line. We push this concept even further and enable distributed web services with the same API
as batch and streaming workloads.”

Sparkling Water allows users to combine the fast, scalable machine learning algorithms of H2O
with the capabilities of Spark. Spark is an elegant and powerful general-purpose, open-source,
in-memory platform with tremendous momentum. H2O is an in-memory platform for machine
learning that is reshaping how people apply math and predictive analytics to their business
problems. Integrating these two open-source environments provides a seamless experience for
users who want to make a query using Spark SQL, feed the results into H2O to build a model
and make predictions, and then use the results again in Spark. For any given problem, better
interoperability between tools provides a better experience.

Deep Learning Over Spark -
Exploration
Gathering data.
Exploration
Cleaning
Join
01
02
03
04
Over/Undersampling05
Part 1 Part 2
Part 3 Part 4
Part 5 Part 6
Part 7 Part 8
Part 9 Part 10
Part 11 Part 12
Initial Step

ETL - Spark ML
Spark Transformers. Unary and row
transformers.
Custom transformers to apply any
function to one or more columns and to
create another dataframe from the
original
Estimators. “It fits a model to the input
DataFrame and ParamMap to produce a
Transformer (a Model) that can calculate
predictions for any DataFrame-based
input datasets”
01 Spark ML pipelines. Sequences of
transformers and sequences.
They can be fitted sequentially.
Serialization. Pipelines are
serialized and stored to be reused
in inference process.
Narrow transformations. High
performance. No shuffle needed
02
03
04
05
06

Training
https://www.ritchieng.com/neural-networks-learning/
Training

Distributed Training
Parameter Server All Reduce
https://eng.uber.com/horovod/https://www.slideshare.net/JenAman/scaling-machine-
learning-to-billions-of-parameters

Example - IBM Human Resources Analytics Employee
Attrition

Overview
Public Dataset extracted from Kaggle
“Uncover the factors that lead to employee attrition and explore important questions such as ‘show me a
breakdown of distance from home by job role and attrition’ or ‘compare average monthly income by
education and attrition’. This is a fictional data set created by IBM data scientists.”
Very small dataset with 33 features.
Classification problem with two classes.
Using Spark ML + BigDL.

TRAIN PROCESS
StringIndexer OneHot Custom I Custom II VectorAssembler
Spark
Training
Dataframe
Dataframe to
RDD
Train Save Metrics
Save ETL Model +
BigDL Model
Precision + Recall +
Confusion Matrix
Spark ML Transformers

Load Model
Dataframe to
RDD
Predict
“Get model from storage and make predictions IN BATCH MODE”
Spark Test
Dataframe
true
false
INFERENCE PROCESS

Presentation title / 20
Metrics
“We use precision and recall for both classes”
https://en.wikipedia.org/wiki/Precision_and_recall
https://educationalresearchtechniques.com/201
6/08/22/using-confusion-matrices-to-evaluate-
performance/

INFERENCE IN STREAMING
Socket Load Model Transform
Spark Structured Streaming
TCP Message Predict
true
false

Recap
To know how Spark works under the hood gives much more power.
Don´t waste resources.
Use types. Spark Dataset can help but it is not enough. Frameless
01
Generic Pipelines.
02
03
04

Emiliano Martínez
Email: emiliano.martinez@bbva.com
Twiter: @EmiCareOfCell44

Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018

Similar to Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018 (20)

More from Codemotion

More from Codemotion (20)

Recently uploaded

Recently uploaded (20)

Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018