En esta charla se presentará como se puede afrontar el reto de implantar el Deep Learning sobre la estructura de cómputo de Spark. Se hablará de como construir un proyecto utilizando la infraestructura de Spark ML y BigDL de Intel y su puesta en producción.
Find out more at https://madrid2018.codemotionworld.com/speakers/
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
1. Deep Learning In Spark/ 1
Deep Learning in Spark
Emiliano Martínez
Codemotion 2018
2. Deep Learning In Spark2
ABOUT ME
BBVA Innovation Labs
Cybersecurity
Hyperscale
Artificial Intelligence
https://www.bbva.com/en/guest-authors/bbva-labs/
3. Deep Learning In Spark3
THE FIRST Deep Learning
ML - “the field of study that gives
computers the ability to learn
without being explicitly
programmed”
Arthur Samuel
DL - “for most flavors of the old
generations of learning algorithms …
performance will plateau. … deep
learning … is the first class of
algorithms … that is scalable. …
performance just keeps getting
better as you feed them more data”
Andrew Ng
http://www.cs.ox.ac.uk/activities/machlearn/
4. Deep Learning In Spark4
Intutition
https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6
Logistic Regression ANN
5. Deep Learning In Spark5
THE LAST Apache Spark I
https://databricks.com/spark/about
6. Deep Learning In Spark6
Apache Spark II
“Apache Spark is a high performance,
general-purpose distributed computing that
enables us to process large … very large
quantities of data beyond that can fit in a
single machine”
“Internet powerhouses such as Netflix,
Yahoo, and eBay have deployed Spark
at massive scale, collectively processing
multiple petabytes of data on clusters
of over 8,000 nodes”
https://amplab.cs.berkeley.edu/projects/spark-lightning-fast-cluster-
computing/
https://techvidvan.com/tutorials/spark-cluster-manager-yarn-mesos-and-standalone/
8. Deep Learning In Spark8
Deep Learning Over Spark - Software
https://software.intel.com/en-us/articles/building-large-scale-image-feature-extraction-with-bigdl-at-jdcom
You may want to write your deep learning programs using BigDL if:
You want to analyze a large amount of data on the same Big Data (Hadoop/Spark) cluster where the
data are stored (in, say, HDFS, HBase, Hive, etc.).
You want to add deep learning functionalities (either training or prediction) to your Big Data
(Spark) programs and/or workflow.
You want to leverage existing Hadoop/Spark clusters to run your deep learning applications, which
can be then dynamically shared with other workloads (e.g., ETL, data warehouse, feature
engineering, classical machine learning, graph analytics, etc.)
9. Deep Learning In Spark9
Deep Learning Over Spark - Software
DL4J takes advantage of the latest distributed computing frameworks including Apache Spark and
Hadoop to accelerate training. On multi-GPUs, it is equal to Caffe in performance.
Deeplearning4j is written in Java and is compatible with any JVM language, such as Scala, Clojure
or Kotlin. The underlying computations are written in C, C++ and Cuda. Keras will serve as the
Python API.
10. Deep Learning In Spark10
Deep Learning Over Spark - Software
“MMLSpark integrates the distributed computing framework Apache Spark with the flexible deep
learning framework CNTK. Enabling deep learning at unprecedented scales.”
“Spark is well known for it's ability to switch between batch and streaming workloads by modifying a
single line. We push this concept even further and enable distributed web services with the same API
as batch and streaming workloads.”
11. Deep Learning In Spark11
Deep Learning Over Spark - Software
Sparkling Water allows users to combine the fast, scalable machine learning algorithms of H2O
with the capabilities of Spark. Spark is an elegant and powerful general-purpose, open-source,
in-memory platform with tremendous momentum. H2O is an in-memory platform for machine
learning that is reshaping how people apply math and predictive analytics to their business
problems. Integrating these two open-source environments provides a seamless experience for
users who want to make a query using Spark SQL, feed the results into H2O to build a model
and make predictions, and then use the results again in Spark. For any given problem, better
interoperability between tools provides a better experience.
12. Deep Learning Over Spark -
Exploration
Gathering data.
Exploration
Cleaning
Join
01
02
03
04
Over/Undersampling05
Part 1 Part 2
Part 3 Part 4
Part 5 Part 6
Part 7 Part 8
Part 9 Part 10
Part 11 Part 12
Initial Step
13. Deep Learning In Spark13
ETL - Spark ML
Spark Transformers. Unary and row
transformers.
Custom transformers to apply any
function to one or more columns and to
create another dataframe from the
original
Estimators. “It fits a model to the input
DataFrame and ParamMap to produce a
Transformer (a Model) that can calculate
predictions for any DataFrame-based
input datasets”
01 Spark ML pipelines. Sequences of
transformers and sequences.
They can be fitted sequentially.
Serialization. Pipelines are
serialized and stored to be reused
in inference process.
Narrow transformations. High
performance. No shuffle needed
02
03
04
05
06
15. Deep Learning In Spark15
Distributed Training
Parameter Server All Reduce
https://eng.uber.com/horovod/https://www.slideshare.net/JenAman/scaling-machine-
learning-to-billions-of-parameters
16. Deep Learning In Spark/ 16
Example - IBM Human Resources Analytics Employee
Attrition
17. Deep Learning In Spark17
Overview
Public Dataset extracted from Kaggle
“Uncover the factors that lead to employee attrition and explore important questions such as ‘show me a
breakdown of distance from home by job role and attrition’ or ‘compare average monthly income by
education and attrition’. This is a fictional data set created by IBM data scientists.”
Very small dataset with 33 features.
Classification problem with two classes.
Using Spark ML + BigDL.
18. Deep Learning In Spark18
TRAIN PROCESS
StringIndexer OneHot Custom I Custom II VectorAssembler
Spark
Training
Dataframe
Dataframe to
RDD
Train Save Metrics
Save ETL Model +
BigDL Model
Precision + Recall +
Confusion Matrix
Spark ML Transformers
19. Deep Learning In Spark19
Load Model
Dataframe to
RDD
Predict
“Get model from storage and make predictions IN BATCH MODE”
Spark Test
Dataframe
true
false
INFERENCE PROCESS
20. Presentation title / 20
Metrics
“We use precision and recall for both classes”
https://en.wikipedia.org/wiki/Precision_and_recall
https://educationalresearchtechniques.com/201
6/08/22/using-confusion-matrices-to-evaluate-
performance/
21. Deep Learning In Spark21
INFERENCE IN STREAMING
Socket Load Model Transform
Spark Structured Streaming
TCP Message Predict
true
false
22. Deep Learning In Spark/ 22
Recap
To know how Spark works under the hood gives much more power.
Don´t waste resources.
Use types. Spark Dataset can help but it is not enough. Frameless
01
Generic Pipelines.
02
03
04
23. Deep Learning In Spark23
Emiliano Martínez
Email: emiliano.martinez@bbva.com
Twiter: @EmiCareOfCell44