SPARK AND DEEP LEARNING FRAMEWORKS AT
SCALE
Vartika Singh
2 © Cloudera, Inc. All rights reserved.
3 © Cloudera, Inc. All rights reserved.
OBJECTIVE
• Enabling Machine Learning in field
• Enablement and use case discovery
• Data and ML: what do we focus on?
• Typical data ingest architecture
• Extending Spark
• Deep Learning - how does the fit in?
• Hardware
Objective
4 © Cloudera, Inc. All rights reserved.
5 © Cloudera, Inc. All rights reserved.
DATA - MARKET PROPOSITION
Click Stream Smart clicks, impression and
conversions
Videos Fraud, navigation, ad placement
Medical Data Tumor detection, patient mortality,
anomaly identification
City data Planning, Resource distribution
Wafer, Oil and gas data Pipeline optimization, fault detection
?? ...
6 © Cloudera, Inc. All rights reserved.
7 © Cloudera, Inc. All rights reserved.
Ref: https://hbr.org/2017/05/whats-your-data-strategy
• Less than half of an organization’s structured data is actively used in making decisions
• Less than 1% of it’s unstructured data is analyzed or used at all
• More than 70% of employees have access to data they should not
• 80% of analysts time is spent simply discovering and preparing data
• Data breaches are common
• Rogue data sets propagate in silos
• Companies’ data technology often is not up to the demands put on it
8 © Cloudera, Inc. All rights reserved.
9 © Cloudera, Inc. All rights reserved.
Use case
discovery
Model Serving
Hidden feedback
loops
Undeclared
consumer
dependencies
Change in the
external world
Ref: Hidden Technical Debt in Machine Learning ... - NIPS Proceedings
10 © Cloudera, Inc. All rights reserved.
11 © Cloudera, Inc. All rights reserved.
Is evolving Science
We are not very good at anticipating what the next emerging serious flaw will
be.
What we’re missing is an engineering discipline with its principles of analysis
and design.
Keep It Simple Stupid!
https://medium.com/@mijordan3/artificial-intelligence-the-revolution-hasnt-happened-yet-5e1d5812e1e7
12 © Cloudera, Inc. All rights reserved.
13 © Cloudera, Inc. All rights reserved.
Data
Processes
ML
● Deconstruct the problem.
● Democratize
● Paved Pathways
© Cloudera, Inc. All rights reserved.
INTELLIGENT INFRASTRUCTURE!!!
15 © Cloudera, Inc. All rights reserved.
CLOUDERA DATA SCIENCE WORKBENCH
16 © Cloudera, Inc. All rights reserved.
OVERVIEW - PROJECTS
17 © Cloudera, Inc. All rights reserved.
OVERVIEW - GPUS
18 © Cloudera, Inc. All rights reserved.
OVERVIEW - WEBUIS
19 © Cloudera, Inc. All rights reserved.
OVERVIEW - DISTRIBUTED COMPUTING WITH WORKERS
20 © Cloudera, Inc. All rights reserved.
OTHER FEATURES
• Git
• S3/HDFS
21 © Cloudera, Inc. All rights reserved.
• Create a snapshot of model code,
dependencies, and configuration
necessary to train the model.
• Build and execute the training run
in an isolate container.
• Track specified model metrics,
performance, and model artifacts.
• Inspect, compare , or deploy prior
models.
EXPERIMENTS
22 © Cloudera, Inc. All rights reserved.
MODELS
23 © Cloudera, Inc. All rights reserved.
• In model parallelism, different machines in
the distributed system are responsible for
the computations in different parts of a
single network - for example, each layer in
the neural network may be assigned to a
different machine.
24 © Cloudera, Inc. All rights reserved.
• In data parallelism, different machines have
a complete copy of the model; each machine
simply gets a different portion of the data, and
results from each are somehow combined.
25 © Cloudera, Inc. All rights reserved.
26 © Cloudera, Inc. All rights reserved.
SPARK AND JNI
• OpenCV
• Tesseract
• Common Implementations using JavaCPP
Ref: https://github.com/bytedeco/javacpp
27 © Cloudera, Inc. All rights reserved.
SPARK/HPC WORKLOADS
Gene Sequencing/ Assembling/ Analysis
• Data parallelism and statistical methods lie at the core of all DNA sequencing
workloads.
• Sequencing - Base calling
• Variant calling
• GATK - Can run on Spark
• Canu - Transform to PySpark workload using Python C extensions
• Analysis - HAIL
Ref: https://software.broadinstitute.org/gatk/
Ref: https://hail.is/
Ref: https://blog.cloudera.com/blog/2017/05/hail-scalable-genomics-analysis-with-spark/
28 © Cloudera, Inc. All rights reserved.
HPC WORKLOADS
• Portions of the Hadoop ecosystem can open your grid to more users.
• PySpark allows a company that is using a legacy C++ grid to re-use their C++ library assets
with very little to no changes. Python to C++ bindings result in minimal performance penalties.
• Cloudera Data Science Workbench (CDSW) allow Data Scientists to rapidly develop and
visualize models with more involvement from the business.
• In infrastructures with direct attached storage, Hadoop’s locality based processing allows for
fast efficient movement of data between storage and compute.
• Deploying Hadoop on a portion or on all of your grid allows you to use the same tools on the
grid that you would use on a Cloud Based Hadoop Cluster.
29 © Cloudera, Inc. All rights reserved.
DEEP LEARNING IN BIG DATA
• A major source of difficulty in many real-
world artificial intelligence applications is
that many of the factors of variation
influence every single piece of data we can
observe.
• Deep learning solves this central problem
via representation learning by introducing
representations that are expressed in terms
of other, simpler representations.
30 © Cloudera, Inc. All rights reserved.
BIOINFORMATICS
• Protein Structure
• Gene Expression Regulation
• Protein Classification
• Anomaly Classification
• Segmentation
31 © Cloudera, Inc. All rights reserved.
BIOINFORMATICS: THE NATURE OF DATA
• Complex and expensive data acquisition processes limit the size of
bioinformatics datasets.
• Significantly unequal class distributions
• In clinical or disease-related cases, there is inevitably less data from treatment groups than
from the normal (control) group.
• Visualization
• Multimodal Deep Learning
32 © Cloudera, Inc. All rights reserved.
IOT
• A time series is a sequence of regular time-ordered observations
• Example: stock prices, weather readings, smartphone sensor data
• Challenges
• Large scale streaming data
• Heterogeneity
• Time and space correlation
• High noise data
• NRT decision on multimodal data
33 © Cloudera, Inc. All rights reserved.
IOT DEVICES
• Network compression
• Convert to sparse network
• Not general enough
• Factors to consider
• Running time
• Energy consumption
• Architectural considerations
• FFL are much faster than convolution layers in CNN
• Activation functions (ReLu are more time-efficient than Tanh than Sigmoid)
• CNNs use less storage than DNNs due to fewer stored parameters in convolutional layers
• Accelerators
• Tinymotes
• Fog Computing
34 © Cloudera, Inc. All rights reserved.
NLP
• Word Embeddings: GloVe, Word2Vec
• RNN -> LSTMs -> Attention Mechanism
• Applications
• Sentiment analysis
• Gene sequencing
• Natural language generation
35 © Cloudera, Inc. All rights reserved.
DEEP LEARNING - THE HYPERPARAMETERS
• Architecture
• How many layers
• How many nodes/filters
• Which type
• Data
• Batches size
• Size of filters
• Number of steps the
memory of cells will learn
• Training
• Regularization
• Learning rate
• Gradient expressions
• Init policy
36 © Cloudera, Inc. All rights reserved.
TRANSFER LEARNING
37 © Cloudera, Inc. All rights reserved.
TRANSFER LEARNING
• Deep neural networks trained on natural images exhibit a curious phenomenon
in common:
• In the first layer they learn features similar to Gabor filters and color blobs.
• Such first-layer features appear not to be specific to a particular dataset or task, but general in
that they are applicable to many datasets and tasks.
• Initializing a network with transferred features from almost any number of layers
can produce a boost to generalization that lingers even after fine-tuning to the
target dataset.
• The effectiveness of feature transfer is expected to decline as the base and
target tasks become less similar.
38 © Cloudera, Inc. All rights reserved.
SPARK DEEP LEARNING PIPELINES
• Transfer learning
• Distributed hyperparameter tuning
• Deploying models in SQL
39 © Cloudera, Inc. All rights reserved.
DISTRIBUTED TRAINING - WHEN TO DO IT
• Distributed training isn’t free
• Setup time
• Continue to train your networks on a single machine, until the training time
becomes prohibitive
40 © Cloudera, Inc. All rights reserved.
OPERATIONAL IMPLICATIONS
• Model exploration using small data
• Computational limits
• Irreducible errors
• Predictable
41 © Cloudera, Inc. All rights reserved.
• Neurons and Synapses
• Computed weighted sum for
each layer
• Compute the gradient of the loss
relative to the filter inputs
• Compute the gradient of the loss
relative to the weights
M. Mohammadi, A. Al-Fuqaha, S. Sorour, and M. Guizani, “Deep Learning for IoT Big Data and Streaming Analytics: A Survey,” arXiv preprint arXiv:1712.04301v1 [cs.NI], 2017.
DNN
42 © Cloudera, Inc. All rights reserved.
DEEP LEARNING AT SCALE
• Backpropagation requires intermediate outputs of the network to be preserved
for the backwards computation, thus training has increased storage
requirements.
• Second, due to the gradients use for hill-climbing, the precision requirement for
training is generally higher than inference.
43 © Cloudera, Inc. All rights reserved.
DEEP LEARNING AT SCALE
• A significant amount of effort has been put into developing deep learning
systems that can scale to very large models and large training sets
• Large models in the literature are now top performers in supervised visual
recognition tasks
• Can even learn to detect objects when trained from unlabeled images alone
• The very largest of these systems are able to train neural networks with over 1
billion trainable parameters
44 © Cloudera, Inc. All rights reserved.
HARDWARE FOR DNN
• Intel Knights Landing CPU features special vector instructions for deep learning
• Nvidia PASCAL GP100 GPU features 16-bit floating point (FP16) arithmetic
support to perform two FP16 operations on a single precision core for faster
deep learning computation
• Systems have also been built specifically for DNN processing such as Nvidia
DGX-1 and Facebook’s Big Basin custom DNN server
• DNN inference has also been demonstrated on various embedded System-on-
Chips (SoC) such as Nvidia Tegra and Samsung Exynos as well as FPGAs
45 © Cloudera, Inc. All rights reserved.
GPU SUPPORT IN YARN
• As of now, only Nvidia GPUs are supported by YARN
• YARN node managers have to be pre-installed with Nvidia drivers.
• When Docker is used as container runtime context, nvidia-docker 1.0 needs to
be installed (Current supported version in YARN for nvidia-docker).
• https://issues.apache.org/jira/browse/YARN-3926
• https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-
site/UsingGpus.html
46 © Cloudera, Inc. All rights reserved.
Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, Joel Emer, Efficient Processing of Deep Neural Networks: A Tutorial and Survey
47 © Cloudera, Inc. All rights reserved.
48 © Cloudera, Inc. All rights reserved.
ACCELERATORS FOR TEMPORAL ARCHITECTURES
• The downside for using matrix multiplication for the CONV layers is that there is
redundant data in the input feature map matrix, which can lead to either
inefficiency in storage, or a complex memory access pattern
• There are software libraries designed for CPUs (e.g., Open- BLAS, Intel MKL,
etc.) and GPUs (e.g., cuBLAS, cuDNN, etc.) that optimize for matrix
multiplications
• The matrix multiplications on these platforms can be further sped up by
applying computational transforms to the data to reduce the number of
multiplications
49 © Cloudera, Inc. All rights reserved.
ACCELERATORS FOR SPATIAL ARCHITECTURES
• For DNNs, the bottleneck for processing is in the
memory access
• Accelerators, such as spatial architectures,
provide an opportunity to reduce the energy cost
of data movement by introducing several levels
of local memory hierarchy with different energy
cost
• The multiple levels of memory hierarchy help to
improve energy efficiency by providing low-cost
data accesses
50 © Cloudera, Inc. All rights reserved.
1) How do you
collect your data?
2) Where do your
data scientists play?
3) Let’s talk to
the business
THANK YOU

Spark and Deep Learning Frameworks at Scale 7.19.18

  • 1.
    SPARK AND DEEPLEARNING FRAMEWORKS AT SCALE Vartika Singh
  • 2.
    2 © Cloudera,Inc. All rights reserved.
  • 3.
    3 © Cloudera,Inc. All rights reserved. OBJECTIVE • Enabling Machine Learning in field • Enablement and use case discovery • Data and ML: what do we focus on? • Typical data ingest architecture • Extending Spark • Deep Learning - how does the fit in? • Hardware Objective
  • 4.
    4 © Cloudera,Inc. All rights reserved.
  • 5.
    5 © Cloudera,Inc. All rights reserved. DATA - MARKET PROPOSITION Click Stream Smart clicks, impression and conversions Videos Fraud, navigation, ad placement Medical Data Tumor detection, patient mortality, anomaly identification City data Planning, Resource distribution Wafer, Oil and gas data Pipeline optimization, fault detection ?? ...
  • 6.
    6 © Cloudera,Inc. All rights reserved.
  • 7.
    7 © Cloudera,Inc. All rights reserved. Ref: https://hbr.org/2017/05/whats-your-data-strategy • Less than half of an organization’s structured data is actively used in making decisions • Less than 1% of it’s unstructured data is analyzed or used at all • More than 70% of employees have access to data they should not • 80% of analysts time is spent simply discovering and preparing data • Data breaches are common • Rogue data sets propagate in silos • Companies’ data technology often is not up to the demands put on it
  • 8.
    8 © Cloudera,Inc. All rights reserved.
  • 9.
    9 © Cloudera,Inc. All rights reserved. Use case discovery Model Serving Hidden feedback loops Undeclared consumer dependencies Change in the external world Ref: Hidden Technical Debt in Machine Learning ... - NIPS Proceedings
  • 10.
    10 © Cloudera,Inc. All rights reserved.
  • 11.
    11 © Cloudera,Inc. All rights reserved. Is evolving Science We are not very good at anticipating what the next emerging serious flaw will be. What we’re missing is an engineering discipline with its principles of analysis and design. Keep It Simple Stupid! https://medium.com/@mijordan3/artificial-intelligence-the-revolution-hasnt-happened-yet-5e1d5812e1e7
  • 12.
    12 © Cloudera,Inc. All rights reserved.
  • 13.
    13 © Cloudera,Inc. All rights reserved. Data Processes ML ● Deconstruct the problem. ● Democratize ● Paved Pathways
  • 14.
    © Cloudera, Inc.All rights reserved. INTELLIGENT INFRASTRUCTURE!!!
  • 15.
    15 © Cloudera,Inc. All rights reserved. CLOUDERA DATA SCIENCE WORKBENCH
  • 16.
    16 © Cloudera,Inc. All rights reserved. OVERVIEW - PROJECTS
  • 17.
    17 © Cloudera,Inc. All rights reserved. OVERVIEW - GPUS
  • 18.
    18 © Cloudera,Inc. All rights reserved. OVERVIEW - WEBUIS
  • 19.
    19 © Cloudera,Inc. All rights reserved. OVERVIEW - DISTRIBUTED COMPUTING WITH WORKERS
  • 20.
    20 © Cloudera,Inc. All rights reserved. OTHER FEATURES • Git • S3/HDFS
  • 21.
    21 © Cloudera,Inc. All rights reserved. • Create a snapshot of model code, dependencies, and configuration necessary to train the model. • Build and execute the training run in an isolate container. • Track specified model metrics, performance, and model artifacts. • Inspect, compare , or deploy prior models. EXPERIMENTS
  • 22.
    22 © Cloudera,Inc. All rights reserved. MODELS
  • 23.
    23 © Cloudera,Inc. All rights reserved. • In model parallelism, different machines in the distributed system are responsible for the computations in different parts of a single network - for example, each layer in the neural network may be assigned to a different machine.
  • 24.
    24 © Cloudera,Inc. All rights reserved. • In data parallelism, different machines have a complete copy of the model; each machine simply gets a different portion of the data, and results from each are somehow combined.
  • 25.
    25 © Cloudera,Inc. All rights reserved.
  • 26.
    26 © Cloudera,Inc. All rights reserved. SPARK AND JNI • OpenCV • Tesseract • Common Implementations using JavaCPP Ref: https://github.com/bytedeco/javacpp
  • 27.
    27 © Cloudera,Inc. All rights reserved. SPARK/HPC WORKLOADS Gene Sequencing/ Assembling/ Analysis • Data parallelism and statistical methods lie at the core of all DNA sequencing workloads. • Sequencing - Base calling • Variant calling • GATK - Can run on Spark • Canu - Transform to PySpark workload using Python C extensions • Analysis - HAIL Ref: https://software.broadinstitute.org/gatk/ Ref: https://hail.is/ Ref: https://blog.cloudera.com/blog/2017/05/hail-scalable-genomics-analysis-with-spark/
  • 28.
    28 © Cloudera,Inc. All rights reserved. HPC WORKLOADS • Portions of the Hadoop ecosystem can open your grid to more users. • PySpark allows a company that is using a legacy C++ grid to re-use their C++ library assets with very little to no changes. Python to C++ bindings result in minimal performance penalties. • Cloudera Data Science Workbench (CDSW) allow Data Scientists to rapidly develop and visualize models with more involvement from the business. • In infrastructures with direct attached storage, Hadoop’s locality based processing allows for fast efficient movement of data between storage and compute. • Deploying Hadoop on a portion or on all of your grid allows you to use the same tools on the grid that you would use on a Cloud Based Hadoop Cluster.
  • 29.
    29 © Cloudera,Inc. All rights reserved. DEEP LEARNING IN BIG DATA • A major source of difficulty in many real- world artificial intelligence applications is that many of the factors of variation influence every single piece of data we can observe. • Deep learning solves this central problem via representation learning by introducing representations that are expressed in terms of other, simpler representations.
  • 30.
    30 © Cloudera,Inc. All rights reserved. BIOINFORMATICS • Protein Structure • Gene Expression Regulation • Protein Classification • Anomaly Classification • Segmentation
  • 31.
    31 © Cloudera,Inc. All rights reserved. BIOINFORMATICS: THE NATURE OF DATA • Complex and expensive data acquisition processes limit the size of bioinformatics datasets. • Significantly unequal class distributions • In clinical or disease-related cases, there is inevitably less data from treatment groups than from the normal (control) group. • Visualization • Multimodal Deep Learning
  • 32.
    32 © Cloudera,Inc. All rights reserved. IOT • A time series is a sequence of regular time-ordered observations • Example: stock prices, weather readings, smartphone sensor data • Challenges • Large scale streaming data • Heterogeneity • Time and space correlation • High noise data • NRT decision on multimodal data
  • 33.
    33 © Cloudera,Inc. All rights reserved. IOT DEVICES • Network compression • Convert to sparse network • Not general enough • Factors to consider • Running time • Energy consumption • Architectural considerations • FFL are much faster than convolution layers in CNN • Activation functions (ReLu are more time-efficient than Tanh than Sigmoid) • CNNs use less storage than DNNs due to fewer stored parameters in convolutional layers • Accelerators • Tinymotes • Fog Computing
  • 34.
    34 © Cloudera,Inc. All rights reserved. NLP • Word Embeddings: GloVe, Word2Vec • RNN -> LSTMs -> Attention Mechanism • Applications • Sentiment analysis • Gene sequencing • Natural language generation
  • 35.
    35 © Cloudera,Inc. All rights reserved. DEEP LEARNING - THE HYPERPARAMETERS • Architecture • How many layers • How many nodes/filters • Which type • Data • Batches size • Size of filters • Number of steps the memory of cells will learn • Training • Regularization • Learning rate • Gradient expressions • Init policy
  • 36.
    36 © Cloudera,Inc. All rights reserved. TRANSFER LEARNING
  • 37.
    37 © Cloudera,Inc. All rights reserved. TRANSFER LEARNING • Deep neural networks trained on natural images exhibit a curious phenomenon in common: • In the first layer they learn features similar to Gabor filters and color blobs. • Such first-layer features appear not to be specific to a particular dataset or task, but general in that they are applicable to many datasets and tasks. • Initializing a network with transferred features from almost any number of layers can produce a boost to generalization that lingers even after fine-tuning to the target dataset. • The effectiveness of feature transfer is expected to decline as the base and target tasks become less similar.
  • 38.
    38 © Cloudera,Inc. All rights reserved. SPARK DEEP LEARNING PIPELINES • Transfer learning • Distributed hyperparameter tuning • Deploying models in SQL
  • 39.
    39 © Cloudera,Inc. All rights reserved. DISTRIBUTED TRAINING - WHEN TO DO IT • Distributed training isn’t free • Setup time • Continue to train your networks on a single machine, until the training time becomes prohibitive
  • 40.
    40 © Cloudera,Inc. All rights reserved. OPERATIONAL IMPLICATIONS • Model exploration using small data • Computational limits • Irreducible errors • Predictable
  • 41.
    41 © Cloudera,Inc. All rights reserved. • Neurons and Synapses • Computed weighted sum for each layer • Compute the gradient of the loss relative to the filter inputs • Compute the gradient of the loss relative to the weights M. Mohammadi, A. Al-Fuqaha, S. Sorour, and M. Guizani, “Deep Learning for IoT Big Data and Streaming Analytics: A Survey,” arXiv preprint arXiv:1712.04301v1 [cs.NI], 2017. DNN
  • 42.
    42 © Cloudera,Inc. All rights reserved. DEEP LEARNING AT SCALE • Backpropagation requires intermediate outputs of the network to be preserved for the backwards computation, thus training has increased storage requirements. • Second, due to the gradients use for hill-climbing, the precision requirement for training is generally higher than inference.
  • 43.
    43 © Cloudera,Inc. All rights reserved. DEEP LEARNING AT SCALE • A significant amount of effort has been put into developing deep learning systems that can scale to very large models and large training sets • Large models in the literature are now top performers in supervised visual recognition tasks • Can even learn to detect objects when trained from unlabeled images alone • The very largest of these systems are able to train neural networks with over 1 billion trainable parameters
  • 44.
    44 © Cloudera,Inc. All rights reserved. HARDWARE FOR DNN • Intel Knights Landing CPU features special vector instructions for deep learning • Nvidia PASCAL GP100 GPU features 16-bit floating point (FP16) arithmetic support to perform two FP16 operations on a single precision core for faster deep learning computation • Systems have also been built specifically for DNN processing such as Nvidia DGX-1 and Facebook’s Big Basin custom DNN server • DNN inference has also been demonstrated on various embedded System-on- Chips (SoC) such as Nvidia Tegra and Samsung Exynos as well as FPGAs
  • 45.
    45 © Cloudera,Inc. All rights reserved. GPU SUPPORT IN YARN • As of now, only Nvidia GPUs are supported by YARN • YARN node managers have to be pre-installed with Nvidia drivers. • When Docker is used as container runtime context, nvidia-docker 1.0 needs to be installed (Current supported version in YARN for nvidia-docker). • https://issues.apache.org/jira/browse/YARN-3926 • https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn- site/UsingGpus.html
  • 46.
    46 © Cloudera,Inc. All rights reserved. Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, Joel Emer, Efficient Processing of Deep Neural Networks: A Tutorial and Survey
  • 47.
    47 © Cloudera,Inc. All rights reserved.
  • 48.
    48 © Cloudera,Inc. All rights reserved. ACCELERATORS FOR TEMPORAL ARCHITECTURES • The downside for using matrix multiplication for the CONV layers is that there is redundant data in the input feature map matrix, which can lead to either inefficiency in storage, or a complex memory access pattern • There are software libraries designed for CPUs (e.g., Open- BLAS, Intel MKL, etc.) and GPUs (e.g., cuBLAS, cuDNN, etc.) that optimize for matrix multiplications • The matrix multiplications on these platforms can be further sped up by applying computational transforms to the data to reduce the number of multiplications
  • 49.
    49 © Cloudera,Inc. All rights reserved. ACCELERATORS FOR SPATIAL ARCHITECTURES • For DNNs, the bottleneck for processing is in the memory access • Accelerators, such as spatial architectures, provide an opportunity to reduce the energy cost of data movement by introducing several levels of local memory hierarchy with different energy cost • The multiple levels of memory hierarchy help to improve energy efficiency by providing low-cost data accesses
  • 50.
    50 © Cloudera,Inc. All rights reserved. 1) How do you collect your data? 2) Where do your data scientists play? 3) Let’s talk to the business
  • 51.