About
@MLnick on Twitter & Github
Principal Engineer, IBM
CODAIT - Center for Open-Source Data
& AI Technologies
Machine Learning & AI
Apache Spark committer & PMC
Author of Machine Learning with Spark
Various conferences & meetups
Center for Open Source Data
and AI Technologies
CODAIT
codait.org
DBG / Oct 4, 2018 / © 2018 IBM Corporation
CODAIT aims to make AI solutions
dramatically easier to create, deploy,
and manage in the enterprise
Relaunch of the Spark Technology
Center (STC) to reflect expanded
mission
Improving Enterprise AI Lifecycle in Open Source
Gather
Data
Analyze
Data
Machine
Learning
Deep
Learning
Deploy
Model
Maintain
Model
Python
Data Science
Stack
Fabric for
Deep Learning
(FfDL)
Mleap +
PFA
Scikit-LearnPandas
Apache
Spark
Apache
Spark
Jupyter
Model
Asset
eXchange
Keras +
Tensorflow
The Machine Learning
Workflow
Applying Machine Learning: Perception
In reality the workflow spans teams …
… and tools
Spark provides a unified platform
Machine Learning
Pipelines
What is a “model”?
Pipelines in Spark ML
Example – Text Classifier Pipeline
Example – Text Classifier PipelineModel
Spark ML Components
14
Source: http://spark.apache.org/docs/latest
Deep Learning
DBG / June 6, 2018 / © 2018 IBM Corporation
Deep Learning Overview
• Original theory from 1940s; computer models
originated around 1960s; fell out of favor in
1980s/90s
• Recent resurgence due to
• Bigger (and better) data; standard datasets (e.g.
ImageNet)
• Better hardware (GPUs)
• Improvements to algorithms, architectures and
optimization
• Leading to new state-of-the-art results in
computer vision (images and video);
speech/text; language translation and more
Source: Wikipedia
Modern Neural Networks
• Deep (multi-layer) networks
• Computer vision
• Convolution neural networks (CNNs)
• Image classification, object detection, segmentation
• Sequences and time-series
• Recurrent neural networks (RNNs)
• Machine translation, text generation
• Embeddings
• Text, categorical features
• Deep learning frameworks
• Flexibility, computation graphs, auto-differentiation,
GPUs
Source: Stanford CS231n
Deep Learning Frameworks
* Logos trademarks of their respective projects
Computation Graphs
Source: Google AI Blog
*MnasNet Network
*Inception V3
DL Frameworks on Spark
Major Frameworks
21
• Deeplearning4J
• BigDL
• Deep Learning Pipelines
• TensorFlowOnSpark
• Microsoft Machine Learning on Spark
(MMLSpark)
Deeplearning4J
22
• Distributed GPU support for all major deep
learning architectures
• CPU / Distributed CPU / Single GPU options exist
• Supports Convolutional Nets, LSTMs / RNNs,
Feedforward Nets, Word2Vec, custom layers
• Supported by startup Skymind.io
• Backed by its own linear algebra library –
ND4J
• APIs in Scala, Java, Python
• Newer Scala API, Keras-like
• Keras import / export for Python API
• Production serving is through proprietary
layer
• DataVec for ETL
BigDL
23
• Distributed CPU with Intel MKL
• No GPU support
• Most DL models – CNN, RNN
• Backed by Intel
• Natively integrated with Spark
• Scala, Python API
• Support for Spark ML pipelines
• Uses private internal Spark components for
distributed training
• Load Keras, Caffe, Torch models
• New Keras-style API
Deep Learning Pipelines
24
• Created by Databricks
• Focus on scoring models (TensorFlow / Keras) and
basic transfer learning
• No support for training the DL model
• Focus on image data & use cases
• Natively integrated with Spark
• Scala, Python API
• Support for Spark ML pipelines
• Support for scoring models as a SQL UDF
• Largely dormant currently
TensorFlowOnSpark
25
• Created by Yahoo
• Scale out TF on Spark clusters
• Use Spark executors to launch TF processes
• Supports distributed training through TF parameter
servers
• RDMA / Infiniband improvement to TF to speed up
distributed training
• Good support for TensorBoard
• Good integration with Spark
• But only Python API
• Some support for Spark ML pipelines
• Relatively inactive recently
MMLSpark
26
• Created by Microsoft
• Supports training using CNTK including distributed
• Image, text data
• Good integration with Spark
• Scala, Python, R API
• Support for Spark ML pipelines
• Varied deployment options
• Relatively active, seems quite well supported
Other Frameworks
27
• H20 AI / DeepWater
• Apache MXNet Spark integration
• TensorFrames
• CaffeOnSpark
• scalable-deep-learning on Github
• MLlib – MLPClassifier only
• Sparknet (abandoned)
Integration Challenges
28
• Moving data from Spark to DL framework (and
back)
• Serialization overhead – especially Python
• Managing DL computation graphs from Spark
executors means fault tolerance is difficult to
achieve
• GPU awareness
• Optimize and standardize data exchange -
SPARK-24579
• Apache Arrow
• Barrier Execution Mode - SPARK-24374
• Accelerator-aware scheduling - SPARK-
24615
29
Thank you!
codait.org
twitter.com/MLnick
github.com/MLnick
developer.ibm.com
FfDL
Sign up for IBM Cloud and try Watson Studio!
https://ibm.biz/BdYhXz
https://datascience.ibm.com/
MAX
Brought to you by community.ibm.com/icpfordata Catch the replay at ibmaicommunity.bemyapp.com

AI and Spark - IBM Community AI Day

  • 2.
    About @MLnick on Twitter& Github Principal Engineer, IBM CODAIT - Center for Open-Source Data & AI Technologies Machine Learning & AI Apache Spark committer & PMC Author of Machine Learning with Spark Various conferences & meetups
  • 3.
    Center for OpenSource Data and AI Technologies CODAIT codait.org DBG / Oct 4, 2018 / © 2018 IBM Corporation CODAIT aims to make AI solutions dramatically easier to create, deploy, and manage in the enterprise Relaunch of the Spark Technology Center (STC) to reflect expanded mission Improving Enterprise AI Lifecycle in Open Source Gather Data Analyze Data Machine Learning Deep Learning Deploy Model Maintain Model Python Data Science Stack Fabric for Deep Learning (FfDL) Mleap + PFA Scikit-LearnPandas Apache Spark Apache Spark Jupyter Model Asset eXchange Keras + Tensorflow
  • 4.
  • 5.
  • 6.
    In reality theworkflow spans teams …
  • 7.
  • 8.
    Spark provides aunified platform
  • 9.
  • 10.
    What is a“model”?
  • 11.
  • 12.
    Example – TextClassifier Pipeline
  • 13.
    Example – TextClassifier PipelineModel
  • 14.
    Spark ML Components 14 Source:http://spark.apache.org/docs/latest
  • 15.
    Deep Learning DBG /June 6, 2018 / © 2018 IBM Corporation
  • 16.
    Deep Learning Overview •Original theory from 1940s; computer models originated around 1960s; fell out of favor in 1980s/90s • Recent resurgence due to • Bigger (and better) data; standard datasets (e.g. ImageNet) • Better hardware (GPUs) • Improvements to algorithms, architectures and optimization • Leading to new state-of-the-art results in computer vision (images and video); speech/text; language translation and more Source: Wikipedia
  • 17.
    Modern Neural Networks •Deep (multi-layer) networks • Computer vision • Convolution neural networks (CNNs) • Image classification, object detection, segmentation • Sequences and time-series • Recurrent neural networks (RNNs) • Machine translation, text generation • Embeddings • Text, categorical features • Deep learning frameworks • Flexibility, computation graphs, auto-differentiation, GPUs Source: Stanford CS231n
  • 18.
    Deep Learning Frameworks *Logos trademarks of their respective projects
  • 19.
    Computation Graphs Source: GoogleAI Blog *MnasNet Network *Inception V3
  • 20.
  • 21.
    Major Frameworks 21 • Deeplearning4J •BigDL • Deep Learning Pipelines • TensorFlowOnSpark • Microsoft Machine Learning on Spark (MMLSpark)
  • 22.
    Deeplearning4J 22 • Distributed GPUsupport for all major deep learning architectures • CPU / Distributed CPU / Single GPU options exist • Supports Convolutional Nets, LSTMs / RNNs, Feedforward Nets, Word2Vec, custom layers • Supported by startup Skymind.io • Backed by its own linear algebra library – ND4J • APIs in Scala, Java, Python • Newer Scala API, Keras-like • Keras import / export for Python API • Production serving is through proprietary layer • DataVec for ETL
  • 23.
    BigDL 23 • Distributed CPUwith Intel MKL • No GPU support • Most DL models – CNN, RNN • Backed by Intel • Natively integrated with Spark • Scala, Python API • Support for Spark ML pipelines • Uses private internal Spark components for distributed training • Load Keras, Caffe, Torch models • New Keras-style API
  • 24.
    Deep Learning Pipelines 24 •Created by Databricks • Focus on scoring models (TensorFlow / Keras) and basic transfer learning • No support for training the DL model • Focus on image data & use cases • Natively integrated with Spark • Scala, Python API • Support for Spark ML pipelines • Support for scoring models as a SQL UDF • Largely dormant currently
  • 25.
    TensorFlowOnSpark 25 • Created byYahoo • Scale out TF on Spark clusters • Use Spark executors to launch TF processes • Supports distributed training through TF parameter servers • RDMA / Infiniband improvement to TF to speed up distributed training • Good support for TensorBoard • Good integration with Spark • But only Python API • Some support for Spark ML pipelines • Relatively inactive recently
  • 26.
    MMLSpark 26 • Created byMicrosoft • Supports training using CNTK including distributed • Image, text data • Good integration with Spark • Scala, Python, R API • Support for Spark ML pipelines • Varied deployment options • Relatively active, seems quite well supported
  • 27.
    Other Frameworks 27 • H20AI / DeepWater • Apache MXNet Spark integration • TensorFrames • CaffeOnSpark • scalable-deep-learning on Github • MLlib – MLPClassifier only • Sparknet (abandoned)
  • 28.
    Integration Challenges 28 • Movingdata from Spark to DL framework (and back) • Serialization overhead – especially Python • Managing DL computation graphs from Spark executors means fault tolerance is difficult to achieve • GPU awareness • Optimize and standardize data exchange - SPARK-24579 • Apache Arrow • Barrier Execution Mode - SPARK-24374 • Accelerator-aware scheduling - SPARK- 24615
  • 29.
    29 Thank you! codait.org twitter.com/MLnick github.com/MLnick developer.ibm.com FfDL Sign upfor IBM Cloud and try Watson Studio! https://ibm.biz/BdYhXz https://datascience.ibm.com/ MAX
  • 30.
    Brought to youby community.ibm.com/icpfordata Catch the replay at ibmaicommunity.bemyapp.com