Build Deep Learning
Pipelines on Apache
Spark for Ads
Optimization
Big Data Consultant & Senior Data Scientist
Craig Chao
chaocraig@gmail.com
Slideshare: Craig Chao
Agenda
!  Prolog
!  Data Become a Weapon of New Colonialism
!  Why Not Tensorflow but Deep Learning on Apache Spark?
!  Data Engineer * Data Science
!  ML Pipelines on Apache Spark
!  ML & DL for Ads Optimization
!  Deep Learning on Apache Spark
!  Conclusion
Prolog
!  Data Become a Weapon of New Colonialism
!  Why Not Tensorflow but Deep Learning on Apache
Spark?
!  Data Engineer * Data Science
Data Become a Weapon of New Colonialism
順豐、菜鳥互踢數據接口	
華為手機上面騰訊APP的使用者數據
是誰的?	
美國MIT譽為「中國最聰明公司」科大訊飛
人臉識別的「偷食神器」	
A Judge Just Ordered
LinkedIn to Allow Scraping
08/2017
Data Become a Weapon of New Colonialism
Src: https://twitter.com/jason_kint/ 	
Src: https://www.iab.com/insights/iab-internet-advertising-revenue-report-conducted-by-pricewaterhousecoopers-pwc-2/
Data Become a Weapon of New Colonialism
Data Become a Weapon of New Colonialism
Why Not Tensorflow but Deep
Learning on Apache Spark?
Data Developer/Engineer vs. Data Scientist
Data Developer/Engineer vs. Data Scientist
Src: https://www.stitchdata.com/resources/reports/the-state-of-data-engineering/ 	 https://www.oreilly.com/ideas/2016-data-science-salary-survey-results 	
5 ~ 10 : 1
ML Pipelines on Apache Spark
Src: https://dzone.com/articles/distingish-pop-music-from-heavy-metal-using-apache6
ML Pipelines on Apache Spark
!  Dataframe
!  ML dataset holding a variety of data types
!  Transformer
!  an algorithm transforming one DataFrame into another
DataFrame
!  Estimator
!  an algorithm being fit on a DataFrame to produce a
Transformer
!  Pipeline
!  chains multiple Transformers and Estimators together to
specify an ML workflow
!  Parameter
!  Parameters belong to specific instances of Estimators and
Transformers
!  Any parameters in the ParamMap will override parameters
previously specified via setter methods.
ML Pipelines on Apache Spark
Src: https://dzone.com/articles/distingish-pop-music-from-heavy-metal-using-apache6
ML Pipelines on Apache Spark
Raw unknown lyrics	 After Cleanser	 After StopWordsRemover	 After Stemmer	
After Word2Vec	 After LogisticRegression	
Pop or Heavy Metal?
ML Pipelines on Apache Spark
ML Pipelines on Apache Spark
ML Pipelines on Apache Spark
!  Advantages
!  Model selection (a.k.a.
hyperparameter tuning)
via cross-validation &
train validation split
!  Pipeline/Model save/
reload
https://github.com/tmatyashovsky/spark-ml-samples
ML Pipelines on Apache Spark
https://github.com/tmatyashovsky/spark-ml-samples
ML & DL for Ads Optimization
ML & DL for Ads Optimization
Rose Navy Olive
Alice 0 +4 0
Bob 0 0 +2
Carol -1 0 -2
Dave +3 0 0
(Alice)
(Blue)
(Navy)
(Periwinkle)
ML & DL for Ads Optimization
•  Optimizing X, Y simultaneously is non-convex, hard
•  If X or Y are fixed, system of linear equations: convex,
easy
•  Initialize Y with random values
•  Solve for X
•  Fix X, solve for Y
•  Repeat (“Alternating”)
X
YT
ML & DL for Ads Optimization
A m
=
n
S
k
k• T’
n
m
•Σ
Singular Value Decomposition(SVD)	 Context-aware Matrix Factorization
ML & DL for Ads Optimization
ML & DL for Ads Optimization
Deep Walk(2014)	A Multi-View Deep Learning(2015)
ML & DL for Ads Optimization
Wide & Deep Learning Models((Youtube, 2016)	
Deep Candidate Generation Model(Youtube, 2016)	 Session-based Recommendation With
RNN(2016)
Deep Learning on Apache Spark
Spark	 MMLSpark	 DL4J	 SystemML	 BigDL	
Vendor	 Databricks	 Microsoft	 DeepLearning4J	 Apache 	 Intel	
Tensorflow
OnSpark	
DeepDist	 OpenDL	 CaffeOnSpark	 TensorFrames	 Dist-keras	
Reference	 https://
github.com/
yahoo/
TensorFlowO
nSpark 	
http://
deepdist.c
om/ 	
https://
github.com/
guoding831
28/OpenDL 	
https://
github.com/
yahoo/
CaffeOnSpar k	
https://
github.com/
databricks/
tensorframes 	
https://
github.com
/cerndb/
dist-keras 	
Source: Craig Chao, DataConf 2017, Taipei
Deep Learning on Apache Spark
Apache SystemML
!  Apache Top-Level-Project
!  Declarative Large-Scale
Machine Learning
!  OS‎: ‎Linux‎, ‎macOS‎, ‎Windows
!  Written in‎: ‎Java
!  Open-sourced by IBM in
2015
A machine learning platform optimal for big data
Deep Learning on Apache Spark
Apache SystemML
https://github.com/dusenberrymw/systemml-nn/blob/master/nn/examples/mnist_lenet.dml 	
Build-in NN modules
Deep Learning on Apache Spark:
Apache SystemML
!  Seamless integration of Spark Machine Learning
pipelines with Microsoft Cognitive Toolkit (CNTK) and
OpenCV
!  CNTK Model Gallery
!  https://www.microsoft.com/en-us/cognitive-toolkit/features/
model-gallery/
!  Including GAN, Reinforcement Learning, ResNet152…
Deep Learning on Apache Spark:
MS MMLSpark
Deep Learning on Apache Spark:
MS MMLSpark
it implicitly converts the data
into the format expected by the
algorithm: tokenize and hash
strings, one-hot encodes
categorical variables,
assembles the features into
vector and so on.
Deep Learning on Apache Spark:
MS MMLSpark
ML Pipeline to evaluate CNTK model.	
Windows Azure Storage Blob
Deep Learning on Apache Spark:
Databricks
!  Founded by the creators of
Apache Spark, Ali Ghodsi,
CEO, adjunct professor of
UC Berkeley
!  The total funding is $100M+
!  Import model from TF,
MXNet, Keras, PyTorch,
Caffe, CNTK, Theano, Jcuda
Deep Learning on Apache Spark:
DataBricks
Deep Learning on Apache Spark:
DataBricks
Build a NN model from scratch	
Easy on a driver-only cluster,
complicated on distributed nodes.
Deep Learning on Apache Spark:
DL4J
!  DeepLearning4J is a java based
toolkit for building, training and
deploying Neural Networks
!  An open-source, distributed deep-
learning project in Java and Scala
spearheaded by the people at
Skymind
!  ND4J is the Java scientific computing
engine powering our matrix
manipulations. ND4S is its Scala wrapper.
!  Including RL and model import from
Keras(Theano, Tensorflow, Caffe and
CNTK)	
Machine learning models are served in
production with Skymind's model server.	
Secure, Scalable, Stable, Debuggable, Certified
Deep Learning on Apache Spark:
DL4J
Src: Anatolii(2017)
Deep Learning on Apache Spark
BigDL
!  A distributed deep learning library for
Apache Spark released by Intel®
!  Can load pre-trained Caffe or Torch models
!  Uses Intel MKL(Intel® Math Kernel Library)
and multi-threaded programming in each
Spark task
Deep Learning on Apache Spark
BigDL
Build a NN model from scratch
Deep Learning on Apache Spark
BigDL	 DL4J	 Databricks	 MMLSpark	 SystemML	
Vendor	 Intel	 DeepLearning4J	 Databricks	 Microsoft	 Apache 	
Pre-trained models	 Caffe/Torch/
Tensorflow	
Keras, TensorFlow,
Caffe and Theano	
TF, MXNet, Keras, PyTorch,
Caffe, CNTK, Theano, JCuda	
CNTK Gallery/
Keras	
DML/Caffe2DML	
Train a NN from scratch	 Y	 Y	 Y	 N	 Y / DML	
Notebook	 Python/Scala	 Scala / Reactive	 Python/Scala/R/SQL	 Python/Scala	 Python/Scala	
Free	 Y	 N / if model server	 N	 Y	 Y	
Usability	 High	 High	 High	 Middle	 Low	
Docker	 Y	 Y / Spark Notebook	 N	 Y	 Y	
Cloud	 Y / (AWS, Azure,
Cloudera…)	
N	 Y / AWS	 Azure	 N	
Source: Craig Chao, DataConf 2017
Conclusions
!  Data Wars
!  Unified Data Platform
!  Data Engineer/Developers are key
roles
!  Reusable/Portable ML Pipelines
!  DL has deep layers of hidden factors
!  DL models for Ads/RecSys
!  Codes level intro. of DL solutions on
Apache Spark
Add a Slide Title - 3
chaocraig@gmail.com	
Slideshare: Craig Chao

Build a deep learning pipeline on apache spark for ads optimization

  • 1.
    Build Deep Learning Pipelineson Apache Spark for Ads Optimization Big Data Consultant & Senior Data Scientist Craig Chao chaocraig@gmail.com Slideshare: Craig Chao
  • 2.
    Agenda !  Prolog !  DataBecome a Weapon of New Colonialism !  Why Not Tensorflow but Deep Learning on Apache Spark? !  Data Engineer * Data Science !  ML Pipelines on Apache Spark !  ML & DL for Ads Optimization !  Deep Learning on Apache Spark !  Conclusion
  • 3.
    Prolog !  Data Becomea Weapon of New Colonialism !  Why Not Tensorflow but Deep Learning on Apache Spark? !  Data Engineer * Data Science
  • 4.
    Data Become aWeapon of New Colonialism 順豐、菜鳥互踢數據接口 華為手機上面騰訊APP的使用者數據 是誰的? 美國MIT譽為「中國最聰明公司」科大訊飛 人臉識別的「偷食神器」 A Judge Just Ordered LinkedIn to Allow Scraping 08/2017
  • 5.
    Data Become aWeapon of New Colonialism Src: https://twitter.com/jason_kint/ Src: https://www.iab.com/insights/iab-internet-advertising-revenue-report-conducted-by-pricewaterhousecoopers-pwc-2/
  • 6.
    Data Become aWeapon of New Colonialism
  • 7.
    Data Become aWeapon of New Colonialism
  • 8.
    Why Not Tensorflowbut Deep Learning on Apache Spark?
  • 9.
  • 10.
    Data Developer/Engineer vs.Data Scientist Src: https://www.stitchdata.com/resources/reports/the-state-of-data-engineering/ https://www.oreilly.com/ideas/2016-data-science-salary-survey-results 5 ~ 10 : 1
  • 11.
    ML Pipelines onApache Spark Src: https://dzone.com/articles/distingish-pop-music-from-heavy-metal-using-apache6
  • 12.
    ML Pipelines onApache Spark !  Dataframe !  ML dataset holding a variety of data types !  Transformer !  an algorithm transforming one DataFrame into another DataFrame !  Estimator !  an algorithm being fit on a DataFrame to produce a Transformer !  Pipeline !  chains multiple Transformers and Estimators together to specify an ML workflow !  Parameter !  Parameters belong to specific instances of Estimators and Transformers !  Any parameters in the ParamMap will override parameters previously specified via setter methods.
  • 13.
    ML Pipelines onApache Spark Src: https://dzone.com/articles/distingish-pop-music-from-heavy-metal-using-apache6
  • 14.
    ML Pipelines onApache Spark Raw unknown lyrics After Cleanser After StopWordsRemover After Stemmer After Word2Vec After LogisticRegression Pop or Heavy Metal?
  • 15.
    ML Pipelines onApache Spark
  • 16.
    ML Pipelines onApache Spark
  • 17.
    ML Pipelines onApache Spark !  Advantages !  Model selection (a.k.a. hyperparameter tuning) via cross-validation & train validation split !  Pipeline/Model save/ reload https://github.com/tmatyashovsky/spark-ml-samples
  • 18.
    ML Pipelines onApache Spark https://github.com/tmatyashovsky/spark-ml-samples
  • 19.
    ML & DLfor Ads Optimization
  • 20.
    ML & DLfor Ads Optimization Rose Navy Olive Alice 0 +4 0 Bob 0 0 +2 Carol -1 0 -2 Dave +3 0 0 (Alice) (Blue) (Navy) (Periwinkle)
  • 21.
    ML & DLfor Ads Optimization •  Optimizing X, Y simultaneously is non-convex, hard •  If X or Y are fixed, system of linear equations: convex, easy •  Initialize Y with random values •  Solve for X •  Fix X, solve for Y •  Repeat (“Alternating”) X YT
  • 22.
    ML & DLfor Ads Optimization A m = n S k k• T’ n m •Σ Singular Value Decomposition(SVD) Context-aware Matrix Factorization
  • 23.
    ML & DLfor Ads Optimization
  • 24.
    ML & DLfor Ads Optimization Deep Walk(2014) A Multi-View Deep Learning(2015)
  • 25.
    ML & DLfor Ads Optimization Wide & Deep Learning Models((Youtube, 2016) Deep Candidate Generation Model(Youtube, 2016) Session-based Recommendation With RNN(2016)
  • 26.
    Deep Learning onApache Spark Spark MMLSpark DL4J SystemML BigDL Vendor Databricks Microsoft DeepLearning4J Apache Intel Tensorflow OnSpark DeepDist OpenDL CaffeOnSpark TensorFrames Dist-keras Reference https:// github.com/ yahoo/ TensorFlowO nSpark http:// deepdist.c om/ https:// github.com/ guoding831 28/OpenDL https:// github.com/ yahoo/ CaffeOnSpar k https:// github.com/ databricks/ tensorframes https:// github.com /cerndb/ dist-keras Source: Craig Chao, DataConf 2017, Taipei
  • 27.
    Deep Learning onApache Spark Apache SystemML !  Apache Top-Level-Project !  Declarative Large-Scale Machine Learning !  OS‎: ‎Linux‎, ‎macOS‎, ‎Windows !  Written in‎: ‎Java !  Open-sourced by IBM in 2015 A machine learning platform optimal for big data
  • 28.
    Deep Learning onApache Spark Apache SystemML https://github.com/dusenberrymw/systemml-nn/blob/master/nn/examples/mnist_lenet.dml Build-in NN modules
  • 29.
    Deep Learning onApache Spark: Apache SystemML
  • 30.
    !  Seamless integrationof Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK) and OpenCV !  CNTK Model Gallery !  https://www.microsoft.com/en-us/cognitive-toolkit/features/ model-gallery/ !  Including GAN, Reinforcement Learning, ResNet152… Deep Learning on Apache Spark: MS MMLSpark
  • 31.
    Deep Learning onApache Spark: MS MMLSpark it implicitly converts the data into the format expected by the algorithm: tokenize and hash strings, one-hot encodes categorical variables, assembles the features into vector and so on.
  • 32.
    Deep Learning onApache Spark: MS MMLSpark ML Pipeline to evaluate CNTK model. Windows Azure Storage Blob
  • 33.
    Deep Learning onApache Spark: Databricks !  Founded by the creators of Apache Spark, Ali Ghodsi, CEO, adjunct professor of UC Berkeley !  The total funding is $100M+ !  Import model from TF, MXNet, Keras, PyTorch, Caffe, CNTK, Theano, Jcuda
  • 34.
    Deep Learning onApache Spark: DataBricks
  • 35.
    Deep Learning onApache Spark: DataBricks Build a NN model from scratch Easy on a driver-only cluster, complicated on distributed nodes.
  • 36.
    Deep Learning onApache Spark: DL4J !  DeepLearning4J is a java based toolkit for building, training and deploying Neural Networks !  An open-source, distributed deep- learning project in Java and Scala spearheaded by the people at Skymind !  ND4J is the Java scientific computing engine powering our matrix manipulations. ND4S is its Scala wrapper. !  Including RL and model import from Keras(Theano, Tensorflow, Caffe and CNTK) Machine learning models are served in production with Skymind's model server. Secure, Scalable, Stable, Debuggable, Certified
  • 37.
    Deep Learning onApache Spark: DL4J Src: Anatolii(2017)
  • 38.
    Deep Learning onApache Spark BigDL !  A distributed deep learning library for Apache Spark released by Intel® !  Can load pre-trained Caffe or Torch models !  Uses Intel MKL(Intel® Math Kernel Library) and multi-threaded programming in each Spark task
  • 39.
    Deep Learning onApache Spark BigDL Build a NN model from scratch
  • 40.
    Deep Learning onApache Spark BigDL DL4J Databricks MMLSpark SystemML Vendor Intel DeepLearning4J Databricks Microsoft Apache Pre-trained models Caffe/Torch/ Tensorflow Keras, TensorFlow, Caffe and Theano TF, MXNet, Keras, PyTorch, Caffe, CNTK, Theano, JCuda CNTK Gallery/ Keras DML/Caffe2DML Train a NN from scratch Y Y Y N Y / DML Notebook Python/Scala Scala / Reactive Python/Scala/R/SQL Python/Scala Python/Scala Free Y N / if model server N Y Y Usability High High High Middle Low Docker Y Y / Spark Notebook N Y Y Cloud Y / (AWS, Azure, Cloudera…) N Y / AWS Azure N Source: Craig Chao, DataConf 2017
  • 41.
    Conclusions !  Data Wars ! Unified Data Platform !  Data Engineer/Developers are key roles !  Reusable/Portable ML Pipelines !  DL has deep layers of hidden factors !  DL models for Ads/RecSys !  Codes level intro. of DL solutions on Apache Spark
  • 42.
    Add a SlideTitle - 3 chaocraig@gmail.com Slideshare: Craig Chao