Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Build a deep learning pipeline on apache spark for ads optimization

951 views

Published on

Build a deep learning pipeline on spark for ads optimization formal

Published in: Internet

Build a deep learning pipeline on apache spark for ads optimization

  1. 1. Build Deep Learning Pipelines on Apache Spark for Ads Optimization Big Data Consultant & Senior Data Scientist Craig Chao chaocraig@gmail.com Slideshare: Craig Chao
  2. 2. Agenda !  Prolog !  Data Become a Weapon of New Colonialism !  Why Not Tensorflow but Deep Learning on Apache Spark? !  Data Engineer * Data Science !  ML Pipelines on Apache Spark !  ML & DL for Ads Optimization !  Deep Learning on Apache Spark !  Conclusion
  3. 3. Prolog !  Data Become a Weapon of New Colonialism !  Why Not Tensorflow but Deep Learning on Apache Spark? !  Data Engineer * Data Science
  4. 4. Data Become a Weapon of New Colonialism 順豐、菜鳥互踢數據接口 華為手機上面騰訊APP的使用者數據 是誰的? 美國MIT譽為「中國最聰明公司」科大訊飛 人臉識別的「偷食神器」 A Judge Just Ordered LinkedIn to Allow Scraping 08/2017
  5. 5. Data Become a Weapon of New Colonialism Src: https://twitter.com/jason_kint/ Src: https://www.iab.com/insights/iab-internet-advertising-revenue-report-conducted-by-pricewaterhousecoopers-pwc-2/
  6. 6. Data Become a Weapon of New Colonialism
  7. 7. Data Become a Weapon of New Colonialism
  8. 8. Why Not Tensorflow but Deep Learning on Apache Spark?
  9. 9. Data Developer/Engineer vs. Data Scientist
  10. 10. Data Developer/Engineer vs. Data Scientist Src: https://www.stitchdata.com/resources/reports/the-state-of-data-engineering/ https://www.oreilly.com/ideas/2016-data-science-salary-survey-results 5 ~ 10 : 1
  11. 11. ML Pipelines on Apache Spark Src: https://dzone.com/articles/distingish-pop-music-from-heavy-metal-using-apache6
  12. 12. ML Pipelines on Apache Spark !  Dataframe !  ML dataset holding a variety of data types !  Transformer !  an algorithm transforming one DataFrame into another DataFrame !  Estimator !  an algorithm being fit on a DataFrame to produce a Transformer !  Pipeline !  chains multiple Transformers and Estimators together to specify an ML workflow !  Parameter !  Parameters belong to specific instances of Estimators and Transformers !  Any parameters in the ParamMap will override parameters previously specified via setter methods.
  13. 13. ML Pipelines on Apache Spark Src: https://dzone.com/articles/distingish-pop-music-from-heavy-metal-using-apache6
  14. 14. ML Pipelines on Apache Spark Raw unknown lyrics After Cleanser After StopWordsRemover After Stemmer After Word2Vec After LogisticRegression Pop or Heavy Metal?
  15. 15. ML Pipelines on Apache Spark
  16. 16. ML Pipelines on Apache Spark
  17. 17. ML Pipelines on Apache Spark !  Advantages !  Model selection (a.k.a. hyperparameter tuning) via cross-validation & train validation split !  Pipeline/Model save/ reload https://github.com/tmatyashovsky/spark-ml-samples
  18. 18. ML Pipelines on Apache Spark https://github.com/tmatyashovsky/spark-ml-samples
  19. 19. ML & DL for Ads Optimization
  20. 20. ML & DL for Ads Optimization Rose Navy Olive Alice 0 +4 0 Bob 0 0 +2 Carol -1 0 -2 Dave +3 0 0 (Alice) (Blue) (Navy) (Periwinkle)
  21. 21. ML & DL for Ads Optimization •  Optimizing X, Y simultaneously is non-convex, hard •  If X or Y are fixed, system of linear equations: convex, easy •  Initialize Y with random values •  Solve for X •  Fix X, solve for Y •  Repeat (“Alternating”) X YT
  22. 22. ML & DL for Ads Optimization A m = n S k k• T’ n m •Σ Singular Value Decomposition(SVD) Context-aware Matrix Factorization
  23. 23. ML & DL for Ads Optimization
  24. 24. ML & DL for Ads Optimization Deep Walk(2014) A Multi-View Deep Learning(2015)
  25. 25. ML & DL for Ads Optimization Wide & Deep Learning Models((Youtube, 2016) Deep Candidate Generation Model(Youtube, 2016) Session-based Recommendation With RNN(2016)
  26. 26. Deep Learning on Apache Spark Spark MMLSpark DL4J SystemML BigDL Vendor Databricks Microsoft DeepLearning4J Apache Intel Tensorflow OnSpark DeepDist OpenDL CaffeOnSpark TensorFrames Dist-keras Reference https:// github.com/ yahoo/ TensorFlowO nSpark http:// deepdist.c om/ https:// github.com/ guoding831 28/OpenDL https:// github.com/ yahoo/ CaffeOnSpar k https:// github.com/ databricks/ tensorframes https:// github.com /cerndb/ dist-keras Source: Craig Chao, DataConf 2017, Taipei
  27. 27. Deep Learning on Apache Spark Apache SystemML !  Apache Top-Level-Project !  Declarative Large-Scale Machine Learning !  OS‎: ‎Linux‎, ‎macOS‎, ‎Windows !  Written in‎: ‎Java !  Open-sourced by IBM in 2015 A machine learning platform optimal for big data
  28. 28. Deep Learning on Apache Spark Apache SystemML https://github.com/dusenberrymw/systemml-nn/blob/master/nn/examples/mnist_lenet.dml Build-in NN modules
  29. 29. Deep Learning on Apache Spark: Apache SystemML
  30. 30. !  Seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK) and OpenCV !  CNTK Model Gallery !  https://www.microsoft.com/en-us/cognitive-toolkit/features/ model-gallery/ !  Including GAN, Reinforcement Learning, ResNet152… Deep Learning on Apache Spark: MS MMLSpark
  31. 31. Deep Learning on Apache Spark: MS MMLSpark it implicitly converts the data into the format expected by the algorithm: tokenize and hash strings, one-hot encodes categorical variables, assembles the features into vector and so on.
  32. 32. Deep Learning on Apache Spark: MS MMLSpark ML Pipeline to evaluate CNTK model. Windows Azure Storage Blob
  33. 33. Deep Learning on Apache Spark: Databricks !  Founded by the creators of Apache Spark, Ali Ghodsi, CEO, adjunct professor of UC Berkeley !  The total funding is $100M+ !  Import model from TF, MXNet, Keras, PyTorch, Caffe, CNTK, Theano, Jcuda
  34. 34. Deep Learning on Apache Spark: DataBricks
  35. 35. Deep Learning on Apache Spark: DataBricks Build a NN model from scratch Easy on a driver-only cluster, complicated on distributed nodes.
  36. 36. Deep Learning on Apache Spark: DL4J !  DeepLearning4J is a java based toolkit for building, training and deploying Neural Networks !  An open-source, distributed deep- learning project in Java and Scala spearheaded by the people at Skymind !  ND4J is the Java scientific computing engine powering our matrix manipulations. ND4S is its Scala wrapper. !  Including RL and model import from Keras(Theano, Tensorflow, Caffe and CNTK) Machine learning models are served in production with Skymind's model server. Secure, Scalable, Stable, Debuggable, Certified
  37. 37. Deep Learning on Apache Spark: DL4J Src: Anatolii(2017)
  38. 38. Deep Learning on Apache Spark BigDL !  A distributed deep learning library for Apache Spark released by Intel® !  Can load pre-trained Caffe or Torch models !  Uses Intel MKL(Intel® Math Kernel Library) and multi-threaded programming in each Spark task
  39. 39. Deep Learning on Apache Spark BigDL Build a NN model from scratch
  40. 40. Deep Learning on Apache Spark BigDL DL4J Databricks MMLSpark SystemML Vendor Intel DeepLearning4J Databricks Microsoft Apache Pre-trained models Caffe/Torch/ Tensorflow Keras, TensorFlow, Caffe and Theano TF, MXNet, Keras, PyTorch, Caffe, CNTK, Theano, JCuda CNTK Gallery/ Keras DML/Caffe2DML Train a NN from scratch Y Y Y N Y / DML Notebook Python/Scala Scala / Reactive Python/Scala/R/SQL Python/Scala Python/Scala Free Y N / if model server N Y Y Usability High High High Middle Low Docker Y Y / Spark Notebook N Y Y Cloud Y / (AWS, Azure, Cloudera…) N Y / AWS Azure N Source: Craig Chao, DataConf 2017
  41. 41. Conclusions !  Data Wars !  Unified Data Platform !  Data Engineer/Developers are key roles !  Reusable/Portable ML Pipelines !  DL has deep layers of hidden factors !  DL models for Ads/RecSys !  Codes level intro. of DL solutions on Apache Spark
  42. 42. Add a Slide Title - 3 chaocraig@gmail.com Slideshare: Craig Chao

×