TENSORFLOW + SPARK DATAFRAMES
=
TENSORFRAMES
Tallinn Advanced Java Meetup
Oct 24, 2016
Chris Fregly
Research Scientist @ PipelineIO
Thank You for Hosting, Planet OS!!
WHO AM I
Chris Fregly
• Currently
Research Scientist @ PipelineIO (http://pipeline.io)
Contributor @ Apache Spark
Committer @ Netflix Open Source
Founder @ Advanced Spark andTensorFlow Meetup
Author @ Advanced Spark (http://advancedspark.com)
Creator @ PANCAKE STACK (http://pancake-stack.com)
• Previously
Streaming Data Engineer @ Netflix, Databricks,IBM Spark
ADVANCED SPARK AND TENSORFLOW
MEETUP
4,600 Members+
Top 4 Spark Meetup!
Github Repo Stars + Forks
DockerHub Repo Pulls
CURRENT PIPELINE.IO RESEARCH
• Model Deploying andTesting
• Model Scaling and Serving
• Online ModelTraining
• Dynamic Model Optimizing
PIPELINE.IO DELIVERABLES
• 100% Open Source!!
• Github:
• https://github.com/fluxcapacitor/
• DockerHub
• https://hub.docker.com/r/fluxcapacitor
PIPELINE.IO WORKSHOPS
AGENDA
• Neural Networks
• GPUs
• Tensorflow
• TensorFrames
WHAT ARE NEURAL NETWORKS?
• Like All Machine Learning, Goal is to Minimize Loss (Error)
• Mostly Supervised Learning Classification
• Many labeled training samples exist
• Training
• Step 1: Start with Random Guesses for Input Weights
• Step 2: Calculate ErrorAgainst Labeled Data
• Step 3: Determine Gradient Amount and Direction (+ or -)
• Step 4: Back-propagate Gradient to Update Each Input Weight
• Step 5: Repeat Step 1 until Convergence or Max Epochs Reached
BACK PROPAGATION
http://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html
Chain Rule
CONVOLUTIONAL NEURAL NETWORKS
• Apply Many Layers (aka. Filters) to Input
• Each Layer/Filter Picks up on Features
• Features not necessarily human-grokkable
• Brute Force –Try Diff numLayers & layerSizes
• Filter Examples
• 3 Color Filters: RGB
• Moving AVG for Time Series
MY FAVORITE USE CASE – STITCH FIX
StitchFix Strata Conf SF 2016:
Using Deep Learning to Create New Clothing Styles!
RECURRENT NEURAL NETWORKS
Maintain State
Enables Learning of Sequential Patterns
Uses forText/NLP Prediction
CHARACTER RNNS
Preserving State
differentiates between
1st and 2nd ‘l’
to improve prediction
AGENDA
• Neural Networks
• GPUs
• Tensorflow
• TensorFrames
CPU VS GPU
• Fundamentally Different than CPUs
• Therefore,GPU/CUDA Programming Fundamentally Different
SAME INSTRUCTION, MULTIPLE DATA
MINIMIZE DATA DEPENDENCIES
• More natural for structured,independent data
• Tasks perform identical instructions in parallel on same-structured data
• Reduce data dependencies as they limit parallelism
Previous Instruction Previous Loop Iteration
MEMORY AND CORES
EXPLORE YOUR SURROUNDINGS
`nvidia-smi`
AGENDA
• Neural Networks
• GPUs
• Tensorflow
• TensorFrames
WHAT IS TENSORFLOW?
• Google Open Source General Purpose Numerical Computation Engine
• Happens to be Good for Neural Networks!
• Tooling
• Tensorboard (port 6006 == `goog` upside down!) à
• DAG-based like Spark!
• Computation graph is logical plan
• Stored in Protobuf’s
• Tensorflow converts logical to physical plan
• Lots of Libraries
• TFLearn (Tensorflow’s Scikit-learn Impl)
• Tensorflow Serving (Prediction Layer) à
DEMO!
AWS + Docker + GPU + Docker +
Tensorflow
DEMO!
Tensorflow Serving
AGENDA
• Neural Networks
• GPUs
• Tensorflow
• TensorFrames
WHAT ARE TENSORFRAMES?
• Bridge between Spark (JVM) and Tensorflow (C++)
• Python and Scala Bindings for Application Code
• Uses JavaCPP for JNI-level Integration
• Must Install TensorFrames C++ Runtime Libs on All Spark
Workers
• Developed by Old Co-worker @ Databricks,Tim Hunter
• PhD inTensors – He’s ”Mr..Tensor”
WHY TENSORFRAMES?
• Why Not?!
• Best of BothWorlds: Legacy Spark Support +Tensorflow
• Mix and Match Spark ML + Tensorflow AI on Same Data
• Tensorflow is DAG-based Similar to Spark
• Enables Data-Parallel Model Training
DATA-PARALLEL MODEL TRAINING
• Large Dataset are Partitioned Across HDFS Cluster
• Computation Graph (Logical Plan) Passed to SparkWorkers
• Workers Train on Each Data Partition in Parallel
• Workers Periodically Aggregate (ie.AVG) Results
• Aggregations happen in “Parameter Server”
• Spark Master/Driver is Parameter Server
TENSORFLOW + MULTIPLE HOSTS/GPUS
Multi-GPU,Data-ParallelTraining
Step 1: CPU transfers model replica and (initial) gradients to each GPU
Step 2: CPU synchronizes and waits for all GPUs to process batch
Step 3: CPU copies all training results (gradients) back from GPU
Step 4: CPU averages gradients from all GPUs
Step 5: Repeat Step 1 with (new) gradients
Code
https://github.com/tensorflow/tensorflow/blob/master/
tensorflow/models/image/cifar10/
cifar10_multi_gpu_train.py
TENSORFRAME PERFORMANCE
• Depends on Algorithm and Dataset, of course!
• TensorFrames Require Extra Serialization JVM <-> C++
• What about Python Serialization from Python Bindings?
• Should be minimal unless using Python UDFs
• PySpark keeps small logical plan in Python layer
• Physical operations happen in JVM (except Python UDFs!)
DEMO!
TensorFrames in Python and Scala
THANK YOU!!
Chris Fregly,Research Scientist @ PipelineIO
• LinkedIn: https://linkedin.com/in/cfregly
• Twitter: @cfregly
http://pipeline.io

Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24, 2016

  • 1.
    TENSORFLOW + SPARKDATAFRAMES = TENSORFRAMES Tallinn Advanced Java Meetup Oct 24, 2016 Chris Fregly Research Scientist @ PipelineIO Thank You for Hosting, Planet OS!!
  • 2.
    WHO AM I ChrisFregly • Currently Research Scientist @ PipelineIO (http://pipeline.io) Contributor @ Apache Spark Committer @ Netflix Open Source Founder @ Advanced Spark andTensorFlow Meetup Author @ Advanced Spark (http://advancedspark.com) Creator @ PANCAKE STACK (http://pancake-stack.com) • Previously Streaming Data Engineer @ Netflix, Databricks,IBM Spark
  • 3.
    ADVANCED SPARK ANDTENSORFLOW MEETUP 4,600 Members+ Top 4 Spark Meetup! Github Repo Stars + Forks DockerHub Repo Pulls
  • 4.
    CURRENT PIPELINE.IO RESEARCH •Model Deploying andTesting • Model Scaling and Serving • Online ModelTraining • Dynamic Model Optimizing
  • 5.
    PIPELINE.IO DELIVERABLES • 100%Open Source!! • Github: • https://github.com/fluxcapacitor/ • DockerHub • https://hub.docker.com/r/fluxcapacitor
  • 6.
  • 7.
    AGENDA • Neural Networks •GPUs • Tensorflow • TensorFrames
  • 8.
    WHAT ARE NEURALNETWORKS? • Like All Machine Learning, Goal is to Minimize Loss (Error) • Mostly Supervised Learning Classification • Many labeled training samples exist • Training • Step 1: Start with Random Guesses for Input Weights • Step 2: Calculate ErrorAgainst Labeled Data • Step 3: Determine Gradient Amount and Direction (+ or -) • Step 4: Back-propagate Gradient to Update Each Input Weight • Step 5: Repeat Step 1 until Convergence or Max Epochs Reached
  • 9.
  • 10.
    CONVOLUTIONAL NEURAL NETWORKS •Apply Many Layers (aka. Filters) to Input • Each Layer/Filter Picks up on Features • Features not necessarily human-grokkable • Brute Force –Try Diff numLayers & layerSizes • Filter Examples • 3 Color Filters: RGB • Moving AVG for Time Series
  • 11.
    MY FAVORITE USECASE – STITCH FIX StitchFix Strata Conf SF 2016: Using Deep Learning to Create New Clothing Styles!
  • 12.
    RECURRENT NEURAL NETWORKS MaintainState Enables Learning of Sequential Patterns Uses forText/NLP Prediction
  • 13.
    CHARACTER RNNS Preserving State differentiatesbetween 1st and 2nd ‘l’ to improve prediction
  • 14.
    AGENDA • Neural Networks •GPUs • Tensorflow • TensorFrames
  • 15.
    CPU VS GPU •Fundamentally Different than CPUs • Therefore,GPU/CUDA Programming Fundamentally Different
  • 16.
  • 17.
    MINIMIZE DATA DEPENDENCIES •More natural for structured,independent data • Tasks perform identical instructions in parallel on same-structured data • Reduce data dependencies as they limit parallelism Previous Instruction Previous Loop Iteration
  • 18.
  • 19.
  • 20.
    AGENDA • Neural Networks •GPUs • Tensorflow • TensorFrames
  • 21.
    WHAT IS TENSORFLOW? •Google Open Source General Purpose Numerical Computation Engine • Happens to be Good for Neural Networks! • Tooling • Tensorboard (port 6006 == `goog` upside down!) à • DAG-based like Spark! • Computation graph is logical plan • Stored in Protobuf’s • Tensorflow converts logical to physical plan • Lots of Libraries • TFLearn (Tensorflow’s Scikit-learn Impl) • Tensorflow Serving (Prediction Layer) à
  • 22.
    DEMO! AWS + Docker+ GPU + Docker + Tensorflow
  • 23.
  • 24.
    AGENDA • Neural Networks •GPUs • Tensorflow • TensorFrames
  • 25.
    WHAT ARE TENSORFRAMES? •Bridge between Spark (JVM) and Tensorflow (C++) • Python and Scala Bindings for Application Code • Uses JavaCPP for JNI-level Integration • Must Install TensorFrames C++ Runtime Libs on All Spark Workers • Developed by Old Co-worker @ Databricks,Tim Hunter • PhD inTensors – He’s ”Mr..Tensor”
  • 26.
    WHY TENSORFRAMES? • WhyNot?! • Best of BothWorlds: Legacy Spark Support +Tensorflow • Mix and Match Spark ML + Tensorflow AI on Same Data • Tensorflow is DAG-based Similar to Spark • Enables Data-Parallel Model Training
  • 27.
    DATA-PARALLEL MODEL TRAINING •Large Dataset are Partitioned Across HDFS Cluster • Computation Graph (Logical Plan) Passed to SparkWorkers • Workers Train on Each Data Partition in Parallel • Workers Periodically Aggregate (ie.AVG) Results • Aggregations happen in “Parameter Server” • Spark Master/Driver is Parameter Server
  • 28.
    TENSORFLOW + MULTIPLEHOSTS/GPUS Multi-GPU,Data-ParallelTraining Step 1: CPU transfers model replica and (initial) gradients to each GPU Step 2: CPU synchronizes and waits for all GPUs to process batch Step 3: CPU copies all training results (gradients) back from GPU Step 4: CPU averages gradients from all GPUs Step 5: Repeat Step 1 with (new) gradients Code https://github.com/tensorflow/tensorflow/blob/master/ tensorflow/models/image/cifar10/ cifar10_multi_gpu_train.py
  • 29.
    TENSORFRAME PERFORMANCE • Dependson Algorithm and Dataset, of course! • TensorFrames Require Extra Serialization JVM <-> C++ • What about Python Serialization from Python Bindings? • Should be minimal unless using Python UDFs • PySpark keeps small logical plan in Python layer • Physical operations happen in JVM (except Python UDFs!)
  • 30.
  • 31.
    THANK YOU!! Chris Fregly,ResearchScientist @ PipelineIO • LinkedIn: https://linkedin.com/in/cfregly • Twitter: @cfregly http://pipeline.io