Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Atlanta Hadoop Users Meetup 09 21 2016
1. TENSORFLOW + SPARK DATAFRAMES
=
TENSORFRAMES
Atlanta Hadoop Users Group
Sept 21, 2016
Chris Fregly
Research Scientist @ http://pipeline.io
Thank You for Hosting, HashMap!!
2. WHO AM I
Chris Fregly
• Currently
Research Scientist @ PipelineIO (http://pipeline.io)
Contributor @ Apache Spark
Committer @ Netflix Open Source
Founder @ Advanced Spark andTensorFlow Meetup
Author @ Advanced Spark (http://advancedspark.com)
Creator @ PANCAKE STACK (http://pancake-stack.com)
• Previously
Streaming Data Engineer @ Netflix, Databricks,IBM Spark
3. ADVANCED SPARK AND TENSORFLOW
MEETUP
4,400+ Members!
Top 4 Spark Meetup!!
Github Repo Stars + Forks
DockerHub Repo Pulls
10. WHAT ARE NEURAL NETWORKS?
• Like All Machine Learning, Goal is to Minimize Loss (Error)
• Mostly Supervised Learning Classification
• Many labeled training samples exist
• Training
• Step 1: Start with Random Guesses for Input Weights
• Step 2: Calculate ErrorAgainst Labeled Data
• Step 3: Determine Gradient Amount and Direction (+ or -)
• Step 4: Back-propagate Gradient to Update Each Input Weight
• Step 5: Repeat Step 1 until Convergence or Max Epochs Reached
12. CONVOLUTIONAL NEURAL NETWORKS
• Apply Many Layers (aka. Filters) to Input
• Each Layer/Filter Picks up on Features
• Features not necessarily human-grokkable
• Brute Force –Try Diff numLayers & layerSizes
• Filter Examples
• 3 Color Filters: RGB
• Moving AVG for Time Series
13. MY FAVORITE USE CASE – STITCH FIX
StitchFix Strata Conf SF 2016:
Using Deep Learning to Create New Clothing Styles!
19. MINIMIZE DATA DEPENDENCIES
• More natural for structured,independent data
• Tasks perform identical instructions in parallel on same-structured data
• Reduce data dependencies as they limit parallelism
Previous Instruction Previous Loop Iteration
23. WHAT IS TENSORFLOW?
• Google Open Source General Purpose Numerical Computation Engine
• Happens to be Good for Neural Networks!
• Tooling
• Tensorboard (port 6006 == `goog` upside down!) à
• DAG-based like Spark!
• Computation graph is logical plan
• Stored in Protobuf’s
• Tensorflow converts logical to physical plan
• Lots of Libraries
• TFLearn (Tensorflow’s Scikit-learn Impl)
• Tensorflow Serving (Prediction Layer) à
28. WHAT ARE TENSORFRAMES?
• Bridge between Spark (JVM) and Tensorflow (C++)
• Python and Scala Bindings for Application Code
• Uses JavaCPP for JNI-level Integration
• Must Install TensorFrames C++ Runtime Libs on All Spark
Workers
• Developed by Old Co-worker @ Databricks,Tim Hunter
• PhD inTensors – He’s ”Mr..Tensor”
29. WHY TENSORFRAMES?
• Why Not?!
• Best of BothWorlds: Legacy Spark Support +Tensorflow
• Mix and Match Spark ML + Tensorflow AI on Same Data
• Tensorflow is DAG-based Similar to Spark
• Enables Data-Parallel Model Training
30. DATA-PARALLEL MODEL TRAINING
• Large Dataset are Partitioned Across HDFS Cluster
• Computation Graph (Logical Plan) Passed to SparkWorkers
• Workers Train on Each Data Partition in Parallel
• Workers Periodically Aggregate (ie.AVG) Results
• Aggregations happen in “Parameter Server”
• Spark Master/Driver is Parameter Server
31. TENSORFLOW + MULTIPLE HOSTS/GPUS
Multi-GPU,Data-ParallelTraining
Step 1: CPU transfers model replica and (initial) gradients to each GPU
Step 2: CPU synchronizes and waits for all GPUs to process batch
Step 3: CPU copies all training results (gradients) back from GPU
Step 4: CPU averages gradients from all GPUs
Step 5: Repeat Step 1 with (new) gradients
Code
https://github.com/tensorflow/tensorflow/blob/master/
tensorflow/models/image/cifar10/
cifar10_multi_gpu_train.py
32. TENSORFRAME PERFORMANCE
• Depends on Algorithm and Dataset, of course!
• TensorFrames Require Extra Serialization JVM <-> C++
• What about Python Serialization from Python Bindings?
• Should be minimal unless using Python UDFs
• PySpark keeps small logical plan in Python layer
• Physical operations happen in JVM (except Python UDFs!)