Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Deep Learning and Streaming in Apac... by Databricks 7755 views
- Productionizing Behavioural Feature... by Spark Summit 2369 views
- Apache Spark Structured Streaming H... by Spark Summit 2671 views
- From Pipelines to Refineries: scali... by Databricks 1740 views
- VEGAS: The Missing Matplotlib for S... by Spark Summit 3827 views
- Building Custom ML PipelineStages f... by Spark Summit 3082 views

Deep learning has shown tremendous successes, yet it often requires a lot of effort to leverage its power. Existing deep learning frameworks require writing a lot of code to run a model, let alone in a distributed manner. Deep Learning Pipelines is a Spark Package library that makes practical deep learning simple based on the Spark MLlib Pipelines API. Leveraging Spark, Deep Learning Pipelines scales out many compute-intensive deep learning tasks. In this talk we dive into – the various use cases of Deep Learning Pipelines such as prediction at massive scale, transfer learning, and hyperparameter tuning, many of which can be done in just a few lines of code. – how to work with complex data such as images in Spark and Deep Learning Pipelines. – how to deploy deep learning models through familiar Spark APIs such as MLlib and Spark SQL to empower everyone from machine learning practitioners to business analysts. Finally, we discuss integration with popular deep learning frameworks.

No Downloads

Total views

3,100

On SlideShare

0

From Embeds

0

Number of Embeds

603

Shares

0

Downloads

242

Comments

0

Likes

7

No notes for slide

Joining her is also …. ->

End-to-end workflow with DL Pipelines:

Data ingestion

Featurization

Training

Cross-validation and evaluation

Model deployment

Deep Learning easily and at scale: the vision

Challenges

Overview of a DL pipeline

How Spark can help: starting from an example

Processing images with DLP

the API and an example

some tips for Databricks users

Building simple Deep Learning models with transfer learning

What is transfer learning?

Example from demo

Trying different combinations.

Tips: memory, GPU, etc.

Integration with SQL

Why is it useful?

Example with Keras

TODO: might want to show GAN / cat generators as well

TODO: give a sense of how long it takes to train / use these models

With an emphasis on the models often being used for feature transform

With an emphasis on the models often being used for feature transform

- We have no native support for images in Spark, so we added one in DL Pipelines.

- Numpy arrays – since most deep learning libraries support Python, they tend to work with numpy arrays

And underneath, the transformer is using Spark to massively scale out the prediction.

Focus on image classification for now.

There is no great model to transfer learn off of

Want to try getting better accuracy by re-training or fine-tuning a model with your own data

Currently we have SQL UDF creation for Keras models

- 1. Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark Sue Ann Hong, Databricks Tim Hunter, Databricks #EUdd3
- 2. About Us Sue Ann Hong • Software engineer @ Databricks • Ph.D. from CMU in Machine Learning Tim Hunter • Software engineer @ Databricks • Ph.D. from UC Berkeley in Machine Learning • Very early Spark user
- 3. This talk • Deep Learning at scale: current state • Deep Learning Pipelines: the vision • End-to-end workflow with DL Pipelines • Future
- 4. Deep Learning at Scale : current state 4put your #assignedhashtag here by setting the
- 5. What is Deep Learning? • A set of machine learning techniques that use layers that transform numerical inputs • Classification • Regression • Arbitrary mapping • Popular in the 80’s as Neural Networks • Recently came back thanks to advances in data collection, computation techniques, and hardware. t
- 6. Success of Deep Learning Tremendous success for applications with complex data • AlphaGo • Image interpretation • Automatic translation • Speech recognition
- 7. But requires a lot of effort • No exact science around deep learning • Success requires many engineer-hours • Low level APIs with steep learning curve • Not well integrated with other enterprise tools • Tedious to distribute computations
- 8. What does Spark offer? Very little in Apache Spark MLlib itself (multilayer perceptron) Many Spark packages Integrations with existing DL libraries • Deep Learning Pipelines (from Databricks) • Caffe (CaffeOnSpark) • Keras (Elephas) • mxnet • Paddle • TensorFlow (TensorFlow on Spark, TensorFrames) • CNTK (mmlspark) Implementations of DL on Spark • BigDL • DeepDist • DeepLearning4J • MLlib • SparkCL • SparkNet
- 9. Deep Learning in industry • Currently limited adoption • Huge potential beyond the industrial giants • How do we accelerate the road to massive availability?
- 10. Deep Learning Pipelines 10put your #assignedhashtag here by setting the
- 11. Deep Learning Pipelines: Deep Learning with Simplicity • Open-source Databricks library • Focuses on ease of use and integration • without sacrificing performance • Primary language: Python • Uses Apache Spark for scaling out common tasks • Integrates with MLlib Pipelines to capture the ML workflow concisely s
- 12. A typical Deep Learning workflow • Load data (images, text, time series, …) • Interactive work • Train • Select an architecture for a neural network • Optimize the weights of the NN • Evaluate results, potentially re-train • Apply: • Pass the data through the NN to produce new features or output Load data Interactive work Train Evaluate Apply
- 13. A typical Deep Learning workflow Load data Interactive work Train Evaluate Apply • Image loading in Spark • Distributed batch prediction • Deploying models in SQL • Transfer learning • Distributed tuning • Pre-trained models Part 1 Part 2
- 14. End-to-End Workflow with Deep Learning Pipelines 14put your #assignedhashtag here by setting the
- 15. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply t
- 16. Adds support for images in Spark • ImageSchema, reader, conversion functions to/from numpy arrays • Most of the tools we’ll describe work on ImageSchema columns from sparkdl import readImages image_df = readImages(sample_img_dir)
- 17. Upcoming: built-in support in Spark • Spark-21866 • Contributing image format & reading to Spark • Targeted for Spark 2.3 • Joint work with Microsoft 17put your #assignedhashtag here by setting the
- 18. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply
- 19. Applying popular models • Popular pre-trained models accessible through MLlib Transformers predictor = DeepImagePredictor(inputCol="image", outputCol="predicted_labels", modelName="InceptionV3") predictions_df = predictor.transform(image_df)
- 20. Applying popular models predictor = DeepImagePredictor(inputCol="image", outputCol="predicted_labels", modelName="InceptionV3") predictions_df = predictor.transform(image_df)
- 21. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply Hyperparameter tuning Transfer learning s
- 22. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply Hyperparameter tuning Transfer learning
- 23. Transfer learning • Pre-trained models may not be directly applicable • New domain, e.g. shoes • Training from scratch requires • Enormous amounts of data • A lot of compute resources & time • Idea: intermediate representations learned for one task may be useful for other related tasks
- 24. Transfer Learning SoftMax GIANT PANDA 0.9 RACCOON 0.05 RED PANDA 0.01 …
- 25. Transfer Learning 0.01 …
- 26. Transfer Learning 0.01 … Classifie r
- 27. Transfer Learning 0.01 … Classifie r Rose: 0.7 Daisy: 0.3
- 28. MLlib Pipelines primer • MLlib: the machine learning library included with Spark • Transformer • Takes in a Spark dataframe • Returns a Spark dataframe with new column(s) containing “transformed” data • e.g. a Model is a Transformer • Estimator • A learning algorithm, e.g. lr = LogisticRegression() • Produces a Model via lr.fit() • Pipeline: a sequence of Transformers and Estimators
- 29. Transfer Learning as a Pipeline 0.01 … Classifie rDeepImageFeaturiz er Rose / Daisy
- 30. Transfer Learning as a Pipeline 0.01 … DeepImageFeaturiz er Image Loading Preprocessi ng Logistic Regression MLlib Pipeline
- 31. Transfer Learning as a Pipeline 31put your #assignedhashtag here by setting the featurizer = DeepImageFeaturizer(inputCol="image", outputCol="features", modelName="InceptionV3") lr = LogisticRegression(labelCol="label") p = Pipeline(stages=[featurizer, lr]) p_model = p.fit(train_df)
- 32. Transfer Learning • Usually for classification tasks • Similar task, new domain • But other forms of learning leveraging learned representations can be loosely considered transfer learning
- 33. Featurization for similarity-based ML 0.01 … DeepImageFeaturiz er Image Loading Preprocessi ng Logistic Regression
- 34. Featurization for similarity-based ML 0.01 … DeepImageFeaturiz er Image Loading Preprocessi ng Clustering KMeans GaussianMixture Nearest Neighbor KNN LSH Distance computation
- 35. Featurization for similarity-based ML 0.01 … DeepImageFeaturiz er Image Loading Preprocessi ng Clustering KMeans GaussianMixture Nearest Neighbor KNN LSH Distance computation Duplicate Detection Recommendatio n Anomaly Detection Search result diversification
- 36. Break?
- 37. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply Hyperparameter tuning Transfer learning t
- 38. Distributed Hyperparameter Tuning … Spark driver {model = ResNet50, learning_rate = 0.01, batch_size = 64} {model = InceptionV3, learning_rate = 0.1, batch_size = 128}
- 39. MLlib Components • Learning algorithms • Implement model = fit(train_df, ) • Metric used to measure model’s goodness on validation data • e.g. BinaryClassificationEvaluator computes accuracy 40put your #assignedhashtag here by setting the e.g. learning rate Estimators Evaluators params
- 40. Model Selection (Parameter Search) 41put your #assignedhashtag here by setting the Estimators Evaluators paramsMap of cv = CrossValidator( estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=BinaryClassificationEvaluator()) best_model = cv.fit(train_df) paramGrid = ( ParamGridBuilder() .addGrid(hashingTF.numFeatures, [10, 100]) .addGrid(lr.regParam, [0.1, 0.01]) .build() )
- 41. Deep Learning Estimators • Transfer learning Pipelines • KerasImageFileEstimator
- 42. Deep Learning Estimators • Transfer learning Pipelines • KerasImageFileEstimator
- 43. Distributed Tuning Transfer Learning Logistic Regressio n + ResNet50 … Logistic Regressio n + InceptionV3 Spark driver
- 44. pipeline = Pipeline([DeepImageFeaturizer(), LogisticRegression()]) paramGrid = ( ParamGridBuilder() .addGrid(modelName=[“ “, “ “]) ) cv = CrossValidator(estimator= pipeline, estimatorParamMaps=paramGrid, evaluator=BinaryClassificationEvaluator(), numFolds=3) best_model = cv.fit(train_df) Distributed Tuning Transfer Learning ResNet50 InceptionV3
- 45. Deep Learning Estimators • Transfer learning Pipelines • KerasImageFileEstimator
- 46. Keras 47 model = Sequential() model.add(Dense(32, input_dim=784)) model.add(Activation('relu')) • A popular, declarative interface to build DL models • High level, expressive API in python • Executes on TensorFlow, Theano, CNTK
- 47. model = Sequential() model.add(...) model.save(model_filename) estimator = KerasImageFileEstimator( kerasOptimizer=“adam“, kerasLoss=“categorical_crossentropy“, kerasFitParams={“batch_size“:100}, modelFile=model_filename) model = model.fit(dataframe) 48 Keras Estimator
- 48. model = Sequential() model.add(...) model.save(model_filename) estimator = KerasImageFileEstimator( kerasOptimizer=“adam“, kerasLoss=“categorical_crossentropy“, kerasFitParams={“batch_size“:100}, modelFile=model_filename) model = model.fit(dataframe) 49 Keras Estimator in Model Selection estimator = KerasImageFileEstimator( kerasOptimizer=“adam“, kerasLoss=“categorical_crossentropy“) paramGrid = ( ParamGridBuilder() .addGrid(kerasFitParams=[{“batch_size“:100}, {“batch_size“:200}]) .addGrid(modelFile=[model1, model2]) )
- 49. 50 Keras Estimator in Model Selection estimator = KerasImageFileEstimator( kerasOptimizer=“adam“, kerasLoss=“categorical_crossentropy“) paramGrid = ( ParamGridBuilder() .addGrid(kerasFitParams=[{“batch_size“:100}, {“batch_size“:200}]) .addGrid(modelFile=[model1, model2]) ) cv = CrossValidator(estimator=estimator, estimatorParamMaps=paramGrid, evaluator=BinaryClassificationEvaluator(), numFolds=3) best_model = cv.fit(train_df)
- 50. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply
- 51. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply Spark SQL Batch prediction s
- 52. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply Spark SQL Batch prediction
- 53. Batch prediction as an MLlib Transformer • Recall a model is a Transformer in MLlib predictor = XXTransformer(inputCol="image", outputCol=”predictions", modelSpecification={…}) predictions = predictor.transform(test_df)
- 54. Hierarchy of DL transformers for images 55 TFImageTransformer KerasImageTransformer DeepImageFeaturizerDeepImagePredictor NamedImageTransformer(model_name) (keras.Model) (tf.Graph) Input (Image) Output (vector)
- 55. Hierarchy of DL transformers for images 56 TFImageTransformer KerasImageTransformer DeepImageFeaturizerDeepImagePredictor NamedImageTransformer(model_name) (keras.Model) (tf.Graph) Input (Image) Output (vector) model_namemodel_name Keras.Model tf.Graph
- 56. Hierarchy of DL transformers 57 TFImageTransformer KerasImageTransformer DeepImageF eaturizer DeepImageP redictor NamedImageTransformer ) Input (Image ) Output (vector ) model_namemodel_name Keras.Model tf.Graph TFTextTransformer KerasTextTransformer DeepTextFeat urizer DeepTextPre dictor NamedTextTransformerInput (Text) Output (vector) model_namemodel_name Keras.Model tf.Graph TFTransformer
- 57. 58 TFImageTransformer Defined by tf.Graph 1. Convert ImageSchema data into a vector 2. Use tensorframes to • Efficiently distribute tf.Graph to workers • Apply the graph to the partitioned data 3. Return the result as a spark.ml.Vector
- 58. Aside: high performance with Spark • TensorFlow (and other frameworks) have 2 mechanisms to ingest data: • memory-based API (tensors) • file-based API (Queue, …) • DLP makes all data transfers in memory • Spark responsible for reading and assembling data • Low-level transformations handled by TensorFrames 59put your #assignedhashtag here by setting the t
- 59. High performance with Spark • Every DL transform is a TensorFlow graph 60put your #assignedhashtag here by setting the x: int32 y: int32 mul 3 z
- 60. High performance with Spark • Every DL transform is a TensorFlow graph • Applied on each of the rows of the dataframe 61 output_df: DataFrame[x: int, y: int, z: int] x: int32 y: int32 mul 3 z df: DataFrame[x: int, y: int]
- 61. Example: running batch inference Spark driver Spark worker Deep Learning Pipelines Broadcast (protocol buffers)TensorFlow Program (tf.Graph) TensorFlow runtime (C++) Worker JVM Driver JVM
- 62. Deep Learning Pipelines • Load data • Interactive work • Train • Evaluate model • Apply Spark SQL Batch prediction s
- 63. Shipping predictors in SQL Take a trained model / Pipeline, register a SQL UDF usable by anyone in the organization In Spark SQL: registerKerasUDF(”my_object_recognition_function", keras_model_file="/mymodels/007model.h5") select image, my_object_recognition_function(image) as objects from traffic_imgs This means you can apply deep learning models in streaming!
- 64. Almost done 65put your #assignedhashtag here by setting the
- 65. Deep Learning Pipelines : Future In progress • Scala API for DeepImageFeaturizer • Text featurization (embeddings) • TFTransformer for arbitrary vectors Future • Distributed training • Support for more backends, e.g. MXNet, PyTorch, BigDL
- 66. Deep Learning without Deep Pockets • Simple API for Deep Learning, integrated with MLlib • Scales common tasks with transformers and estimators • Embeds Deep Learning models in MLlib and SparkSQL • Check out https://github.com/databricks/spark-deep- learning !
- 67. Questions? Thank you! 68put your #assignedhashtag here by setting the
- 68. Resources Blog posts & webinars (http://databricks.com/blog) • Deep Learning Pipelines • GPU acceleration in Databricks • BigDL on Databricks • Deep Learning and Apache Spark Docs for Deep Learning on Databricks (http://docs.databricks.com) • Getting started • Deep Learning Pipelines Example • Spark integration

No public clipboards found for this slide

Special Offer to SlideShare Readers

The SlideShare family just got bigger. You now have unlimited* access to books, audiobooks, magazines, and more from Scribd.

Cancel anytime.
Be the first to comment