SlideShare a Scribd company logo
1 of 29
Download to read offline
Data science and deep
learning on Spark with
1/10th of the code
Roope Astala, Sudarshan Ragunathan
Microsoft Corporation
Agenda
• User story: Snow Leopard Conservation
• Introducing Microsoft Machine Learning Library for Apache Spark
– Vision: Productivity, Scalability, State-of-Art Algorithms, Open Source
– Deep Learning and Image Processing
– Text Analytics
Disclaimer:​ Any roadmap items are subject to change without notice.
Apache®, Apache Spark, and Spark® are either registered trademarks or trademarks of the Apache
Software Foundation in the United States and/or other countries.
Rhetick Sengupta
President, Board of Directors
Snow leopards
• 3,900-6,500 individuals left in the wild
• Variety of Threats
– Poaching
– Retribution killing
– Loss of prey
– Loss of habitat (mining)
• Little known about their ecology,
behavior, movement patterns,
survival rates
• More data required to influence
survival
Range spread across 1.5 million km2
Camera trapping since 2009
• 1,700 sq km
• 42 camera
traps
• 8,490 trap
nights
• 4 primary
sampling
periods
• 56 secondary
sampling
periods
• ~1.3 mil
images
Camera Trap Images
Manually classifying
images averages 300
hours per survey
Automated Image
Classification Benefits
Short term
• Thousands of hours of researcher and volunteer
time saved
• Resources redeployed to science and
conservation vs image sorting
• Much more accurate data on range and
population
Long term
• Population numbers that can be accurately
monitored
• Influence governments on protected areas
• Enhance community based conservation
programs
• Predict threats before they happen
www.snowleopard.org
How can you help?
We need more camera surveys!
• 1,700 sq km surveyed of
1,500,000
• $500 will buy an additional
camera
• $5,000 will fund a researcher
• Any amount helps
Contact me directly at rhetick@hotmail.com or donate online.
Microsoft Machine Learning Library
for Apache Spark (MMLSpark)
GitHub Repo: https://github.com/Azure/mmlspark
Get started now using Docker image:
docker run -it -p 8888:8888 -e ACCEPT_EULA=yes microsoft/mmlspark
Navigate to http://localhost:8888 to view example Jupyter notebooks
Challenges when using Spark for ML
• User needs to write lots of “ceremonial” code to prepare
features for ML algorithms.
– Coerce types and data layout to that what’s expected by learner
– Use different conventions for different learners
• Lack of domain-specific libraries: computer vision or text
analytics…
• Limited capabilities for model evaluation & model
management
Vision of Microsoft Machine Learning Library for
Apache Spark
• Enable you to solve ML problems dealing with large amounts of
data.
• Make you as a professional data scientist more productive on
Spark, so you can focus on ML problems, not software
engineering problems.
• Provide cutting edge ML algorithms on Spark: deep learning,
computer vision, text analytics…
• Open Source at GitHub
Example: Hello MMLSpark
Predict income from US Census data
Using Plain PySpark
Using MMLSpark
from mmlspark import TrainClassifier, ComputeModelStatistics
from pyspark.ml.classification import LogisticRegression
model = TrainClassifier(model=LogisticRegression(), labelCol=" income").fit(train)
prediction = model.transform(test)
metrics = ComputeModelStatistics().transform(prediction)
Algorithms
• Deep learning through Microsoft Cognitive Toolkit
(CNTK)
– Scale-out DNN featurization and scoring. Take an existing
DNN model or train locally on big GPU machine, and
deploy it to Spark cluster to score large data.
– Scale-up training on a GPU VM. Preprocess large data on
Spark cluster workers and feed to GPU to train the DNN.
• Scale-out algorithms for “traditional” ML through
SparkML
Design Principles
• Run on every platform and on language supported by Spark
• Follow SparkML pipeline model for composability. MML-Spark consists
of Estimators and Transforms that can be combined with existing
SparkML components into pipelines.
• Use SparkML DataFrames as common format. You can use existing
capabilities in Spark for reading in the data to your model.
• Consistent handling of different datatypes – text, categoricals, images
– for different algorithms. No need for low-level type coercion,
encoding or vector assembly.
Deep Neural Net Featurization
• Basic idea: Interior layers of pre-trained DNN models have high-order information about
features
• Using “headless” pre-trained DNNs allows us to extract really good set of features from
images that can in turn be used to train more “traditional” models like random forests,
SVM, logistic regression, etc.
– Pre-trained DNNs are typically state-of-the-art models on datasets like ImageNet, MSCoco or
CIFAR, for example ResNet (Microsoft), GoogLeNet (Google), Inception (Google), VGG, etc.
• Transfer learning enables us to train effective models where we don’t have enough data,
computational power or domain expertise to train a new DNN from scratch
• Performance scales with executors
High-order features
Predictions
DNN Featurization using MML-Spark
cntkModel = CNTKModel().setInputCol("images").
setOutputCol("features").setModelLocation(resnetModel).
setOutputNode("z.x")
featurizedImages = cntkModel.transform(imagesWithLabels).
select(['labels','features'])
model = TrainClassifier(model=LogisticRegression(),labelCol="labels").
fit(featurizedImages)
The DNN featurization is incorporated as SparkML pipeline stage. The evaluation
happens directly on JVM from Scala: no Python UDF overhead!
Image Processing Transforms
DNNs are often picky about their input data shape and normalization.
We provide bindings to OpenCV image ingress and processing operations,
exposed as SparkML PipelineStages:
images = spark.readImages(IMAGE_PATH, recursive = True, sampleRatio = 0.1)
tr = ImageTransform().setOutputCol("transformed")
.resize(height = 200, width = 200)
.crop(0, 0, height = 180, width = 180)
smallImages = tr.transform(images).select("transformed")
Image Pipeline for Object Detection and
Classification
• In real data, objects are often not framed and 1 per image. There can be manycandidate sub-
images.
• You could extract candidate sub-images using flatMapsand UDFs, but this gets cumbersome quickly.
• We plan to provide simplified pipeline components to support generic object detection workflow.
Detect
candidate
ROI
Extract
sub-image
Classify
sub-images
Prediction:
Cloud,
None,
Sun,
Smiley,
None
…
Virtual network
Training of DNNs on GPU node
• GPUs are very powerful for training DNNs. However, running an entire cluster of GPUs is often too
expensive and unnecessary.
• Instead, load and prep large data on CPU Spark cluster, then feed the prepped data to GPU node on
virtual network for training. Once DNN is trained, broadcast the model to CPU nodes for evaluation.
learner = CNTKLearner(brainScript=brainscriptText, dataTransfer='hdfs-mount',
gpuMachines=‘my-gpu-vm’, workingDir=‘file:/tmp/’).fit(trainData)
predictions = learner.setOutputNode(‘z’).transform(testData)
Raw
data
Processed data
as DataFrame
Trained DNN as
PipelineStage
Text Analytics
• Goal: provide one-step text featurization capability that lets you
take free-form text columns and turn them into feature vectors for
ML algorithms.
– Tokenization, stop-word removal, case normalization
– N-Gram creation, feature hashing, IDF weighing.
• Future: multi-language support, more advanced text
preprocessing and DNN featurization capabilities.
Example: Text analytics for document
classification by sentiment
from pyspark.ml import Pipeline
from mmlspark import TextFeaturizer, SelectColumns, TrainClassifier
from pyspark.ml.classification import LogisticRegression
textFeaturizer = TextFeaturizer(inputCol="text", outputCol = "features",
useStopWordsRemover=True, useIDF=True, minDocFreq=5,
numFeatures=2**16)
columnSelector = SelectColumns(cols=["features","label"])
classifier = TrainClassifier(model = LogisticRegression(), labelCol='label')
textPipeline = Pipeline(stages= [textFeaturizer,columnSelector,classifier])
Easy to get started
GitHub Repo: https://github.com/Azure/mmlspark
Get started now using Docker image:
docker run -it -p 8888:8888 -e ACCEPT_EULA=yes microsoft/mmlspark
Navigate to http://localhost:8888 for example Jupyter notebooks
Spark package installable for generic Spark 2.1 clusters
Script Action installation on Azure HDInsight cluster

More Related Content

What's hot

Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaDeep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Spark Summit
 
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
Víctor Zabalza
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgets
Víctor Zabalza
 
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Naoki (Neo) SATO
 

What's hot (20)

Meetup tensorframes
Meetup tensorframesMeetup tensorframes
Meetup tensorframes
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
 
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaDeep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgets
 
Scale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyDataScale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyData
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
 
Intro to Scalable Deep Learning on AWS with Apache MXNet
Intro to Scalable Deep Learning on AWS with Apache MXNetIntro to Scalable Deep Learning on AWS with Apache MXNet
Intro to Scalable Deep Learning on AWS with Apache MXNet
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
 
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, SparkDistributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
 
20160908 hivemall meetup
20160908 hivemall meetup20160908 hivemall meetup
20160908 hivemall meetup
 
Advanced spark deep learning
Advanced spark deep learningAdvanced spark deep learning
Advanced spark deep learning
 
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on SparkInferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
 
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
 

Similar to Data Science and Deep Learning on Spark with 1/10th of the Code with Roope Astala and Sudarshan Ragunathan ()

夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架
hdhappy001
 

Similar to Data Science and Deep Learning on Spark with 1/10th of the Code with Roope Astala and Sudarshan Ragunathan () (20)

BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for Spark
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Deep learning and Apache Spark
Deep learning and Apache SparkDeep learning and Apache Spark
Deep learning and Apache Spark
 
夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
Deep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an IntroductionDeep Learning with Apache Spark: an Introduction
Deep Learning with Apache Spark: an Introduction
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Recently uploaded (20)

Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 

Data Science and Deep Learning on Spark with 1/10th of the Code with Roope Astala and Sudarshan Ragunathan ()

  • 1. Data science and deep learning on Spark with 1/10th of the code Roope Astala, Sudarshan Ragunathan Microsoft Corporation
  • 2. Agenda • User story: Snow Leopard Conservation • Introducing Microsoft Machine Learning Library for Apache Spark – Vision: Productivity, Scalability, State-of-Art Algorithms, Open Source – Deep Learning and Image Processing – Text Analytics Disclaimer:​ Any roadmap items are subject to change without notice. Apache®, Apache Spark, and Spark® are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
  • 4. Snow leopards • 3,900-6,500 individuals left in the wild • Variety of Threats – Poaching – Retribution killing – Loss of prey – Loss of habitat (mining) • Little known about their ecology, behavior, movement patterns, survival rates • More data required to influence survival
  • 5. Range spread across 1.5 million km2
  • 6.
  • 7.
  • 8. Camera trapping since 2009 • 1,700 sq km • 42 camera traps • 8,490 trap nights • 4 primary sampling periods • 56 secondary sampling periods • ~1.3 mil images
  • 9. Camera Trap Images Manually classifying images averages 300 hours per survey
  • 10. Automated Image Classification Benefits Short term • Thousands of hours of researcher and volunteer time saved • Resources redeployed to science and conservation vs image sorting • Much more accurate data on range and population Long term • Population numbers that can be accurately monitored • Influence governments on protected areas • Enhance community based conservation programs • Predict threats before they happen
  • 11. www.snowleopard.org How can you help? We need more camera surveys! • 1,700 sq km surveyed of 1,500,000 • $500 will buy an additional camera • $5,000 will fund a researcher • Any amount helps Contact me directly at rhetick@hotmail.com or donate online.
  • 12. Microsoft Machine Learning Library for Apache Spark (MMLSpark) GitHub Repo: https://github.com/Azure/mmlspark Get started now using Docker image: docker run -it -p 8888:8888 -e ACCEPT_EULA=yes microsoft/mmlspark Navigate to http://localhost:8888 to view example Jupyter notebooks
  • 13. Challenges when using Spark for ML • User needs to write lots of “ceremonial” code to prepare features for ML algorithms. – Coerce types and data layout to that what’s expected by learner – Use different conventions for different learners • Lack of domain-specific libraries: computer vision or text analytics… • Limited capabilities for model evaluation & model management
  • 14. Vision of Microsoft Machine Learning Library for Apache Spark • Enable you to solve ML problems dealing with large amounts of data. • Make you as a professional data scientist more productive on Spark, so you can focus on ML problems, not software engineering problems. • Provide cutting edge ML algorithms on Spark: deep learning, computer vision, text analytics… • Open Source at GitHub
  • 15. Example: Hello MMLSpark Predict income from US Census data
  • 17.
  • 18.
  • 19. Using MMLSpark from mmlspark import TrainClassifier, ComputeModelStatistics from pyspark.ml.classification import LogisticRegression model = TrainClassifier(model=LogisticRegression(), labelCol=" income").fit(train) prediction = model.transform(test) metrics = ComputeModelStatistics().transform(prediction)
  • 20. Algorithms • Deep learning through Microsoft Cognitive Toolkit (CNTK) – Scale-out DNN featurization and scoring. Take an existing DNN model or train locally on big GPU machine, and deploy it to Spark cluster to score large data. – Scale-up training on a GPU VM. Preprocess large data on Spark cluster workers and feed to GPU to train the DNN. • Scale-out algorithms for “traditional” ML through SparkML
  • 21. Design Principles • Run on every platform and on language supported by Spark • Follow SparkML pipeline model for composability. MML-Spark consists of Estimators and Transforms that can be combined with existing SparkML components into pipelines. • Use SparkML DataFrames as common format. You can use existing capabilities in Spark for reading in the data to your model. • Consistent handling of different datatypes – text, categoricals, images – for different algorithms. No need for low-level type coercion, encoding or vector assembly.
  • 22. Deep Neural Net Featurization • Basic idea: Interior layers of pre-trained DNN models have high-order information about features • Using “headless” pre-trained DNNs allows us to extract really good set of features from images that can in turn be used to train more “traditional” models like random forests, SVM, logistic regression, etc. – Pre-trained DNNs are typically state-of-the-art models on datasets like ImageNet, MSCoco or CIFAR, for example ResNet (Microsoft), GoogLeNet (Google), Inception (Google), VGG, etc. • Transfer learning enables us to train effective models where we don’t have enough data, computational power or domain expertise to train a new DNN from scratch • Performance scales with executors High-order features Predictions
  • 23. DNN Featurization using MML-Spark cntkModel = CNTKModel().setInputCol("images"). setOutputCol("features").setModelLocation(resnetModel). setOutputNode("z.x") featurizedImages = cntkModel.transform(imagesWithLabels). select(['labels','features']) model = TrainClassifier(model=LogisticRegression(),labelCol="labels"). fit(featurizedImages) The DNN featurization is incorporated as SparkML pipeline stage. The evaluation happens directly on JVM from Scala: no Python UDF overhead!
  • 24. Image Processing Transforms DNNs are often picky about their input data shape and normalization. We provide bindings to OpenCV image ingress and processing operations, exposed as SparkML PipelineStages: images = spark.readImages(IMAGE_PATH, recursive = True, sampleRatio = 0.1) tr = ImageTransform().setOutputCol("transformed") .resize(height = 200, width = 200) .crop(0, 0, height = 180, width = 180) smallImages = tr.transform(images).select("transformed")
  • 25. Image Pipeline for Object Detection and Classification • In real data, objects are often not framed and 1 per image. There can be manycandidate sub- images. • You could extract candidate sub-images using flatMapsand UDFs, but this gets cumbersome quickly. • We plan to provide simplified pipeline components to support generic object detection workflow. Detect candidate ROI Extract sub-image Classify sub-images Prediction: Cloud, None, Sun, Smiley, None …
  • 26. Virtual network Training of DNNs on GPU node • GPUs are very powerful for training DNNs. However, running an entire cluster of GPUs is often too expensive and unnecessary. • Instead, load and prep large data on CPU Spark cluster, then feed the prepped data to GPU node on virtual network for training. Once DNN is trained, broadcast the model to CPU nodes for evaluation. learner = CNTKLearner(brainScript=brainscriptText, dataTransfer='hdfs-mount', gpuMachines=‘my-gpu-vm’, workingDir=‘file:/tmp/’).fit(trainData) predictions = learner.setOutputNode(‘z’).transform(testData) Raw data Processed data as DataFrame Trained DNN as PipelineStage
  • 27. Text Analytics • Goal: provide one-step text featurization capability that lets you take free-form text columns and turn them into feature vectors for ML algorithms. – Tokenization, stop-word removal, case normalization – N-Gram creation, feature hashing, IDF weighing. • Future: multi-language support, more advanced text preprocessing and DNN featurization capabilities.
  • 28. Example: Text analytics for document classification by sentiment from pyspark.ml import Pipeline from mmlspark import TextFeaturizer, SelectColumns, TrainClassifier from pyspark.ml.classification import LogisticRegression textFeaturizer = TextFeaturizer(inputCol="text", outputCol = "features", useStopWordsRemover=True, useIDF=True, minDocFreq=5, numFeatures=2**16) columnSelector = SelectColumns(cols=["features","label"]) classifier = TrainClassifier(model = LogisticRegression(), labelCol='label') textPipeline = Pipeline(stages= [textFeaturizer,columnSelector,classifier])
  • 29. Easy to get started GitHub Repo: https://github.com/Azure/mmlspark Get started now using Docker image: docker run -it -p 8888:8888 -e ACCEPT_EULA=yes microsoft/mmlspark Navigate to http://localhost:8888 for example Jupyter notebooks Spark package installable for generic Spark 2.1 clusters Script Action installation on Azure HDInsight cluster