Dask and V100s for Fast,
Distributed Batch Scoring of
Computer Vision Workloads
Mathew Salvaris
• What is Dask?
• Batch scoring use cases
• Style transfer
• Mask- RCNN for object detection
and segmentation
http://docs.dask.org/en/latest/why.html
Dask for batch scoring
• Same code runs on single node
locally or in a cluster
• No orchestration code to write,
Dask handles orchestration
• Can create and execute complex
DAGs that can be generated on the
fly
Batch scoring use cases
• Style Transfer
• Object detection and segmentation
ReturnModel Write
Style transfer process
Read
Apply
ReturnModel Write
Dask Graph of Style Transfer Process
Load
Image
Load
Image
Load
Image
Load
Image
Stack
Style
Transfer
Write
N
@curry
def process_batch(client, style_model, output_path, batch_filenames):
remote_batch_f = client.scatter(batch_filenames)
img_array_f = client.map(load_image, remote_batch_f)
stacked_array_f = client.submit(stack, img_array_f)
styled_array_f = client.submit(stylize_batch, style_model, stacked_array_f)
return client.submit(write, batch, styled_array_f, output_path)
Testing Dask locally on GPU
• Need GPU based libraries for workers and CPU based ones client
• Would limit the visibility of CUDA devices to each worker
• Spawn as many workers as there are GPUs
CUDA_VISIBLE_DEVICES=0 dask-worker 127.0.0.1 --nprocs 1 --nthreads 1 --resources 'GPU=1
What is Azure ML Pipelines
Create experimentation graph that gets you from data to a model
The pipeline can be exported as a parameterised end point
Dask on Azure pipelines
• Need to set off workers, scheduler and run client
• Azure ML pipelines has MPIStep which allows us to trigger MPI job
• Run workers on all ranks - Run client and scheduler on rank 0
GPU: V100
# Images: 823
180.4
96.7
48.9
25.4
15.5
0
20
40
60
80
100
120
140
160
180
200
1 Process 2 Processes 4 Processes 8 Processes 12 Processes
Time Taken (S)
4-5 images/second
53 images/second
180.4
96.7
48.9
25.4
15.5 12.3
0
20
40
60
80
100
120
140
160
180
200
1 Process 2 Processes 4 Processes 8 Processes 12 Processes
Blob Premium Blob
GPU: V100
# Images: 823
180.4
96.7
48.9
25.4
15.5
106.8
56.5
33.7
16.8
12.3
0
20
40
60
80
100
120
140
160
180
200
1 Process 2 Processes 4 Processes 8 Processes 12 Processes
Blob Premium Blob
69 images/second
8 images/second
GPU: V100
# Images: 823
Azure ML Pipelines
Notebook
Mask-RCNN process
ReturnCombine Write
Read Model
BBOX
Mask
BBOX
Mask
Dask Graph of Mask-RCNN
Load
Image
Load
Image
Load
Image
Load
Image
Preprocessing
Score
Merge image and
annotations
N
Write
@curry
def process_batch(client, style_model, preprocessing, output_path, batch):
remote_batch_f = client.scatter(batch)
img_array_f = client.map(load_image, remote_batch_f)
pre_img_array_f = client.map(preprocessing, img_array_f)
bbox_list_f = client.submit(score_batch, style_model, pre_img_array_f)
results_f = client.submit(loop_annotations, img_array_f, bbox_list_f)
return client.submit(loop_write(output_path), batch, results_f)
Dask on Kubernetes
• Use Helm chart to deploy, workers, scheduler and Jupyter lab
• Provision 3 VMs with 4 V100s each
• Installed Nvidia device plugin
• Installed plugin for storage
Demo Kubernetes
Summary
Able to easily prototype two
batch scoring scenarios locally
then deploy on Kubernetes as
well as Azure pipelines
Using GPUs through Dask could
be more straight forward, better
interaction between Dask and DL
libraries
Acknowledgements
JS Tan
Azure Machine Learning Pipelines Team
Thanks
&
Questions?

Dask for Fast Distributed Batch Scoring of Computer Vision Workloads

  • 1.
    Dask and V100sfor Fast, Distributed Batch Scoring of Computer Vision Workloads Mathew Salvaris
  • 2.
    • What isDask? • Batch scoring use cases • Style transfer • Mask- RCNN for object detection and segmentation
  • 3.
  • 4.
    Dask for batchscoring • Same code runs on single node locally or in a cluster • No orchestration code to write, Dask handles orchestration • Can create and execute complex DAGs that can be generated on the fly
  • 5.
    Batch scoring usecases • Style Transfer • Object detection and segmentation ReturnModel Write
  • 6.
  • 7.
    Dask Graph ofStyle Transfer Process Load Image Load Image Load Image Load Image Stack Style Transfer Write N
  • 8.
    @curry def process_batch(client, style_model,output_path, batch_filenames): remote_batch_f = client.scatter(batch_filenames) img_array_f = client.map(load_image, remote_batch_f) stacked_array_f = client.submit(stack, img_array_f) styled_array_f = client.submit(stylize_batch, style_model, stacked_array_f) return client.submit(write, batch, styled_array_f, output_path)
  • 9.
    Testing Dask locallyon GPU • Need GPU based libraries for workers and CPU based ones client • Would limit the visibility of CUDA devices to each worker • Spawn as many workers as there are GPUs CUDA_VISIBLE_DEVICES=0 dask-worker 127.0.0.1 --nprocs 1 --nthreads 1 --resources 'GPU=1
  • 11.
    What is AzureML Pipelines Create experimentation graph that gets you from data to a model The pipeline can be exported as a parameterised end point
  • 12.
    Dask on Azurepipelines • Need to set off workers, scheduler and run client • Azure ML pipelines has MPIStep which allows us to trigger MPI job • Run workers on all ranks - Run client and scheduler on rank 0
  • 13.
    GPU: V100 # Images:823 180.4 96.7 48.9 25.4 15.5 0 20 40 60 80 100 120 140 160 180 200 1 Process 2 Processes 4 Processes 8 Processes 12 Processes Time Taken (S) 4-5 images/second 53 images/second
  • 14.
    180.4 96.7 48.9 25.4 15.5 12.3 0 20 40 60 80 100 120 140 160 180 200 1 Process2 Processes 4 Processes 8 Processes 12 Processes Blob Premium Blob GPU: V100 # Images: 823
  • 15.
    180.4 96.7 48.9 25.4 15.5 106.8 56.5 33.7 16.8 12.3 0 20 40 60 80 100 120 140 160 180 200 1 Process 2Processes 4 Processes 8 Processes 12 Processes Blob Premium Blob 69 images/second 8 images/second GPU: V100 # Images: 823
  • 16.
  • 17.
  • 18.
    Dask Graph ofMask-RCNN Load Image Load Image Load Image Load Image Preprocessing Score Merge image and annotations N Write
  • 19.
    @curry def process_batch(client, style_model,preprocessing, output_path, batch): remote_batch_f = client.scatter(batch) img_array_f = client.map(load_image, remote_batch_f) pre_img_array_f = client.map(preprocessing, img_array_f) bbox_list_f = client.submit(score_batch, style_model, pre_img_array_f) results_f = client.submit(loop_annotations, img_array_f, bbox_list_f) return client.submit(loop_write(output_path), batch, results_f)
  • 20.
    Dask on Kubernetes •Use Helm chart to deploy, workers, scheduler and Jupyter lab • Provision 3 VMs with 4 V100s each • Installed Nvidia device plugin • Installed plugin for storage
  • 21.
  • 22.
    Summary Able to easilyprototype two batch scoring scenarios locally then deploy on Kubernetes as well as Azure pipelines Using GPUs through Dask could be more straight forward, better interaction between Dask and DL libraries
  • 23.
  • 24.