Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1
© 2020 The MathWorks, Inc.
Processing Large Datasets for ADAS applications
using Spark
Arvind Hosagrahara (ahosagra@math...
2
Outline
§ The Big Picture
– Problem Statement / Challenges
§ Architecture of the system (ETL, Training, Inference)
§ Dee...
3
The Big Picture – Automated Driving
How can I
analyze & simulate
scenarios?
How can I
design & deploy
algorithms?
How ca...
4
Semantic Segmentation
What?
§ Semantic segmentation is the classification
of every pixel in an image/video
Why?
§ Semant...
5
“…autonomous vehicles would have
to be driven hundreds of millions of
miles and sometimes hundreds of
billions of miles ...
6
Problem Statement
How can we:
1. Accelerate development of semantic segmentation
models from prototype to production in ...
7
Challenges
Business Challenges
§ Prototype to production: establish system & software engineering processes
§ Verificati...
8
Demonstration / Results
(Image-level features with batch normalization)
9
10
Demo / Results
3-D simulation data to train a semantic segmentation
network and fine-tune it to real-world data using
g...
11
12
Big Picture: Architecture
13
Architecture of the demo system
Burst
Data access
Experiment
Management
/
Governance
/
Security
14
Burst
Data access
Experiment
Management
/
Governance
/
Security
Data access
Experiment
Management
/
Governance
/
Securi...
15
Data access
Experiment
Management
/
Governance
/
Security
Development
Environment
(Desktop / Cloud)
Storage
(Large Data...
16
A simplified view of the system architecture
Experiment
Management
/
Governance
/
Security
Development Environment
(Des...
17
The ETL (Extract Transform Load) Pipeline
§ The ETL (Extract
Transform Load) Pipeline
Experiment
Management
/
Governanc...
18
The Training Pipeline
§
Experiment
Management
/
Governance
/
Security
Development Environment
(Desktop / Cloud)
Storage...
19
The Inference Pipeline
§
Experiment
Management
/
Governance
/
Security
Storage
(Large Datasets / Models /
Metadata)
Mod...
20
The Software Stack
Experiment
Management
/
Governance
/
Security
Development Environment
(Desktop / Cloud)
Storage
(Lar...
21
The Software Stack
Experiment
Management
/
Governance
/
Security
Development Environment
(Desktop / Cloud)
Storage
(Lar...
22
The DevOps perspective
23
Workflows supporting DevOps
Development Environment Operational Environment
TEST
O
P
E
R
A
T
E
DEPLOY
MONITOR
DESIGN
B
...
24
Deep-Dive: Spark Workflows
25
Workflow: Cluster management and Data Engineering via REST
Engineers
Developers
Scientists
Data Analyst
Compute Cluster...
26
Video 1 Outline
§ Planning capacity and performance
§ Cluster creation
§ Data and application transfer via the DBFS API
27
Workflow: JDBC and Database toolbox (a)
Engineers
Developers
Scientists
Data Analyst
Existing Compute Cluster
(Cloud/On...
28
Workflow: JDBC and Database toolbox (b)
Engineers
Developers
Scientists
Data Analyst
Existing Compute Cluster
(Cloud/On...
29
Workflow: Databricks-connect
Engineers
Developers
Scientists
Data Analyst
Existing Compute Cluster
(Cloud/On-Prem)
Data...
30
Video 2 Outline
§ Databricks-Connect workflows
§ Read/Write delta tables
31
The Training Pipeline
32
The Training Pipeline (Local Experimentation)
§ Optimized for GPU
§ High-level API (“write less and do
more”)
§ Languag...
33
Workflow: Push-down via MATLAB Compiler
Engineers
Developers
Scientists
Data Analyst
Compute Cluster
(Cloud/On-Prem)
sp...
34
Video 3 Outline
§ SparkSession and RDD abstractions
§ Tall-based workflows
§ Deploytool and app based workflows
– Compi...
35
Build and Test
§ Improve quality of code and
models by testing on CI Servers
(self-hosted or Cloud-hosted)
§ Package li...
36
The Training Pipeline (Cloud-based training)
§ Infrastructure defined as code
§ Leverage the latest hardware,
optimized...
37
Management of Experiments and tracking of performance
38
Video 4 Outline
§ MLFlow Demo
– Creation of Experiments / Runs
– Logging of Parameters and Metrics
– Integration with E...
39
Training Benchmarks and Performance
VGG-16 based network on
NVIDIA Quadro GV100
40
The Inference Pipeline
41
The Inference Pipeline
§ Directed Acyclic Graphs (DAG)
used to resolve changes to:
– Training models, ETL changes
– The...
42
Scaling Inference on the MATLAB Production Server
§ Scales gracefully
§ Optimal usage of
available CPU and GPU
resource...
43
The ETL (Extract-Transform-Load) Pipeline
44
The ETL Pipeline
§ Wide variety of Automotive and Engineering
Formats (eg: ROSBag, MP4, MDF, BLF, PCAP)
§ RESTful endpo...
45
Putting it all together
46
Build Graph (DAG)
§ Engine code
§ Trained models
§ New/Cached Data
§ Operates on S3 version metadata
§ Can be triggered...
47
Design Considerations
48
Scaling considerations
Storage
§ Cloud based storage allows
ingestion of larger datasets
Compute
§ Quicker training ite...
49
Accuracy considerations
§ More Training Data
§ Simulation / Scene generation
– Roadrunner
Simulation
50
Interop considerations
§ ONNX interop
§ Transfer learning techniques
§ Import/Export to
– OpenDRIVE®,
– FBX®,
– glTF™,
...
51
Embedded Targeting
§ GPU Coder™ Support Package
for NVIDIA GPUs
§ NVIDIA® DRIVE™ and Jetson
hardware platforms
52
Governance and Lifecycle management
§ Local Experimentation
§ Centralized Tracking
53
Security Considerations
§ Cloud Storage Best Practices
§ ACL based permissions
§ IAM based identity and access
54
Conclusions
§ Well-architected systems [1] accelerate development of automated
semantic segmentation against large data...
55
Key Takeaways
§ Build upon proven, safety compliant, and DevOps-ready development tools
§ Leveraging best-in-class simu...
56
Q&A
Contact us at: databricks@mathworks.com
58
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Processing Large Datasets for ADAS Applications using Apache Spark

Download to read offline

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.

Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.

Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.

This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

  • Be the first to like this

Processing Large Datasets for ADAS Applications using Apache Spark

  1. 1. 1 © 2020 The MathWorks, Inc. Processing Large Datasets for ADAS applications using Spark Arvind Hosagrahara (ahosagra@mathworks.com) Lead Solutions Architect May 2021
  2. 2. 2 Outline § The Big Picture – Problem Statement / Challenges § Architecture of the system (ETL, Training, Inference) § Deep-Dive into Spark Workflows – Cluster Management – Interactive exploration – Pushdown of analytics to Spark Clusters – Experiment and metadata management § Design considerations – Scaling, Accuracy and Interoperability – Security, Governance, Model Lifecycle management § Conclusions / Key Takeaways § Q&A
  3. 3. 3 The Big Picture – Automated Driving How can I analyze & simulate scenarios? How can I design & deploy algorithms? How can I integrate & test systems? Control Planning Perception
  4. 4. 4 Semantic Segmentation What? § Semantic segmentation is the classification of every pixel in an image/video Why? § Semantic segmentation: Higher quality perception outcomes, easier to analyze and validation of localization modules for AD. § Automation: Faster design iterations
  5. 5. 5 “…autonomous vehicles would have to be driven hundreds of millions of miles and sometimes hundreds of billions of miles to demonstrate their reliability in terms of fatalities and injuries.” [1] [1] Driving to Safety, Rand Corporation https://www.rand.org/content/dam/rand/pubs/research_reports/RR1400/RR1478/RAND_RR1478.pdf
  6. 6. 6 Problem Statement How can we: 1. Accelerate development of semantic segmentation models from prototype to production in order to build better perception software 2. Scale the workflows, capabilities and throughput of engineering processes by leveraging automation, cloud storage, compute and other infrastructure
  7. 7. 7 Challenges Business Challenges § Prototype to production: establish system & software engineering processes § Verification and validation: connect data, embedded software and simulation § Workforce development: hire & integrate engineers with disparate background Technical Challenges § Handling large datasets and specialized formats § Leveraging the cloud to scale computational requirements § Enable self-serve analytics for the data scientist / engineer
  8. 8. 8 Demonstration / Results (Image-level features with batch normalization)
  9. 9. 9
  10. 10. 10 Demo / Results 3-D simulation data to train a semantic segmentation network and fine-tune it to real-world data using generative adversarial networks (GANs)
  11. 11. 11
  12. 12. 12 Big Picture: Architecture
  13. 13. 13 Architecture of the demo system Burst Data access Experiment Management / Governance / Security
  14. 14. 14 Burst Data access Experiment Management / Governance / Security Data access Experiment Management / Governance / Security Development Environment (Desktop / Cloud) Storage (Large Datasets / Models / Metadata) Model Inference Engine for ETL and Performance Monitoring Monitoring / Visualization Front-ends Cloud-based training system (Spark Clusters / GPU support / Fast networks) Recorded and Simulated Data (Batch and Streaming Ingest) Burst Ingest
  15. 15. 15 Data access Experiment Management / Governance / Security Development Environment (Desktop / Cloud) Storage (Large Datasets / Models / Metadata) Model Inference Engine for ETL and Performance Monitoring Monitoring / Visualization Front-ends Cloud-based training system (Spark Clusters / GPU support / Fast networks) Recorded and Simulated Data (Batch and Streaming Ingest) Burst Ingest
  16. 16. 16 A simplified view of the system architecture Experiment Management / Governance / Security Development Environment (Desktop / Cloud) Storage (Large Datasets / Models / Metadata) Model Inference Engine for ETL and Performance Monitoring Monitoring / Visualization Front-ends Cloud-based training system (Spark Clusters / GPU support / Fast networks) Recorded and Simulated Data (Batch and Streaming Ingest) Burst Ingest Data access
  17. 17. 17 The ETL (Extract Transform Load) Pipeline § The ETL (Extract Transform Load) Pipeline Experiment Management / Governance / Security Storage (Large Datasets / Models / Metadata) Model Inference Engine for ETL and Performance Monitoring Recorded and Simulated Data (Batch and Streaming Ingest) Ingest § §
  18. 18. 18 The Training Pipeline § Experiment Management / Governance / Security Development Environment (Desktop / Cloud) Storage (Large Datasets / Models / Metadata) Cloud-based training system (Spark Clusters / GPU support / Fast networks) Burst Ingest Data access § The Training Pipeline §
  19. 19. 19 The Inference Pipeline § Experiment Management / Governance / Security Storage (Large Datasets / Models / Metadata) Model Inference Engine for ETL and Performance Monitoring Monitoring / Visualization Front-ends Burst Ingest Data access § § The Inference Pipeline
  20. 20. 20 The Software Stack Experiment Management / Governance / Security Development Environment (Desktop / Cloud) Storage (Large Datasets / Models / Metadata) Model Inference Engine for ETL and Performance Monitoring Monitoring / Visualization Front-ends Cloud-based training system (Spark Clusters / GPU support / Fast networks) Recorded and Simulated Data (Batch and Streaming Ingest) Burst Ingest Data access
  21. 21. 21 The Software Stack Experiment Management / Governance / Security Development Environment (Desktop / Cloud) Storage (Large Datasets / Models / Metadata) Model Inference Engine for ETL and Performance Monitoring Monitoring / Visualization Front-ends Cloud-based training system (Spark Clusters / GPU support / Fast networks) Recorded and Simulated Data (Batch and Streaming Ingest) Burst Ingest Data access MATLAB® and Simulink® MATLAB Production Server™ MATLAB Webapp Server™ 3P Dashboards Amazon S3™ AWS Identity and Access Management (IAM) AWS™ MLflow™
  22. 22. 22 The DevOps perspective
  23. 23. 23 Workflows supporting DevOps Development Environment Operational Environment TEST O P E R A T E DEPLOY MONITOR DESIGN B U I L D P L A N F E E D B A C K Cluster Management JDBC based connectivity to data sources Databricks- connect and interactive exploration CI/CD integration Jobs and Spark-Submit Model Lifecycle Management Experiment Management Coder based acceleration Embedded Applications Inference engine Central result, config, data management
  24. 24. 24 Deep-Dive: Spark Workflows
  25. 25. 25 Workflow: Cluster management and Data Engineering via REST Engineers Developers Scientists Data Analyst Compute Cluster (Cloud/On-Prem) jobsubmission.m MATLAB HTTP Client MATLAB Integration (REST 2.0 API) • Create Clusters • CRUD with DBFS • Create Jobs • Authentication /Tokens • Secrets • Notifications • Libraries • Workspace Databricks specific runtime image MATLAB Runtime init_script Dockerfile dockerized runtime analytics.jar
  26. 26. 26 Video 1 Outline § Planning capacity and performance § Cluster creation § Data and application transfer via the DBFS API
  27. 27. 27 Workflow: JDBC and Database toolbox (a) Engineers Developers Scientists Data Analyst Existing Compute Cluster (Cloud/On-Prem) DatabaseToolbox ODBC 2.6.x JDBC 2.6.x Driver SQL
  28. 28. 28 Workflow: JDBC and Database toolbox (b) Engineers Developers Scientists Data Analyst Existing Compute Cluster (Cloud/On-Prem) ODBC 2.6.x JDBC 2.6.x Driver SparkSession SQL Databases / Delta Tables
  29. 29. 29 Workflow: Databricks-connect Engineers Developers Scientists Data Analyst Existing Compute Cluster (Cloud/On-Prem) Databricks-connect Spark DAG Spark Master Spark API
  30. 30. 30 Video 2 Outline § Databricks-Connect workflows § Read/Write delta tables
  31. 31. 31 The Training Pipeline
  32. 32. 32 The Training Pipeline (Local Experimentation) § Optimized for GPU § High-level API (“write less and do more”) § Language abstractions to improve accuracy (eg: imageDataAugmenter) § Workflows for ease-of-use and debug – apps for model selection – monitor accuracy, loss and training performance – many quick-start examples
  33. 33. 33 Workflow: Push-down via MATLAB Compiler Engineers Developers Scientists Data Analyst Compute Cluster (Cloud/On-Prem) spark-submit job for New Clusters analytics.jar Spark DAG Spark Master analytics.m MATLAB Compiler/SDK MATLAB API for Spark Tall Arrays Libraries for Existing Clusters
  34. 34. 34 Video 3 Outline § SparkSession and RDD abstractions § Tall-based workflows § Deploytool and app based workflows – Compiler SDK for Java § mcc – Spark Submit § DBFS API to push application to Databricks
  35. 35. 35 Build and Test § Improve quality of code and models by testing on CI Servers (self-hosted or Cloud-hosted) § Package libraries for deployment § Adhere to agile development practices https://www.mathworks.com/solutions/continuous-integration.html
  36. 36. 36 The Training Pipeline (Cloud-based training) § Infrastructure defined as code § Leverage the latest hardware, optimized drivers and toolkits § APIs for command and control of Jobs, artifacts, notifications, execution, etc. § Capital Expenses (CapEx) è Operating Expenses (OpEx)
  37. 37. 37 Management of Experiments and tracking of performance
  38. 38. 38 Video 4 Outline § MLFlow Demo – Creation of Experiments / Runs – Logging of Parameters and Metrics – Integration with Experiment Management features
  39. 39. 39 Training Benchmarks and Performance VGG-16 based network on NVIDIA Quadro GV100
  40. 40. 40 The Inference Pipeline
  41. 41. 41 The Inference Pipeline § Directed Acyclic Graphs (DAG) used to resolve changes to: – Training models, ETL changes – The training data – New data / changes to the cache § Trained Models can be deployed as cloud services § Cloud Reference Architectures available on github.com to integrate with existing systems
  42. 42. 42 Scaling Inference on the MATLAB Production Server § Scales gracefully § Optimal usage of available CPU and GPU resources § Can increase throughput as desired § PAYG models available on the cloud
  43. 43. 43 The ETL (Extract-Transform-Load) Pipeline
  44. 44. 44 The ETL Pipeline § Wide variety of Automotive and Engineering Formats (eg: ROSBag, MP4, MDF, BLF, PCAP) § RESTful endpoints or Client based access § Connectivity to Cloud Storage (S3) and distributed streaming systems (Kafka) § Horizontally scalable and designed to run 24x7x365 on headless cloud systems § Provides data proximity when executing on the cloud § Accelerated CUDA execution on NVIDIA® via GPU Coder™
  45. 45. 45 Putting it all together
  46. 46. 46 Build Graph (DAG) § Engine code § Trained models § New/Cached Data § Operates on S3 version metadata § Can be triggered manually or automatically (eg: via CI/CD pipelines)
  47. 47. 47 Design Considerations
  48. 48. 48 Scaling considerations Storage § Cloud based storage allows ingestion of larger datasets Compute § Quicker training iterations both locally and on the cloud § Faster inference pipeline Cost § CapEx è OpEx Workflows § Self-serve analytics optimized for agility along a maturity framework "Glue code and pipeline jungles are symptomatic of integration issues that may have a root cause in overly separated “research” and “engineering” roles… … engineers and researchers are embedded together on the same teams (and indeed, are often the same people) can help reduce this source of friction significantly [1]. [1] The Hidden Technical Debt of Machine Learning Systems https://proceedings.neurips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf
  49. 49. 49 Accuracy considerations § More Training Data § Simulation / Scene generation – Roadrunner Simulation
  50. 50. 50 Interop considerations § ONNX interop § Transfer learning techniques § Import/Export to – OpenDRIVE®, – FBX®, – glTF™, – OpenFlight, – OpenSceneGraph, – OBJ, and USD formats. § Automated driving simulators and game engines (eg: CARLA, VIRES VTD, NVIDIA DRIVE Sim®, Metamoto®, LGSVL, Baidu Apollo®, Unity®, and Unreal Engine®)
  51. 51. 51 Embedded Targeting § GPU Coder™ Support Package for NVIDIA GPUs § NVIDIA® DRIVE™ and Jetson hardware platforms
  52. 52. 52 Governance and Lifecycle management § Local Experimentation § Centralized Tracking
  53. 53. 53 Security Considerations § Cloud Storage Best Practices § ACL based permissions § IAM based identity and access
  54. 54. 54 Conclusions § Well-architected systems [1] accelerate development of automated semantic segmentation against large datasets § Local development provides smooth workflows for the development and refinement of deep learning models § Cloud-based scaling of compute and storage can be leveraged on Databricks to enable self-service analytics § Techniques that are relevant for ADAS development are equally applicable across other domains such as medical, geo-exploration, etc. [1] AWS Well-Architected https://aws.amazon.com/architecture/well-architected/
  55. 55. 55 Key Takeaways § Build upon proven, safety compliant, and DevOps-ready development tools § Leveraging best-in-class simulation integration platform and modeling tools leads to faster time-to-market outcomes for ADAS development § Domain specific tooling allows practitioners to write less and do more § If this is interesting, please contact us at databricks@mathworks.com for more details or to get started.
  56. 56. 56 Q&A Contact us at: databricks@mathworks.com
  57. 57. 58

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

Views

Total views

101

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

2

Shares

0

Comments

0

Likes

0

×