Raela Wang
Databricks
Geospatial Analytics at
Scale with Deep Learning
and Apache Spark
#UnifiedAnalytics #SparkAISummit
About Me
• Raela Wang
• Solutions Architect @ Databricks
• Specialist in Machine Learning solutions
2#UnifiedAnalytics #SparkAISummit
What this talk covers
- Image Processing with Apache Spark
- Object Detection with Transfer Learning
- Deep Learning Pipelines
- Run Geolocalized queries to analyze results
- Magellan
- Demo
3
Mapping the world
• One of the most
ancient big data
activities in the world
• Critical for navigation,
warfare, commercial
exploitation
4
Mapping the world with images
- A lot of tools and
companies now provide
geospatial solutions
- Increasingly done with a
combination of satellites,
airplanes and drones
5
©Wired/PlanetLabs
Mapping the world with images
Large range of new applications
- Disaster Recovery: flood survey, fallen trees
- Infrastructure management: road damages
- Economic Intelligence: roof inclination for
solar panels
6
- Increasing amounts of Rich Data
- Cost effective solutions for acquiring data at scale (Drones,
CubeSats)
- Difficulty Scaling
- Traditional tools not designed for scalability: how to work at the
scale of a country or a continent?
- Pipelining Challenges
- Geospatial combines a lot of tools and problems: alignment,
image corrections, object detection, …
All these technologies need to communicate data in a timely fashion
New Challenges
7
8
vehicle_classes = {
18:('car', 'red'),
23:('truck', 'orange'),
19:('bus', 'white', 0.0)}
9
Apache Spark: the glue of big data
- Technologies exist in isolation
- OpenCV - Image manipulation
- Tensorflow, Keras, PyTorch - Deep Learning
- PostGIS, GeoMesa, Magellan - Geospatial Analytics
- Leaflet.js/OpenStreetMap - Visualization
- Apache Spark - ties all these libraries together
- At scale
- Allows pipelining
- Easily move data from 1 technology to another without having
to think about data representation
10
High-level View of the Pipeline
11
map data XML
metadata
UDFs
Transfer Learning with
Deep Learning Pipelines
Analyze and
Visualize
Geospatial Analytics with
Magellan
Parsing Image Data
12
map data XML
metadata
UDFs
Transfer Learning with
Deep Learning Pipelines
Analyze and
Visualize
Geospatial Analytics with
Magellan
Ingesting Images
Spark 2.3 -- ImageSchema to Read/Write Image
data
- Use the same schema across packages
- Scikit-image, MMLSpark, OpenCV, PIL, Deep
Learning Pipelines, ...
13
images = spark.readImages(img_dir,
recursive = True,
sampleRatio = 0.1)
Image Transformations with
Spark
- Spark Joins
- Combine images with XML metadata
- Spark UDFs
- Eastings and Northings → Latitudes and Longitudes
- Creating Image chips and respective coordinates
14
15
(lat, long)
(lat, long)
Deep Learning
16
map data XML
metadata
UDFs
Transfer Learning with
Deep Learning Pipelines
Analyze and
Visualize
Geospatial Analytics with
Magellan
Success of Deep Learning
• Tremendous success of image-based
applications
• Increased availability of pre-trained models
• Quickly building domain-specific models using
transfer learning
•
Existing frameworks
● Mostly Python
● Google's TensorFlow is the most popular (easy to install/use)
● PyTorch popular in research
● Others: MXNet, Theano, Caffe, Keras, DeepLearning4J (java)
Spark Deep Learning Pipelines:
Deep Learning with Simplicity•Open-source Databricks library
•Focuses on ease of use and integration
• without sacrificing performance
•Primary language: Python
•Includes APIs to transform images
s
Geospatial Analytics with
Magellan
20
map data XML
metadata
UDFs
Transfer Learning with
Deep Learning Pipelines
Analyze and
Visualize
Geospatial Analytics with
Magellan
Common geospatial tasks
● Find all objects within an area
● Build geometries
● Cluster and aggregate similar objects
● Infer geometries (roads, buildings, etc.)
21
Magellan
• Open-source library for geospatial analytics with Spark
• Understands various formats (geojson, …)
• Performs basic geometric operations at scale (polygon intersection,
joining, … )
• Integrates into Spark SQL engine and builds indices for high
performance
Demo
#UnifiedAnalytics #SparkAISummit
Recap
1) Read images with Spark
2) Parse image data with OpenCV and Spark UDFs
a) Slice images into smaller image chips
b) Generate respective coordinates for image chips
3) Pass data into a pre-trained tensorflow model and extract
predictions with Spark Deep Learning Pipelines
a) Model was trained on the xView dataset
b) Model classifies objects identified in images
4) Visualize identified vehicles on a heatmap
5) Cross-check with Magellan
24
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Geospatial Analytics at Scale with Deep Learning and Apache Spark

  • 1.
    Raela Wang Databricks Geospatial Analyticsat Scale with Deep Learning and Apache Spark #UnifiedAnalytics #SparkAISummit
  • 2.
    About Me • RaelaWang • Solutions Architect @ Databricks • Specialist in Machine Learning solutions 2#UnifiedAnalytics #SparkAISummit
  • 3.
    What this talkcovers - Image Processing with Apache Spark - Object Detection with Transfer Learning - Deep Learning Pipelines - Run Geolocalized queries to analyze results - Magellan - Demo 3
  • 4.
    Mapping the world •One of the most ancient big data activities in the world • Critical for navigation, warfare, commercial exploitation 4
  • 5.
    Mapping the worldwith images - A lot of tools and companies now provide geospatial solutions - Increasingly done with a combination of satellites, airplanes and drones 5 ©Wired/PlanetLabs
  • 6.
    Mapping the worldwith images Large range of new applications - Disaster Recovery: flood survey, fallen trees - Infrastructure management: road damages - Economic Intelligence: roof inclination for solar panels 6
  • 7.
    - Increasing amountsof Rich Data - Cost effective solutions for acquiring data at scale (Drones, CubeSats) - Difficulty Scaling - Traditional tools not designed for scalability: how to work at the scale of a country or a continent? - Pipelining Challenges - Geospatial combines a lot of tools and problems: alignment, image corrections, object detection, … All these technologies need to communicate data in a timely fashion New Challenges 7
  • 8.
  • 9.
    vehicle_classes = { 18:('car','red'), 23:('truck', 'orange'), 19:('bus', 'white', 0.0)} 9
  • 10.
    Apache Spark: theglue of big data - Technologies exist in isolation - OpenCV - Image manipulation - Tensorflow, Keras, PyTorch - Deep Learning - PostGIS, GeoMesa, Magellan - Geospatial Analytics - Leaflet.js/OpenStreetMap - Visualization - Apache Spark - ties all these libraries together - At scale - Allows pipelining - Easily move data from 1 technology to another without having to think about data representation 10
  • 11.
    High-level View ofthe Pipeline 11 map data XML metadata UDFs Transfer Learning with Deep Learning Pipelines Analyze and Visualize Geospatial Analytics with Magellan
  • 12.
    Parsing Image Data 12 mapdata XML metadata UDFs Transfer Learning with Deep Learning Pipelines Analyze and Visualize Geospatial Analytics with Magellan
  • 13.
    Ingesting Images Spark 2.3-- ImageSchema to Read/Write Image data - Use the same schema across packages - Scikit-image, MMLSpark, OpenCV, PIL, Deep Learning Pipelines, ... 13 images = spark.readImages(img_dir, recursive = True, sampleRatio = 0.1)
  • 14.
    Image Transformations with Spark -Spark Joins - Combine images with XML metadata - Spark UDFs - Eastings and Northings → Latitudes and Longitudes - Creating Image chips and respective coordinates 14
  • 15.
  • 16.
    Deep Learning 16 map dataXML metadata UDFs Transfer Learning with Deep Learning Pipelines Analyze and Visualize Geospatial Analytics with Magellan
  • 17.
    Success of DeepLearning • Tremendous success of image-based applications • Increased availability of pre-trained models • Quickly building domain-specific models using transfer learning •
  • 18.
    Existing frameworks ● MostlyPython ● Google's TensorFlow is the most popular (easy to install/use) ● PyTorch popular in research ● Others: MXNet, Theano, Caffe, Keras, DeepLearning4J (java)
  • 19.
    Spark Deep LearningPipelines: Deep Learning with Simplicity•Open-source Databricks library •Focuses on ease of use and integration • without sacrificing performance •Primary language: Python •Includes APIs to transform images s
  • 20.
    Geospatial Analytics with Magellan 20 mapdata XML metadata UDFs Transfer Learning with Deep Learning Pipelines Analyze and Visualize Geospatial Analytics with Magellan
  • 21.
    Common geospatial tasks ●Find all objects within an area ● Build geometries ● Cluster and aggregate similar objects ● Infer geometries (roads, buildings, etc.) 21
  • 22.
    Magellan • Open-source libraryfor geospatial analytics with Spark • Understands various formats (geojson, …) • Performs basic geometric operations at scale (polygon intersection, joining, … ) • Integrates into Spark SQL engine and builds indices for high performance
  • 23.
  • 24.
    Recap 1) Read imageswith Spark 2) Parse image data with OpenCV and Spark UDFs a) Slice images into smaller image chips b) Generate respective coordinates for image chips 3) Pass data into a pre-trained tensorflow model and extract predictions with Spark Deep Learning Pipelines a) Model was trained on the xView dataset b) Model classifies objects identified in images 4) Visualize identified vehicles on a heatmap 5) Cross-check with Magellan 24
  • 25.
    DON’T FORGET TORATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT