Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics using Apache Spark by Adam Mollenkopf of ESRI

Applying Geospatial Analytics
Using Apache Spark
C. Adam Mollenkopf
Real-Time GIS Capability Lead, Esri
amollenkopf@esri.com
@amollenkopf

What we do
Geographic Information System (GIS)
•  Founded in 1969
•  Esri develops GIS software
•  Global Company with over 350,000 user organizations worldwide
Headquarters in Redlands, CA 80 Esri distributors worldwide

Hortonworks Certified Partner
http://hortonworks.com/partner/esri/

Continuous & Batch Analytics
on high velocity & volume spatiotemporal data
Apps
Access
DesktopWeb Device
ServicesGeoEvent
Extension
GeoAnalytics
Extension
A
ArcGIS Server
•  Ingesting real-time
spatiotemporal data
•  Performing continuous
processing and
real-time analytics
•  Sending updates and
alerts to those who need
it where they need it
Ingestion
Storage
Continuous
Analytics )
Batch
Analytics
Visualization

Ingestion
of high velocity spatiotemporal data

High Velocity Ingestion
Requirements
•  Sustain a single node throughput of tens of thousands of events per second
•  Achieve near linear scalability of throughput when adding additional machines
•  Gracefully handle bursty data

Apache Kafka
Publish-subscribe messaging rethought as a distributed commit log
•  Fast
-  single broker can handle hundreds of MBs of reads and writes per second
•  Scalable
-  data streams are partitioned and spread over a cluster of machines
•  Durable
-  messages are persisted to disk and replicated within the cluster
•  Distributed
-  cluster-centric design that offers strong durability and fault-tolerance guarantees

Apache Spark
A fast and general engine for large-scale data processing
•  Unified big data processing
-  write streaming jobs the same way you write batch jobs
-  can combine streaming with batch and interactive queries
•  Spark apps can be written in Java, Scala, Python, and R

1 node cluster benchmark c4.2xlarge (Windows 2012 Server R2): 8 vCPU, 15 GiB, 100GB SSD, 1,000 Mbps EBS
1 Node Throughput
Ingest 1 node
Spark Streaming
w/ Kafka
132k

2 Node Linear Scalability of Throughput
2 node cluster benchmark c4.2xlarge (Windows 2012 Server R2): 8 vCPU, 15 GiB, 100GB SSD, 1,000 Mbps EBS
Ingest 1 node 2 node
Spark Streaming
w/ Kafka
132k 282k

Gracefully Handle Bursty Data
Direct API for Kafka + Back-pressure
•  Direct API for Kafka (Introduced at Spark 1.3)
-  Provides exactly-once semantics and offset ranges
•  Back-pressure (Planned feature of Spark 1.5, see SPARK-7398)
-  Fast Publisher, Slow Subscriber signaling

Analytics
of high velocity & volume spatiotemporal data

GIS Tools for Hadoop
http://esri.github.io/gis-tools-for-hadoop/
•  Esri Geometry API for Java:
-  Geometry objects: points, lines, polygons
-  Spatial relations: intersects, touches, overlaps, …
-  Spatial operations: buffer, cut, union, …
•  Spatial Framework for Hadoop
-  Includes Spatial UDFs (User Defined Functions) that extend Hive
•  GeoProcessing Tools for Hadoop
Ch. 8 Geospatial & Temporal Data Analysis

High Velocity Geospatial Analytics
Continuous analytics on data-in-motion
•  A GeoEvent Service configures the flow of events,
-  the Filtering and Processing steps to perform,
-  what ingestion stream(s) to apply them to,
-  and where to send the results.

•  A GeoEvent Service configures the flow of events,
-  the Filtering and Processing steps to perform,
-  what ingestion stream(s) to apply them to,
-  and where to send the results.
=> DAG
KafkaUtils.createStream(ssc, …)
.map( event => FieldEnricher.enrich(event, …) )
.filter( event => IncidentDetector.evaluate(event, …) )
.map( event => FieldEnricher.enrich(event, …) )
.map( event => FieldMapper(event, …))
.saveTo…
(Directed Acyclic Graph)

Demo
New York Taxi Cab Location Density Monitoring

Storage

High Velocity & Volume Storage
Requirements
•  Sustain a write throughput of tens of thousands of events per second
•  achieve growth in volume capacity & write throughput when adding additional machines
•  efficiently access and query a large volume of data
-  Query by any combination of id, time, space, and attributes

Elasticsearch
Store and Search Data in Real-Time
•  Distributed, Scalable, and Highly Available
-  Detect new or failed nodes, and reorganize and rebalance data automatically
•  Near real-time
-  All data is immediately made available for search and analytics
•  Spatial and Full Text Search
-  Comes with GeoPoint and GeoShape (polygon and polyline)
•  RESTful API
•  Spark Elasticsearch Connector
-  https://github.com/elastic/elasticsearch-hadoop/tree/master/spark/core/main/scala/org/
elasticsearch/spark/rdd

High velocity & volume storage c4.2xlarge (Windows 2012 Server R2): 8 vCPU, 15 GiB, 100GB SSD, 1,000 Mbps EBS
1 Node Write Throughput
Storage 1 node
{es} 106k
Ingest 1 node
Spark + Kafka 132k

Storage 1 node 2 node
{es} 106k 143k
Spark + Kafka 132k 282k

Storage 1 node 2 node 3 node
{es} 106k 143k 192k

Storage 1 node 2 node 3 node 4 node
{es} 106k 143k 192k 224k

Storage 1 node 2 node 3 node 4 node 5 node
{es} 106k 143k 192k 224k 249k

Visualization

High Velocity & Volume Visualization
Requirements
•  Render a map service that has the ability to do aggregation-on-the-fly
-  aggregations are calculated at various levels of detail and are specific to each user session
-  when zoomed in far enough raw features are returned and rendered

ArcGIS API for JavaScript
https://developers.arcgis.com/javascript/
•  A lightweight way to embed maps and tasks in web apps
•  Connects to any Map Service or Feature Service compliant source

High Velocity & Volume Visualization
Aggregation-on-the-fly

Demo
Ingestion, Storage, Continuous Analytics, and Visualization
High Velocity & Volume

High Velocity & Volume Analytics
Continuous and Batch Analytics

Customer Example
of applying geospatial analytics on big data

Port of Rotterdam
Vessel and Port Usage Behavioral Analytics
•  8th largest port in the world
•  Largest port in Europe

Polyline Track Tool
Speed Tool
Line Crosses Tool
Density Tool
Port of Rotterdam
Vessel and Port Usage Behavioral Analytics

Port of Rotterdam
Polyline Track Analytics

Port of Rotterdam
Density Analytics

Port of Rotterdam
Line Crosses Analytics

D
d
Δ
(Lat,lon)
Where is Δ≃ 0 ?
Port of Rotterdam
Dredging Prioritization

Port of Rotterdam
Dredging Prioritization

•  When working with high velocity & volume spatiotemporal data we have found the best
technology selections are as follows:
-  Ingestion = Spark Streaming + Kafka
-  Storage = Elasticsearch + Spark Elasticsearch Connector
-  Visualization = ArcGIS API for JavaScript + on-the-fly-aggregations in Elasticsearch
-  Continuous Analytics = Spark Streaming + GIS Tools for Hadoop
-  Batch Analytics = Spark Core +/- Spark SQL + GIS Tools for Hadoop
-  GIS Tools for Hadoop
-  Can be used as a basis to add spatial geometries, relations, and operators to Spark
-  http://esri.github.io/gis-tools-for-hadoop/
Applying Geospatial Analytics Using Apache Spark
Summary

Questions / Feedback?
C. Adam Mollenkopf
Real-Time GIS Capability Lead, Esri
amollenkopf@esri.com
@amollenkopf

Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics using Apache Spark by Adam Mollenkopf of ESRI

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics using Apache Spark by Adam Mollenkopf of ESRI

Similar to Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics using Apache Spark by Adam Mollenkopf of ESRI (20)

More from Data Con LA

More from Data Con LA (20)

Recently uploaded

Recently uploaded (20)

Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics using Apache Spark by Adam Mollenkopf of ESRI