High Performance and Scalable Geospatial Analytics on Cloud with Open Source

1
High Performance and Scalable Geospatial
Analytics on Cloud with Open Source
James Hughes – CCRI
Constantin Stanca – Hortonworks

3
Summary
• Loading Geospatial data into the cloud and GeoTools datastores never seems as easy as
it should be. There's sensors network, GPS devices, Twitter streams, FTP servers and all
sorts of other data that you need to parse, convert to SimpleFeatures, and then ingest.
• GeoMesa, NiFi and Spark provides a fully open source solution to ease the pain of
ingesting and analyzing data using ANY GeoTools data store.
• DataPlane Services Cloud Manager (powered by Cloudbreak) helps you to deploy
ephemeral geospatial analytics clusters to support increased computation
requirements, all decoupled from storage.
• We will show how real-time streaming data such as satellite AIS can be ingested and
managed in real-time with NiFi. Also, show how geospatial data stored in S3, HDFS, or
HBase, ORC or Parquet, can be queried at scale using GeoMesa, Spark and Zeppelin.

4
Geospatial Analytics
Challenges

5
Data Movement & System Complexity with Added Pressure of Big Data
Acquire
Data
Store
Data
Acquire
Data
Store
Data
Store
Data
Store
Data
Store
Data
Process
and
Analyze
Data
Data
Flow
Acquire
Data
Acquire
Data

6
If That Was Not Enough …
Spatial Data Types
Points
Locations
Events
Instantaneous
Positions
Lines
Road networks
Voyages
Trips
Trajectories
Polygons
Administrative
Regions
Airspaces

7
Spatial Data Relationships
equals
disjoint
intersects
touches
crosses
within
contains
overlaps

8
Topology Operations
Algorithms
Convex Hull
Buffer
Validation
Dissolve
Polygonization
Simplification
Triangulation
Voronoi
Linear Referencing
and more...
8

9
Requirements for a High
Performance Geospatial
Analytics Platform

10
Traditional Approach
• GIS, data crunching and web serving were three very separate worlds.
• If a web app wanted access to the analysis there was a long process of ETL, DB work,
imports and exports, and bribing various network and storage people for the resources
you needed.

11
Requirements for a High Performance Geospatial Analytics Platform
• IoT sensors present an opportunity to understand the world right now
• A map of the current state of the world enables faster reactions
• The variety of sensors and data source present data management challenges
• Adding new, varied data sources must be easy
• Big data requires distributed storage / computation and scalable infrastructure
• The data layer has to scale
• Analysis has to be easy

12
Scalable Geospatial
Analytics on the Cloud

13
How Cloud Helps to Address Geospatial Big Data Challenges
• Challenges:
• Big data problem (derive insights from all data)
• Compute resources when they are needed (easy scale, easy access to data)
• Solution:
• Cloud provides elastically the needed compute resources, all decoupled from the storage, whether
that is an object store, file system or NoSQL.

14
Importance for Geospatial Analytics
• Spatial streaming visualizations and analytics can present near real-time insights
• Decision makers can respond more rapidly when they see live data feeds on a map
• Spatial batch analytics can fuse multiple data sources together to understand a region
• Patterns of life emerge
• Advertisers can plan their next campaigns
• Business can locate their new store sites

15
Cloudbreak
• Cloudbreak can be utilized to address
Geospatial computational capacity needs
• Easily spin auto-scalable clusters for
different workloads and purposes, whether
is a Geospatial Ingest Cluster with NiFi and
GeoMesa, or Geospatial Analytics cluster
with Spark and GeoMesa.
• Data can reside in your object store or even
in a persistent data store.
• These ephemeral clusters can be scheduled
for a period of time or only until the job is
done so you pay only what you use.

17
How GeoMesa Helps with Geospatial Data Type Challenges
• Challenges:
• Vector & raster data
• Geospatial data types
• Solution:
• GeoMesa tools for streaming, persisting, managing, and analyzing spatio-temporal data at scale

18
What Is GeoMesa?
A suite of tools for streaming, persisting, managing, and
analyzing spatio-temporal data at scale

19
What Is GeoMesa?

20
What Is GeoMesa?

21
What Is GeoMesa?

22
What Is GeoMesa?

23
Proposed Reference Architecture

24
How Does HDP/HDF + GeoMesa Stream Data?
• The GeoMesa Kafka DataStore allows data produces to write CRUD messages to a Kafka
topic.
• Consumers off that topic build up an in-memory representation of the current state of
the world.
• This allows for
• live maps,
• real time analytics, and
• complex event processing.

25
How Does HDP/HDF + GeoMesa Persist Data?
GeoMesa integrates with HBase and Accumulo:
• Key structures use space filling curves
• Complex geospatial filters and processing can be
‘pushed down’ using Filters, Coprocessors, and Iterators
GeoMesa’s File System Datastore provides the ability to
store spatio-temporally indexed data on S3 cloud object
store or storage formats like ORC or Parquet.

26
Geospatial Data Flow Transformation
with NiFi and GeoMesa

27
Geo Data in
Motion
(Cloud)
Geo Data
in
Motion
(on-premises)
Geo Data
at Rest
(on-premises)
Edge
Geo Data
Geo Data
in Motion
Edge
Analytics
Geo Data
at Rest
(Cloud)
Edge
Geo Data
Geo Data
at Rest
(on-premises)
Closed
Loop
Analytics
Machine
Learning
Deep
Historical
Analysis
Geospatial Data Flow Transformation with NiFi and GeoMesa
On-Prem
Cloud
Satellite AIS
Spatial Data

28
GeoMesa NiFi
• GeoMesa-NiFi allows you to ingest data into GeoMesa straight from NiFi by leveraging
custom processors.
• NiFi allows you to ingest data into GeoMesa from every source GeoMesa supports and
more.
Data
SimpleFeatureType
Schema
GeoMesa NiFi
Processors enabled datastores

29
GeoMesa NiFi Processors
• PutGeoMesaAccumulo: Ingest data into a GeoMesa Accumulo datastore with a
GeoMesa converter or from geoavro
• PutGeoMesaHBase: Ingest data into a GeoMesa HBase datastore with a GeoMesa
converter or from geoavro
• PutGeoMesaFileSystem: Ingest data into a GeoMesa File System datastore with a
GeoMesa converter or from geoavro
• PutGeoMesaKafka: Ingest data into a GeoMesa Kafka datastore with a GeoMesa
converter or from geoavro
• PutGeoTools: Ingest data into an arbitrary GeoTools datastore using a GeoMesa
converter or avro
• ConvertToGeoAvro: Use a GeoMesa converter to create geoavro

30
Analyze Geospatial Data with
GeoMesa and Spark

31
How does HDP + GeoMesa analyze geospatial data?
• GeoMesa integrates deeply with Spark to:
• create spatial User Defined Types and User Defined Functions
• (based on LocationTech JTS, a geometry library)
• optimize spatial queries against GeoMesa DataSources
• persist output data back to GeoMesa
• leverage Zeppelin notebooks to allow for rapid innovation and creativity
• Zeppelin allows analysts to visualize results easily

32
DEMO
Data Ingest and Interactive Insights with
GeoMesa, NiFi, Spark and Zeppelin

33
Demo
• Introduce EE dataset
• Data management / NiFi overview
• Real-time view + historical recall
• Spark Analysis

36
Setup
● Import GeoMesa
dependency
● Create dataframe
backed by GeoMesa
relation
● Create SQL temporary
view so we can query
it

37
Sub-select Data
● Create rough sub-
selection of data
■ Bound by time
■ Bound by bounding
box roughly around
the Gulf of Mexico
● Create a new temporary
view from this sub-
selection
● Cache the data (pull into
memory)

38
Data Exploration
● Query for Tankers in the
Gulf
● Get counts for each type
of Tanker
● Group the counts by day
● Graph counts to see
trends

39
Data Exploration
● Restrict our search to
just Trinity Bay

40
Data Exploration
● Create a new
temporary view of
the number of ships
in Trinity Bay

41
Extra Data
● Pull in Gas price data
○ Acquired from
EIA.gov
○ Two Gas Price
Indexes
■ NYH: New York
Harbor
■ GC: Gulf Coast
● Create temporary view
so we can analyze with
SQL

42
Data Exploration
● Graph data over
time period of
Harvey
● Notice we don’t
have daily values

43
Data Exploration
● Create temporary
view of gas price
data around our
time of interest

44
Data Exploration
● Backfill the price data with
the last value to give us day-
continuous data
● Min/Max Normalize gas and
ship counts
● Graph gas prices and ship
counts together

46
Resources
• GeoMesa Project: http://www.geomesa.org/
• GeoMesa-NiFi: http://www.geomesa.org/documentation/user/nifi.html
• GeoMesa-Spark: http://www.geomesa.org/documentation/user/spark/index.html
• Articles:
• http://www.ccri.com/2017/03/20/new-geomesa-spark-sql-zeppelin-notebooks-support/
• http://www.ccri.com/2018/02/26/interactive-insights-hurricane-harveys-impact-energy-
production-geomesa-jupyter-notebooks/

High Performance and Scalable Geospatial Analytics on Cloud with Open Source

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to High Performance and Scalable Geospatial Analytics on Cloud with Open Source

Similar to High Performance and Scalable Geospatial Analytics on Cloud with Open Source (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

High Performance and Scalable Geospatial Analytics on Cloud with Open Source