Harnessing Spark Catalyst for
Custom Data Payloads
GIS Raster Support in Spark DataFrames
Simeon	H.K.	Fitch
Co-Founder	&	VP	of	R&D,	Astraea
Astraea
• Developing a machine learning platform to
make solving planetary problems easier
• With exploding population growth and finite
resources, we need to have tools to better plan
for sustainable growth
• We aim to bring earth science data to business
applications through machine learning
2
See	the	earth.	As	it	was,	as	it	is, as	it	could	be.​
Preface
• Assumptions:
– Basic knowledge of Spark, Resilient Distributed Datasets (RDDs), and the DataFrame
compute model
– Basic understanding of a typical ETL/ML pipeline
• Prior Art:
– Approach outlined derived from other work
– Fundamental raster support via Azavea’s GeoTrellis
– Spark integration cues taken from:
• CCRi’s GeoMesa
• Databrick’s Spark-Avro
• Caveat Emptor:
– As of Spark 2.1.0, approach is not officially sanctioned;
uses undocumented, private APIs
– Not for everyone, but for us, benefits outweigh the risks
3
PROBLEM STATEMENT
To efficiently and effectively build machine learning models with Earth observation data
4
Data Native Form
5
Bandc
Bandb
Banda
Temporal
Projected
Extent (TPE)
Granule Metadata (GM)
Remote Sensing Data Product
Granule/Scene/Tile
(GeoTIFF, HDF-EOS, GML-JPEG2000)
… …
add_offset
Band 32 emissivity
scale_factor
TileID
Value
0.002
1, 255
0.49
long_name
Key
valid_range
51004010
Multiband
Tile
Granule-wide
properties
Canonical ML Functional Form
6
c
1
a
1
b
1TPEA
1GMA [ 0 ] [ 0 ] [ 0 ] . . .[r1, c1]
Spark Dataframe Row
(i.e. ML Observation)
Band Values at
Single Cell
. . .. . .. . .. . .. . .. . .
Projected Extent of
Tile + Cell Row/
Column
Bandc
Bandb
Banda
Temporal
Projected
Extent (TPE)
Granule Metadata (GM)
Analytics Base Table
(ABT)
…
t1
t2
t2
t1
t2
t1
T3
T2
T2
T3
T2
T1
…
Delivering Imagery to ML
SLAAW
Scenes/
Granules
(Scene 1)
t0,b1
(Scene 1)
t0, bn
(Scene 1)
t0,b3
(Scene 1)
t0,b2
(Scene 1)
t0, b7
(Scene 1)
t0, b6
(Scene 1)
t0, b4
(Scene 1)
t0, b5
(Scene 2)
t1,b1
(Scene 2)
t1, bn
(Scene 2)
t1,b3
(Scene 2)
t1,b2
(Scene 2)
t1, b7
(Scene 2)
t1, b6
(Scene 2)
t1, b4
(Scene 2)
t1, b5
(Scene N)
tf,b1
(Scene N)
tf, bn
(Scene N)
tf,b3
(Scene N)
tf,b2
(Scene N)
tf, b7
(Scene N)
tf, b6
(Scene N)
tf, b4
(Scene N)
tf, b5
…
…
…
Feature
Engineering
Exploratory Data
Analysis
(EDA)
Data Quality
Check
(DQC)
Base Analytics Functional Form
(BAFF)
t1
t2
t2
t2
t1
t1
i6
i5
i4
i3
i2
i1
…
7
World-wide	data	coverage
Distributed	DataFrame
Distributed	DataFrame
Scalable	Machine	Learning
time
wavelength
Why This is Hard: Dimensionality
8
Spatial
(500m	→	5m	→	30cm)	
Temporal
(Refresh	rates:	Weeks	→	Daily	→	Hourly)	
Spectral
(4	bands	→	200	bands)
Planet
DigiGlobe
Landsat8
Planetary	
Resources
Metadata
• Coordinate	Reference	System
• Temporal/Spatial	Extent
• QA	Flags
• Calibration	parameters
+
Why This is Hard: Data Footprint
9
As resolution scales, image size explodes
Data	footprint	for	one	football	field	size	multiband	raster	
(single	point	in	time!)
• 30	meters
• 8 band
• 0.5	GB/image
Landsat8
(NASA)
• 3	meters
• 4	band
• 16	GB/image
Planet
PlanetScope
Ortho
• 30	centimeters
• 4	band
• 1.0	TB/image
DigiGlobe
• 10	m	Resolution
• 200	band	(hyper-spectral)
• 50	TB/	image?
Planetary
Resources
CAPABILITY DEMONSTRATION
Prototyping Spark Catalyst raster integration
10
Domain-Specific Data Discretization
Swath ~ Granule ~ Scene ~ Raster
⇓
Tile ~ Chip
⇓
Cell ~ Pixel
11
𝑛	×	𝑚	where	𝑛, 𝑚 ≳ 1200
(e.g.	Landsat	8:	76002)	
𝑛.
, where	𝑛 ≲ 512
(Typical:	642 to	2562)
1×1
Each	of	these	has	one	or	more	“bands”
(e.g.	Landsat	8:	11,	MODIS:	36,	Hyperion:	220)
TileUDT and Friends
• Using the approach covered in the next section we register TileUDT
with Spark
• With UDTs come User Defined Functions (UDFs)
• Some examples:
12
§ vectorizeTiles
§ explodeTiles
§ localMax
§ localMin
§ localStats
§ localAdd
§ localSubtract
§ tileHistogram
§ tileStatistics
§ tileMean
§ aggHistogram
§ aggStats
See	work-in-progress	code	and	examples/tests	in:
https://github.com/s22s/geotrellis-spark-sql/
TileUDT Notebook Demo
13
ZeppelinHub Version
14
IMPLEMENTATION
From GeoTiff to RDD[Tile] to Dataset[Tile] to DataFrame
Software Stack
• Scala
• Apache Spark
• GeoTrellis
• Accumulo
• Docker
• Apache Zeppelin
15
GeoTrellis
• GeoTrellis is an open source
Scala framework for efficiently
manipulating raster GIS data
• Provides facilities to ingest and
process tiles at scale
• Has powerful abstractions for
working with RDD[Tile]s.
– Mosaicing, stitching, pyramiding,
resampling, reprojecting, etc.
– Implements C. Dana Tomlin’s
“Map Algebra”
16
Getting From RDDs to DataFrames
• Goal: work with tiles via DataFrame APIs
– Better ergonomics
– More computationally efficient
– Required for SparkML
• Bonus: if a capability is available in
DataFrames, it’s also available in SQL!
17
Encoding Data with Spark Catalyst
• Catalyst is the engine behind Spark DataFrames & SQL
• Moving data from RDDs to DataFrames requires using one of two
Catalyst APIs:
– ExpressionEncoder[Tile] or
– UserDefinedType[Tile]
• Both are (currently) package private
• Both have steep learning curves
• Both are extremely powerful once harnessed
– ExpressionEncoder is ideal for simple structures
– UserDefinedType is more efficient for larger data payloads
• For our needs, UserDefinedType (UDT) is the best fit
18
Anatomy of a UDT
To	access	private	API,	need	to	be	a	subpackage of	sql.
Supertype parameterized	on	user	type
Name	shown	in	schema	and	query	plan
Runtime	class	descriptor	of	user	type
Schema	describing	how	the	type	will	be	
encoded	within	Catalyst.	You	have	lots	of	
flexibility	here,	even	using	other	UDTs.	In	this	
example	we	pack	the	tile	into	an	opaque	blob.
Conversion	from	user	data	type	to	Catalyst	encoding
Conversion	from	Catalyst	encoding	to	user	data	type
19
UDT Registration
• User defined type is registered with
Catalyst by providing mapping between
native type and UDT
20
Spark Catalyst Toolbox
• User Defined Type (UDT)
• User Defined Function (UDF, 2 forms)
• User Defined Aggregation Function (UDAF)
• User Defined Table Function (UDTF, a.k.a.
“Generator”)
• Data Source
• Query Plan
• Optimization Rule
21
Future Work
• GeoTrellis Layer Store as an integrated
Spark DataSource (in progress)
• Expanding standard GeoTrellis RDD
features into efficient UDFs
• GIS Vector primitives (a la GeoMesa)
• Becoming an official module of GeoTrellis
22
23
THANK YOU!
The End

Harnessing Spark Catalyst for Custom Data Payloads

  • 1.
    Harnessing Spark Catalystfor Custom Data Payloads GIS Raster Support in Spark DataFrames Simeon H.K. Fitch Co-Founder & VP of R&D, Astraea
  • 2.
    Astraea • Developing amachine learning platform to make solving planetary problems easier • With exploding population growth and finite resources, we need to have tools to better plan for sustainable growth • We aim to bring earth science data to business applications through machine learning 2 See the earth. As it was, as it is, as it could be.​
  • 3.
    Preface • Assumptions: – Basicknowledge of Spark, Resilient Distributed Datasets (RDDs), and the DataFrame compute model – Basic understanding of a typical ETL/ML pipeline • Prior Art: – Approach outlined derived from other work – Fundamental raster support via Azavea’s GeoTrellis – Spark integration cues taken from: • CCRi’s GeoMesa • Databrick’s Spark-Avro • Caveat Emptor: – As of Spark 2.1.0, approach is not officially sanctioned; uses undocumented, private APIs – Not for everyone, but for us, benefits outweigh the risks 3
  • 4.
    PROBLEM STATEMENT To efficientlyand effectively build machine learning models with Earth observation data 4
  • 5.
    Data Native Form 5 Bandc Bandb Banda Temporal Projected Extent(TPE) Granule Metadata (GM) Remote Sensing Data Product Granule/Scene/Tile (GeoTIFF, HDF-EOS, GML-JPEG2000) … … add_offset Band 32 emissivity scale_factor TileID Value 0.002 1, 255 0.49 long_name Key valid_range 51004010 Multiband Tile Granule-wide properties
  • 6.
    Canonical ML FunctionalForm 6 c 1 a 1 b 1TPEA 1GMA [ 0 ] [ 0 ] [ 0 ] . . .[r1, c1] Spark Dataframe Row (i.e. ML Observation) Band Values at Single Cell . . .. . .. . .. . .. . .. . . Projected Extent of Tile + Cell Row/ Column Bandc Bandb Banda Temporal Projected Extent (TPE) Granule Metadata (GM)
  • 7.
    Analytics Base Table (ABT) … t1 t2 t2 t1 t2 t1 T3 T2 T2 T3 T2 T1 … DeliveringImagery to ML SLAAW Scenes/ Granules (Scene 1) t0,b1 (Scene 1) t0, bn (Scene 1) t0,b3 (Scene 1) t0,b2 (Scene 1) t0, b7 (Scene 1) t0, b6 (Scene 1) t0, b4 (Scene 1) t0, b5 (Scene 2) t1,b1 (Scene 2) t1, bn (Scene 2) t1,b3 (Scene 2) t1,b2 (Scene 2) t1, b7 (Scene 2) t1, b6 (Scene 2) t1, b4 (Scene 2) t1, b5 (Scene N) tf,b1 (Scene N) tf, bn (Scene N) tf,b3 (Scene N) tf,b2 (Scene N) tf, b7 (Scene N) tf, b6 (Scene N) tf, b4 (Scene N) tf, b5 … … … Feature Engineering Exploratory Data Analysis (EDA) Data Quality Check (DQC) Base Analytics Functional Form (BAFF) t1 t2 t2 t2 t1 t1 i6 i5 i4 i3 i2 i1 … 7 World-wide data coverage Distributed DataFrame Distributed DataFrame Scalable Machine Learning time wavelength
  • 8.
    Why This isHard: Dimensionality 8 Spatial (500m → 5m → 30cm) Temporal (Refresh rates: Weeks → Daily → Hourly) Spectral (4 bands → 200 bands) Planet DigiGlobe Landsat8 Planetary Resources Metadata • Coordinate Reference System • Temporal/Spatial Extent • QA Flags • Calibration parameters +
  • 9.
    Why This isHard: Data Footprint 9 As resolution scales, image size explodes Data footprint for one football field size multiband raster (single point in time!) • 30 meters • 8 band • 0.5 GB/image Landsat8 (NASA) • 3 meters • 4 band • 16 GB/image Planet PlanetScope Ortho • 30 centimeters • 4 band • 1.0 TB/image DigiGlobe • 10 m Resolution • 200 band (hyper-spectral) • 50 TB/ image? Planetary Resources
  • 10.
    CAPABILITY DEMONSTRATION Prototyping SparkCatalyst raster integration 10
  • 11.
    Domain-Specific Data Discretization Swath~ Granule ~ Scene ~ Raster ⇓ Tile ~ Chip ⇓ Cell ~ Pixel 11 𝑛 × 𝑚 where 𝑛, 𝑚 ≳ 1200 (e.g. Landsat 8: 76002) 𝑛. , where 𝑛 ≲ 512 (Typical: 642 to 2562) 1×1 Each of these has one or more “bands” (e.g. Landsat 8: 11, MODIS: 36, Hyperion: 220)
  • 12.
    TileUDT and Friends •Using the approach covered in the next section we register TileUDT with Spark • With UDTs come User Defined Functions (UDFs) • Some examples: 12 § vectorizeTiles § explodeTiles § localMax § localMin § localStats § localAdd § localSubtract § tileHistogram § tileStatistics § tileMean § aggHistogram § aggStats See work-in-progress code and examples/tests in: https://github.com/s22s/geotrellis-spark-sql/
  • 13.
  • 14.
    14 IMPLEMENTATION From GeoTiff toRDD[Tile] to Dataset[Tile] to DataFrame
  • 15.
    Software Stack • Scala •Apache Spark • GeoTrellis • Accumulo • Docker • Apache Zeppelin 15
  • 16.
    GeoTrellis • GeoTrellis isan open source Scala framework for efficiently manipulating raster GIS data • Provides facilities to ingest and process tiles at scale • Has powerful abstractions for working with RDD[Tile]s. – Mosaicing, stitching, pyramiding, resampling, reprojecting, etc. – Implements C. Dana Tomlin’s “Map Algebra” 16
  • 17.
    Getting From RDDsto DataFrames • Goal: work with tiles via DataFrame APIs – Better ergonomics – More computationally efficient – Required for SparkML • Bonus: if a capability is available in DataFrames, it’s also available in SQL! 17
  • 18.
    Encoding Data withSpark Catalyst • Catalyst is the engine behind Spark DataFrames & SQL • Moving data from RDDs to DataFrames requires using one of two Catalyst APIs: – ExpressionEncoder[Tile] or – UserDefinedType[Tile] • Both are (currently) package private • Both have steep learning curves • Both are extremely powerful once harnessed – ExpressionEncoder is ideal for simple structures – UserDefinedType is more efficient for larger data payloads • For our needs, UserDefinedType (UDT) is the best fit 18
  • 19.
    Anatomy of aUDT To access private API, need to be a subpackage of sql. Supertype parameterized on user type Name shown in schema and query plan Runtime class descriptor of user type Schema describing how the type will be encoded within Catalyst. You have lots of flexibility here, even using other UDTs. In this example we pack the tile into an opaque blob. Conversion from user data type to Catalyst encoding Conversion from Catalyst encoding to user data type 19
  • 20.
    UDT Registration • Userdefined type is registered with Catalyst by providing mapping between native type and UDT 20
  • 21.
    Spark Catalyst Toolbox •User Defined Type (UDT) • User Defined Function (UDF, 2 forms) • User Defined Aggregation Function (UDAF) • User Defined Table Function (UDTF, a.k.a. “Generator”) • Data Source • Query Plan • Optimization Rule 21
  • 22.
    Future Work • GeoTrellisLayer Store as an integrated Spark DataSource (in progress) • Expanding standard GeoTrellis RDD features into efficient UDFs • GIS Vector primitives (a la GeoMesa) • Becoming an official module of GeoTrellis 22
  • 23.

Editor's Notes

  • #2 Approach is general, not limited to GIS/EO
  • #3 A little about who we are and what we’re up to
  • #4 Explain why it matters.... can't be a data scientist if you can get the data to the form you need for modelling
  • #5 What we're working on at Astraea: platform to allow data scientists to efficiently build and deploy models based on EO data.
  • #8 SLAAW has to happen before you can even start your experimental design Save the Data Scientists time by providing higher-level abstractions for doing the “science” Make a really challenging data source more accessible to the data scientist. Two goals: address SLAAW; make data science steps more efficient. World wide collections of data. Need to be able to scale. Distinction between Python/R dataframes and Spark distributed ones
  • #13 1) These functions can be applied globally to the distributed dataframe Allows for SLAAW, DQC, EDA, FE
  • #15 Get rasters into Spark Manipulate rasters Move rasters into Dataframe
  • #17 GeoTrellis gets the imagery into Spark Map Algebra provides fundamental sets of primitives for performing analytics on GIS raster data
  • #18 GeoTrellis alone only gets us part of the way there