1. See the Earth as it could be.
Enabling Global-Scale Geospatial Machine Learning
FOSS4G NA 2018
Simeon Fitch
Co-Founder & VP of R&D
Astraea, Inc.
2. See the Earth as it could be. 2
Overview
• Context
• Problem Statement
• Introducing RasterFrames
• Example Problem
• Numerical and Performance Results
• Take-Aways
3. See the Earth as it could be. 3
With exploding population growth and finite resources,
we need to have tools to better plan for sustainable
growth.
By automating the processes around Remote Sensing,
High Performance Computing, and Machine Learning,
we empower individuals to ask complex questions of
the world.
HPC
ML
RS
4. See the Earth as it could be. 4
Think Locally, Compute Globally
• Model development is a creative, iterative, and
interactive process. How do we do this on a global
scale?
• Good tools minimize cognitive friction; attentive to good
ergonomics
• At a minimum, we need:
Solve for local → Scale to global
• Global-scale remote sensing data provide particular
challenges
5. 5
Why This is Hard: Data Dimensionality
Temporal
Spatial
Spectral
Metadata
6. 6
Why This is Hard: Data Density
500 Meter
7 Band
30 Meter
8 Band
3 Meter
4 Band
1 Meter
4 Band
0.3 Meter
8 Band
0
1
10
100
1,000
10,000
100,000
1,000,000
MODIS NBAR Landsat Planet NAIP Digital Globe
MultibandBytes
Football Field Multiband Image in Bytes
Log
Scale!
7. 7
Why This is Hard: Data Velocity
EOSDISHoldingsand Projected Growth
4
Source: Katie Baynes, NASA Goddard. “NASA’s EOSDIS Cumulus”. 2017. https://goo.gl/eQX9om
8. See the Earth as it could be. 8
Why This is Hard: Compute & Mental Model
• Traditional cluster computing (e.g. MPI) scales, but requires
special expertise
• Python Pandas & R DataFrames are very accessible, but not
scalable
• Spark DataFrames provide the best of both worlds, but aren’t
imagery friendly, until now
9. See the Earth as it could be.
• Incubating LocationTech project
• Provides ability to work with global-
scale remote sensing imagery in a
convenient yet scalable format
• Integrates with multiple data
sources and libraries, including
Spark ML, GeoTrellis Map Algebra
and GeoMesa Spark-JTS
• Python, Scala and SQL APIs
GeoTrellis
Layers
Map Algebra
Layer
Operations
Statistical
Analysis
Ti l eLayer RDD Machine
Learning
Visualization
GeoTIFF
RasterFrame
Spark
Dat aSour ce
Spark
Dat aFr ame
spatial
join
Geospatial
Queries
9
10. See the Earth as it could be.
RasterFrame Anatomy
10
12. Polyglot API
12
SELECT spatial_key,
rf_localAggMin(red) as red_min,
rf_localAggMax(red) as red_max,
rf_localAggMean(red) as red_mean
FROM df
GROUP BY spatial_key
df.groupBy("spatial_key").agg(
localAggMin($"red") as "red_min",
localAggMax($"red") as "red_max",
localAggMean($"red") as "red_mean")
df.groupBy(df.spatial_key).agg(
localAggMin(df.red).alias('red_min'),
localAggMax(df.red).alias('red_max'),
localAggMean(df.red).alias('red_mean'))
14. 14
Motivating Example: Global Ranking NDVI
On any given day, where in the world should we look for
high NDVI value(s)?
Real goal: Compute something on global imagery to
present RasterFrames and explore its scalability
Isn’t NDVI the
“Hello World”
of FOSS4G?
24. See the Earth as it could be. 24
RasterFrame Take-Aways
• DataFrames lower cognitive friction when modeling. Good
Ergonomics!
• Rich set of raster processing primitives
• Support for descriptive and predictive analysis
• Via spark-shell, Jupyter Notebook, Zeppelin, etc. can
interact with data and iterate over solution
• It scales!
• Many more examples at http://rasterframes.io
25. See the Earth as it could be. 25
Getting Started
• Try it out via Jupyter Notebooks:
docker pull s22s/rasterframes-notebooks
• Documentation: http://rasterframes.io
• Code: https://github.com/locationtech/rasterframes
• Chat: https://gitter.im/s22s/raster-frames
• Social: @metasim on GitHub & Twitter
• Company: http://www.astraea.earth
26. See the Earth as it could be. 26
Shout Outs
• Thanks to LocationTech
• For Incubating RasterFrames; mentoring by Jim Hughes & Rob Emanuele
• The teams behind GeoTrellis, GeoMesa, JTS, & SFCurve
• Thanks to NASA, USGS, & NOAA
• Supporting public access to massive curated data sets is not easy!
• Upcoming Astraea Presentations
• Machine Learning, FOSS, and open data to map deforestation trends in the Brazilian Amazon
Courtney Whalen & Jason Brown
Tuesday, May 15, 2018 - 4:30 to 5:05 (right after this presentation)
Gateway 1
• Using Deep Learning to Derive 3D Cities from Satellite Imagery
Eric Culbertson
Wednesday, May 16, 2018 - 2:00 to 2:35
Gateway 2
• Please visit Astraea
• Booth #14
Editor's Notes
Unlock the wealth of information in global remote sensing data
Do we all agree that geospatial raster data has a wealth of potential information that can be gleaned from it?
My role at Astræa is to apply the art and discipline of software engineering to make data scientists efficient and effective in solving these problems
To empower, think about how people approach problems
Let’s think about the context for solving problems
We want our models to make a big impact; you must aim for global impact
Hard for reasons that are both obvious and not so obvious
The dreaded hockey stick
Thanks to Baynes and the EOSDIS team
The prior challenges are kind of obvious
This is what adds the friction
Need better ergonomics
Who likes DataFrames?
Who’s familiar with Spark?
Spark as a frontrunner in compute over industry data.
To effectively and efficiently deliver the power of high-performance computing, advanced machine learning, and remote sensing to our users
RasterFrames provides the ability to work with global EO data in a data frame format, familiar to most data scientists
Just a Spark DataFrame, but with special components.
“Tile” and “TileLayerMetadata” are types from the GeoTrellis library.
STK is “Space Time Key”
Conceptually you can also think of it as a map layer.
Regularly growing API
Explain NDVI?: Normalied difference vegetative index
Somewhat contrived example for the purposes of highlighting some of RF features
I’m not a data scientist... The results haven’t been validated, this is just a computational proxy for real analyis
We are not specifying a region of interest…. We are computing this for the whole world.
Code examples are in Scala (my native language). Look very similar in Python
This front-end section is currently data source (MODIS on PDS) specific
RasterFrames readers integrate directly with Spark DataSource API.
Aim: nice ergonomics
“Tile” and associated operations come from the GeoTrellis library
UDF == User Defined Function
Also a gateway to scoring by CNN
Another example of a UDF
Top 20 tiles with highest NDVI z-score
Not validated, but some interesting points of note for further investigation