SlideShare a Scribd company logo
1 of 26
See the Earth as it could be.
Enabling Global-Scale Geospatial Machine Learning
FOSS4G NA 2018
Simeon Fitch
Co-Founder & VP of R&D
Astraea, Inc.
See the Earth as it could be. 2
Overview
• Context
• Problem Statement
• Introducing RasterFrames
• Example Problem
• Numerical and Performance Results
• Take-Aways
See the Earth as it could be. 3
With exploding population growth and finite resources,
we need to have tools to better plan for sustainable
growth.
By automating the processes around Remote Sensing,
High Performance Computing, and Machine Learning,
we empower individuals to ask complex questions of
the world.
HPC
ML
RS
See the Earth as it could be. 4
Think Locally, Compute Globally
• Model development is a creative, iterative, and
interactive process. How do we do this on a global
scale?
• Good tools minimize cognitive friction; attentive to good
ergonomics
• At a minimum, we need:
Solve for local → Scale to global
• Global-scale remote sensing data provide particular
challenges
5
Why This is Hard: Data Dimensionality
Temporal
Spatial
Spectral
Metadata
6
Why This is Hard: Data Density
500 Meter
7 Band
30 Meter
8 Band
3 Meter
4 Band
1 Meter
4 Band
0.3 Meter
8 Band
0
1
10
100
1,000
10,000
100,000
1,000,000
MODIS NBAR Landsat Planet NAIP Digital Globe
MultibandBytes
Football Field Multiband Image in Bytes
Log
Scale!
7
Why This is Hard: Data Velocity
EOSDISHoldingsand Projected Growth
4
Source: Katie Baynes, NASA Goddard. “NASA’s EOSDIS Cumulus”. 2017. https://goo.gl/eQX9om
See the Earth as it could be. 8
Why This is Hard: Compute & Mental Model
• Traditional cluster computing (e.g. MPI) scales, but requires
special expertise
• Python Pandas & R DataFrames are very accessible, but not
scalable
• Spark DataFrames provide the best of both worlds, but aren’t
imagery friendly, until now
See the Earth as it could be.
• Incubating LocationTech project
• Provides ability to work with global-
scale remote sensing imagery in a
convenient yet scalable format
• Integrates with multiple data
sources and libraries, including
Spark ML, GeoTrellis Map Algebra
and GeoMesa Spark-JTS
• Python, Scala and SQL APIs
GeoTrellis
Layers
Map Algebra
Layer
Operations
Statistical
Analysis
Ti l eLayer RDD Machine
Learning
Visualization
GeoTIFF
RasterFrame
Spark
Dat aSour ce
Spark
Dat aFr ame
spatial
join
Geospatial
Queries
9
See the Earth as it could be.
RasterFrame Anatomy
10
See the Earth as it could be.
Standard Tile Operations
• localAggStats
• localAggMax
• localAggMin
• localAggMean
• localAggDataCells
• localAggNoDataCells
• localAdd
• localSubtract
• localMultiply
• localDivide
• localAlgebra
• tileDimensions
• aggHistogram
• aggStats
• aggMean
• aggDataCells
• aggNoDataCells
• tileMean
• tileSum
• tileMin
• tileMax
• tileHistogram
• tileStats
• dataCells
• noDataCells
• box2D
• tileToArray
• arrayToTile
• assembleTile
• explodeTiles
• cellType
• convertCellType
• withNoData
• renderAscii
11
Polyglot API
12
SELECT spatial_key,
rf_localAggMin(red) as red_min,
rf_localAggMax(red) as red_max,
rf_localAggMean(red) as red_mean
FROM df
GROUP BY spatial_key
df.groupBy("spatial_key").agg(
localAggMin($"red") as "red_min",
localAggMax($"red") as "red_max",
localAggMean($"red") as "red_mean")
df.groupBy(df.spatial_key).agg( 
localAggMin(df.red).alias('red_min'), 
localAggMax(df.red).alias('red_max'), 
localAggMean(df.red).alias('red_mean'))
User Manual With Examples
13
14
Motivating Example: Global Ranking NDVI
On any given day, where in the world should we look for
high NDVI value(s)?
Real goal: Compute something on global imagery to
present RasterFrames and explore its scalability
Isn’t NDVI the
“Hello World”
of FOSS4G?
Compute Pipeline
15
Implementation: Query & Ingest
16
val catalog = spark.read
.format("modis-catalog")
.load()
val granules = catalog
.where($"acquisitionDate" === LocalDate.of(2017, 6, 7))
val b01 = granules.select(download_tiles(modis_band_url("B01")))
val b02 = granules.select(download_tiles(modis_band_url("B02")))
val joined = b01.join(b02, "spatial_key")
Implementation: Computing NDVI
17
val ndvi = udf((b2: Tile, b1: Tile) ⇒ {
val nir = b2.convert(FloatConstantNoDataCellType)
val red = b1.convert(FloatConstantNoDataCellType)
(nir - red) / (nir + red)
})
val withNDVI = joined
.withColumn("ndvi", ndvi($"B02_tile", $"B01_tile"))
Implementation: Computing Histograms
18
-50
0
50
100
150
200
250
300
350
400
-1.5 -1 -0.5 0 0.5 1 1.5Count
x100000 Red Band
Global NDVI Histogram for 2017-06-07
0
100
200
300
400
500
600
700
0 2000 4000 6000 8000 10000 12000
Count
x100000
Red Band
Global Red Band Histogram for 2017-06-07
0
50
100
150
200
250
300
350
400
450
500
0 2000 4000 6000 8000 10000 12000
Count
x100000
Red Band
Global NIR Band Histogram for 2017-06-07
val hist = withNDVI.select(
aggHistogram($"B01_tile"),
aggHistogram($"B02_tile"),
aggHistogram($"ndvi")
)
Implementation: Scoring Tiles
19
val ndviStats = hist.first()._3.stats
val zscoreRange = udf((t: Tile) ⇒ {
val mean = ndviStats.mean
val stddev = math.sqrt(ndviStats.variance)
t.mapDouble(c ⇒ (c - mean) / stddev).findMinMaxDouble
})
val scored = withNDVI
.withColumn("zscores", zscoreRange($"ndvi"))
Implementation: Results
20
val ordered = scored
.select(
$"B01_extent" as "extent",
$"zscores._2" as "zscoreMax"
)
.orderBy(desc("zscoreMax"))
val features = scored
.limit(20)
.select($"extent", $"zscoreMax")
.map { case (extent, zscoreMax) ⇒
val geom = extent.toPolygon().reproject(Sinusoidal, LatLng)
Feature(geom, Map("zscoreMax" -> zscoreMax))
}
.collect
val results = JsonFeatureCollection(features).toJson
Results: Histograms
0
100
200
300
400
500
600
700
0 2000 4000 6000 8000 10000 12000
Count
x100000
Red Band
Global Red Band Histogram for 2017-06-07
-50
0
50
100
150
200
250
300
350
400
-1.5 -1 -0.5 0 0.5 1 1.5
Count
x100000
Red Band
Global NDVI Histogram for 2017-06-07
0
50
100
150
200
250
300
350
400
450
500
0 2000 4000 6000 8000 10000 12000
Count
x100000
Red Band
Global NIR Band Histogram for 2017-06-07
21
Results: Top NDVI for 2017-06-07
22
Results: Benchmarks
23
31.47
16.99
12.23
9.64
8.31
5.53
6.41 6.33
0
5
10
15
20
25
30
35
8 16 24 32 40 80 120 160
Time(minutes)
CPU Cores
See the Earth as it could be. 24
RasterFrame Take-Aways
• DataFrames lower cognitive friction when modeling. Good
Ergonomics!
• Rich set of raster processing primitives
• Support for descriptive and predictive analysis
• Via spark-shell, Jupyter Notebook, Zeppelin, etc. can
interact with data and iterate over solution
• It scales!
• Many more examples at http://rasterframes.io
See the Earth as it could be. 25
Getting Started
• Try it out via Jupyter Notebooks:
docker pull s22s/rasterframes-notebooks
• Documentation: http://rasterframes.io
• Code: https://github.com/locationtech/rasterframes
• Chat: https://gitter.im/s22s/raster-frames
• Social: @metasim on GitHub & Twitter
• Company: http://www.astraea.earth
See the Earth as it could be. 26
Shout Outs
• Thanks to LocationTech
• For Incubating RasterFrames; mentoring by Jim Hughes & Rob Emanuele
• The teams behind GeoTrellis, GeoMesa, JTS, & SFCurve
• Thanks to NASA, USGS, & NOAA
• Supporting public access to massive curated data sets is not easy!
• Upcoming Astraea Presentations
• Machine Learning, FOSS, and open data to map deforestation trends in the Brazilian Amazon
Courtney Whalen & Jason Brown
Tuesday, May 15, 2018 - 4:30 to 5:05 (right after this presentation)
Gateway 1
• Using Deep Learning to Derive 3D Cities from Satellite Imagery
Eric Culbertson
Wednesday, May 16, 2018 - 2:00 to 2:35
Gateway 2
• Please visit Astraea
• Booth #14

More Related Content

What's hot

Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013
 Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013 Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013
Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013Big Data Spain
 
ArcGIS - الدرس الأول
 ArcGIS - الدرس الأول ArcGIS - الدرس الأول
ArcGIS - الدرس الأولAhmad Harbash
 
CARTO BUILDER: from visualization to geospatial analysis
CARTO BUILDER: from visualization to geospatial analysisCARTO BUILDER: from visualization to geospatial analysis
CARTO BUILDER: from visualization to geospatial analysisJorge Sanz
 
Exploration and 3D GIS Software - MapInfo Professional Discover3D 2015
Exploration and 3D GIS Software - MapInfo Professional Discover3D 2015Exploration and 3D GIS Software - MapInfo Professional Discover3D 2015
Exploration and 3D GIS Software - MapInfo Professional Discover3D 2015Prakher Hajela Saxena
 
Supermap gis 10i(2020) ai gis technology v1.0
Supermap gis 10i(2020) ai gis technology v1.0Supermap gis 10i(2020) ai gis technology v1.0
Supermap gis 10i(2020) ai gis technology v1.0GeoMedeelel
 
DSD-INT 2018 Earth Science Through Datacubes - Merticariu
DSD-INT 2018 Earth Science Through Datacubes - MerticariuDSD-INT 2018 Earth Science Through Datacubes - Merticariu
DSD-INT 2018 Earth Science Through Datacubes - MerticariuDeltares
 
Free and Open Source GIS
Free and Open Source GISFree and Open Source GIS
Free and Open Source GISNico Elema
 
Introduction of super map gis 10i(2020) (1)
Introduction of super map gis 10i(2020) (1)Introduction of super map gis 10i(2020) (1)
Introduction of super map gis 10i(2020) (1)GeoMedeelel
 
Distributed system
Distributed systemDistributed system
Distributed systemMD Redaan
 
How to empower community by using GIS lecture 1
How to empower community by using GIS lecture 1How to empower community by using GIS lecture 1
How to empower community by using GIS lecture 1wang yaohui
 
Starfish-A self tuning system for bigdata analytics
Starfish-A self tuning system for bigdata analyticsStarfish-A self tuning system for bigdata analytics
Starfish-A self tuning system for bigdata analyticssai Pramoda
 
Esri Maps for MicroStrategy
Esri Maps for MicroStrategyEsri Maps for MicroStrategy
Esri Maps for MicroStrategyEsri
 
Trb 2017 annual_conference_visualization_lightning_talk_rst
Trb 2017 annual_conference_visualization_lightning_talk_rstTrb 2017 annual_conference_visualization_lightning_talk_rst
Trb 2017 annual_conference_visualization_lightning_talk_rstRobert Tung
 
Big Spatial(!) Data Processing mit GeoMesa. AGIT 2019, Salzburg, Austria.
Big Spatial(!) Data Processing mit GeoMesa. AGIT 2019, Salzburg, Austria.Big Spatial(!) Data Processing mit GeoMesa. AGIT 2019, Salzburg, Austria.
Big Spatial(!) Data Processing mit GeoMesa. AGIT 2019, Salzburg, Austria.Anita Graser
 
Cartaro Workshop at the Geosharing Conferenc in Bern
Cartaro Workshop at the Geosharing Conferenc in BernCartaro Workshop at the Geosharing Conferenc in Bern
Cartaro Workshop at the Geosharing Conferenc in BernUli Müller
 
How to empower community by using GIS lecture 2
How to empower community by using GIS lecture 2How to empower community by using GIS lecture 2
How to empower community by using GIS lecture 2wang yaohui
 
Dsm Presentation
Dsm PresentationDsm Presentation
Dsm Presentationrichoe
 

What's hot (20)

Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013
 Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013 Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013
Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013
 
Rasdaman use case
Rasdaman use case Rasdaman use case
Rasdaman use case
 
ArcGIS - الدرس الأول
 ArcGIS - الدرس الأول ArcGIS - الدرس الأول
ArcGIS - الدرس الأول
 
CARTO BUILDER: from visualization to geospatial analysis
CARTO BUILDER: from visualization to geospatial analysisCARTO BUILDER: from visualization to geospatial analysis
CARTO BUILDER: from visualization to geospatial analysis
 
Gis Xke
Gis XkeGis Xke
Gis Xke
 
Exploration and 3D GIS Software - MapInfo Professional Discover3D 2015
Exploration and 3D GIS Software - MapInfo Professional Discover3D 2015Exploration and 3D GIS Software - MapInfo Professional Discover3D 2015
Exploration and 3D GIS Software - MapInfo Professional Discover3D 2015
 
Supermap gis 10i(2020) ai gis technology v1.0
Supermap gis 10i(2020) ai gis technology v1.0Supermap gis 10i(2020) ai gis technology v1.0
Supermap gis 10i(2020) ai gis technology v1.0
 
DSD-INT 2018 Earth Science Through Datacubes - Merticariu
DSD-INT 2018 Earth Science Through Datacubes - MerticariuDSD-INT 2018 Earth Science Through Datacubes - Merticariu
DSD-INT 2018 Earth Science Through Datacubes - Merticariu
 
Free and Open Source GIS
Free and Open Source GISFree and Open Source GIS
Free and Open Source GIS
 
Introduction of super map gis 10i(2020) (1)
Introduction of super map gis 10i(2020) (1)Introduction of super map gis 10i(2020) (1)
Introduction of super map gis 10i(2020) (1)
 
HW3_Introduction_Mik
HW3_Introduction_MikHW3_Introduction_Mik
HW3_Introduction_Mik
 
Distributed system
Distributed systemDistributed system
Distributed system
 
How to empower community by using GIS lecture 1
How to empower community by using GIS lecture 1How to empower community by using GIS lecture 1
How to empower community by using GIS lecture 1
 
Starfish-A self tuning system for bigdata analytics
Starfish-A self tuning system for bigdata analyticsStarfish-A self tuning system for bigdata analytics
Starfish-A self tuning system for bigdata analytics
 
Esri Maps for MicroStrategy
Esri Maps for MicroStrategyEsri Maps for MicroStrategy
Esri Maps for MicroStrategy
 
Trb 2017 annual_conference_visualization_lightning_talk_rst
Trb 2017 annual_conference_visualization_lightning_talk_rstTrb 2017 annual_conference_visualization_lightning_talk_rst
Trb 2017 annual_conference_visualization_lightning_talk_rst
 
Big Spatial(!) Data Processing mit GeoMesa. AGIT 2019, Salzburg, Austria.
Big Spatial(!) Data Processing mit GeoMesa. AGIT 2019, Salzburg, Austria.Big Spatial(!) Data Processing mit GeoMesa. AGIT 2019, Salzburg, Austria.
Big Spatial(!) Data Processing mit GeoMesa. AGIT 2019, Salzburg, Austria.
 
Cartaro Workshop at the Geosharing Conferenc in Bern
Cartaro Workshop at the Geosharing Conferenc in BernCartaro Workshop at the Geosharing Conferenc in Bern
Cartaro Workshop at the Geosharing Conferenc in Bern
 
How to empower community by using GIS lecture 2
How to empower community by using GIS lecture 2How to empower community by using GIS lecture 2
How to empower community by using GIS lecture 2
 
Dsm Presentation
Dsm PresentationDsm Presentation
Dsm Presentation
 

Similar to RasterFrames - FOSS4G NA 2018

Magellan FOSS4G Talk, Boston 2017
Magellan FOSS4G Talk, Boston 2017Magellan FOSS4G Talk, Boston 2017
Magellan FOSS4G Talk, Boston 2017Ram Sriharsha
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphsStanka Dalekova
 
Giving MongoDB a Way to Play with the GIS Community
Giving MongoDB a Way to Play with the GIS CommunityGiving MongoDB a Way to Play with the GIS Community
Giving MongoDB a Way to Play with the GIS CommunityMongoDB
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryStanka Dalekova
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryStanka Dalekova
 
03 인사이트를 줄 수 있는 Google Maps + CartoDB 활용사례 파헤치기
03 인사이트를 줄 수 있는 Google Maps + CartoDB 활용사례 파헤치기03 인사이트를 줄 수 있는 Google Maps + CartoDB 활용사례 파헤치기
03 인사이트를 줄 수 있는 Google Maps + CartoDB 활용사례 파헤치기KwangJin So
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache CalciteJulian Hyde
 
RasterFrames + STAC
RasterFrames + STACRasterFrames + STAC
RasterFrames + STACSimeon Fitch
 
Challenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache SparkChallenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache SparkDatabricks
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
 
PyDX Presentation about Python, GeoData and Maps
PyDX Presentation about Python, GeoData and MapsPyDX Presentation about Python, GeoData and Maps
PyDX Presentation about Python, GeoData and MapsHannes Hapke
 
Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2yannabraham
 
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5Keshav Murthy
 
Gis capabilities on Big Data Systems
Gis capabilities on Big Data SystemsGis capabilities on Big Data Systems
Gis capabilities on Big Data SystemsAhmad Jawwad
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterWeb-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterDatabricks
 
Watershed Delineation in ArcGIS
Watershed Delineation in ArcGISWatershed Delineation in ArcGIS
Watershed Delineation in ArcGISArthur Green
 
Text Mining Applied to SQL Queries: a Case Study for SDSS SkyServer
Text Mining Applied to SQL Queries: a Case Study for SDSS SkyServerText Mining Applied to SQL Queries: a Case Study for SDSS SkyServer
Text Mining Applied to SQL Queries: a Case Study for SDSS SkyServerVitor Hirota Makiyama
 
Scaling Spatial Analytics with Google Cloud & CARTO
Scaling Spatial Analytics with Google Cloud & CARTOScaling Spatial Analytics with Google Cloud & CARTO
Scaling Spatial Analytics with Google Cloud & CARTOCARTO
 

Similar to RasterFrames - FOSS4G NA 2018 (20)

Magellan FOSS4G Talk, Boston 2017
Magellan FOSS4G Talk, Boston 2017Magellan FOSS4G Talk, Boston 2017
Magellan FOSS4G Talk, Boston 2017
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphs
 
Giving MongoDB a Way to Play with the GIS Community
Giving MongoDB a Way to Play with the GIS CommunityGiving MongoDB a Way to Play with the GIS Community
Giving MongoDB a Way to Play with the GIS Community
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
 
Intro to Spatial data
Intro to Spatial data Intro to Spatial data
Intro to Spatial data
 
03 인사이트를 줄 수 있는 Google Maps + CartoDB 활용사례 파헤치기
03 인사이트를 줄 수 있는 Google Maps + CartoDB 활용사례 파헤치기03 인사이트를 줄 수 있는 Google Maps + CartoDB 활용사례 파헤치기
03 인사이트를 줄 수 있는 Google Maps + CartoDB 활용사례 파헤치기
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 
RasterFrames + STAC
RasterFrames + STACRasterFrames + STAC
RasterFrames + STAC
 
Challenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache SparkChallenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache Spark
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
PyDX Presentation about Python, GeoData and Maps
PyDX Presentation about Python, GeoData and MapsPyDX Presentation about Python, GeoData and Maps
PyDX Presentation about Python, GeoData and Maps
 
Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2
 
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
 
Gis capabilities on Big Data Systems
Gis capabilities on Big Data SystemsGis capabilities on Big Data Systems
Gis capabilities on Big Data Systems
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterWeb-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
 
Watershed Delineation in ArcGIS
Watershed Delineation in ArcGISWatershed Delineation in ArcGIS
Watershed Delineation in ArcGIS
 
Text Mining Applied to SQL Queries: a Case Study for SDSS SkyServer
Text Mining Applied to SQL Queries: a Case Study for SDSS SkyServerText Mining Applied to SQL Queries: a Case Study for SDSS SkyServer
Text Mining Applied to SQL Queries: a Case Study for SDSS SkyServer
 
Scaling Spatial Analytics with Google Cloud & CARTO
Scaling Spatial Analytics with Google Cloud & CARTOScaling Spatial Analytics with Google Cloud & CARTO
Scaling Spatial Analytics with Google Cloud & CARTO
 

Recently uploaded

Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxnada99848
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 

Recently uploaded (20)

Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptx
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 

RasterFrames - FOSS4G NA 2018

  • 1. See the Earth as it could be. Enabling Global-Scale Geospatial Machine Learning FOSS4G NA 2018 Simeon Fitch Co-Founder & VP of R&D Astraea, Inc.
  • 2. See the Earth as it could be. 2 Overview • Context • Problem Statement • Introducing RasterFrames • Example Problem • Numerical and Performance Results • Take-Aways
  • 3. See the Earth as it could be. 3 With exploding population growth and finite resources, we need to have tools to better plan for sustainable growth. By automating the processes around Remote Sensing, High Performance Computing, and Machine Learning, we empower individuals to ask complex questions of the world. HPC ML RS
  • 4. See the Earth as it could be. 4 Think Locally, Compute Globally • Model development is a creative, iterative, and interactive process. How do we do this on a global scale? • Good tools minimize cognitive friction; attentive to good ergonomics • At a minimum, we need: Solve for local → Scale to global • Global-scale remote sensing data provide particular challenges
  • 5. 5 Why This is Hard: Data Dimensionality Temporal Spatial Spectral Metadata
  • 6. 6 Why This is Hard: Data Density 500 Meter 7 Band 30 Meter 8 Band 3 Meter 4 Band 1 Meter 4 Band 0.3 Meter 8 Band 0 1 10 100 1,000 10,000 100,000 1,000,000 MODIS NBAR Landsat Planet NAIP Digital Globe MultibandBytes Football Field Multiband Image in Bytes Log Scale!
  • 7. 7 Why This is Hard: Data Velocity EOSDISHoldingsand Projected Growth 4 Source: Katie Baynes, NASA Goddard. “NASA’s EOSDIS Cumulus”. 2017. https://goo.gl/eQX9om
  • 8. See the Earth as it could be. 8 Why This is Hard: Compute & Mental Model • Traditional cluster computing (e.g. MPI) scales, but requires special expertise • Python Pandas & R DataFrames are very accessible, but not scalable • Spark DataFrames provide the best of both worlds, but aren’t imagery friendly, until now
  • 9. See the Earth as it could be. • Incubating LocationTech project • Provides ability to work with global- scale remote sensing imagery in a convenient yet scalable format • Integrates with multiple data sources and libraries, including Spark ML, GeoTrellis Map Algebra and GeoMesa Spark-JTS • Python, Scala and SQL APIs GeoTrellis Layers Map Algebra Layer Operations Statistical Analysis Ti l eLayer RDD Machine Learning Visualization GeoTIFF RasterFrame Spark Dat aSour ce Spark Dat aFr ame spatial join Geospatial Queries 9
  • 10. See the Earth as it could be. RasterFrame Anatomy 10
  • 11. See the Earth as it could be. Standard Tile Operations • localAggStats • localAggMax • localAggMin • localAggMean • localAggDataCells • localAggNoDataCells • localAdd • localSubtract • localMultiply • localDivide • localAlgebra • tileDimensions • aggHistogram • aggStats • aggMean • aggDataCells • aggNoDataCells • tileMean • tileSum • tileMin • tileMax • tileHistogram • tileStats • dataCells • noDataCells • box2D • tileToArray • arrayToTile • assembleTile • explodeTiles • cellType • convertCellType • withNoData • renderAscii 11
  • 12. Polyglot API 12 SELECT spatial_key, rf_localAggMin(red) as red_min, rf_localAggMax(red) as red_max, rf_localAggMean(red) as red_mean FROM df GROUP BY spatial_key df.groupBy("spatial_key").agg( localAggMin($"red") as "red_min", localAggMax($"red") as "red_max", localAggMean($"red") as "red_mean") df.groupBy(df.spatial_key).agg( localAggMin(df.red).alias('red_min'), localAggMax(df.red).alias('red_max'), localAggMean(df.red).alias('red_mean'))
  • 13. User Manual With Examples 13
  • 14. 14 Motivating Example: Global Ranking NDVI On any given day, where in the world should we look for high NDVI value(s)? Real goal: Compute something on global imagery to present RasterFrames and explore its scalability Isn’t NDVI the “Hello World” of FOSS4G?
  • 16. Implementation: Query & Ingest 16 val catalog = spark.read .format("modis-catalog") .load() val granules = catalog .where($"acquisitionDate" === LocalDate.of(2017, 6, 7)) val b01 = granules.select(download_tiles(modis_band_url("B01"))) val b02 = granules.select(download_tiles(modis_band_url("B02"))) val joined = b01.join(b02, "spatial_key")
  • 17. Implementation: Computing NDVI 17 val ndvi = udf((b2: Tile, b1: Tile) ⇒ { val nir = b2.convert(FloatConstantNoDataCellType) val red = b1.convert(FloatConstantNoDataCellType) (nir - red) / (nir + red) }) val withNDVI = joined .withColumn("ndvi", ndvi($"B02_tile", $"B01_tile"))
  • 18. Implementation: Computing Histograms 18 -50 0 50 100 150 200 250 300 350 400 -1.5 -1 -0.5 0 0.5 1 1.5Count x100000 Red Band Global NDVI Histogram for 2017-06-07 0 100 200 300 400 500 600 700 0 2000 4000 6000 8000 10000 12000 Count x100000 Red Band Global Red Band Histogram for 2017-06-07 0 50 100 150 200 250 300 350 400 450 500 0 2000 4000 6000 8000 10000 12000 Count x100000 Red Band Global NIR Band Histogram for 2017-06-07 val hist = withNDVI.select( aggHistogram($"B01_tile"), aggHistogram($"B02_tile"), aggHistogram($"ndvi") )
  • 19. Implementation: Scoring Tiles 19 val ndviStats = hist.first()._3.stats val zscoreRange = udf((t: Tile) ⇒ { val mean = ndviStats.mean val stddev = math.sqrt(ndviStats.variance) t.mapDouble(c ⇒ (c - mean) / stddev).findMinMaxDouble }) val scored = withNDVI .withColumn("zscores", zscoreRange($"ndvi"))
  • 20. Implementation: Results 20 val ordered = scored .select( $"B01_extent" as "extent", $"zscores._2" as "zscoreMax" ) .orderBy(desc("zscoreMax")) val features = scored .limit(20) .select($"extent", $"zscoreMax") .map { case (extent, zscoreMax) ⇒ val geom = extent.toPolygon().reproject(Sinusoidal, LatLng) Feature(geom, Map("zscoreMax" -> zscoreMax)) } .collect val results = JsonFeatureCollection(features).toJson
  • 21. Results: Histograms 0 100 200 300 400 500 600 700 0 2000 4000 6000 8000 10000 12000 Count x100000 Red Band Global Red Band Histogram for 2017-06-07 -50 0 50 100 150 200 250 300 350 400 -1.5 -1 -0.5 0 0.5 1 1.5 Count x100000 Red Band Global NDVI Histogram for 2017-06-07 0 50 100 150 200 250 300 350 400 450 500 0 2000 4000 6000 8000 10000 12000 Count x100000 Red Band Global NIR Band Histogram for 2017-06-07 21
  • 22. Results: Top NDVI for 2017-06-07 22
  • 24. See the Earth as it could be. 24 RasterFrame Take-Aways • DataFrames lower cognitive friction when modeling. Good Ergonomics! • Rich set of raster processing primitives • Support for descriptive and predictive analysis • Via spark-shell, Jupyter Notebook, Zeppelin, etc. can interact with data and iterate over solution • It scales! • Many more examples at http://rasterframes.io
  • 25. See the Earth as it could be. 25 Getting Started • Try it out via Jupyter Notebooks: docker pull s22s/rasterframes-notebooks • Documentation: http://rasterframes.io • Code: https://github.com/locationtech/rasterframes • Chat: https://gitter.im/s22s/raster-frames • Social: @metasim on GitHub & Twitter • Company: http://www.astraea.earth
  • 26. See the Earth as it could be. 26 Shout Outs • Thanks to LocationTech • For Incubating RasterFrames; mentoring by Jim Hughes & Rob Emanuele • The teams behind GeoTrellis, GeoMesa, JTS, & SFCurve • Thanks to NASA, USGS, & NOAA • Supporting public access to massive curated data sets is not easy! • Upcoming Astraea Presentations • Machine Learning, FOSS, and open data to map deforestation trends in the Brazilian Amazon Courtney Whalen & Jason Brown Tuesday, May 15, 2018 - 4:30 to 5:05 (right after this presentation) Gateway 1 • Using Deep Learning to Derive 3D Cities from Satellite Imagery Eric Culbertson Wednesday, May 16, 2018 - 2:00 to 2:35 Gateway 2 • Please visit Astraea • Booth #14

Editor's Notes

  1. Unlock the wealth of information in global remote sensing data Do we all agree that geospatial raster data has a wealth of potential information that can be gleaned from it? My role at Astræa is to apply the art and discipline of software engineering to make data scientists efficient and effective in solving these problems
  2. To empower, think about how people approach problems Let’s think about the context for solving problems We want our models to make a big impact; you must aim for global impact Hard for reasons that are both obvious and not so obvious
  3. Spatial: 500m, 30m, 1m, 0.3m Temporal: Weeks, Days, Hours Spectral: 4 bands, 7 bands, 34 bands, 200+ bands Active sensors (SAR, LiDAR) Metadata: Coordinate Reference System, Temporal/Spatial Extent, QA Flags, Calibration parameters
  4. The dreaded hockey stick Thanks to Baynes and the EOSDIS team
  5. The prior challenges are kind of obvious This is what adds the friction Need better ergonomics Who likes DataFrames? Who’s familiar with Spark? Spark as a frontrunner in compute over industry data.
  6. To effectively and efficiently deliver the power of high-performance computing, advanced machine learning, and remote sensing to our users RasterFrames provides the ability to work with global EO data in a data frame format, familiar to most data scientists
  7. Just a Spark DataFrame, but with special components. “Tile” and “TileLayerMetadata” are types from the GeoTrellis library. STK is “Space Time Key” Conceptually you can also think of it as a map layer.
  8. Regularly growing API
  9. Explain NDVI?: Normalied difference vegetative index Somewhat contrived example for the purposes of highlighting some of RF features I’m not a data scientist... The results haven’t been validated, this is just a computational proxy for real analyis
  10. We are not specifying a region of interest…. We are computing this for the whole world.
  11. Code examples are in Scala (my native language). Look very similar in Python This front-end section is currently data source (MODIS on PDS) specific RasterFrames readers integrate directly with Spark DataSource API. Aim: nice ergonomics
  12. “Tile” and associated operations come from the GeoTrellis library UDF == User Defined Function Also a gateway to scoring by CNN
  13. Another example of a UDF
  14. Top 20 tiles with highest NDVI z-score Not validated, but some interesting points of note for further investigation
  15. r3.xlarge (8 cores, 30GB RAM)
  16. Please thank your civil servants