Toward STAC in RasterFrames
Satellite Data Interoperablity Workshop
August 13-15, 2018
Simeon H.K. Fitch
VP of R&D, Astraea, Inc.
fitch@astraea.earth
GitHub: @metasim
• Data Science Platform Company
• Core product: RasterFlow AI
– On-demand supercomputer for the Earth
– Enables non-expert leveraging of cutting-edge
machine learning on big geospatial data.
– Global Scale. Deep Insight. For Everyone.
• Built around our open source core:
See the Earth. As it was, as it is, as it could be.℠
2
GeoTrellis
Layers
GeoTIFF
GeoTIFF
GeoTIFF
STAC SparkSQLDataSourceAPI
EC2/EMR
RasterFrames: Data Ecosystem
3
Amazon PDS
(experimental)
RasterFrames: Anatomy
4
EO Data Product
DataFrame Row
(i.e. ML Observation)
Tiled BandsTPE Extent Metadata
λa
[0,0] λb
[0,0]
λc
[0,0]TPEA
[0,0]
. . .
. . .TPEA
[0,1]
. . .
. . .
. . .. . .. . .. . .. . .
TPEA
[1,0]
TPEA
[1,1]
. . .. . .. . .. . .. . .
λa
[0,1] λb
[0,1] λc
[0,1]
λa
[1,0] λb
[1,0] λc
[1,0]
λc
[1,1]λb
[1,1]λa
[1,1]
SM[0,0]
SM[0,1]
SM[1,1]
SM[1,0]
. . .
. . .
Spark DataFrame
Layer-wide Metadata
RasterFrameBandc
Bandb
Banda
Temporal
Projected
Extent (TPE)
Scene Metadata (SM)
… …
add_of f set
Band 32 emi ssi vi t y
scal e_f act or
Ti l eI D
Value
0. 002
1, 255
0. 49
l ong_name
Key
val i d_r ange
51004010
Multiband
Tile
Granule/Scene
GeoTIFF, HDF-EOS,
GML-JPEG2000, etc.
Spark DataFrame
Reader
Restructuring EO for Scaled ML
SELECT geometry, datetime, b1, b2, b3
WHERE
provider = 'NAIP' AND
st_intersects(geometry, st_geomFromText('WKT...')) AND
datetime > '2014-03-12' AND
datetime <= '2018-01-09' AND
eo:cloud_cover < 0.05
FROM stac_catalog
RasterFrames: Canonical Query
5
λa
[0,0] λb
[0,0] λc
[0,0]TPEA
[0,0]
. . .
. . .TPEA
[0,1]
. . .
. . .
. . .. . .. . .. . .. . .
TPEA
[1,0]
TPEA
[1,1]
. . .. . .. . .. . .. . .
λa
[0,1] λb
[0,1] λc
[0,1]
λa
[1,0] λb
[1,0] λc
[1,0]
λc
[1,1]λb
[1,1]λa
[1,1]
SM[0,0]
SM[0,1]
SM[1,1]
SM[1,0]
. . .
. . .
Layer-wide Metadata
RasterFrames: Query Deconstruction
6
Questions & Concerns
7
Questions & Concerns
8
• Is STAC-API appropriate for this use case?
– Would GraphQL help? (thx Jason Gilman for idea!)
• Can we reduce # of HTTP requests?
• Are there limits on results set size?
– Is paging assumed? Can we stream?
• How to integrate COG range requests?
– STAC extension?
– Also: Header offsets.
9
http://rasterframes.io
fitch@astraea.earth
APPENDIX
10
DataSource Schema
11
override def schema: StructType = StructType(Seq(
StructField(C.ID, StringType, false),
StructField(C.TIMESTAMP, TimestampType, false),
StructField(C.PROPERTIES, DataTypes.createMapType(StringType, StringType, false)),
StructField(C.BOUNDS, geomSchema, true),
StructField(C.GEOMETRY, geomSchema, false),
StructField(C.ASSETS, assetSchema, false)
))
DataSource Table Build
12
override def buildScan(requiredColumns: Array[String], attributeFilters: Array[Filter]): RDD[Row] = {
val activeFilters = filters ++ attributeFilters.filterNot(_.references.contains(C.TIMESTAMP))
/** If no filters are provided, were not going to return the whole catalog. */
if(filters.isEmpty) sqlContext.sparkContext.emptyRDD[Row]
else {
val query = STACRelation.toUrl(base, activeFilters)
val result = getJson(query)
val featureRDD = result.asJsObject.fields("features") match {
case JsArray(features) ⇒ sqlContext.sparkContext.makeRDD(features
case _ ⇒ throw new IllegalArgumentException("Unexpected JSON response recevied:n" + result)
}
featureRDD.map(_.convertTo[DOM.GeoJsonFeature]).map { feature ⇒
val entries = requiredColumns.map {
case C.ID ⇒ feature.properties(C.ID).convertTo[String]
case C.TIMESTAMP ⇒ feature.properties("datetime").convertTo[Timestamp]
case C.PROPERTIES ⇒ feature.properties.mapValues(_.toString)
case C.BOUNDS ⇒ feature.bbox.map(_.jtsGeom).orNull
case C.GEOMETRY ⇒ feature.geometry
case C.ASSETS ⇒ feature.assets.mapValues(_.toString)
}
Row(entries: _*)
}
}
SQL Example
13
SELECT geometry, datetime, b1, b2, b3
WHERE
provider = 'NAIP' AND
st_intersects(geometry, st_geomFromText('WKT...')) AND
datetime > '2014-03-12' AND
datetime <= '2018-01-09' AND
eo:cloud_cover < 0.05
FROM stac_catalog

RasterFrames + STAC

  • 1.
    Toward STAC inRasterFrames Satellite Data Interoperablity Workshop August 13-15, 2018 Simeon H.K. Fitch VP of R&D, Astraea, Inc. fitch@astraea.earth GitHub: @metasim
  • 2.
    • Data SciencePlatform Company • Core product: RasterFlow AI – On-demand supercomputer for the Earth – Enables non-expert leveraging of cutting-edge machine learning on big geospatial data. – Global Scale. Deep Insight. For Everyone. • Built around our open source core: See the Earth. As it was, as it is, as it could be.℠ 2
  • 3.
  • 4.
    RasterFrames: Anatomy 4 EO DataProduct DataFrame Row (i.e. ML Observation) Tiled BandsTPE Extent Metadata λa [0,0] λb [0,0] λc [0,0]TPEA [0,0] . . . . . .TPEA [0,1] . . . . . . . . .. . .. . .. . .. . . TPEA [1,0] TPEA [1,1] . . .. . .. . .. . .. . . λa [0,1] λb [0,1] λc [0,1] λa [1,0] λb [1,0] λc [1,0] λc [1,1]λb [1,1]λa [1,1] SM[0,0] SM[0,1] SM[1,1] SM[1,0] . . . . . . Spark DataFrame Layer-wide Metadata RasterFrameBandc Bandb Banda Temporal Projected Extent (TPE) Scene Metadata (SM) … … add_of f set Band 32 emi ssi vi t y scal e_f act or Ti l eI D Value 0. 002 1, 255 0. 49 l ong_name Key val i d_r ange 51004010 Multiband Tile Granule/Scene GeoTIFF, HDF-EOS, GML-JPEG2000, etc. Spark DataFrame Reader Restructuring EO for Scaled ML
  • 5.
    SELECT geometry, datetime,b1, b2, b3 WHERE provider = 'NAIP' AND st_intersects(geometry, st_geomFromText('WKT...')) AND datetime > '2014-03-12' AND datetime <= '2018-01-09' AND eo:cloud_cover < 0.05 FROM stac_catalog RasterFrames: Canonical Query 5 λa [0,0] λb [0,0] λc [0,0]TPEA [0,0] . . . . . .TPEA [0,1] . . . . . . . . .. . .. . .. . .. . . TPEA [1,0] TPEA [1,1] . . .. . .. . .. . .. . . λa [0,1] λb [0,1] λc [0,1] λa [1,0] λb [1,0] λc [1,0] λc [1,1]λb [1,1]λa [1,1] SM[0,0] SM[0,1] SM[1,1] SM[1,0] . . . . . . Layer-wide Metadata
  • 6.
  • 7.
  • 8.
    Questions & Concerns 8 •Is STAC-API appropriate for this use case? – Would GraphQL help? (thx Jason Gilman for idea!) • Can we reduce # of HTTP requests? • Are there limits on results set size? – Is paging assumed? Can we stream? • How to integrate COG range requests? – STAC extension? – Also: Header offsets.
  • 9.
  • 10.
  • 11.
    DataSource Schema 11 override defschema: StructType = StructType(Seq( StructField(C.ID, StringType, false), StructField(C.TIMESTAMP, TimestampType, false), StructField(C.PROPERTIES, DataTypes.createMapType(StringType, StringType, false)), StructField(C.BOUNDS, geomSchema, true), StructField(C.GEOMETRY, geomSchema, false), StructField(C.ASSETS, assetSchema, false) ))
  • 12.
    DataSource Table Build 12 overridedef buildScan(requiredColumns: Array[String], attributeFilters: Array[Filter]): RDD[Row] = { val activeFilters = filters ++ attributeFilters.filterNot(_.references.contains(C.TIMESTAMP)) /** If no filters are provided, were not going to return the whole catalog. */ if(filters.isEmpty) sqlContext.sparkContext.emptyRDD[Row] else { val query = STACRelation.toUrl(base, activeFilters) val result = getJson(query) val featureRDD = result.asJsObject.fields("features") match { case JsArray(features) ⇒ sqlContext.sparkContext.makeRDD(features case _ ⇒ throw new IllegalArgumentException("Unexpected JSON response recevied:n" + result) } featureRDD.map(_.convertTo[DOM.GeoJsonFeature]).map { feature ⇒ val entries = requiredColumns.map { case C.ID ⇒ feature.properties(C.ID).convertTo[String] case C.TIMESTAMP ⇒ feature.properties("datetime").convertTo[Timestamp] case C.PROPERTIES ⇒ feature.properties.mapValues(_.toString) case C.BOUNDS ⇒ feature.bbox.map(_.jtsGeom).orNull case C.GEOMETRY ⇒ feature.geometry case C.ASSETS ⇒ feature.assets.mapValues(_.toString) } Row(entries: _*) } }
  • 13.
    SQL Example 13 SELECT geometry,datetime, b1, b2, b3 WHERE provider = 'NAIP' AND st_intersects(geometry, st_geomFromText('WKT...')) AND datetime > '2014-03-12' AND datetime <= '2018-01-09' AND eo:cloud_cover < 0.05 FROM stac_catalog

Editor's Notes

  • #3 Personal background in delivering data-rich scientific and engineering applications
  • #5 Machine Learning is Observation Oriented