Enabling Efficient Cloud Workflows
Eugene Cheipesh
echeipesh@azavea.com
GitHub: @echeipesh
www.azavea.com
Cloud Native Workflows
“Hey, let's not download the whole file every time.”
Cloud Native
● Network is between storage and computation
● Storage is chunked and probably Key/Value
● Computation must scale
● Coordination is limited and risky
Cloud Optimized GeoTiff
● Defines an internal structure of GeoTiffs that is
optimized for streaming reading from blob
storage like S3 or Google Cloud Storage.
● That is, it’s best utilized for requests that use
HTTP GET Range Requests
https://trac.osgeo.org/gdal/wiki/CloudOptimizedGeoTIFF
http://www.cogeo.org/
“A GeoTIFF file is a TIFF 6.0 file, and
inherits the file structure as described
in the corresponding portion of the
TIFF spec. All GeoTIFF specific
information is encoded in several
additional reserved TIFF tags, and
contains no private Image File
Directories (IFD's), binary structures
or other private information invisible
to standard TIFF readers.”
http://geotiff.maptools.org/spec/geotiff2.4.html#2.4
GeoTiff
TIFF
● Tagged Image File Format
● Binary image is separated out into
“segments”
● Contains metadata in a sequence of “tags”
● Can contain metadata for multiple images in
different “image file directories” (IDFs)
GeoTiff Custom Tags
● GeoTiff adds custom tag GeoKeys
○ non-TIFF keys for CRS map space etc.
● This is how we know where the pixels are on
the surface of the earth (the geospatial bit).
http://geotiff.maptools.org/spec/geotiff6.html#6.1
GeoTiff Streaming Read
● Step 1: Fetch the header
● Step 2: Decide what segments you want
● Step 3: Read the segments
Request: BBOX Mean
GeoTIFF: All of the above
COG: IFDs First - Rule #1
TIFF Segments
● Chunks of an image are stored at different
locations. This allows for parts of the image to
be decompressed and loaded separately.
● There are two methods for segment layout:
Striped and Tiled
TIFF: Strip Segments
COG: Strip Segments - Bad
TIFF: Tile Segments
COG: Use Tiled Segments - Rule #2
Request: BBOX Mean at 60m
Request: BBOX Mean at 60m
GeoTIFF: Has Overviews
COG: Provide Overviews - Rule #3
COG Rules
● Put IFDs first
● Sort IFDs by resolution
● Use Tiled segments
● Include overviews (in file)
COG Rules Continued ?
● Provenance Metadata
● Include Summary Statistics
● File naming convention
● Catalog suggestion
COG “Profiles”: Slippy Map
● Restrict CRS as EPSG:3857
● Align GeoTiff segments with slippy tile grid
● Align GeoTiff files with tile boundaries
● Serve tile endpoints directly from COG
COG World Coverage?
“You’re going to need more than one GeoTIFF.”
s3://[bucket]/[Shard]_[GeoHash]_[ISO_TIME]_[ID].tiff
GeoTrellis & COGs
COG: Tile Request
COG Tiler
● Application specific catalog
● Reprojection on read
● Memache gives responsiveness
● Raster metadata pushed into files
GeoTrellis Layers
GeoTrellis: Avro Layers
GeoTrellis Avro Layers
● Many small files, one per tile
○ PUT requests get expensive
● Data duplication hazzard
○ Probably not authoritative
● Limited access
○ Requires GeoTrellis code to read
GeoTrellis: COG Layers
GeoTrellis Strucuted COG
● Base zoom tiles are in base GeoTiff layer
● Introduces “DATE_TIME” tag
● Includes VRT for layer files
● GeoTiff segments match tile size and layout
● Additional zoom levels stored in overviews
● Pyramid up when overviews “run out”
GeoTrellis: COG Pyramid
Unstructured COGs
Cloud Native thinking
GeoTrellis Unstructured COG
● Lots of good raster sources are already COG
● They don’t have a strict layout though
● Lets “inline” ingest process into the batch job
● Slower but reduces data duplication
COG: Tile Request (Again)
GeoTrellis Unstructured COG
● Requires mechanism to query for raster names
○ WFS, STAC, PostgreSQL, Scan
● COG provies “gradual” wins over GeoTiff
● Reproject on read to match workflow
Coregistration Workflow
● Working with multiple raster layers with differing:
○ CRS, Resolution, Extent
● Must pick the working/target layout
● We can “push-down” this process into IO
Coregisteration Workflow
COGs for GeoTrellis
● Treating GeoTiff as a Key/Value tile store
○ With built in notion level of detail
● Turns an ingest workflow into query workflow
○ Use the metadata to plan the read
● Opens up the results
○ QGIS, GeoServer, GDAL
COGs in General
● GeoJSON of raster formats
● Gradated spec
● Set of best practices
● Driven by implementations
THANKS
Q: Is this a good idea ?
Q: So is this a standard ?
Q: Why not use MRF ?

Cloud Optimized GeotTIFFs: enabling efficient cloud workflows

Editor's Notes

  • #9 This is original formalization of the idea by Even Rouault, its authoritative. Includes benchmarks using GDAL showing the difference in range read speeds.
  • #10 This is site managed by Chris Holmes, it is better overview and critical includes links to implementations and tools relevant to COG.
  • #18 TIFF header. Arrows are offsets in the TIFF file No order is implied Header is minimal does not have image information, IDFs have image information
  • #19 All of these are valid structures for TIFF. Left, easy to read, all the headers are at the top. (IDF should be seen as headers) Center, looks good but not clear what the advantage of this organization is. Right, easy to write if you don’t know when the image stream will end.
  • #20 Only the left organization is good for us because we get all the relevant header information at the top.
  • #21 COG spec is not firm about compression, deflate is common but slow, LZW is probably the most conservative choice. In any case the choice must be something that is understood by GDAL across all installs to preserve interoperability.
  • #22 In reality strips can even be one or two pixels high for large files.
  • #27 Fetching larger AOI will require a lot of IO time and network traffic
  • #29 Using overviews when downsampled information is sufficient, it very often is, saves the IO.
  • #31 Provenance is a huge problem, temptation is to use TIFF tags to keep track, but no standard exists yet. Histograms are super useful if you ever intend to view the file File naming convention seems tempting but impossible to enforce and thus expect Catalogs are the only other option
  • #32 This COG profile optimizes the IO to 1 request = 1 segment No reason to require this for all COGs Conclusion is that there is room to optimize COG layout for each system or application
  • #34 STAC static files provide reliant way to serve some index which is described using GeoJSON files REST and WFS 3.0 Services on top of static index provide an integration point for application
  • #35 You can play games with the names of the GeoTIFF files to come up with a scheme of finding the file we want only from request information, ID and geospatial bbox. You should consider sharding the scheme to avoid hotspotting your Blob store like S3. (See S3 best practices). People from GeoWave/GeoMesa/GeoTrellis project will have a lot of lessons about building static index like this that they would be happy to share. Also there are reusable components available from those projects.
  • #37 RasterFoundry is the raster management and manipulation system built on top of GeoTrellis.
  • #38 Level 0 support for COGs is serving tiles. RF does this using GeoTrellis.
  • #39 Tile request from RasterFoundry. Catalog is application specific COG Header is hot information, must be cached first Fetching COG segments may be too slow for quick screen refreshes
  • #40 An alternative view of tile request. Highlight here is that we’re reprojecting the COG on the fly per tile request.
  • #43 GeoTrellis layers are the constructs that represent a distributed raster collection that we can work over in cluster memory.
  • #44 Avro layers work really great on sorted stores like Accumulo but on S3 we end up with lots of small files.
  • #45 COG becomes our meta tile, its great. As a bonus the layers are now accessible from non GeoTrellis applications. This has potential to drastically reduce the data duplication.
  • #46 Note we had to add the DATE_TIME tag to keep track of the actual time. VRT files tie them together and make them accessible.
  • #47 There is only so much you can resample a set of tiles before you don’t have enough information to save At that point you need to collect the tops of the pyramid and reshuffle them into higher levels GeoTrellis handles this process when it writes out pyramided COG layers.
  • #48 Snapshot from OSM / RasterFrames / GeoTrellis machine learning project Best part is that now every layer written by GeoTrellis is accessible through QGIS This is a workflow where we combine source data, produces GeoTrellis layer and published OSM layer
  • #49 Thinking about COGs as mini-image databases
  • #51 We can put tile reading process in front of a distributed batch job. Instead of running it per request we run it per record. This effectively inlines the ingest process into the batch job itself.
  • #52 Same concerns that we talked about for making viewable COG catalogs are in place for making distributed imagery piplines sourcing COGs.
  • #53 Often we need more than one layer to do something interesting
  • #54 Fetch the metadata from the catalog to find what we want Fetch the metadata from the COGs to find out what it is Filter the potential IO using the COG headers Fetch the segments as tiles we know we need in a way we want
  • #55 Proof is in the pudding This GeoTrellis Landsat tiler coregisters scenes from multiple CRCs Reponsivness here gives us the idea that cost in a batch process is actually minimal.