Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Distributed georeferenced raster processing on Spark with GeoTrellis

GeoTrellis is a geographic data processing engine for high performance applications. This presentation is focused on how Spark RDD partitioning scheme can influent the whole Spark application behaviour.

  • Login to see the comments

  • Be the first to like this

Distributed georeferenced raster processing on Spark with GeoTrellis

  1. 1. DISTRIBUTED GEOREFERENCED RASTER PROCESSING ON SPARK Grigory Pomadchin @daunnc / @pomadchin
  2. 2. GEO + VectorTiles + PointClouds + ——————————————————————————————— Raster +Vector + /w
  3. 3. • Raster Foundry (Spark SQL & ML) • Raster Frames (Spark SQL & ML, Datasets query API) • GeoPy Spark (Python bindings) • Vector Pipes (Vector tiles on Spark) • PDAL intergration (PointClouds on Spark) GEOTRELLIS ECOSYSTEM
  6. 6. • RDD
 (a basic core spark type from the past (no)) • Manual partitioning control • DATASET • Query planning optimizations, more related to already well partitioned and structured data. PARTITIONING SCHEME SPECIAL BROWN COLORED FUNCTIONS • Join • groupByKey • reduceByKey • combineByKey • Repartition • Each function that has no preservePartitioning flag or can accept partitioner as an argument, probably map?
  8. 8. join reduceByKey join map reduceByKey join mapVlues reduceByKey MAP IS A FUNCTION OF A DIFFERENT KIND?
  9. 9. • inspired by Eugene Cheipesh slides
  10. 10. DATA PREPARATION • {hadoop | s3}GeoTiffRDD loads data from {HDFS / LocalFS | S3} into Spark • (I, V) - {ProjectedExtent(extent, crs) | TemporalProjectedExtent(extent, crs, time)}, {Multiband | Singleband}Tile • K - {Spatial(col, row) | SpaceTime(col, row, time)}
  11. 11. • inspired by Eugene Cheipesh slides
  12. 12. WAT?! • Load data into Spark memory according to some partitioning scheme • Ahead of shuffle: smaller chunks are better for Spark (as the max shuffle block size is only 2GBs) • Are we dependent on the input data type? (yes) • Window reading (what’s the desired / perfect window size?)
  13. 13. SPARK SHUFFLE BLOCK FEATURE • ~ 128mb per partition (rule of a thumb) • if(partitionsNumber ~ 2000) repartition(> 2000)
  15. 15. WINDOWED READS • Here we have a sort of some crop function by grid bounds on each element: tiff.crop(gridBounds) (it is the meaning of rr.readWindows func)
  16. 16. WINDOWED READS • 13 GB loads not efficient into memory of three AWS m3.xlarge instances .
  17. 17. WINDOWED READS • Instead of 13Gb it fetches even 40Gb per partition…
  18. 18. WINDOWED READS • The solution is to pack segments into desired windows based on the input format requirements • After all the main idea is to leverage the gains by having a good partion scheme
  19. 19. READ / WRITE • SFC index and parallelism level control • Cassandra and range queries example
 (range queries and compare it to spark Cassandra connector, queries parallelism inside Spark partitions)
  20. 20. READ / WRITE
  21. 21. API & SPARK PROBLEMS • Spark has its limitations • It’s not required for a small data amount
 (In the real time case even milliseconds are important, otherwise we have to live somehow with the Spark slow responses) • The second API in addition to the RDD API is the answer?
 (Collections API; does it make any sense to abstract over RDDs and Collections?)
  22. 22. • • • • • • •