SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
Distributed georeferenced raster processing on Spark with GeoTrellis
GeoTrellis is a geographic data processing engine for high performance applications. This presentation is focused on how Spark RDD partitioning scheme can influent the whole Spark application behaviour.
GeoTrellis is a geographic data processing engine for high performance applications. This presentation is focused on how Spark RDD partitioning scheme can influent the whole Spark application behaviour.
6.
• RDD
(a basic core spark type from the
past (no))
• Manual partitioning control
• DATASET
• Query planning
optimizations, more related
to already well partitioned
and structured data.
PARTITIONING SCHEME
SPECIAL BROWN COLORED FUNCTIONS
• Join
• groupByKey
• reduceByKey
• combineByKey
• Repartition
• Each function that has no
preservePartitioning flag or
can accept partitioner as an
argument, probably map?
12.
WAT?!
• Load data into Spark memory according to some
partitioning scheme
• Ahead of shuffle: smaller chunks are better for
Spark (as the max shuffle block size is only 2GBs)
• Are we dependent on the input data type? (yes)
• Window reading (what’s the desired / perfect
window size?)
13.
SPARK SHUFFLE BLOCK FEATURE
• ~ 128mb per partition (rule of a thumb)
• if(partitionsNumber ~ 2000) repartition(> 2000)
15.
WINDOWED READS
• Here we have a sort of some crop
function by grid bounds on each
element: tiff.crop(gridBounds) (it is
the meaning of rr.readWindows func)
16.
WINDOWED READS
• 13 GB loads not efficient into
memory of three AWS m3.xlarge
instances .
17.
WINDOWED READS
• Instead of 13Gb it fetches even 40Gb
per partition…
18.
WINDOWED READS
• The solution is to pack segments into
desired windows based on the input
format requirements
• After all the main idea is to leverage
the gains by having a good partion
scheme
19.
READ / WRITE
• SFC index and parallelism level control
• Cassandra and range queries example
(range queries and compare it to spark Cassandra connector, queries parallelism
inside Spark partitions)
21.
API & SPARK PROBLEMS
• Spark has its limitations
• It’s not required for a small data amount
(In the real time case even milliseconds are important, otherwise we have to live
somehow with the Spark slow responses)
• The second API in addition to the RDD API is
the answer?
(Collections API; does it make any sense to abstract over RDDs and Collections?)