Distributed georeferenced raster processing on Spark with GeoTrellis

DISTRIBUTED GEOREFERENCED RASTER PROCESSING ON SPARK
Grigory Pomadchin @daunnc / @pomadchin
GEO +
VectorTiles + PointClouds +
———————————————————————————————
Raster +Vector +
/w
• Raster Foundry
(Spark SQL & ML)
• Raster Frames
(Spark SQL & ML, Datasets
query API)
• GeoPy Spark
(Python bindings)
• Vector Pipes
(Vector tiles on Spark)
• PDAL
intergration
(PointClouds on Spark)
GEOTRELLIS ECOSYSTEM
WHATS UNDER THE COVERS?
SPACE FILLING CURVES
Distributed georeferenced raster processing on Spark with GeoTrellis
• RDD

(a basic core spark type from the
past (no))
• Manual partitioning control
• DATASET
• Query planning
optimizations, more related
to already well partitioned
and structured data.
PARTITIONING SCHEME
SPECIAL BROWN COLORED FUNCTIONS
• Join
• groupByKey
• reduceByKey
• combineByKey
• Repartition
• Each function that has no
preservePartitioning flag or
can accept partitioner as an
argument, probably map?
MAP IS A FUNCTION OF A DIFFERENT KIND?
join
reduceByKey
join map
reduceByKey
join
mapVlues reduceByKey
MAP IS A FUNCTION OF A DIFFERENT KIND?
• inspired by Eugene Cheipesh slides
DATA PREPARATION
• {hadoop | s3}GeoTiffRDD loads data from {HDFS / LocalFS | S3} into Spark
• (I, V) - {ProjectedExtent(extent, crs) | TemporalProjectedExtent(extent, crs, time)},
{Multiband | Singleband}Tile
• K - {Spatial(col, row) | SpaceTime(col, row, time)}
• inspired by Eugene Cheipesh slides
WAT?!
• Load data into Spark memory according to some
partitioning scheme
• Ahead of shuffle: smaller chunks are better for
Spark (as the max shuffle block size is only 2GBs)
• Are we dependent on the input data type? (yes)
• Window reading (what’s the desired / perfect
window size?)
SPARK SHUFFLE BLOCK FEATURE
• ~ 128mb per partition (rule of a thumb)
• if(partitionsNumber ~ 2000) repartition(> 2000)
WINDOWED READS
WINDOWED READS
• Here we have a sort of some crop
function by grid bounds on each
element: tiff.crop(gridBounds) (it is
the meaning of rr.readWindows func)
WINDOWED READS
• 13 GB loads not efficient into
memory of three AWS m3.xlarge
instances .
WINDOWED READS
• Instead of 13Gb it fetches even 40Gb
per partition…
WINDOWED READS
• The solution is to pack segments into
desired windows based on the input
format requirements
• After all the main idea is to leverage
the gains by having a good partion
scheme
READ / WRITE
• SFC index and parallelism level control
• Cassandra and range queries example

(range queries and compare it to spark Cassandra connector, queries parallelism
inside Spark partitions)
READ / WRITE
API & SPARK PROBLEMS
• Spark has its limitations
• It’s not required for a small data amount

(In the real time case even milliseconds are important, otherwise we have to live
somehow with the Spark slow responses)
• The second API in addition to the RDD API is
the answer?

(Collections API; does it make any sense to abstract over RDDs and Collections?)
• https://github.com/locationtech/geotrellis
• https://geotrellis.io
• https://www.azavea.com
• https://www.azavea.com/blog
• https://yuns-stacy.github.io/geotrellis-angular-demo/dashboard
• http://rasterframes.io/ml/statistics.html
• https://github.com/pomadchin
1 of 23

Recommended

Scylla Summit 2016: Graph Processing with Titan and Scylla by
Scylla Summit 2016: Graph Processing with Titan and ScyllaScylla Summit 2016: Graph Processing with Titan and Scylla
Scylla Summit 2016: Graph Processing with Titan and ScyllaScyllaDB
2.9K views18 slides
Scylla Summit 2018: Getting the Most Out of Scylla on Kubernetes by
Scylla Summit 2018: Getting the Most Out of Scylla on KubernetesScylla Summit 2018: Getting the Most Out of Scylla on Kubernetes
Scylla Summit 2018: Getting the Most Out of Scylla on KubernetesScyllaDB
1.8K views22 slides
Powering a Graph Data System with Scylla + JanusGraph by
Powering a Graph Data System with Scylla + JanusGraphPowering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphScyllaDB
1.7K views31 slides
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r... by
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
683 views51 slides
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale by
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleZeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleScyllaDB
1.5K views42 slides
How Adobe Does 2 Million Records Per Second Using Apache Spark! by
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
614 views36 slides

More Related Content

What's hot

AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise... by
AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...
AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...ScyllaDB
1.4K views60 slides
Zeppelin and spark sql demystified by
Zeppelin and spark sql demystifiedZeppelin and spark sql demystified
Zeppelin and spark sql demystifiedOmid Vahdaty
763 views15 slides
ScyllaDB's Avi Kivity on UDF, UDA, and the Future by
ScyllaDB's Avi Kivity on UDF, UDA, and the FutureScyllaDB's Avi Kivity on UDF, UDA, and the Future
ScyllaDB's Avi Kivity on UDF, UDA, and the FutureScyllaDB
3K views38 slides
SparkSQL: A Compiler from Queries to RDDs by
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
6.1K views44 slides
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu... by
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...ScyllaDB
15K views30 slides
How to be Successful with Scylla by
How to be Successful with ScyllaHow to be Successful with Scylla
How to be Successful with ScyllaScyllaDB
3.8K views34 slides

What's hot(20)

AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise... by ScyllaDB
AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...
AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...
ScyllaDB1.4K views
Zeppelin and spark sql demystified by Omid Vahdaty
Zeppelin and spark sql demystifiedZeppelin and spark sql demystified
Zeppelin and spark sql demystified
Omid Vahdaty763 views
ScyllaDB's Avi Kivity on UDF, UDA, and the Future by ScyllaDB
ScyllaDB's Avi Kivity on UDF, UDA, and the FutureScyllaDB's Avi Kivity on UDF, UDA, and the Future
ScyllaDB's Avi Kivity on UDF, UDA, and the Future
ScyllaDB3K views
SparkSQL: A Compiler from Queries to RDDs by Databricks
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks6.1K views
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu... by ScyllaDB
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
ScyllaDB15K views
How to be Successful with Scylla by ScyllaDB
How to be Successful with ScyllaHow to be Successful with Scylla
How to be Successful with Scylla
ScyllaDB3.8K views
Emr spark tuning demystified by Omid Vahdaty
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
Omid Vahdaty7.7K views
Spark Summit EU talk by Qifan Pu by Spark Summit
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
Spark Summit1.7K views
PostgreSQL on AWS: Tips & Tricks (and horror stories) by Alexander Kukushkin
PostgreSQL on AWS: Tips & Tricks (and horror stories)PostgreSQL on AWS: Tips & Tricks (and horror stories)
PostgreSQL on AWS: Tips & Tricks (and horror stories)
Alexander Kukushkin4.2K views
Operating and Supporting Delta Lake in Production by Databricks
Operating and Supporting Delta Lake in ProductionOperating and Supporting Delta Lake in Production
Operating and Supporting Delta Lake in Production
Databricks333 views
The True Cost of NoSQL DBaaS Options by ScyllaDB
The True Cost of NoSQL DBaaS OptionsThe True Cost of NoSQL DBaaS Options
The True Cost of NoSQL DBaaS Options
ScyllaDB1.1K views
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia by Databricks
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks11.7K views
Apache Spark II (SparkSQL) by Datio Big Data
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
Datio Big Data1.9K views
How to Monitor and Size Workloads on AWS i3 instances by ScyllaDB
How to Monitor and Size Workloads on AWS i3 instancesHow to Monitor and Size Workloads on AWS i3 instances
How to Monitor and Size Workloads on AWS i3 instances
ScyllaDB546 views
Seastar Summit 2019 Keynote by ScyllaDB
Seastar Summit 2019 KeynoteSeastar Summit 2019 Keynote
Seastar Summit 2019 Keynote
ScyllaDB4.6K views
What Kiwi.com Has Learned Running ScyllaDB and Go by ScyllaDB
What Kiwi.com Has Learned Running ScyllaDB and GoWhat Kiwi.com Has Learned Running ScyllaDB and Go
What Kiwi.com Has Learned Running ScyllaDB and Go
ScyllaDB851 views
Spark Summit 2016: Connecting Python to the Spark Ecosystem by Daniel Rodriguez
Spark Summit 2016: Connecting Python to the Spark EcosystemSpark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Daniel Rodriguez422 views
Performance Troubleshooting Using Apache Spark Metrics by Databricks
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
Databricks3.5K views
Using ScyllaDB with JanusGraph for Cyber Security by ScyllaDB
Using ScyllaDB with JanusGraph for Cyber SecurityUsing ScyllaDB with JanusGraph for Cyber Security
Using ScyllaDB with JanusGraph for Cyber Security
ScyllaDB658 views
Empowering the AWS DynamoDB™ application developer with Alternator by ScyllaDB
Empowering the AWS DynamoDB™ application developer with AlternatorEmpowering the AWS DynamoDB™ application developer with Alternator
Empowering the AWS DynamoDB™ application developer with Alternator
ScyllaDB560 views

Similar to Distributed georeferenced raster processing on Spark with GeoTrellis

Apache Spark - San Diego Big Data Meetup Jan 14th 2015 by
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime
731 views45 slides
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014 by
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014cdmaxime
741 views46 slides
Processing Large Data with Apache Spark -- HasGeek by
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
12.4K views72 slides
Extreme Apache Spark: how in 3 months we created a pipeline that can process ... by
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
25.4K views65 slides
Spark: The State of the Art Engine for Big Data Processing by
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingRamaninder Singh Jhajj
815 views22 slides
New Developments in Spark by
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
9.7K views43 slides

Similar to Distributed georeferenced raster processing on Spark with GeoTrellis(20)

Apache Spark - San Diego Big Data Meetup Jan 14th 2015 by cdmaxime
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime731 views
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014 by cdmaxime
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
cdmaxime741 views
Processing Large Data with Apache Spark -- HasGeek by Venkata Naga Ravi
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi12.4K views
Extreme Apache Spark: how in 3 months we created a pipeline that can process ... by Josef A. Habdank
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank25.4K views
New Developments in Spark by Databricks
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks9.7K views
Apache Spark: The Next Gen toolset for Big Data Processing by prajods
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
prajods3.3K views
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio by Alluxio, Inc.
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Alluxio, Inc.399 views
Scaling Spark Workloads on YARN - Boulder/Denver July 2015 by Mac Moore
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Mac Moore3.3K views
Paris Data Geek - Spark Streaming by Djamel Zouaoui
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui2.7K views
Migrating ETL Workflow to Apache Spark at Scale in Pinterest by Databricks
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Databricks333 views
Introduction to Spark - Phoenix Meetup 08-19-2014 by cdmaxime
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
cdmaxime1.6K views
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio... by Chris Fregly
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly3.6K views
Apache Spark Overview @ ferret by Andrii Gakhov
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov1.2K views
Dive into spark2 by Gal Marder
Dive into spark2Dive into spark2
Dive into spark2
Gal Marder464 views

Recently uploaded

Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT by
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITUpdates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITShapeBlue
206 views8 slides
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And... by
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...ShapeBlue
106 views12 slides
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...The Digital Insurer
90 views52 slides
Business Analyst Series 2023 - Week 4 Session 7 by
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7DianaGray10
139 views31 slides
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue by
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueShapeBlue
222 views7 slides
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda... by
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...ShapeBlue
161 views13 slides

Recently uploaded(20)

Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT by ShapeBlue
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITUpdates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
ShapeBlue206 views
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And... by ShapeBlue
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
ShapeBlue106 views
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...
Business Analyst Series 2023 - Week 4 Session 7 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10139 views
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue by ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
ShapeBlue222 views
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda... by ShapeBlue
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
ShapeBlue161 views
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ... by ShapeBlue
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
ShapeBlue184 views
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool by ShapeBlue
Extending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPoolExtending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPool
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool
ShapeBlue123 views
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O... by ShapeBlue
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
ShapeBlue132 views
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... by ShapeBlue
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
ShapeBlue194 views
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... by ShapeBlue
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
ShapeBlue145 views
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ by ShapeBlue
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericConfidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
ShapeBlue130 views
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R... by ShapeBlue
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
ShapeBlue173 views
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... by ShapeBlue
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
ShapeBlue119 views
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates by ShapeBlue
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates
ShapeBlue252 views
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... by ShapeBlue
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
ShapeBlue166 views
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
ShapeBlue203 views
Future of AR - Facebook Presentation by Rob McCarty
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
Rob McCarty64 views

Distributed georeferenced raster processing on Spark with GeoTrellis

  • 1. DISTRIBUTED GEOREFERENCED RASTER PROCESSING ON SPARK Grigory Pomadchin @daunnc / @pomadchin
  • 2. GEO + VectorTiles + PointClouds + ——————————————————————————————— Raster +Vector + /w
  • 3. • Raster Foundry (Spark SQL & ML) • Raster Frames (Spark SQL & ML, Datasets query API) • GeoPy Spark (Python bindings) • Vector Pipes (Vector tiles on Spark) • PDAL intergration (PointClouds on Spark) GEOTRELLIS ECOSYSTEM
  • 4. WHATS UNDER THE COVERS?
  • 7. • RDD
 (a basic core spark type from the past (no)) • Manual partitioning control • DATASET • Query planning optimizations, more related to already well partitioned and structured data. PARTITIONING SCHEME SPECIAL BROWN COLORED FUNCTIONS • Join • groupByKey • reduceByKey • combineByKey • Repartition • Each function that has no preservePartitioning flag or can accept partitioner as an argument, probably map?
  • 8. MAP IS A FUNCTION OF A DIFFERENT KIND?
  • 10. • inspired by Eugene Cheipesh slides
  • 11. DATA PREPARATION • {hadoop | s3}GeoTiffRDD loads data from {HDFS / LocalFS | S3} into Spark • (I, V) - {ProjectedExtent(extent, crs) | TemporalProjectedExtent(extent, crs, time)}, {Multiband | Singleband}Tile • K - {Spatial(col, row) | SpaceTime(col, row, time)}
  • 12. • inspired by Eugene Cheipesh slides
  • 13. WAT?! • Load data into Spark memory according to some partitioning scheme • Ahead of shuffle: smaller chunks are better for Spark (as the max shuffle block size is only 2GBs) • Are we dependent on the input data type? (yes) • Window reading (what’s the desired / perfect window size?)
  • 14. SPARK SHUFFLE BLOCK FEATURE • ~ 128mb per partition (rule of a thumb) • if(partitionsNumber ~ 2000) repartition(> 2000)
  • 16. WINDOWED READS • Here we have a sort of some crop function by grid bounds on each element: tiff.crop(gridBounds) (it is the meaning of rr.readWindows func)
  • 17. WINDOWED READS • 13 GB loads not efficient into memory of three AWS m3.xlarge instances .
  • 18. WINDOWED READS • Instead of 13Gb it fetches even 40Gb per partition…
  • 19. WINDOWED READS • The solution is to pack segments into desired windows based on the input format requirements • After all the main idea is to leverage the gains by having a good partion scheme
  • 20. READ / WRITE • SFC index and parallelism level control • Cassandra and range queries example
 (range queries and compare it to spark Cassandra connector, queries parallelism inside Spark partitions)
  • 22. API & SPARK PROBLEMS • Spark has its limitations • It’s not required for a small data amount
 (In the real time case even milliseconds are important, otherwise we have to live somehow with the Spark slow responses) • The second API in addition to the RDD API is the answer?
 (Collections API; does it make any sense to abstract over RDDs and Collections?)
  • 23. • https://github.com/locationtech/geotrellis • https://geotrellis.io • https://www.azavea.com • https://www.azavea.com/blog • https://yuns-stacy.github.io/geotrellis-angular-demo/dashboard • http://rasterframes.io/ml/statistics.html • https://github.com/pomadchin