Point in Polygon Convex Hull Does Shape X (intersect) Shape Y Contain cover touch Buffer Shape X by a certain amount to produce Y
Explain each component at high level
Apache con big data 2015 magellan
Magellan: Geospatial Analytics on Spark
Spark and Data Science Architect,
• Geospatial Analytics
• Geospatial Data Formats
• Spark SQL and Catalyst: An Intro
• How does Magellan use Spark SQL?
• Q & A
Geospatial Context is useful
Where do people go on weekends?
Does usage pattern change with time?
Predict the drop off point of a user?
Predict the location where next pick up can be expected?
Identify crime hotspots
How do these hotspots evolve with time?
Predict the likelihood of crime occurring at a given neighborhood
Predict climate at fairly granular level
Climate insurance: do I need to buy insurance for my crops?
Climate as a factor in crime: Join climate dataset with Crimes
Obscure Data Formats
• ESRI Shapefile format
–DBX (Dbase File Format, very few legitimate parsers exist)
• Open Source Parsers exist (GIS 4 Hadoop)
Why do we need a proper framework?
• No standardized way of dealing with data and metadata
• Esoteric Data Formats and Coordinate Systems
• No optimizations for efficient joins
• APIs are too low level
• Language integration simplifies exploratory analytics
• Commonly used algorithms can be made available at scale
What do we want to support?
• Parse geospatial data and metadata into Shapes + Metadata Map
• Python and Scala support
• Geometric Queries
–simple and intuitive syntax!
• Scalable implementations of common algorithms
Where are we at?
• Magellan available on Github (https://github.com/harsha2010/magellan)
• Can parse and understand most widely used formats
• All geometries supported
• 1.0.3 released (http://spark-packages.org/package/harsha2010/magellan)
• Broadcast join available for common scenarios
• Work in progress (targeted 1.0.4)
–Geohash Join optimization
–Map Matching algorithm using Markov Models
• Python and Scala support
• Please give it a try and give us feedback!
• Basic Abstraction in terms of Shape
–Point, PolyLine, Polygon
–Supports multiple implementations (currently uses ESRI-Java-API)
• SQL Data Type = Shape
–Efficient: Construct once and use
• Operations supported as SQL operations
–within, intersects, contains etc.
• Allows efficient Join implementations using Catalyst
–Broadcast join already available
–Geohash based join algorithm in progress
–Intuitive manipulation of distributed structured data
• Catalyst Optimizer
–Push predicates to Data Source, allows optimized filters
• Memory Optimized Execution Engine
The Spark ecosystem
Distributed Compute Engine
• Speed, ease of use and fast prototyping
• Open source
• Powerful abstractions
• Python, R , Scala, Java support
Spark DataFrames are intuitive
dept name age
Bio H Smith 48
CS A Turing 54
Bio B Jones 43
Phys E Witten 61
Rows and Data Types
• Standard SQL Data Types
–Date, Int, Long, String, etc
• Complex Data Types
–Array, Map, Struct, etc
• Custom Data Types
• Row = Collection of Data Types
–Represents a single row
• Arithmetic Expressions
–Not, and, in, case when
• String Expressions
–substring, like, startsWith
• Data Sources to read data into Data Frames
–Supports extending pushdowns to data store
• Optimized in memory layout
–ORC, Tungsten etc.
• Spark Strategies
–Convert logical plan -> physical plan
–Rules based on statistics
How does Magellan use Catalyst?
• Custom Data Source
–Parses GeoJSON, ESRIShapefile etc into (Shape, Metadata) pairs
–Returns a DataFrame with columns (point, polygon, polyline, metadata)
–Outputs Shape instances
• Custom Data Type
–Point, Polygon, Polyline instances of Shape
–Each Shape has a python counterpart.
–Each Shape is its own SQL type (=> no serialization overhead for SQL -> Scala and back)
• Magellan Context
–Overrides Spark Planner allowing custom join implementations
• Python wrappers
• Adds Custom Python Data Types
–Point, PolyLine, Polygon wrappers around Scala Data Types
• Wraps coordinate transformations and expressions
• Custom Picklers and Unpicklers
–Serialize to and from Scala
• Geohash Indices
• Spatial Join Optimization
• Map Matching Algorithms
• Improving pyspark bindings
• Given sequence of points (representing a trip?) what was the road path taken?
–Error in GPS measurements
–Error in coordinate projections
–Time Gap between measurements, cannot just snap to nearest road