Page1
Magellan: Geospatial Analytics on Spark
Ram Sriharsha
Twitter: @halfabrane
Spark and Data Science Architect,
Hortonworks
Page2
What is geospatial context?
•Given a point = (-122.412651, 37.777748) which
city is it in?
•Does shape X intersect shape Y?
–Compute the intersection
•Given a sequence of points and a system of roads
–Compute best path representing points
Page3
Geospatial context is useful
What neighborhoods do people go to on weekends?
Predict the drop off neighborhood of a user?
Predict the location where next pick up can be expected?
How does usage pattern change with time?
Identify crime hotspot neighborhoods
How do these hotspots evolve with time?
Predict the likelihood of crime occurring at a given neighborhood
Predict climate at fairly granular level
Climate insurance: do I need to buy insurance for my crops?
Climate as a factor in crime: Join climate dataset with Crimes
Page4
Geospatial data is pervasive
Page5
Why geospatial now?
Vast mobile data + geospatial
= truly big data problem !
Page6
Do you think we need one more geospatial library?
Page7
Parsing!
•ESRI Shapefiles
–Spec for Shapes, no spec for metadata
–Worse, metadata = Dbase Format (really??)
•GeoJSON
–Verbose
–But atleast parseable
–Unfortunately not common
•ESRI Format
–JSON but not GeoJSON!
Page8
Coordinate System Hell!
Mobile data = GPS coordinates
Map coordinate systems optimized for precision
Transform from one to another
Page9
Scalability (or the lack thereof)
•ESRI Hive (runs on Hadoop but lacks spatial joins)
•JTS, Geos, Shapely (no support for scalability)
•Other proprietary engines = black boxes
Page10
Simple, intuitive,
handles common
formats
Scalable
Feature rich
but still
extensible
Venn Diagram of geospatial libraries?
Page11
Feature Extractors
Language integration simplifies exploratory analytics
Q-Q
Q-A
similarity
Parse +
Clean
Logs
Ad
category
mapping
Query
category
mapping
Poly
Exp
(Q-A)
Features
Model
Convex
Solver
Train/T
est
Split
train
Test/validation
Metrics
Ad Server
HDFS
Data Prep
Score Model - Real-time
Data
Flow
Stage
Data Flow Stage - Batch
Feedback
Spatial
Context
Page12
Not all is lost!
• local computations w/ ESRI Java API
• Scale out computation w/ Spark
• Python + R support without compromising
performance via Pyspark , SparkR
• Catalyst + Data Sources + Data Frames
= Flexibility + Simplicity + Performance
• Stitch it all together + Allow extension points
=> Success!
Page13
Magellan: a complete story for geospatial?
Create geospatial analytics applications
faster:
• Use your favorite language (Python/ Scala), even R
• Get best in class algorithms for common spatial analytics
• Write less code
• Read data efficiently
• Let the optimizer do the heavy lifting
Page14
How does it work?
Custom Data Types for Shapes:
• Point, Line, PolyLine, Polygon extend Shape
• Local Computations using ESRI Java API
• No need for Scala -> SQL serialization
Expressions for Operators:
• Literals e.g point(-122.4, 37.6)
• Boolean Expressions e.g Intersects, Contains
• Binary Expressions e.g Intersection
Custom Data Sources:
• Schema = [point, polyline, polygon, metadata]
• Metadata = Map[String, String]
• GeoJSON and Shapefile implementations
Custom Strategies for Spatial Join:
• Broadcast Cartesian Join
• Geohash Join (in progress)
• Plug into Catalyst as experimental strategies
Page15
Magellan in a nutshell
• Read Shapefiles/ GeoJSON as DataSources:
–sqlContext.read("magellan").load(“$path”)
–sqlContext.read(“magellan”).option(“type”, “geojson”).load(“$path”)
• Spatial Queries using Expressions
–point(-122.5, 37.6) = Shape Literal
–$”point” within $”polygon” = Boolean Expression
–$”polygon1” intersection $”polygon2” = Binary Expression
• Joins using Catalyst + Spatial Optimizations
–points.join(polygons).where($”point” within $”polygon”)
Page16
Where are we at?
Magellan 1.0.3 is out on Spark Packages, go give it a try!:
• Scala support, Python support will be functional in 1.0.4 (needs Spark 1.5)
• Github: https://github.com/harsha2010/magellan
• Spark Packages: http://spark-packages.org/package/harsha2010/magellan
• Data Formats: ESRI Shapefile + metadata, GeoJSON
• Operators: Intersects, Contains, Within, Intersection
• Joins: Broadcast
• Blog: http://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/
• Zeppelin Notebook Example: http://bit.ly/1GwLyrV
Page17
What is next?
Magellan 1.0.4 expected release December:
• Python support
• MultiPolygon (Polygon Collection), MultiLineString (PolyLine Collection)
• Spark 1.5, 1.6
• Spatial Join Optimization
• Map Matching Algorithms
• More Operators based on requirements
• Support for other common geospatial data formats (WKT, others?)
Page18
Demo
Uber queries

Spark summit europe 2015 magellan

  • 1.
    Page1 Magellan: Geospatial Analyticson Spark Ram Sriharsha Twitter: @halfabrane Spark and Data Science Architect, Hortonworks
  • 2.
    Page2 What is geospatialcontext? •Given a point = (-122.412651, 37.777748) which city is it in? •Does shape X intersect shape Y? –Compute the intersection •Given a sequence of points and a system of roads –Compute best path representing points
  • 3.
    Page3 Geospatial context isuseful What neighborhoods do people go to on weekends? Predict the drop off neighborhood of a user? Predict the location where next pick up can be expected? How does usage pattern change with time? Identify crime hotspot neighborhoods How do these hotspots evolve with time? Predict the likelihood of crime occurring at a given neighborhood Predict climate at fairly granular level Climate insurance: do I need to buy insurance for my crops? Climate as a factor in crime: Join climate dataset with Crimes
  • 4.
  • 5.
    Page5 Why geospatial now? Vastmobile data + geospatial = truly big data problem !
  • 6.
    Page6 Do you thinkwe need one more geospatial library?
  • 7.
    Page7 Parsing! •ESRI Shapefiles –Spec forShapes, no spec for metadata –Worse, metadata = Dbase Format (really??) •GeoJSON –Verbose –But atleast parseable –Unfortunately not common •ESRI Format –JSON but not GeoJSON!
  • 8.
    Page8 Coordinate System Hell! Mobiledata = GPS coordinates Map coordinate systems optimized for precision Transform from one to another
  • 9.
    Page9 Scalability (or thelack thereof) •ESRI Hive (runs on Hadoop but lacks spatial joins) •JTS, Geos, Shapely (no support for scalability) •Other proprietary engines = black boxes
  • 10.
    Page10 Simple, intuitive, handles common formats Scalable Featurerich but still extensible Venn Diagram of geospatial libraries?
  • 11.
    Page11 Feature Extractors Language integrationsimplifies exploratory analytics Q-Q Q-A similarity Parse + Clean Logs Ad category mapping Query category mapping Poly Exp (Q-A) Features Model Convex Solver Train/T est Split train Test/validation Metrics Ad Server HDFS Data Prep Score Model - Real-time Data Flow Stage Data Flow Stage - Batch Feedback Spatial Context
  • 12.
    Page12 Not all islost! • local computations w/ ESRI Java API • Scale out computation w/ Spark • Python + R support without compromising performance via Pyspark , SparkR • Catalyst + Data Sources + Data Frames = Flexibility + Simplicity + Performance • Stitch it all together + Allow extension points => Success!
  • 13.
    Page13 Magellan: a completestory for geospatial? Create geospatial analytics applications faster: • Use your favorite language (Python/ Scala), even R • Get best in class algorithms for common spatial analytics • Write less code • Read data efficiently • Let the optimizer do the heavy lifting
  • 14.
    Page14 How does itwork? Custom Data Types for Shapes: • Point, Line, PolyLine, Polygon extend Shape • Local Computations using ESRI Java API • No need for Scala -> SQL serialization Expressions for Operators: • Literals e.g point(-122.4, 37.6) • Boolean Expressions e.g Intersects, Contains • Binary Expressions e.g Intersection Custom Data Sources: • Schema = [point, polyline, polygon, metadata] • Metadata = Map[String, String] • GeoJSON and Shapefile implementations Custom Strategies for Spatial Join: • Broadcast Cartesian Join • Geohash Join (in progress) • Plug into Catalyst as experimental strategies
  • 15.
    Page15 Magellan in anutshell • Read Shapefiles/ GeoJSON as DataSources: –sqlContext.read("magellan").load(“$path”) –sqlContext.read(“magellan”).option(“type”, “geojson”).load(“$path”) • Spatial Queries using Expressions –point(-122.5, 37.6) = Shape Literal –$”point” within $”polygon” = Boolean Expression –$”polygon1” intersection $”polygon2” = Binary Expression • Joins using Catalyst + Spatial Optimizations –points.join(polygons).where($”point” within $”polygon”)
  • 16.
    Page16 Where are weat? Magellan 1.0.3 is out on Spark Packages, go give it a try!: • Scala support, Python support will be functional in 1.0.4 (needs Spark 1.5) • Github: https://github.com/harsha2010/magellan • Spark Packages: http://spark-packages.org/package/harsha2010/magellan • Data Formats: ESRI Shapefile + metadata, GeoJSON • Operators: Intersects, Contains, Within, Intersection • Joins: Broadcast • Blog: http://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/ • Zeppelin Notebook Example: http://bit.ly/1GwLyrV
  • 17.
    Page17 What is next? Magellan1.0.4 expected release December: • Python support • MultiPolygon (Polygon Collection), MultiLineString (PolyLine Collection) • Spark 1.5, 1.6 • Spatial Join Optimization • Map Matching Algorithms • More Operators based on requirements • Support for other common geospatial data formats (WKT, others?)
  • 18.