Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache con big data 2015 magellan


Published on

Introduction to Magellan: A geospatial Analytics Library written on top of Spark. Slides for talk given at ApacheCon Big Data 2015

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Apache con big data 2015 magellan

  1. 1. Page1 Magellan: Geospatial Analytics on Spark Ram Sriharsha Spark and Data Science Architect, Hortonworks
  2. 2. Page2 Agenda • Geospatial Analytics • Geospatial Data Formats • Challenges • Magellan • Spark SQL and Catalyst: An Intro • How does Magellan use Spark SQL? • Demo • Q & A
  3. 3. Page3 Geospatial Context is useful Where do people go on weekends? Does usage pattern change with time? Predict the drop off point of a user? Predict the location where next pick up can be expected? Identify crime hotspots How do these hotspots evolve with time? Predict the likelihood of crime occurring at a given neighborhood Predict climate at fairly granular level Climate insurance: do I need to buy insurance for my crops? Climate as a factor in crime: Join climate dataset with Crimes
  4. 4. Page4 Geospatial Data is pervasive
  5. 5. Page5 Obscure Data Formats • ESRI Shapefile format –SHP –SHX –DBX (Dbase File Format, very few legitimate parsers exist) • GeoJSON • Open Source Parsers exist (GIS 4 Hadoop)
  6. 6. Page6 Why do we need a proper framework? • No standardized way of dealing with data and metadata • Esoteric Data Formats and Coordinate Systems • No optimizations for efficient joins • APIs are too low level • Language integration simplifies exploratory analytics • Commonly used algorithms can be made available at scale –Map matching –Geohash indices –Markov models
  7. 7. Page7 What do we want to support? • Parse geospatial data and metadata into Shapes + Metadata Map • Python and Scala support • Geometric Queries –efficiently! –simple and intuitive syntax! • Scalable implementations of common algorithms –Map Matching –Geohash Indexing –Spatial Joins
  8. 8. Page8 Where are we at? • Magellan available on Github ( • Can parse and understand most widely used formats –GeoJSON, ESRIShapefile • All geometries supported • 1.0.3 released ( • Broadcast join available for common scenarios • Work in progress (targeted 1.0.4) –Geohash Join optimization –Map Matching algorithm using Markov Models • Python and Scala support • Please give it a try and give us feedback!
  9. 9. Page9 Geospatial Data Structures
  10. 10. Page10 Operations • intersection • union • Symmetrical difference • Difference • Convex hull
  11. 11. Page11 Queries • contains (within) • Covers (covered-by) • intersects • touches
  12. 12. Page12 Geospatial Queries Is a triplet of points (A, B, C) Clockwise or Counter Clockwise ordered?
  13. 13. Page13 Geospatial Queries Ray Tracing Algorithm: Draw a ray from point and count the # of times it intersects polygon
  14. 14. Page14 Accelerating Spatial Queries • Bounding Box • Geohashing • R-Tree (and other) indices
  15. 15. Page15 Magellan • Basic Abstraction in terms of Shape –Point, PolyLine, Polygon –Supports multiple implementations (currently uses ESRI-Java-API) • SQL Data Type = Shape –Efficient: Construct once and use • Operations supported as SQL operations –within, intersects, contains etc. • Allows efficient Join implementations using Catalyst –Broadcast join already available –Geohash based join algorithm in progress
  16. 16. Page16 load
  17. 17. Page17 query metadata
  18. 18. Page18 geometric query
  19. 19. Page19 Python API
  20. 20. Page20 Python API, join
  21. 21. Page21 Why spark? • DataFrames –Intuitive manipulation of distributed structured data • Catalyst Optimizer –Push predicates to Data Source, allows optimized filters • Memory Optimized Execution Engine
  22. 22. Page22 The Spark ecosystem Spark Core Spark SQL Spark Streaming ML-Lib GraphX Distributed Compute Engine • Speed, ease of use and fast prototyping • Open source • Powerful abstractions • Python, R , Scala, Java support
  23. 23. Page23 Spark DataFrames are intuitive RDD DataFrame dept name age Bio H Smith 48 CS A Turing 54 Bio B Jones 43 Phys E Witten 61
  24. 24. Page24 Spark DataFrames are fast!
  25. 25. Page25 Spark SQL under the covers
  26. 26. Page26 Catalyst • Rows, Data Types • Expressions • Operators • Rules • Optimization
  27. 27. Page27 Rows and Data Types • Standard SQL Data Types –Date, Int, Long, String, etc • Complex Data Types –Array, Map, Struct, etc • Custom Data Types • Row = Collection of Data Types –Represents a single row
  28. 28. Page28 Expressions • Literals • Arithmetic Expressions –maxOf, unaryMinus • Predicates –Not, and, in, case when • Cast • String Expressions –substring, like, startsWith
  29. 29. Page29 Optimizations • Constant Folding • Predicate Pushdown –Combine Filters –Push Predicate Through Join • Null Propagation • Boolean Simplification
  30. 30. Page30 Execution Engine • Data Sources to read data into Data Frames –Supports extending pushdowns to data store • Optimized in memory layout –ORC, Tungsten etc. • Spark Strategies –Convert logical plan -> physical plan –Rules based on statistics
  31. 31. Page31 How does Magellan use Catalyst? • Custom Data Source –Parses GeoJSON, ESRIShapefile etc into (Shape, Metadata) pairs –Returns a DataFrame with columns (point, polygon, polyline, metadata) –Overrides PrunedFilteredScan –Outputs Shape instances • Custom Data Type –Point, Polygon, Polyline instances of Shape –Each Shape has a python counterpart. –Each Shape is its own SQL type (=> no serialization overhead for SQL -> Scala and back) • Magellan Context –Overrides Spark Planner allowing custom join implementations • Python wrappers
  32. 32. Page32 Leveraging Data Sources
  33. 33. Page33 Leveraging Catalyst Binary Expression
  34. 34. Page34 Leveraging Catalyst Spatial Join + Predicate Pushdown
  35. 35. Page35 Python Bindings
  36. 36. Page36 • Adds Custom Python Data Types –Point, PolyLine, Polygon wrappers around Scala Data Types • Wraps coordinate transformations and expressions • Custom Picklers and Unpicklers –Serialize to and from Scala
  37. 37. Page37 Future Work • Geohash Indices • Spatial Join Optimization • Map Matching Algorithms • Improving pyspark bindings
  38. 38. Page38 geohashing
  39. 39. Page39 Base 32 encoding
  40. 40. Page40 geohashing
  41. 41. Page41 Map Matching • Given sequence of points (representing a trip?) what was the road path taken? • Challenges –Error in GPS measurements –Error in coordinate projections –Time Gap between measurements, cannot just snap to nearest road
  42. 42. Page42 Demo Uber use case?