Geo Data Analytics
@dmarcous
● DBA (@IDF)
● Big Data Professional (@IDF)
● Data Wizard - Magic with Data (@Google - Waze)
● Pure professional
● Best practices
● Tools
● Tips & Tricks
● Free Advice!
Agenda
● Why?
● Common Language
● Problems at scale
● Solutions at scale
● Tips & Tricks for scientists
(/Wizards)
● Art
● Keep an eye out for…
● Dog Pictures
Why Does Geo Data Matter?
● C/C++, GEOS: http://trac.osgeo.org/geos
● C#, NTS: http://code.google.com/p/nettopologysuite/
● Java, JTS:
○ http://tsusiatsoftware.net/jts/main.html
○ http://www.vividsolutions.com/jts/JTSHome.htm
● Python, shapely: https://github.com/Toblerity/Shapely
● Ruby, ffi-geos: https://github.com/dark-panda/ffi-geos
● Javascript, JSTS: http://github.com/bjornharrtell/jsts
Geometry Object Model
Geospatial Operations
● WKT / WKB - Geospatial Markup Language
○ POLYGON((34.807841777801514 32.164333053441936,34.81168270111084
32.164859820966136,34.81337785720825 32.1613540349589,34.80865716934204
32.16046394346568,34.807841777801514 32.164333053441936))
○ http://arthur-e.github.io/Wicket/sandbox-gmaps3.html
● GeoJSON
○ { "type": "FeatureCollection", "features": [{ "type": "Feature", "properties": { "Name": "Verint", "Guest":
"dmarcous", "Accomodations": "Beer; Pizza" }, "geometry": { "type": "Polygon", "coordinates": [ [
[ 34.807841777801514, 32.164333053441936 ], [ 34.81168270111084,
32.164859820966136 ], [ 34.81337785720825, 32.1613540349589 ], [
34.80865716934204, 32.16046394346568 ], [ 34.807841777801514,
32.164333053441936 ]]]}}]}
○ http://geojson.io/#map=17/32.16267/34.81061
● Shape Files - ESRI vector format
● GML - The Geography Markup Language (GML) is an XML grammar for expressing
geographical features.
● Raster - Display file built from coordinates
Formats
Databases
● RDBMS
○ Postgres (PostGIS)
○ MS-SQL / DB2 / Oracle
● NoSQL
○ MongoDB
○ IBM Cloudant
○ Lucene spatial module (elastic/ solr)
● Pure Geospatial Database
○ CartoDB (OS / Hosted)
○ GeoMesa (Accumulo)
■ GeoTrellis - Scala framework for processing raster data
GIS Systems
List of most popular ones -
http://en.wikipedia.org/wiki/List_of_geographic_information_systems_software
QGIS TileMillGRASS
Problem?
● Non scalar data types
○ Aggregating
○ Sharding
○ Unordered
● Speed & Accuracy
○ The Physical World is non-euclidian
http://www.jandrewrogers.com/2015/03/02/geospatial-
databases-are-hard/
Solution
Data Structures
● R-Tree (PostGIS, actually R+Tree)
● Quad Tree (DB2)
● Hyperdimensional Hashing
● Space Filling Curves
○ Z Order Curve (MS-SQL)
○ Hilbert Curve
The Curse of Dimensionality
Dimension Reduction
● GeoHash - The mainstream way
○ Linear (non tangant), up to x5 difference in cell area
○ Same Prefix - Close areas (sort of…)
○ http://geohash.org/
○ https://github.com/google/open-location-
code/blob/master/docs/comparison.adoc
● S2 - The google way
○ Quadratic, same level cell ~ similar area
○ Faces of a projected cube - divided by Quad-Trees to levels -
Referenced to position on face by a Hilbert Curve
○ https://code.google.com/p/s2-geometry-library/
● MongoDB Geospatial Indexing
● elastic / solr spatial indexing
● GeoMesa
● Build your own - Store the bytes in a fast
key-value store with reduced keys (HBase /
Cassandra)
Near Real Time Answers
● ESRI - Hive UDFs -
https://github.com/Esri/spatial-framework-for-
hadoop/wiki/UDF-Documentation
● Pigeon - Pig UDFs -
https://github.com/aseldawy/pigeon
● Spark -
○ SpatialSpark
○ GeoTrellis
Big Processing - It’s a UDF World
Graph Representation
● Use Cases
○ Routing
○ Supply Chains
○ Users Networks
● Tools
○ GraphX (Spark!) / Giraph (MR)
○ Dato SGraph (formerly known as GraphLab)
○ Gephi (On small parts for exploration)
● Algorithms
○ Shortest Path - Dijkstra / A-*
○ Communities - Triangle Counting
○ Importance - Centrality / Page Rank
Tips & Tricks
Approximation
Timezones
● tz_world
○ http://efele.net/maps/tz/world/
○ What do we do with shapefiles?
● APIs
○ Geonames
○ http://www.earthtools.org/
○ Google Timezone API
● UDFs?
○ Hive - from_utc_timestamp(timestamp, string timezone)
// Word Count
val textFile = spark.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
// Modified Word Count
val textFile = spark.textFile("hdfs://...")
val counts = textFile.map(line => line.split(","))
.map(point => (coord2S2Cell(point(1),point(2)), 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
// Take that from a library!
def coord2S2Cell(longitude: Double, latitude: Double, lvl = 14) : Int =
{
return S2Cell(longitude,latitude, lvl).CellId()
}
Good Old Word Count
Advanced - Precision is of the Essence
● Density Based Clustering
○ DBSCAN
■ Minimum cluster size (>
Noise)
■ Epsilon (Spatial Radius)
○ R - MASS - kde2d
■ RGoogleMaps for the map
■ http://www.everydayanalytics.ca
/2014/04/heatmap-of-toronto-
traffic-signals.html
rJava
● Wrap geospatial functions of your choice
● call them from R
● Use apply on an entire Dataframe!
● Use as features!
● Visualize??? (in 5 minutes)
R Packs for Geospatial Analysis
● geonames
○ Timezone
○ Weather
○ Nearby places
● RGoogleMaps
○ download+paint Maps
○ getGeoCode
● sp / maps / maptools
○ OGC object abstractions
○ Manipulate / display geo data
● rgdal - spTransform
○ Convert formats / coordinates systems
● geosphere - distances / circles / centroids
● fpc - DBSCAN
● Coverage -
○ http://cran.r-project.org/web/views/Spatial.html
Engineered Geo features
● LOCAL
○ time
○ is_early / is_late
○ day of week
○ is_workday / is_weekend
○ is_day_light (sunrise/ sunset tz_world)
● Weather
○ Temperature
○ is_ Rain/ Fog / Hail / Snow
● Squared (s2cell/ geohash) statistics
○ Probability of users in square to predict X
● Address - is_residence / is_business
● News - GDELT
WOW!
Data Art
Google Sheets
Frontend = Javascript?
● Google Maps API
○ https://developers.google.com/maps/documentation/javascript/examples/layer-
heatmap
● Leaflet
R for Visualisation
● ggplot2 + geospatial packs
○ http://uce.uniovi.es/mundor/howtoplotashapemap.html
○ http://stackoverflow.com/questions/9558040/ggplot-map-with-l
○ http://spatial.ly/2012/02/great-maps-ggplot2/
● RGoogleMaps
○ http://rforwork.info/tag/rgooglemaps/
R For Interactive
● Shiny
○ Leaflet
■ http://rstudio.github.io/leaflet/
■ http://shiny.rstudio.com/gallery/superzip-example.html
■ http://shiny.rstudio.com/gallery/bus-dashboard.html
○ Globe
■ https://github.com/trestletech/shinyGlobe
R Animation
● http://rmaps.github.io/blog/posts/animated-choropleths/
@aaronkoblin
Keep an Eye Out!
https://locationtech.org/list-of-projects
Contact
● Daniel Marcous
● dmarcous@gmail.com

Geo data analytics

  • 1.
  • 2.
    @dmarcous ● DBA (@IDF) ●Big Data Professional (@IDF) ● Data Wizard - Magic with Data (@Google - Waze)
  • 3.
    ● Pure professional ●Best practices ● Tools ● Tips & Tricks ● Free Advice!
  • 4.
    Agenda ● Why? ● CommonLanguage ● Problems at scale ● Solutions at scale ● Tips & Tricks for scientists (/Wizards) ● Art ● Keep an eye out for… ● Dog Pictures
  • 5.
    Why Does GeoData Matter?
  • 8.
    ● C/C++, GEOS:http://trac.osgeo.org/geos ● C#, NTS: http://code.google.com/p/nettopologysuite/ ● Java, JTS: ○ http://tsusiatsoftware.net/jts/main.html ○ http://www.vividsolutions.com/jts/JTSHome.htm ● Python, shapely: https://github.com/Toblerity/Shapely ● Ruby, ffi-geos: https://github.com/dark-panda/ffi-geos ● Javascript, JSTS: http://github.com/bjornharrtell/jsts
  • 9.
  • 10.
  • 11.
    ● WKT /WKB - Geospatial Markup Language ○ POLYGON((34.807841777801514 32.164333053441936,34.81168270111084 32.164859820966136,34.81337785720825 32.1613540349589,34.80865716934204 32.16046394346568,34.807841777801514 32.164333053441936)) ○ http://arthur-e.github.io/Wicket/sandbox-gmaps3.html ● GeoJSON ○ { "type": "FeatureCollection", "features": [{ "type": "Feature", "properties": { "Name": "Verint", "Guest": "dmarcous", "Accomodations": "Beer; Pizza" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 34.807841777801514, 32.164333053441936 ], [ 34.81168270111084, 32.164859820966136 ], [ 34.81337785720825, 32.1613540349589 ], [ 34.80865716934204, 32.16046394346568 ], [ 34.807841777801514, 32.164333053441936 ]]]}}]} ○ http://geojson.io/#map=17/32.16267/34.81061 ● Shape Files - ESRI vector format ● GML - The Geography Markup Language (GML) is an XML grammar for expressing geographical features. ● Raster - Display file built from coordinates Formats
  • 12.
    Databases ● RDBMS ○ Postgres(PostGIS) ○ MS-SQL / DB2 / Oracle ● NoSQL ○ MongoDB ○ IBM Cloudant ○ Lucene spatial module (elastic/ solr) ● Pure Geospatial Database ○ CartoDB (OS / Hosted) ○ GeoMesa (Accumulo) ■ GeoTrellis - Scala framework for processing raster data
  • 13.
    GIS Systems List ofmost popular ones - http://en.wikipedia.org/wiki/List_of_geographic_information_systems_software QGIS TileMillGRASS
  • 15.
    Problem? ● Non scalardata types ○ Aggregating ○ Sharding ○ Unordered ● Speed & Accuracy ○ The Physical World is non-euclidian http://www.jandrewrogers.com/2015/03/02/geospatial- databases-are-hard/
  • 16.
  • 17.
    Data Structures ● R-Tree(PostGIS, actually R+Tree) ● Quad Tree (DB2) ● Hyperdimensional Hashing ● Space Filling Curves ○ Z Order Curve (MS-SQL) ○ Hilbert Curve
  • 18.
    The Curse ofDimensionality
  • 19.
    Dimension Reduction ● GeoHash- The mainstream way ○ Linear (non tangant), up to x5 difference in cell area ○ Same Prefix - Close areas (sort of…) ○ http://geohash.org/ ○ https://github.com/google/open-location- code/blob/master/docs/comparison.adoc ● S2 - The google way ○ Quadratic, same level cell ~ similar area ○ Faces of a projected cube - divided by Quad-Trees to levels - Referenced to position on face by a Hilbert Curve ○ https://code.google.com/p/s2-geometry-library/
  • 20.
    ● MongoDB GeospatialIndexing ● elastic / solr spatial indexing ● GeoMesa ● Build your own - Store the bytes in a fast key-value store with reduced keys (HBase / Cassandra) Near Real Time Answers
  • 21.
    ● ESRI -Hive UDFs - https://github.com/Esri/spatial-framework-for- hadoop/wiki/UDF-Documentation ● Pigeon - Pig UDFs - https://github.com/aseldawy/pigeon ● Spark - ○ SpatialSpark ○ GeoTrellis Big Processing - It’s a UDF World
  • 22.
    Graph Representation ● UseCases ○ Routing ○ Supply Chains ○ Users Networks ● Tools ○ GraphX (Spark!) / Giraph (MR) ○ Dato SGraph (formerly known as GraphLab) ○ Gephi (On small parts for exploration) ● Algorithms ○ Shortest Path - Dijkstra / A-* ○ Communities - Triangle Counting ○ Importance - Centrality / Page Rank
  • 23.
  • 24.
  • 26.
    Timezones ● tz_world ○ http://efele.net/maps/tz/world/ ○What do we do with shapefiles? ● APIs ○ Geonames ○ http://www.earthtools.org/ ○ Google Timezone API ● UDFs? ○ Hive - from_utc_timestamp(timestamp, string timezone)
  • 28.
    // Word Count valtextFile = spark.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") // Modified Word Count val textFile = spark.textFile("hdfs://...") val counts = textFile.map(line => line.split(",")) .map(point => (coord2S2Cell(point(1),point(2)), 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") // Take that from a library! def coord2S2Cell(longitude: Double, latitude: Double, lvl = 14) : Int = { return S2Cell(longitude,latitude, lvl).CellId() } Good Old Word Count
  • 29.
    Advanced - Precisionis of the Essence ● Density Based Clustering ○ DBSCAN ■ Minimum cluster size (> Noise) ■ Epsilon (Spatial Radius) ○ R - MASS - kde2d ■ RGoogleMaps for the map ■ http://www.everydayanalytics.ca /2014/04/heatmap-of-toronto- traffic-signals.html
  • 30.
    rJava ● Wrap geospatialfunctions of your choice ● call them from R ● Use apply on an entire Dataframe! ● Use as features! ● Visualize??? (in 5 minutes)
  • 31.
    R Packs forGeospatial Analysis ● geonames ○ Timezone ○ Weather ○ Nearby places ● RGoogleMaps ○ download+paint Maps ○ getGeoCode ● sp / maps / maptools ○ OGC object abstractions ○ Manipulate / display geo data ● rgdal - spTransform ○ Convert formats / coordinates systems ● geosphere - distances / circles / centroids ● fpc - DBSCAN ● Coverage - ○ http://cran.r-project.org/web/views/Spatial.html
  • 33.
    Engineered Geo features ●LOCAL ○ time ○ is_early / is_late ○ day of week ○ is_workday / is_weekend ○ is_day_light (sunrise/ sunset tz_world) ● Weather ○ Temperature ○ is_ Rain/ Fog / Hail / Snow ● Squared (s2cell/ geohash) statistics ○ Probability of users in square to predict X ● Address - is_residence / is_business ● News - GDELT
  • 34.
  • 35.
  • 36.
  • 37.
    Frontend = Javascript? ●Google Maps API ○ https://developers.google.com/maps/documentation/javascript/examples/layer- heatmap ● Leaflet
  • 38.
    R for Visualisation ●ggplot2 + geospatial packs ○ http://uce.uniovi.es/mundor/howtoplotashapemap.html ○ http://stackoverflow.com/questions/9558040/ggplot-map-with-l ○ http://spatial.ly/2012/02/great-maps-ggplot2/ ● RGoogleMaps ○ http://rforwork.info/tag/rgooglemaps/
  • 43.
    R For Interactive ●Shiny ○ Leaflet ■ http://rstudio.github.io/leaflet/ ■ http://shiny.rstudio.com/gallery/superzip-example.html ■ http://shiny.rstudio.com/gallery/bus-dashboard.html ○ Globe ■ https://github.com/trestletech/shinyGlobe
  • 45.
  • 46.
  • 47.
    Keep an EyeOut! https://locationtech.org/list-of-projects
  • 50.