H-Hypermap Heatmap Analytics at Scale

O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

H-Hypermap: Heatmap Analytics at Scale
David Smiley
Freelance Search Developer/Consultant

About: David Smiley
• Software Engineer (16 years)
• Search (7 years)
• Java (full-stack), Web, Spatial
• Freelance search consultant / developer
• Apache Lucene / Solr committer & PMC
• Wrote first book on Solr, updated twice

Agenda
• About this project
• Architecture
• Solr & time sharding
• Experiences with:
– Kotlin, Dropwizard,
Swagger
– Kafka
– Docker, Kontena
• Solr for geo-enrichment
• Solr adapter for Lucene
BKD Lat-Lon point
search & sort
• Heatmaps
– Existing functionality
• demo
– New functionality

H-Hypermap / BOP
• Harvard University, CGA:
Center for Geospatial Analysis
http://gis.harvard.edu
• Harvard Hypermap Project
– Managed by Ben Lewis
• BOP “Billion Object Platform”
– Funded by the Sloan Foundation

BOP Requirements Summary
• Most recent ~billion geo-tweets
• Realtime search (<5 sec latency)
• Sub-second queries
– Including heatmaps!
• On the cheap: ~6 mediocre boxes
Provide a proof-of-concept platform designed to lower the barrier for researchers who
need to access big streaming spatio-temporal datasets.

Logical High-Level Architecture
Archival
Realtime
Harvesting Enrichment
various clients...
various clients...
Data flows via Apache Kafka Systems expose
HTTP web services
“BOP”

Shard: W51
The BOP
Kafka
Topic Ingester
ZooKeeper
Shard: W52
Shard: W53
Shard: W54
Shard: RT
...
Web-
Service
Kafka Streams
• Create Solr doc
• Routes to shard
REST/JSON API
• Keyword search
• Faceting
• Heatmaps
• CSV export
...

BOP Solr Sharding Architecture
Realtime
T2016_05_20
T2016_05_06
T2016_04_22
T2016_04_08
… 4-5 mo.
T2016_05_20
T2016_05_06
T2016_04_22
T2016_04_08
… 4-5 mo.
G_North_America G_Elsewhere
Lone Realtime Collection/Shard. 1-25 hrs
Copy then delete, at night
• Realtime shard is where realtime
search happens. No caches, but small.
• Primary collections have useful caches
• Housekeeping Tasks:
• Move data from RT to primary
• Create new shards; expire old
• Merge/optimize shards

Building a Search Web-Service
• Kotlin language (JVM based)
– Nullity as first-class language feature
• DropWizard framework
– Designed for web-services
• Swagger
– Dynamically generated dev UI for web-services

Apache Kafka
• Kafka: a scalable message/queue platform
• See new Kafka Streams & Kafka Connect APIs
• No back-pressure; can be a challenge
• Non-obvious use:
– For storage; time partitioning
• Lots of benefits yet serious limitations

Docker
• Easy to find/try/use
software
– No installation
– Simplified configuration
(env variables)
– Common logging
– Isolated
• Ideal for:
– Continuous Int. servers
– Trying new software
– Production advantages
• But “new”

Docker in Production
• I use “Kontena”
• Common logging, machine/proc stats, security
– VPN to secure network; access everything as local
• No longer need to care about:
– Ansible, Chef, Puppet, etc.
– Security at network or proxy; not service specific
• Challenges: state & big-data

Enrichment
Geo: Query Solr via spatial point query; attach
related metadata to tweet
Kafka
Topic
Enrich
Kafka
Topic
Twitter
Sentiment
Classifier
Geo: Solr with regional
polygons & metadata

Solr for Geo Enrichment
• Tweets (docs) can have a geo lat/lon
• Enrich tweet with Country, State/Province, …
– Gazetteer lookup (point-in-polygon)
Data Set Features Raw size Index time Index size
Admin2 46,311 824 MB 510 min 892 MB
US States 74,002 747 MB 4.9 min 840 MB
Massachusetts Census Blocks 154,621 152 MB 5.9 min 507 MB

Fast Point-in-Polygon Tricks
Index/Config
• Optimize to 1 segment
• RptWithGeometry
SpatialField
– precisionModel=
"floating_single"
– autoIndex="true"
• <cache name=
"perSegSpatial
FieldCache_WKT" …
Search
• Embed Solr (in-process)
• Use docValues, not stored
– fl=block:field(GEOID10)
Query like this:
• q={!field cache=false
f=WKT}Intersects(POINT
(
$lon $lat))
Sub-Millisecond!

Lucene “LatLonPoint”
• Uses new PointValues (BKD index) in Lucene 6
• Fastest: http://home.apache.org/~mikemccand/geobench.html
• Presently in Lucene sandbox module
• Some limitations: WGS84 points only
• Credit to Rob Muir and Mike McCandless

Solr Adapter For LatLonPoint
• New Solr FieldType for Lucene LatLonPoint
– Filter points by circle, rect, polygon
– Distance sort; but no boosting
Coming soon! Solr 6.4?

Heatmaps: Spatial Grid Faceting
• Spatial density summary grid faceting,
also useful for point-plotting search results
• Lucene & Solr APIs
• Scalable & fast usually…
• Usually rendered with a gradient radius ->
• See: http://spacemansteve.github.io/
leaflet-solr-heatmap/example/index.html

How-to: Heatmaps
• On an RPT field
geo="false"
worldBounds=
"ENVELOPE(
-180, 180, 180, -180)"
prefixTree="packedQuad"
• Query:
/select?facet=true
&facet.heatmap=geo_rpt
&facet.heatmap.geom=
["-180 -90" TO "180 90”]
&facet.heatmap.format=
ints2D or png
// Normal Solr response...
"facet_counts":{
... // facet response fields
"facet_heatmaps":{
"geo_rpt":[
"gridLevel",2,
"columns",32,
"rows",32,
"minX",-180.0,
"maxX",180.0,
"minY",-90.0,
"maxY",90.0,
"counts_ints2D”,
[null, null, [0, 1, ... ]]

New HeatmapSpatialField
• Why?
– With new BKD/PointValues, no “RPT” field to use
– Scalable for heatmaps; don’t worry about search
• Scalable at all resolutions; many millions of docs/shard
– Can be specific about grid resolutions

Heatmaps with Stats
• Instead of counting docs; calculate a metric
– Ex: avg(minuteOfDay)
• Will require JSON Facet API
• Inherently slower than just doc counts

Final Remarks
• Open-Source
– https://github.com/dsmiley/hhypermap-bop
• In-progress
• Improvements to Solr expected to be available
before December; officially in Solr 6.4.

H-Hypermap Heatmap Analytics at Scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to H-Hypermap Heatmap Analytics at Scale

Similar to H-Hypermap Heatmap Analytics at Scale (20)

Recently uploaded

Recently uploaded (20)

H-Hypermap Heatmap Analytics at Scale

Editor's Notes