3. About: David Smiley
• Software Engineer (16 years)
• Search (7 years)
• Java (full-stack), Web, Spatial
• Freelance search consultant / developer
• Apache Lucene / Solr committer & PMC
• Wrote first book on Solr, updated twice
4. Agenda
• About this project
• Architecture
• Solr & time sharding
• Experiences with:
– Kotlin, Dropwizard,
Swagger
– Kafka
– Docker, Kontena
• Solr for geo-enrichment
• Solr adapter for Lucene
BKD Lat-Lon point
search & sort
• Heatmaps
– Existing functionality
• demo
– New functionality
5. H-Hypermap / BOP
• Harvard University, CGA:
Center for Geospatial Analysis
http://gis.harvard.edu
• Harvard Hypermap Project
– Managed by Ben Lewis
• BOP “Billion Object Platform”
– Funded by the Sloan Foundation
6. BOP Requirements Summary
• Most recent ~billion geo-tweets
• Realtime search (<5 sec latency)
• Sub-second queries
– Including heatmaps!
• On the cheap: ~6 mediocre boxes
Provide a proof-of-concept platform designed to lower the barrier for researchers who
need to access big streaming spatio-temporal datasets.
9. BOP Solr Sharding Architecture
Realtime
T2016_05_20
T2016_05_06
T2016_04_22
T2016_04_08
… 4-5 mo.
T2016_05_20
T2016_05_06
T2016_04_22
T2016_04_08
… 4-5 mo.
G_North_America G_Elsewhere
Lone Realtime Collection/Shard. 1-25 hrs
Copy then delete, at night
• Realtime shard is where realtime
search happens. No caches, but small.
• Primary collections have useful caches
• Housekeeping Tasks:
• Move data from RT to primary
• Create new shards; expire old
• Merge/optimize shards
10. Building a Search Web-Service
• Kotlin language (JVM based)
– Nullity as first-class language feature
• DropWizard framework
– Designed for web-services
• Swagger
– Dynamically generated dev UI for web-services
11. Apache Kafka
• Kafka: a scalable message/queue platform
• See new Kafka Streams & Kafka Connect APIs
• No back-pressure; can be a challenge
• Non-obvious use:
– For storage; time partitioning
• Lots of benefits yet serious limitations
12. Docker
• Easy to find/try/use
software
– No installation
– Simplified configuration
(env variables)
– Common logging
– Isolated
• Ideal for:
– Continuous Int. servers
– Trying new software
– Production advantages
• But “new”
13. Docker in Production
• I use “Kontena”
• Common logging, machine/proc stats, security
– VPN to secure network; access everything as local
• No longer need to care about:
– Ansible, Chef, Puppet, etc.
– Security at network or proxy; not service specific
• Challenges: state & big-data
14. Enrichment
Geo: Query Solr via spatial point query; attach
related metadata to tweet
Kafka
Topic
Enrich
Kafka
Topic
Twitter
Sentiment
Classifier
Geo: Solr with regional
polygons & metadata
15. Solr for Geo Enrichment
• Tweets (docs) can have a geo lat/lon
• Enrich tweet with Country, State/Province, …
– Gazetteer lookup (point-in-polygon)
Data Set Features Raw size Index time Index size
Admin2 46,311 824 MB 510 min 892 MB
US States 74,002 747 MB 4.9 min 840 MB
Massachusetts Census Blocks 154,621 152 MB 5.9 min 507 MB
16. Fast Point-in-Polygon Tricks
Index/Config
• Optimize to 1 segment
• RptWithGeometry
SpatialField
– precisionModel=
"floating_single"
– autoIndex="true"
• <cache name=
"perSegSpatial
FieldCache_WKT" …
Search
• Embed Solr (in-process)
• Use docValues, not stored
– fl=block:field(GEOID10)
Query like this:
• q={!field cache=false
f=WKT}Intersects(POINT
(
$lon $lat))
Sub-Millisecond!
17. Lucene “LatLonPoint”
• Uses new PointValues (BKD index) in Lucene 6
• Fastest: http://home.apache.org/~mikemccand/geobench.html
• Presently in Lucene sandbox module
• Some limitations: WGS84 points only
• Credit to Rob Muir and Mike McCandless
18. Solr Adapter For LatLonPoint
• New Solr FieldType for Lucene LatLonPoint
– Filter points by circle, rect, polygon
– Distance sort; but no boosting
Coming soon! Solr 6.4?
19. Heatmaps: Spatial Grid Faceting
• Spatial density summary grid faceting,
also useful for point-plotting search results
• Lucene & Solr APIs
• Scalable & fast usually…
• Usually rendered with a gradient radius ->
• See: http://spacemansteve.github.io/
leaflet-solr-heatmap/example/index.html
20. How-to: Heatmaps
• On an RPT field
geo="false"
worldBounds=
"ENVELOPE(
-180, 180, 180, -180)"
prefixTree="packedQuad"
• Query:
/select?facet=true
&facet.heatmap=geo_rpt
&facet.heatmap.geom=
["-180 -90" TO "180 90”]
&facet.heatmap.format=
ints2D or png
// Normal Solr response...
"facet_counts":{
... // facet response fields
"facet_heatmaps":{
"geo_rpt":[
"gridLevel",2,
"columns",32,
"rows",32,
"minX",-180.0,
"maxX",180.0,
"minY",-90.0,
"maxY",90.0,
"counts_ints2D”,
[null, null, [0, 1, ... ]]
21. New HeatmapSpatialField
• Why?
– With new BKD/PointValues, no “RPT” field to use
– Scalable for heatmaps; don’t worry about search
• Scalable at all resolutions; many millions of docs/shard
– Can be specific about grid resolutions
Coming soon! Solr 6.4?
22. Heatmaps with Stats
• Instead of counting docs; calculate a metric
– Ex: avg(minuteOfDay)
• Will require JSON Facet API
• Inherently slower than just doc counts
Coming soon! Solr 6.4?
23.
24.
25. Final Remarks
• Open-Source
– https://github.com/dsmiley/hhypermap-bop
• In-progress
• Improvements to Solr expected to be available
before December; officially in Solr 6.4.
Editor's Notes
Two halves: 1st: “experiences with” for this project…. And 2nd: Solr geospatial.
1st half is not about Solr, or less so... but there’s usually a Solr tie-in; and I think it will be useful to many attendees.
I’m a Solr expert, but not an expert on these things on the left… nonetheless learned a lot I can share.
This project is still very much in-progress.
* “H-Hypermap is really a collection of related projects”
* Mention my relationship
The Harvard Center for Geographic Analysis has established the HHypermap (Harvard Hypermap) system, comprised of multiple open-source projects aimed at searching vast amounts of spatial data. This talk centers on a system based on SolrCloud that can do realtime search on a billion Twitter tweets with heatmap analytics of sentiment analysis. The open-source system is designed to be suitable for social media data sets or sensor data.
Harvard CGA commissioned Apache Lucene/Solr's heatmap faceting capability in 2015 and this work now continues in 2016. The first new part is computing numeric stats per cell (not just doc counts), which can be used for a variety of applications. The second part is improving Lucene's grid cell indexing scheme to cater to heatmaps, thus allowing heatmap generation to be very fast for large data sets.
The GeoTweet platform, and the BOP being the realtime search part of that.
BOP is time-bound to be roughly 4 months, whereas Archival is everything.
To make the move of data in RT to the primary appear atomic, we use a filter query in the primary and an inverted one in the realtime using “date math” that give an hour buffer for the movement to happen.
TODO details…
1st. Why build a search web-service in front of Solr
The enrichment code is written in Kotlin and uses the new “Kafka Streams” API. Many instances of all of this are deployed to do work in parallel.
The Twitter Sentiment Classifier has a CLI REPL, and I’ve exposed that with a “tcp-server” utility. Send the tweet text, get a 1/0 (happy/sad).
Solr is accessible over HTTP.
I am aware of the massive jump in indexing time… not sure yet why it’s that terrible
Considered embedding the enrichment within Solr using the new “Topic Stream” (streaming expressions), but that’d be less flexible
There is a formula to articulate the quad tree level from a max cell count…. Currently in the web-service.