SlideShare a Scribd company logo
1 of 25
O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
H-Hypermap: Heatmap Analytics at Scale
David Smiley
Freelance Search Developer/Consultant
About: David Smiley
• Software Engineer (16 years)
• Search (7 years)
• Java (full-stack), Web, Spatial
• Freelance search consultant / developer
• Apache Lucene / Solr committer & PMC
• Wrote first book on Solr, updated twice
Agenda
• About this project
• Architecture
• Solr & time sharding
• Experiences with:
– Kotlin, Dropwizard,
Swagger
– Kafka
– Docker, Kontena
• Solr for geo-enrichment
• Solr adapter for Lucene
BKD Lat-Lon point
search & sort
• Heatmaps
– Existing functionality
• demo
– New functionality
H-Hypermap / BOP
• Harvard University, CGA:
Center for Geospatial Analysis
http://gis.harvard.edu
• Harvard Hypermap Project
– Managed by Ben Lewis
• BOP “Billion Object Platform”
– Funded by the Sloan Foundation
BOP Requirements Summary
• Most recent ~billion geo-tweets
• Realtime search (<5 sec latency)
• Sub-second queries
– Including heatmaps!
• On the cheap: ~6 mediocre boxes
Provide a proof-of-concept platform designed to lower the barrier for researchers who
need to access big streaming spatio-temporal datasets.
Logical High-Level Architecture
Archival
Realtime
Harvesting Enrichment
various clients...
various clients...
Data flows via Apache Kafka Systems expose
HTTP web services
“BOP”
Shard: W51
The BOP
Kafka
Topic Ingester
ZooKeeper
Shard: W52
Shard: W53
Shard: W54
Shard: RT
...
Web-
Service
Kafka Streams
• Create Solr doc
• Routes to shard
REST/JSON API
• Keyword search
• Faceting
• Heatmaps
• CSV export
...
BOP Solr Sharding Architecture
Realtime
T2016_05_20
T2016_05_06
T2016_04_22
T2016_04_08
… 4-5 mo.
T2016_05_20
T2016_05_06
T2016_04_22
T2016_04_08
… 4-5 mo.
G_North_America G_Elsewhere
Lone Realtime Collection/Shard. 1-25 hrs
Copy then delete, at night
• Realtime shard is where realtime
search happens. No caches, but small.
• Primary collections have useful caches
• Housekeeping Tasks:
• Move data from RT to primary
• Create new shards; expire old
• Merge/optimize shards
Building a Search Web-Service
• Kotlin language (JVM based)
– Nullity as first-class language feature
• DropWizard framework
– Designed for web-services
• Swagger
– Dynamically generated dev UI for web-services
Apache Kafka
• Kafka: a scalable message/queue platform
• See new Kafka Streams & Kafka Connect APIs
• No back-pressure; can be a challenge
• Non-obvious use:
– For storage; time partitioning
• Lots of benefits yet serious limitations
Docker
• Easy to find/try/use
software
– No installation
– Simplified configuration
(env variables)
– Common logging
– Isolated
• Ideal for:
– Continuous Int. servers
– Trying new software
– Production advantages
• But “new”
Docker in Production
• I use “Kontena”
• Common logging, machine/proc stats, security
– VPN to secure network; access everything as local
• No longer need to care about:
– Ansible, Chef, Puppet, etc.
– Security at network or proxy; not service specific
• Challenges: state & big-data
Enrichment
Geo: Query Solr via spatial point query; attach
related metadata to tweet
Kafka
Topic
Enrich
Kafka
Topic
Twitter
Sentiment
Classifier
Geo: Solr with regional
polygons & metadata
Solr for Geo Enrichment
• Tweets (docs) can have a geo lat/lon
• Enrich tweet with Country, State/Province, …
– Gazetteer lookup (point-in-polygon)
Data Set Features Raw size Index time Index size
Admin2 46,311 824 MB 510 min 892 MB
US States 74,002 747 MB 4.9 min 840 MB
Massachusetts Census Blocks 154,621 152 MB 5.9 min 507 MB
Fast Point-in-Polygon Tricks
Index/Config
• Optimize to 1 segment
• RptWithGeometry
SpatialField
– precisionModel=
"floating_single"
– autoIndex="true"
• <cache name=
"perSegSpatial
FieldCache_WKT" …
Search
• Embed Solr (in-process)
• Use docValues, not stored
– fl=block:field(GEOID10)
Query like this:
• q={!field cache=false
f=WKT}Intersects(POINT
(
$lon $lat))
Sub-Millisecond!
Lucene “LatLonPoint”
• Uses new PointValues (BKD index) in Lucene 6
• Fastest: http://home.apache.org/~mikemccand/geobench.html
• Presently in Lucene sandbox module
• Some limitations: WGS84 points only
• Credit to Rob Muir and Mike McCandless
Solr Adapter For LatLonPoint
• New Solr FieldType for Lucene LatLonPoint
– Filter points by circle, rect, polygon
– Distance sort; but no boosting
Coming soon! Solr 6.4?
Heatmaps: Spatial Grid Faceting
• Spatial density summary grid faceting,
also useful for point-plotting search results
• Lucene & Solr APIs
• Scalable & fast usually…
• Usually rendered with a gradient radius ->
• See: http://spacemansteve.github.io/
leaflet-solr-heatmap/example/index.html
How-to: Heatmaps
• On an RPT field
geo="false"
worldBounds=
"ENVELOPE(
-180, 180, 180, -180)"
prefixTree="packedQuad"
• Query:
/select?facet=true
&facet.heatmap=geo_rpt
&facet.heatmap.geom=
["-180 -90" TO "180 90”]
&facet.heatmap.format=
ints2D or png
// Normal Solr response...
"facet_counts":{
... // facet response fields
"facet_heatmaps":{
"geo_rpt":[
"gridLevel",2,
"columns",32,
"rows",32,
"minX",-180.0,
"maxX",180.0,
"minY",-90.0,
"maxY",90.0,
"counts_ints2D”,
[null, null, [0, 1, ... ]]
New HeatmapSpatialField
• Why?
– With new BKD/PointValues, no “RPT” field to use
– Scalable for heatmaps; don’t worry about search
• Scalable at all resolutions; many millions of docs/shard
– Can be specific about grid resolutions
Coming soon! Solr 6.4?
Heatmaps with Stats
• Instead of counting docs; calculate a metric
– Ex: avg(minuteOfDay)
• Will require JSON Facet API
• Inherently slower than just doc counts
Coming soon! Solr 6.4?
Final Remarks
• Open-Source
– https://github.com/dsmiley/hhypermap-bop
• In-progress
• Improvements to Solr expected to be available
before December; officially in Solr 6.4.

More Related Content

What's hot

Nokia Asha webinar: Developing location-based services for Nokia Asha phones ...
Nokia Asha webinar: Developing location-based services for Nokia Asha phones ...Nokia Asha webinar: Developing location-based services for Nokia Asha phones ...
Nokia Asha webinar: Developing location-based services for Nokia Asha phones ...Microsoft Mobile Developer
 
Location based services for Nokia X and Nokia Asha using Geo2tag
Location based services for Nokia X and Nokia Asha using Geo2tagLocation based services for Nokia X and Nokia Asha using Geo2tag
Location based services for Nokia X and Nokia Asha using Geo2tagMicrosoft Mobile Developer
 
Spatial functions in MySQL 5.6, MariaDB 5.5, PostGIS 2.0 and others
Spatial functions in  MySQL 5.6, MariaDB 5.5, PostGIS 2.0 and othersSpatial functions in  MySQL 5.6, MariaDB 5.5, PostGIS 2.0 and others
Spatial functions in MySQL 5.6, MariaDB 5.5, PostGIS 2.0 and othersHenrik Ingo
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large GraphsNishant Gandhi
 
OSM data in MariaDB / MySQL - All the world in a few large tables
OSM data in MariaDB / MySQL - All the world in a few large tablesOSM data in MariaDB / MySQL - All the world in a few large tables
OSM data in MariaDB / MySQL - All the world in a few large tableshholzgra
 
Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...
Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...
Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...Takahiro Harada
 
Vector Tiles with GeoServer and OpenLayers
Vector Tiles with GeoServer and OpenLayersVector Tiles with GeoServer and OpenLayers
Vector Tiles with GeoServer and OpenLayersJody Garnett
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...Databricks
 
Techniques for Organization and Visualization of Community Photo Collections
Techniques for Organization and Visualization of Community Photo CollectionsTechniques for Organization and Visualization of Community Photo Collections
Techniques for Organization and Visualization of Community Photo CollectionsKumar Srijan
 
object detection paper review
object detection paper reviewobject detection paper review
object detection paper reviewYoonho Na
 
Graph Regularised Hashing
Graph Regularised HashingGraph Regularised Hashing
Graph Regularised HashingSean Moran
 
R-FCN : object detection via region-based fully convolutional networks
R-FCN :  object detection via region-based fully convolutional networksR-FCN :  object detection via region-based fully convolutional networks
R-FCN : object detection via region-based fully convolutional networksEntrepreneur / Startup
 
GeoServer Orientation
GeoServer OrientationGeoServer Orientation
GeoServer OrientationJody Garnett
 
State of GeoServer 2.13
State of GeoServer 2.13State of GeoServer 2.13
State of GeoServer 2.13Jody Garnett
 

What's hot (20)

Nokia Asha webinar: Developing location-based services for Nokia Asha phones ...
Nokia Asha webinar: Developing location-based services for Nokia Asha phones ...Nokia Asha webinar: Developing location-based services for Nokia Asha phones ...
Nokia Asha webinar: Developing location-based services for Nokia Asha phones ...
 
Location based services for Nokia X and Nokia Asha using Geo2tag
Location based services for Nokia X and Nokia Asha using Geo2tagLocation based services for Nokia X and Nokia Asha using Geo2tag
Location based services for Nokia X and Nokia Asha using Geo2tag
 
MySQL 5.7 GIS
MySQL 5.7 GISMySQL 5.7 GIS
MySQL 5.7 GIS
 
Spatial search with geohashes
Spatial search with geohashesSpatial search with geohashes
Spatial search with geohashes
 
Spatial functions in MySQL 5.6, MariaDB 5.5, PostGIS 2.0 and others
Spatial functions in  MySQL 5.6, MariaDB 5.5, PostGIS 2.0 and othersSpatial functions in  MySQL 5.6, MariaDB 5.5, PostGIS 2.0 and others
Spatial functions in MySQL 5.6, MariaDB 5.5, PostGIS 2.0 and others
 
Objects as points
Objects as pointsObjects as points
Objects as points
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large Graphs
 
OSM data in MariaDB / MySQL - All the world in a few large tables
OSM data in MariaDB / MySQL - All the world in a few large tablesOSM data in MariaDB / MySQL - All the world in a few large tables
OSM data in MariaDB / MySQL - All the world in a few large tables
 
Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...
Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...
Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...
 
Vector Tiles with GeoServer and OpenLayers
Vector Tiles with GeoServer and OpenLayersVector Tiles with GeoServer and OpenLayers
Vector Tiles with GeoServer and OpenLayers
 
Real-time lightmap baking
Real-time lightmap bakingReal-time lightmap baking
Real-time lightmap baking
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
 
Techniques for Organization and Visualization of Community Photo Collections
Techniques for Organization and Visualization of Community Photo CollectionsTechniques for Organization and Visualization of Community Photo Collections
Techniques for Organization and Visualization of Community Photo Collections
 
object detection paper review
object detection paper reviewobject detection paper review
object detection paper review
 
Deep Learning for Computer Vision: Segmentation (UPC 2016)
Deep Learning for Computer Vision: Segmentation (UPC 2016)Deep Learning for Computer Vision: Segmentation (UPC 2016)
Deep Learning for Computer Vision: Segmentation (UPC 2016)
 
Graph Regularised Hashing
Graph Regularised HashingGraph Regularised Hashing
Graph Regularised Hashing
 
Efficient Parallel Set-Similarity Joins Using MapReduce
 Efficient Parallel Set-Similarity Joins Using MapReduce Efficient Parallel Set-Similarity Joins Using MapReduce
Efficient Parallel Set-Similarity Joins Using MapReduce
 
R-FCN : object detection via region-based fully convolutional networks
R-FCN :  object detection via region-based fully convolutional networksR-FCN :  object detection via region-based fully convolutional networks
R-FCN : object detection via region-based fully convolutional networks
 
GeoServer Orientation
GeoServer OrientationGeoServer Orientation
GeoServer Orientation
 
State of GeoServer 2.13
State of GeoServer 2.13State of GeoServer 2.13
State of GeoServer 2.13
 

Viewers also liked

2014 11 lucene spatial temporal update
2014 11 lucene spatial temporal update2014 11 lucene spatial temporal update
2014 11 lucene spatial temporal updateDavid Smiley
 
Cug 2015 event recap
Cug 2015 event recapCug 2015 event recap
Cug 2015 event recapCIMM2
 
Loyalty, an essential value for a leader
 Loyalty, an essential value for a leader  Loyalty, an essential value for a leader
Loyalty, an essential value for a leader Gérald Karsenti
 
Suuntaamo yrityksille fi
Suuntaamo yrityksille fiSuuntaamo yrityksille fi
Suuntaamo yrityksille fiSuuntaamo
 
Guía cursos inscripción continua
Guía cursos inscripción continuaGuía cursos inscripción continua
Guía cursos inscripción continuaMariana Fossatti
 
PP1 presentation slide.
PP1 presentation slide.PP1 presentation slide.
PP1 presentation slide.lucaschinsheng
 
National Fund project on Pashmina Goat
National Fund project on Pashmina Goat National Fund project on Pashmina Goat
National Fund project on Pashmina Goat SKUAST-Kashmir
 
The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?Raffael Marty
 
Software application
Software application Software application
Software application Lee Pei Gie
 
Fetc 2016 creating a game design sequence
Fetc 2016 creating a game design sequenceFetc 2016 creating a game design sequence
Fetc 2016 creating a game design sequenceMike Ploor
 
Arquitectura manierista
Arquitectura manieristaArquitectura manierista
Arquitectura manieristainorac
 
WSO2 IoT Server - Product Overview
WSO2 IoT Server - Product OverviewWSO2 IoT Server - Product Overview
WSO2 IoT Server - Product OverviewWSO2
 
High Quality Software Development with Agile and Scrum
High Quality Software Development with Agile and ScrumHigh Quality Software Development with Agile and Scrum
High Quality Software Development with Agile and ScrumLemi Orhan Ergin
 
Quantity surveyor
Quantity surveyorQuantity surveyor
Quantity surveyorSyafiq Deen
 
Jonathan Meiri - Adaptation to the Passengers in the Aviation World
Jonathan Meiri - Adaptation to the Passengers in the Aviation WorldJonathan Meiri - Adaptation to the Passengers in the Aviation World
Jonathan Meiri - Adaptation to the Passengers in the Aviation WorldOscar4B
 
Software Application in Quantity Surveying - Slides
Software Application in Quantity Surveying - SlidesSoftware Application in Quantity Surveying - Slides
Software Application in Quantity Surveying - SlidesPang Khai Shuen
 

Viewers also liked (20)

2014 11 lucene spatial temporal update
2014 11 lucene spatial temporal update2014 11 lucene spatial temporal update
2014 11 lucene spatial temporal update
 
Cug 2015 event recap
Cug 2015 event recapCug 2015 event recap
Cug 2015 event recap
 
Sachin Jain CV
Sachin Jain CVSachin Jain CV
Sachin Jain CV
 
Loyalty, an essential value for a leader
 Loyalty, an essential value for a leader  Loyalty, an essential value for a leader
Loyalty, an essential value for a leader
 
Suuntaamo yrityksille fi
Suuntaamo yrityksille fiSuuntaamo yrityksille fi
Suuntaamo yrityksille fi
 
Guía cursos inscripción continua
Guía cursos inscripción continuaGuía cursos inscripción continua
Guía cursos inscripción continua
 
PP1 presentation slide.
PP1 presentation slide.PP1 presentation slide.
PP1 presentation slide.
 
National Fund project on Pashmina Goat
National Fund project on Pashmina Goat National Fund project on Pashmina Goat
National Fund project on Pashmina Goat
 
Pp1 seminar
Pp1 seminarPp1 seminar
Pp1 seminar
 
The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?
 
Software application
Software application Software application
Software application
 
Fetc 2016 creating a game design sequence
Fetc 2016 creating a game design sequenceFetc 2016 creating a game design sequence
Fetc 2016 creating a game design sequence
 
Arquitectura manierista
Arquitectura manieristaArquitectura manierista
Arquitectura manierista
 
WSO2 IoT Server - Product Overview
WSO2 IoT Server - Product OverviewWSO2 IoT Server - Product Overview
WSO2 IoT Server - Product Overview
 
CBO’s Agriculture Baseline
CBO’s Agriculture BaselineCBO’s Agriculture Baseline
CBO’s Agriculture Baseline
 
phulkari
phulkariphulkari
phulkari
 
High Quality Software Development with Agile and Scrum
High Quality Software Development with Agile and ScrumHigh Quality Software Development with Agile and Scrum
High Quality Software Development with Agile and Scrum
 
Quantity surveyor
Quantity surveyorQuantity surveyor
Quantity surveyor
 
Jonathan Meiri - Adaptation to the Passengers in the Aviation World
Jonathan Meiri - Adaptation to the Passengers in the Aviation WorldJonathan Meiri - Adaptation to the Passengers in the Aviation World
Jonathan Meiri - Adaptation to the Passengers in the Aviation World
 
Software Application in Quantity Surveying - Slides
Software Application in Quantity Surveying - SlidesSoftware Application in Quantity Surveying - Slides
Software Application in Quantity Surveying - Slides
 

Similar to H-Hypermap Heatmap Analytics at Scale

H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...Lucidworks
 
Fusion on Kubernetes - Alan Eugenio & Joe Streeky, Lucidworks
Fusion on Kubernetes - Alan Eugenio & Joe Streeky, LucidworksFusion on Kubernetes - Alan Eugenio & Joe Streeky, Lucidworks
Fusion on Kubernetes - Alan Eugenio & Joe Streeky, LucidworksLucidworks
 
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...Mohamed Sayed
 
Innovating faster with SBT, Continuous Delivery, and LXC
Innovating faster with SBT, Continuous Delivery, and LXCInnovating faster with SBT, Continuous Delivery, and LXC
Innovating faster with SBT, Continuous Delivery, and LXCkscaldef
 
Tips For Maintaining OSS Projects
Tips For Maintaining OSS ProjectsTips For Maintaining OSS Projects
Tips For Maintaining OSS ProjectsTaro L. Saito
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Lucidworks
 
Kubernetes Manchester - 6th December 2018
Kubernetes Manchester - 6th December 2018Kubernetes Manchester - 6th December 2018
Kubernetes Manchester - 6th December 2018David Stockton
 
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...OpenBlend society
 
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013Belmiro Moreira
 
Kubernetes Walk Through from Technical View
Kubernetes Walk Through from Technical ViewKubernetes Walk Through from Technical View
Kubernetes Walk Through from Technical ViewLei (Harry) Zhang
 
Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...DataWorks Summit
 
Redis Day Keynote Salvatore Sanfillipo Redis Labs
Redis Day Keynote Salvatore Sanfillipo Redis LabsRedis Day Keynote Salvatore Sanfillipo Redis Labs
Redis Day Keynote Salvatore Sanfillipo Redis LabsRedis Labs
 
Containers, Serverless and Functions in a nutshell
Containers, Serverless and Functions in a nutshellContainers, Serverless and Functions in a nutshell
Containers, Serverless and Functions in a nutshellEugene Fedorenko
 
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018Sunghyouk Bae
 
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement VMware Tanzu
 
Scality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup PresentationScality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup PresentationScality
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentationCyanny LIANG
 

Similar to H-Hypermap Heatmap Analytics at Scale (20)

H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
 
Fusion on Kubernetes - Alan Eugenio & Joe Streeky, Lucidworks
Fusion on Kubernetes - Alan Eugenio & Joe Streeky, LucidworksFusion on Kubernetes - Alan Eugenio & Joe Streeky, Lucidworks
Fusion on Kubernetes - Alan Eugenio & Joe Streeky, Lucidworks
 
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...
 
Innovating faster with SBT, Continuous Delivery, and LXC
Innovating faster with SBT, Continuous Delivery, and LXCInnovating faster with SBT, Continuous Delivery, and LXC
Innovating faster with SBT, Continuous Delivery, and LXC
 
Tips For Maintaining OSS Projects
Tips For Maintaining OSS ProjectsTips For Maintaining OSS Projects
Tips For Maintaining OSS Projects
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
 
Kubernetes Manchester - 6th December 2018
Kubernetes Manchester - 6th December 2018Kubernetes Manchester - 6th December 2018
Kubernetes Manchester - 6th December 2018
 
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
Introducing Hibernate OGM: porting JPA applications to NoSQL, Sanne Grinovero...
 
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
 
Kubernetes Walk Through from Technical View
Kubernetes Walk Through from Technical ViewKubernetes Walk Through from Technical View
Kubernetes Walk Through from Technical View
 
Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...
 
Redis Day Keynote Salvatore Sanfillipo Redis Labs
Redis Day Keynote Salvatore Sanfillipo Redis LabsRedis Day Keynote Salvatore Sanfillipo Redis Labs
Redis Day Keynote Salvatore Sanfillipo Redis Labs
 
Containers, Serverless and Functions in a nutshell
Containers, Serverless and Functions in a nutshellContainers, Serverless and Functions in a nutshell
Containers, Serverless and Functions in a nutshell
 
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
 
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
 
Scality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup PresentationScality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup Presentation
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
Java on the Mainframe
Java on the MainframeJava on the Mainframe
Java on the Mainframe
 

Recently uploaded

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Recently uploaded (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

H-Hypermap Heatmap Analytics at Scale

  • 1. O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
  • 2. H-Hypermap: Heatmap Analytics at Scale David Smiley Freelance Search Developer/Consultant
  • 3. About: David Smiley • Software Engineer (16 years) • Search (7 years) • Java (full-stack), Web, Spatial • Freelance search consultant / developer • Apache Lucene / Solr committer & PMC • Wrote first book on Solr, updated twice
  • 4. Agenda • About this project • Architecture • Solr & time sharding • Experiences with: – Kotlin, Dropwizard, Swagger – Kafka – Docker, Kontena • Solr for geo-enrichment • Solr adapter for Lucene BKD Lat-Lon point search & sort • Heatmaps – Existing functionality • demo – New functionality
  • 5. H-Hypermap / BOP • Harvard University, CGA: Center for Geospatial Analysis http://gis.harvard.edu • Harvard Hypermap Project – Managed by Ben Lewis • BOP “Billion Object Platform” – Funded by the Sloan Foundation
  • 6. BOP Requirements Summary • Most recent ~billion geo-tweets • Realtime search (<5 sec latency) • Sub-second queries – Including heatmaps! • On the cheap: ~6 mediocre boxes Provide a proof-of-concept platform designed to lower the barrier for researchers who need to access big streaming spatio-temporal datasets.
  • 7. Logical High-Level Architecture Archival Realtime Harvesting Enrichment various clients... various clients... Data flows via Apache Kafka Systems expose HTTP web services “BOP”
  • 8. Shard: W51 The BOP Kafka Topic Ingester ZooKeeper Shard: W52 Shard: W53 Shard: W54 Shard: RT ... Web- Service Kafka Streams • Create Solr doc • Routes to shard REST/JSON API • Keyword search • Faceting • Heatmaps • CSV export ...
  • 9. BOP Solr Sharding Architecture Realtime T2016_05_20 T2016_05_06 T2016_04_22 T2016_04_08 … 4-5 mo. T2016_05_20 T2016_05_06 T2016_04_22 T2016_04_08 … 4-5 mo. G_North_America G_Elsewhere Lone Realtime Collection/Shard. 1-25 hrs Copy then delete, at night • Realtime shard is where realtime search happens. No caches, but small. • Primary collections have useful caches • Housekeeping Tasks: • Move data from RT to primary • Create new shards; expire old • Merge/optimize shards
  • 10. Building a Search Web-Service • Kotlin language (JVM based) – Nullity as first-class language feature • DropWizard framework – Designed for web-services • Swagger – Dynamically generated dev UI for web-services
  • 11. Apache Kafka • Kafka: a scalable message/queue platform • See new Kafka Streams & Kafka Connect APIs • No back-pressure; can be a challenge • Non-obvious use: – For storage; time partitioning • Lots of benefits yet serious limitations
  • 12. Docker • Easy to find/try/use software – No installation – Simplified configuration (env variables) – Common logging – Isolated • Ideal for: – Continuous Int. servers – Trying new software – Production advantages • But “new”
  • 13. Docker in Production • I use “Kontena” • Common logging, machine/proc stats, security – VPN to secure network; access everything as local • No longer need to care about: – Ansible, Chef, Puppet, etc. – Security at network or proxy; not service specific • Challenges: state & big-data
  • 14. Enrichment Geo: Query Solr via spatial point query; attach related metadata to tweet Kafka Topic Enrich Kafka Topic Twitter Sentiment Classifier Geo: Solr with regional polygons & metadata
  • 15. Solr for Geo Enrichment • Tweets (docs) can have a geo lat/lon • Enrich tweet with Country, State/Province, … – Gazetteer lookup (point-in-polygon) Data Set Features Raw size Index time Index size Admin2 46,311 824 MB 510 min 892 MB US States 74,002 747 MB 4.9 min 840 MB Massachusetts Census Blocks 154,621 152 MB 5.9 min 507 MB
  • 16. Fast Point-in-Polygon Tricks Index/Config • Optimize to 1 segment • RptWithGeometry SpatialField – precisionModel= "floating_single" – autoIndex="true" • <cache name= "perSegSpatial FieldCache_WKT" … Search • Embed Solr (in-process) • Use docValues, not stored – fl=block:field(GEOID10) Query like this: • q={!field cache=false f=WKT}Intersects(POINT ( $lon $lat)) Sub-Millisecond!
  • 17. Lucene “LatLonPoint” • Uses new PointValues (BKD index) in Lucene 6 • Fastest: http://home.apache.org/~mikemccand/geobench.html • Presently in Lucene sandbox module • Some limitations: WGS84 points only • Credit to Rob Muir and Mike McCandless
  • 18. Solr Adapter For LatLonPoint • New Solr FieldType for Lucene LatLonPoint – Filter points by circle, rect, polygon – Distance sort; but no boosting Coming soon! Solr 6.4?
  • 19. Heatmaps: Spatial Grid Faceting • Spatial density summary grid faceting, also useful for point-plotting search results • Lucene & Solr APIs • Scalable & fast usually… • Usually rendered with a gradient radius -> • See: http://spacemansteve.github.io/ leaflet-solr-heatmap/example/index.html
  • 20. How-to: Heatmaps • On an RPT field geo="false" worldBounds= "ENVELOPE( -180, 180, 180, -180)" prefixTree="packedQuad" • Query: /select?facet=true &facet.heatmap=geo_rpt &facet.heatmap.geom= ["-180 -90" TO "180 90”] &facet.heatmap.format= ints2D or png // Normal Solr response... "facet_counts":{ ... // facet response fields "facet_heatmaps":{ "geo_rpt":[ "gridLevel",2, "columns",32, "rows",32, "minX",-180.0, "maxX",180.0, "minY",-90.0, "maxY",90.0, "counts_ints2D”, [null, null, [0, 1, ... ]]
  • 21. New HeatmapSpatialField • Why? – With new BKD/PointValues, no “RPT” field to use – Scalable for heatmaps; don’t worry about search • Scalable at all resolutions; many millions of docs/shard – Can be specific about grid resolutions Coming soon! Solr 6.4?
  • 22. Heatmaps with Stats • Instead of counting docs; calculate a metric – Ex: avg(minuteOfDay) • Will require JSON Facet API • Inherently slower than just doc counts Coming soon! Solr 6.4?
  • 23.
  • 24.
  • 25. Final Remarks • Open-Source – https://github.com/dsmiley/hhypermap-bop • In-progress • Improvements to Solr expected to be available before December; officially in Solr 6.4.

Editor's Notes

  1. Two halves: 1st: “experiences with” for this project…. And 2nd: Solr geospatial. 1st half is not about Solr, or less so... but there’s usually a Solr tie-in; and I think it will be useful to many attendees. I’m a Solr expert, but not an expert on these things on the left… nonetheless learned a lot I can share. This project is still very much in-progress.
  2. * “H-Hypermap is really a collection of related projects” * Mention my relationship The Harvard Center for Geographic Analysis has established the HHypermap (Harvard Hypermap) system, comprised of multiple open-source projects aimed at searching vast amounts of spatial data. This talk centers on a system based on SolrCloud that can do realtime search on a billion Twitter tweets with heatmap analytics of sentiment analysis. The open-source system is designed to be suitable for social media data sets or sensor data. Harvard CGA commissioned Apache Lucene/Solr's heatmap faceting capability in 2015 and this work now continues in 2016. The first new part is computing numeric stats per cell (not just doc counts), which can be used for a variety of applications. The second part is improving Lucene's grid cell indexing scheme to cater to heatmaps, thus allowing heatmap generation to be very fast for large data sets.
  3. The GeoTweet platform, and the BOP being the realtime search part of that. BOP is time-bound to be roughly 4 months, whereas Archival is everything.
  4. To make the move of data in RT to the primary appear atomic, we use a filter query in the primary and an inverted one in the realtime using “date math” that give an hour buffer for the movement to happen. TODO details…
  5. 1st. Why build a search web-service in front of Solr
  6. The enrichment code is written in Kotlin and uses the new “Kafka Streams” API. Many instances of all of this are deployed to do work in parallel. The Twitter Sentiment Classifier has a CLI REPL, and I’ve exposed that with a “tcp-server” utility. Send the tweet text, get a 1/0 (happy/sad). Solr is accessible over HTTP.
  7. I am aware of the massive jump in indexing time… not sure yet why it’s that terrible
  8. Considered embedding the enrichment within Solr using the new “Topic Stream” (streaming expressions), but that’d be less flexible
  9. There is a formula to articulate the quad tree level from a max cell count…. Currently in the web-service.
  10. Early known limitations… just point data