Query-time Nonparametric
Regression with Temporally
Bounded Models
Gus Heck & David Smiley
#Activate18 #ActivateSearch
Agenda
• Presenters
• Query time Nonparametric Regression
• Demo – Suggesting tagged images
• Time Routed Aliases
• Demo – Creating a Time Routed Alias
Presenters
Patrick (Gus) Heck
• Solr Contributor since 2013
• Consulting since 2012
• Enterprise Search since
2010
• Apache Ant contributor
2003-2004
• Web Applications since 2003
David Smiley
• Lucene/Solr Committer
(PMC)
• Consulting
• Author of first book on Solr
• Presentations & Training
In the beginning..
Dave Mackey - Vision for an online coaching
platform leveraging machine learning
technology to help companies help their
middle managers.
Engaged me first as Chief Architect and later
as CTO to bring this vision to reality.
It worked, but was not funded...
Brief Overview of Simply Coached
Simply Coached provided online career coaching
system
Key Goal was connecting users with relevant
curated content
Relevance was determined by
• Identifying topics the user is interested in by
observing and learning from click-throughs
• Several other customized metrics
Learning What Interests the User
Goals:
• Suggest things of interest, things the user is willing
to learn
• Make predictions based on the user’s behavior
• Unique prediction per user
• Predictions based on entire user base
• Avoid calcification of the predictive model, old data
needs to be sunset regularly.
Modeling Candidate - Neural Nets
Simply Coached started right at the dawn of
the “3rd wave” of neural networks
Limitations of Neural Networks
• They don’t adapt well to a changing
problem
• We would need to retrain predictive
network regularly
• Require massive data for training
Modeling Candidate - Regression
Not as “cool” as deep learning, but more suitable
• Can converge faster with less data
• But, most techniques are assuming that the
residuals are normally distributed
• Many of our features will be binary
• Normality is right out the window
I decided to look for a Non-Parametric Regression
technique
Non-Parametric Multiplicative Regression
• Can predict continuous variable
• Good for sorting/ranking
• Kernel smoother, presence/absence
kernels already well known
• Non-parametric
Nice discrete sub-calculations...
Used in Habitat Ecology for predicting habitat suitability
https://ir.library.oregonstate.edu/concern/defaults/z029p910z
NPMR Equations
Performance across multiple factors (features)
Local Mean estimator
Example kernels continuous or
presence/absence:
Key realization: the products and sums for i,j can
pre precalculated:
I called these pre-calculated portions “Partials” and
added temporal metadata
But wait... What are we predicting?
Needed a continuous statistic to predict
Thesis: Given a good teaser etc., the user
will click more rapidly on more interesting
articles.
Suggest documents in order of predicted
reaction time
Demo
Three users:
• Sporty Black likes black sports cars
• Cooper Blue likes blue coupes
• Racy Yeller likes yellow formula cars
System was trained by presenting ~20 sets of 9
randomly selected cars and clicking with
varying speed
Images copyright Andrew Mauldin, used with permission
Streaming
Expression!
As mentioned,
NPMR prediction
was only part of the
overall product
It occupies the top
line of Boxes A1 and
B1
The actual demo expression
Streaming Expressions
The final sorted list of documents can be calculated
The bold portion is the final summation
Continuously running, recalculating
sums/products once a minute
Included traditional document indexing
(article scanner) and also streaming
expression update()
(send_per_user_sum_update)
By end, running continuously for months
at a time unattended.
http://www.jesterj.org for more info about
JesterJ
Indexing With JesterJ
Adapting and Scaling
• Adapt at query time with filters, could also
filter in other ways with additional metadata
• Four Dimensions:
1. Content to recommend - constant set
2. Users - per user pricing to the rescue
3. Activity - price for average activity level
4. Time - data accumulates over time
Time Routed Aliases are a solution to #4
Time Routed Aliases
David Smiley
Have a Lot of Timestamped Data?
And you need keyword search and/or analytics
(faceting/aggregations) capabilities.
Examples:
• Logs
• Sensor data (IoT)
• Social media posts
Characteristics: tons of docs, continuously flowing in, limited
retention
Strategies
Hash Partitioned:
• One collection, hash routed shards (built-in)
Time Partitioned:
• One collection, time routed shards (DIY)
• Time partitioned collections (DIY)
• Time partitioned collections via TRAs (built-in)
“Partitioned” = “Routed” = “Organized”
One Collection, Hash Routed Shards
Hash on ID, even distribution
router.name=compositeId
+ Easy (default)
+ High write throughput
- Deleted dead-weight
- Poor realtime search
- Queries execute everywhere
- Inflexible sizing
- … thus Expensive (uniform
hardware requirements)
Deleted docs
Live docs
See DocExpirationUpdateProcessorFactory
shard1 shard4 shard7
2 5 8
9
Time Partitioned Data
Generally better...
• New data always goes to most recent partition
• Variable or equal sized (depends on approach)
• Addresses all negatives of hash routing
• Write throughput can be addressed by sharding within partition
• Opportunities to “optimize” aged indexes
• Flexibility in assigning partitions to better or cheaper hardware
• … all leads to cost savings
How?...
2017-01 2017-02 2017-03 2017-04
2017-05
2017-06 2017-07 2017-08
2017-09
Time Partitioned Data
Implementation Strategies:
• One Collection, router.name=implicit
• See this sample code: (DIY) https://github.com/cga-harvard/hhypermap-
bop/tree/master/bop-core/solr-
plugins/src/main/java/edu/harvard/gis/hhypermap/bop/solrplugins by me
• Multiple Collections
• See this blog: (DIY)
http://blog.cloudera.com/blog/2013/10/collection-aliasing-near-real-time-search-for-
really-big-data/ by Mark Miller
• TRAs… (built-in)
2017-01 2017-02 2017-03 2017-04
2017-05
2017-06 2017-07 2017-08
2017-09
About Collection Aliases
SolrCloud supports collection aliases
Aliases point to one or more collections
Ex: Alias “tra-demo” → 2017-09, 2017-08, 2017-07, ...
2017-01 2017-02 2017-03 2017-04
2017-05
2017-06 2017-07 2017-08
2017-09
Time Routed Aliases
Aliases have new tricks up their sleeves…
• Aliases now have metadata (API to read & edit)
• Can create a “time routed alias” w/ first collection
• Collections in a TRA can
• Route update requests to the correct collections
• Adds/deletes collections automatically
• data driven
• just-in-time or preemptively
Mutable configuration, mostly
New
in Solr 7.3
TRA Creation (using V2 API)
curl http://localhost:8983/api/c -H
'Content-type:application/json' -d '{
"create-alias":{
"name": "tra-demo",
"router": {
"name": "time",
"field": "evt_dt",
"start": "2018-01-01T00:00:00Z",
"interval": "+1DAY",
"autoDeleteAge": "-2DAY"
},
"create-collection": {
"config": "_default",
"numShards": 2,
"maxShardsPerNode": 2
}
}
}'
TRAs, the fine print
TODOs
• Size capped (e.g. 5M docs / collection)
• Query routing to subset of collections
• Better “auto-scaling” tie-ins
• “optimize” of older collections
• TRA deletion, ease of use
TRAs may not be for everyone
• Partitioning is strict and must be adjacent (no gaps)
• Doesn’t work with CDCR
Thank you!
Gus Heck & David Smiley
#Activate18 #ActivateSearch

Query-time Nonparametric Regression with Temporally Bounded Models - Patrick Heck, Needham Software & David Smiley, D W Smiley LLC

  • 1.
    Query-time Nonparametric Regression withTemporally Bounded Models Gus Heck & David Smiley #Activate18 #ActivateSearch
  • 2.
    Agenda • Presenters • Querytime Nonparametric Regression • Demo – Suggesting tagged images • Time Routed Aliases • Demo – Creating a Time Routed Alias
  • 3.
    Presenters Patrick (Gus) Heck •Solr Contributor since 2013 • Consulting since 2012 • Enterprise Search since 2010 • Apache Ant contributor 2003-2004 • Web Applications since 2003 David Smiley • Lucene/Solr Committer (PMC) • Consulting • Author of first book on Solr • Presentations & Training
  • 4.
    In the beginning.. DaveMackey - Vision for an online coaching platform leveraging machine learning technology to help companies help their middle managers. Engaged me first as Chief Architect and later as CTO to bring this vision to reality. It worked, but was not funded...
  • 5.
    Brief Overview ofSimply Coached Simply Coached provided online career coaching system Key Goal was connecting users with relevant curated content Relevance was determined by • Identifying topics the user is interested in by observing and learning from click-throughs • Several other customized metrics
  • 6.
    Learning What Intereststhe User Goals: • Suggest things of interest, things the user is willing to learn • Make predictions based on the user’s behavior • Unique prediction per user • Predictions based on entire user base • Avoid calcification of the predictive model, old data needs to be sunset regularly.
  • 7.
    Modeling Candidate -Neural Nets Simply Coached started right at the dawn of the “3rd wave” of neural networks Limitations of Neural Networks • They don’t adapt well to a changing problem • We would need to retrain predictive network regularly • Require massive data for training
  • 8.
    Modeling Candidate -Regression Not as “cool” as deep learning, but more suitable • Can converge faster with less data • But, most techniques are assuming that the residuals are normally distributed • Many of our features will be binary • Normality is right out the window I decided to look for a Non-Parametric Regression technique
  • 9.
    Non-Parametric Multiplicative Regression •Can predict continuous variable • Good for sorting/ranking • Kernel smoother, presence/absence kernels already well known • Non-parametric Nice discrete sub-calculations... Used in Habitat Ecology for predicting habitat suitability https://ir.library.oregonstate.edu/concern/defaults/z029p910z
  • 10.
    NPMR Equations Performance acrossmultiple factors (features) Local Mean estimator Example kernels continuous or presence/absence: Key realization: the products and sums for i,j can pre precalculated: I called these pre-calculated portions “Partials” and added temporal metadata
  • 11.
    But wait... Whatare we predicting? Needed a continuous statistic to predict Thesis: Given a good teaser etc., the user will click more rapidly on more interesting articles. Suggest documents in order of predicted reaction time
  • 12.
    Demo Three users: • SportyBlack likes black sports cars • Cooper Blue likes blue coupes • Racy Yeller likes yellow formula cars System was trained by presenting ~20 sets of 9 randomly selected cars and clicking with varying speed Images copyright Andrew Mauldin, used with permission
  • 13.
    Streaming Expression! As mentioned, NPMR prediction wasonly part of the overall product It occupies the top line of Boxes A1 and B1
  • 14.
    The actual demoexpression
  • 15.
    Streaming Expressions The finalsorted list of documents can be calculated The bold portion is the final summation
  • 16.
    Continuously running, recalculating sums/productsonce a minute Included traditional document indexing (article scanner) and also streaming expression update() (send_per_user_sum_update) By end, running continuously for months at a time unattended. http://www.jesterj.org for more info about JesterJ Indexing With JesterJ
  • 17.
    Adapting and Scaling •Adapt at query time with filters, could also filter in other ways with additional metadata • Four Dimensions: 1. Content to recommend - constant set 2. Users - per user pricing to the rescue 3. Activity - price for average activity level 4. Time - data accumulates over time Time Routed Aliases are a solution to #4
  • 18.
  • 19.
    Have a Lotof Timestamped Data? And you need keyword search and/or analytics (faceting/aggregations) capabilities. Examples: • Logs • Sensor data (IoT) • Social media posts Characteristics: tons of docs, continuously flowing in, limited retention
  • 20.
    Strategies Hash Partitioned: • Onecollection, hash routed shards (built-in) Time Partitioned: • One collection, time routed shards (DIY) • Time partitioned collections (DIY) • Time partitioned collections via TRAs (built-in) “Partitioned” = “Routed” = “Organized”
  • 21.
    One Collection, HashRouted Shards Hash on ID, even distribution router.name=compositeId + Easy (default) + High write throughput - Deleted dead-weight - Poor realtime search - Queries execute everywhere - Inflexible sizing - … thus Expensive (uniform hardware requirements) Deleted docs Live docs See DocExpirationUpdateProcessorFactory shard1 shard4 shard7 2 5 8 9
  • 22.
    Time Partitioned Data Generallybetter... • New data always goes to most recent partition • Variable or equal sized (depends on approach) • Addresses all negatives of hash routing • Write throughput can be addressed by sharding within partition • Opportunities to “optimize” aged indexes • Flexibility in assigning partitions to better or cheaper hardware • … all leads to cost savings How?... 2017-01 2017-02 2017-03 2017-04 2017-05 2017-06 2017-07 2017-08 2017-09
  • 23.
    Time Partitioned Data ImplementationStrategies: • One Collection, router.name=implicit • See this sample code: (DIY) https://github.com/cga-harvard/hhypermap- bop/tree/master/bop-core/solr- plugins/src/main/java/edu/harvard/gis/hhypermap/bop/solrplugins by me • Multiple Collections • See this blog: (DIY) http://blog.cloudera.com/blog/2013/10/collection-aliasing-near-real-time-search-for- really-big-data/ by Mark Miller • TRAs… (built-in) 2017-01 2017-02 2017-03 2017-04 2017-05 2017-06 2017-07 2017-08 2017-09
  • 24.
    About Collection Aliases SolrCloudsupports collection aliases Aliases point to one or more collections Ex: Alias “tra-demo” → 2017-09, 2017-08, 2017-07, ... 2017-01 2017-02 2017-03 2017-04 2017-05 2017-06 2017-07 2017-08 2017-09
  • 25.
    Time Routed Aliases Aliaseshave new tricks up their sleeves… • Aliases now have metadata (API to read & edit) • Can create a “time routed alias” w/ first collection • Collections in a TRA can • Route update requests to the correct collections • Adds/deletes collections automatically • data driven • just-in-time or preemptively Mutable configuration, mostly New in Solr 7.3
  • 26.
    TRA Creation (usingV2 API) curl http://localhost:8983/api/c -H 'Content-type:application/json' -d '{ "create-alias":{ "name": "tra-demo", "router": { "name": "time", "field": "evt_dt", "start": "2018-01-01T00:00:00Z", "interval": "+1DAY", "autoDeleteAge": "-2DAY" }, "create-collection": { "config": "_default", "numShards": 2, "maxShardsPerNode": 2 } } }'
  • 27.
    TRAs, the fineprint TODOs • Size capped (e.g. 5M docs / collection) • Query routing to subset of collections • Better “auto-scaling” tie-ins • “optimize” of older collections • TRA deletion, ease of use TRAs may not be for everyone • Partitioning is strict and must be adjacent (no gaps) • Doesn’t work with CDCR
  • 28.
    Thank you! Gus Heck& David Smiley #Activate18 #ActivateSearch