Lucene/Solr Revolution 2015: Where Search Meets Machine Learning

Where Search Meets Machine Learning
Diana Hu @sdianahu — Data Science Lead, Verizon
Joaquin Delgado @joaquind — Director of Engineering, Verizon

Disclaimer
2
The content of this presentation are of the authors’
personal statements and does not officially represent their
employer’s view in anyway. Included content is especially
not intended to convey the views of OnCue or Verizon
01

Index
1.  Introduction
2.  Search and Information Retrieval
3.  ML problems as Search-based Systems
4.  ML Meets Search!

Scaling learning systems is hard!
•  Millions of users, items
•  Billions of features
•  Imbalanced Datasets
•  Complex Distributed Systems
•  Many algorithms have not been tested at “Internet Scale”

Typical approaches
•  Distributed systems – Fault tolerance, Throughput vs.
latency
•  Parallelization Strategies – Hashing, trees
•  Processing – Map reduce variants, MPI, graph parallel
•  Databases – Key/Value Stores, NoSQL
Such a custom system requires TLC

Search and
Information Retrieval

Search
Search is about finding specific things that are either known
or assumed to exist, Discovery is about is about helping the
user encounter what he/she didn’t even know exists.
•  Focused on Search: Search Engines, Database Systems
•  Focused on Discovery: Recommender Systems, Advertising
Predicate Logic and Declarative Languages Rock!

Search stack
Matched Hits
Representation
Function
Similarity
Calculation
Matched HitsDocuments
Representation
Function
Input Query
Matched HitsMatched HitsRetrieved Documents
Online
Processing
Ofﬂine
Processing
(*)RelevanceFeedback
Query Representation
Doc Representation Index
*Metadata Engineering
(*) Optional

Search Engines: the big hammer
•  Search engines are largely used to solve non-IR
search problems, because:
•  Widely Available
•  Fast and Scalable
•  Integrates well with existing data stores

But… Are we using the right tool?
•  Search Engines were originally designed for IR.
•  Complex non-IR search tasks sometimes require a two
phase approach
Phase1) Filter Phase 2) Rank

Finding commonalities
Relevance
aka Ranking
RecSys
Discovery
IRSearch
Advertising

ML problems as
Search-based Systems

Machine Learning
Machine Learning in particular supervised learning refer to
techniques used to learn how to classify or score previously
unseen objects based on a training dataset
Inference and Generalization are the Key!

Learning systems’ stack
Visualization / UI
Retrieval
Ranking
Query Generation and
Contextual Pre-ﬁltering
Model Building
Index Building
Data/Events Collections
Data Analytics
Contextual Post Filtering
OnlineOfﬂine
Experimentation

Case study: Recommender Systems
•  Reduce information load by estimating relevance
•  Ranking (aka Relevance) Approaches:
•  Collaborative filtering
•  Content Based
•  Knowledge Based
•  Hybrid
•  Beyond rating prediction and ranking
•  Business filtering logic
•  Low latency and Scale

RecSys: Content based models
•  Rec Task: Given a user profile find the best matching items by their
attributes
•  Similarity calculation: based on keyword overlap between user/items
•  Neighborhood method (i.e. nearest neighbor)
•  Query-based retrieval (i.e Rocchio’s method)
•  Probabilistic methods (classical text classification)
•  Explicit decision models
•  Feature representation: based on content analysis
•  Vector space model
•  TF-IDF
•  Topic Modeling

RecSys: Collaborative Filtering
Matrix
Factorization
Rating
Dataset
User
Factors
Item
Factors
Re-Ranking
Model
Input Query
Online
Processing
Ofﬂine
Processing
Recommendations

Remember the elephant?
Visualization / UI
Retrieval
Ranking
Model Building
Index Building
Data Analytics
OnlineOfﬂine
Experimentation

Simplifying the stack!
Visualization / UI
Model Building
Index Building
Data Analytics
OnlineOfﬂine
Experimentation
Retrieval
Ranking

Simplifying the Search stack
Matched Hits
Representation
Function
Similarity
Calculation
Matched HitsDocuments
Representation
Function
Input Query
Matched HitsMatched HitsRetrieved Documents
Online
Processing
Ofﬂine
Processing
(*)RelevanceFeedback
Query Representation
Doc Representation Index
*Metadata Engineering
(*) Optional
Retrieval
Ranking
ML-Scoring Plugin
Serialized
ML Model

ML-Scoring architecture
Lucene/Solr
Instances +
Labels
Instances
Index
ML
Scoring
Plugin
Serialized
ML Model
Online
Processing
Ofﬂine
Processing
Trainer
+
Indexer

ML-Scoring Options
•  Option A: Solr FunctionQuery
•  Pro: Model is just a query!
•  Cons: Limits expressiveness of models
•  Option B: Solr Custom Function Query
•  Pro: Loading any type of model (also PMML)
•  Cons: Memory limitations, also multiple model reloading
•  Option C: Lucene CustomScoreQuery
•  Pro: Can use PMML and tune how PMML gets loaded
•  Cons: No control on matches
•  Option D: Lucene Low level Custome Query
•  *Mahout vectors from Lucene text (only trains, so not an option)

Real-life Problem
•  Census database that contains documents with the following
fields:
1. Age: continuous; 2. Workclass: 8 values; 3. Fnlwgt: continuous.; 4.
Education: 16 values; 5. Education-num: continuous.; 6. Marital-status: 7
values; 7. Occupation: 14 values; 8. Relationship: 6 values; 9. Race: 5
values; 10. Sex: Male, Female; 11. Capital-gain: continuous.;12. Capital-
loss: continuous.; 13. Hours-per-week: continuous.; 14. Native-country:
41 values; 15. >50K Income: Yes, No.
•  Task is to predict whether a person makes more than 50k a
year based on their attributes

1) Learn from the (training) data
Naïve
Bayes
SVM
Logistic
Regression
Decision
Trees
Train with your favorite
ML Framework

Option A: Just a Solr Function Query
q=“sum(C,

product(age,w1),

product(Workclass,w2),

product(Fnlwgt,
w3),

product(Education,
w4),

….)”

Serialized ML Model
as Query
Trainer
+
Indexer
Y_prediction = C + XB

May result in a crazy Solr functionQuery
See more at https://wiki.apache.org/solr/FunctionQuery
q=dismax&bf="ord(educaton-num)^0.5 recip(rord(age),1,1000,1000)^0.3"

Option B: Custom Solr FuntionQuery
1.  Subclass org.apache.solr.search.ValueSourceParser.
public
class
MyValueSourceParser
extends
ValueSourceParser
{

public
void
init
(NamedList
namedList)
{

…

}

public
ValueSource
parse(FunctionQParser
fqp)
throws
ParseException
{

return
new
MyValueSource();

}

}
2.  In solrconfig.xml, register your new ValueSourceParser directly under the <config> tag
<valueSourceParser
name=“myfunc”
class=“com.custom.MyValueSourceParser”
/>

3.  Subclass org.apache.solr.search.ValueSource and instantiate it in
ValueSourceParser.parse()

Option C: Lucene CustomScoreQuery
2C) Serialize model with PMML
•  Can use JPMML library to read serialized model in Lucene
•  On Lucene will need to implement an extension with
JPMML-evaluator to take vectors as expected
3C) In Lucene:
•  Override CustomScoreQuery: load PMML
•  Create CustomScoreProvider: do model PMML data marshaling
•  Rescoring: PMML evaluation

Predictive Model Markup Language
•  Why use PMML
•  Allows users to build a model in one system
•  Export model and deploy it in a different environment for prediction
•  Fast iteration: from research to deployment to production
•  Model is a XML document with:
•  Header: description of model, and where it was generated
•  DataDictionary: defines fields used by model
•  Model: structure and parameters of model
•  http://dmg.org/pmml/v4-2-1/GeneralStructure.html

Example: Train in Spark to PMML
import
org.apache.spark.mllib.clustering.KMeans

import
org.apache.spark.mllib.linalg.Vectors

//
Load
and
parse
the
data

val
data
=
sc.textFile("/path/to/file")

.map(s
=>
Vectors.dense(s.split(',').map(_.toDouble)))

//
Cluster
the
data
into
three
classes
using
KMeans

val
numIterations
=
20

val
numClusters
=
3

val
kmeansModel
=
KMeans.train(data,
numClusters,
numIterations)

//
Export
clustering
model
to
PMML

kmeansModel.toPMML("/path/to/kmeans.xml")

Overriding scores with
CustomScoreQuery
CustomScoreProvider CustomScoreQuery
Lucene Query
Find next
Match
Score
Rescore Doc
New Score
*Credit to Doug Turnbull’s
Hacking Lucene forCustom Search Results

Overriding scores with
CustomScoreQuery
•  Matching remains
•  Scoring overridden
CustomScoreProvider CustomScoreQuery
Lucene Query
Find next
Match
Score
Rescore Doc
New Score
*Credit to Doug Turnbull’s
Hacking Lucene forCustom Search Results

Implementing CustomScoreQuery
1.  Given normal Lucene Query, use a CustomScoreQuery to wrap it
TermQuery
q
=
New
TermQuery(term)

MyCustomScoreQuery
mcsq
=
New
MyCustomScoreQuery(q)

//Make
sure
query
has
all
fields
needed
by
PMML!

2.  Initialize PMML
PMML
pmml
=
...;

ModelEvaluatorFactory
modelEvaluatorFactory
=

ModelEvaluatorFactory.newInstance();

ModelEvaluator<?>
modelEvaluator
=

modelEvaluatorFactory.newModelManager(pmml);

Evaluator
evaluator
=
(Evaluator)modelEvaluator;

2.  Rescore each doc with IndexReader and docID
public
float
customScore(int
doc,
float
subQueryScore,
float

valSrcScores[])
throws
IOException
{

//Lucene
reader

IndexReader
r
=
context.reader();

Terms
tv
=
r.getTermVector(doc,
_field);

TermsEnum
tenum
=
null;

tenum
=
tv.iterator(tenum);

//convert
the
iterator
order
to
fields
needed
by
model

TermsEnum
tenumPMML
=
tenum2PMML(tenum,

evaluator.getActiveFields());

//Marshall
Data
into
PMML

Map<FieldName,
FieldValue>
arguments
=

new
LinkedHashMap<FieldName,
FieldValue>();

List<FieldName>
activeFields
=
evaluator.getActiveFields();

for(FieldName
activeField
:
activeFields){

//
The
raw
is
value
has
been
sorted
with
number
of
fields
needed

Object
rawValue
=
tenumPMML.next;

FieldValue
activeValue
=
evaluator.prepare(activeField,
rawValue);

arguments.put(activeField,
activeValue);

}

//Rescore
and
evaluate
with
PMML

Map<FieldName,
?>
results
=
evaluator.evaluate(arguments);

FieldName
targetName
=
evaluator.getTargetField();

Object
targetValue
=
results.get(targetName);

return
(float)
targetValue;

Potential issues
•  Performance
•  If search space is very large
•  If model complexity explodes (i.e. kernel expansion)
•  Operations
•  Code is running on key infrastructure
•  Versioning
•  Binary Compatibility

Option D: Low Level Lucene
•  CustomScoreQuery or Custom FunctionScore can’t control
matches
•  If you want custom matches and scoring….
•  Implement:
•  Custom Query Class
•  Custom Weight Class
•  Custom Scorer Class
•  http://opensourceconnections.com/blog/2014/03/12/using-
customscorequery-for-custom-solrlucene-scoring/

Conclusion
•  Importance of the full picture – Learning systems from the
lenses of the whole elephant
•  Reducing the time from science to production is
complicated
•  Scalability is hard!
•  Why not have ML use Search in its core during online eval?
•  Solr and Lucene are a start to customize your learning system

We are Hiring!
Contact me at
diana.hu@verizon.com
@sdianahu
Q&A

O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X

Lucene/Solr Revolution 2015: Where Search Meets Machine Learning

Lucene/Solr Revolution 2015: Where Search Meets Machine Learning

More Related Content

What's hot

Viewers also liked

Similar to Lucene/Solr Revolution 2015: Where Search Meets Machine Learning

Recently uploaded

Lucene/Solr Revolution 2015: Where Search Meets Machine Learning