This document discusses how machine learning problems can be framed as search-based systems and how search and machine learning can be combined. It begins with an introduction to search engines and information retrieval. It then discusses how machine learning problems like recommender systems can be viewed as search tasks involving ranking, retrieval, and relevance calculation. The document proposes simplifying the machine learning pipeline by integrating it with search systems and indexes. It provides examples of implementing machine learning scoring and models within search systems like Solr using techniques like PMML. The goal is to leverage existing search infrastructure for scaling machine learning models.
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Where Search Meets ML: Using Search Systems for Machine Learning Problems
1. Where Search Meets Machine Learning
Diana Hu @sdianahu — Data Science Lead, Verizon
Joaquin Delgado @joaquind — Director of Engineering, Verizon
2. Disclaimer
2
The content of this presentation are of the authors’
personal statements and does not officially represent their
employer’s view in anyway. Included content is especially
not intended to convey the views of OnCue or Verizon
01
5. Scaling learning systems is hard!
• Millions of users, items
• Billions of features
• Imbalanced Datasets
• Complex Distributed Systems
• Many algorithms have not been tested at “Internet Scale”
6. Typical approaches
• Distributed systems – Fault tolerance, Throughput vs.
latency
• Parallelization Strategies – Hashing, trees
• Processing – Map reduce variants, MPI, graph parallel
• Databases – Key/Value Stores, NoSQL
Such a custom system requires TLC
8. Search
Search is about finding specific things that are either known
or assumed to exist, Discovery is about is about helping the
user encounter what he/she didn’t even know exists.
• Focused on Search: Search Engines, Database Systems
• Focused on Discovery: Recommender Systems, Advertising
Predicate Logic and Declarative Languages Rock!
11. Search Engines: the big hammer
• Search engines are largely used to solve non-IR
search problems, because:
• Widely Available
• Fast and Scalable
• Integrates well with existing data stores
12. But… Are we using the right tool?
• Search Engines were originally designed for IR.
• Complex non-IR search tasks sometimes require a two
phase approach
Phase1) Filter Phase 2) Rank
15. Machine Learning
Machine Learning in particular supervised learning refer to
techniques used to learn how to classify or score previously
unseen objects based on a training dataset
Inference and Generalization are the Key!
17. Learning systems’ stack
Visualization / UI
Retrieval
Ranking
Query Generation and
Contextual Pre-filtering
Model Building
Index Building
Data/Events Collections
Data Analytics
Contextual Post Filtering
OnlineOffline
Experimentation
18. Case study: Recommender Systems
• Reduce information load by estimating relevance
• Ranking (aka Relevance) Approaches:
• Collaborative filtering
• Content Based
• Knowledge Based
• Hybrid
• Beyond rating prediction and ranking
• Business filtering logic
• Low latency and Scale
19. RecSys: Content based models
• Rec Task: Given a user profile find the best matching items by their
attributes
• Similarity calculation: based on keyword overlap between user/items
• Neighborhood method (i.e. nearest neighbor)
• Query-based retrieval (i.e Rocchio’s method)
• Probabilistic methods (classical text classification)
• Explicit decision models
• Feature representation: based on content analysis
• Vector space model
• TF-IDF
• Topic Modeling
23. Remember the elephant?
Visualization / UI
Retrieval
Ranking
Query Generation and
Contextual Pre-filtering
Model Building
Index Building
Data/Events Collections
Data Analytics
Contextual Post Filtering
OnlineOffline
Experimentation
24. Simplifying the stack!
Visualization / UI
Query Generation and
Contextual Pre-filtering
Model Building
Index Building
Data/Events Collections
Data Analytics
OnlineOffline
Experimentation
Retrieval
Contextual Post Filtering
Ranking
28. ML-Scoring Options
• Option A: Solr FunctionQuery
• Pro: Model is just a query!
• Cons: Limits expressiveness of models
• Option B: Solr Custom Function Query
• Pro: Loading any type of model (also PMML)
• Cons: Memory limitations, also multiple model reloading
• Option C: Lucene CustomScoreQuery
• Pro: Can use PMML and tune how PMML gets loaded
• Cons: No control on matches
• Option D: Lucene Low level Custome Query
• *Mahout vectors from Lucene text (only trains, so not an option)
29. Real-life Problem
• Census database that contains documents with the following
fields:
1. Age: continuous; 2. Workclass: 8 values; 3. Fnlwgt: continuous.; 4.
Education: 16 values; 5. Education-num: continuous.; 6. Marital-status: 7
values; 7. Occupation: 14 values; 8. Relationship: 6 values; 9. Race: 5
values; 10. Sex: Male, Female; 11. Capital-gain: continuous.;12. Capital-
loss: continuous.; 13. Hours-per-week: continuous.; 14. Native-country:
41 values; 15. >50K Income: Yes, No.
• Task is to predict whether a person makes more than 50k a
year based on their attributes
30. 1) Learn from the (training) data
Naïve
Bayes
SVM
Logistic
Regression
Decision
Trees
Train with your favorite
ML Framework
31. Option A: Just a Solr Function Query
q=“sum(C,
product(age,w1),
product(Workclass,w2),
product(Fnlwgt,
w3),
product(Education,
w4),
….)”
Serialized ML Model
as Query
Trainer
+
Indexer
Y_prediction = C + XB
32. May result in a crazy Solr functionQuery
See more at https://wiki.apache.org/solr/FunctionQuery
q=dismax&bf="ord(educaton-num)^0.5 recip(rord(age),1,1000,1000)^0.3"
34. Option B: Custom Solr FuntionQuery
1. Subclass org.apache.solr.search.ValueSourceParser.
public
class
MyValueSourceParser
extends
ValueSourceParser
{
public
void
init
(NamedList
namedList)
{
…
}
public
ValueSource
parse(FunctionQParser
fqp)
throws
ParseException
{
return
new
MyValueSource();
}
}
2. In solrconfig.xml, register your new ValueSourceParser directly under the <config> tag
<valueSourceParser
name=“myfunc”
class=“com.custom.MyValueSourceParser”
/>
3. Subclass org.apache.solr.search.ValueSource and instantiate it in
ValueSourceParser.parse()
35. Option C: Lucene CustomScoreQuery
2C) Serialize model with PMML
• Can use JPMML library to read serialized model in Lucene
• On Lucene will need to implement an extension with
JPMML-evaluator to take vectors as expected
3C) In Lucene:
• Override CustomScoreQuery: load PMML
• Create CustomScoreProvider: do model PMML data marshaling
• Rescoring: PMML evaluation
36. Predictive Model Markup Language
• Why use PMML
• Allows users to build a model in one system
• Export model and deploy it in a different environment for prediction
• Fast iteration: from research to deployment to production
• Model is a XML document with:
• Header: description of model, and where it was generated
• DataDictionary: defines fields used by model
• Model: structure and parameters of model
• http://dmg.org/pmml/v4-2-1/GeneralStructure.html
37. Example: Train in Spark to PMML
import
org.apache.spark.mllib.clustering.KMeans
import
org.apache.spark.mllib.linalg.Vectors
//
Load
and
parse
the
data
val
data
=
sc.textFile("/path/to/file")
.map(s
=>
Vectors.dense(s.split(',').map(_.toDouble)))
//
Cluster
the
data
into
three
classes
using
KMeans
val
numIterations
=
20
val
numClusters
=
3
val
kmeansModel
=
KMeans.train(data,
numClusters,
numIterations)
//
Export
clustering
model
to
PMML
kmeansModel.toPMML("/path/to/kmeans.xml")
40. Overriding scores with
CustomScoreQuery
• Matching remains
• Scoring overridden
CustomScoreProvider CustomScoreQuery
Lucene Query
Find next
Match
Score
Rescore Doc
New Score
*Credit to Doug Turnbull’s
Hacking Lucene forCustom Search Results
41. Implementing CustomScoreQuery
1. Given normal Lucene Query, use a CustomScoreQuery to wrap it
TermQuery
q
=
New
TermQuery(term)
MyCustomScoreQuery
mcsq
=
New
MyCustomScoreQuery(q)
//Make
sure
query
has
all
fields
needed
by
PMML!
43. Implementing CustomScoreQuery
2. Rescore each doc with IndexReader and docID
public
float
customScore(int
doc,
float
subQueryScore,
float
valSrcScores[])
throws
IOException
{
//Lucene
reader
IndexReader
r
=
context.reader();
Terms
tv
=
r.getTermVector(doc,
_field);
TermsEnum
tenum
=
null;
tenum
=
tv.iterator(tenum);
//convert
the
iterator
order
to
fields
needed
by
model
TermsEnum
tenumPMML
=
tenum2PMML(tenum,
evaluator.getActiveFields());
44. Implementing CustomScoreQuery
2. Rescore each doc with IndexReader and docID
//Marshall
Data
into
PMML
Map<FieldName,
FieldValue>
arguments
=
new
LinkedHashMap<FieldName,
FieldValue>();
List<FieldName>
activeFields
=
evaluator.getActiveFields();
for(FieldName
activeField
:
activeFields){
//
The
raw
is
value
has
been
sorted
with
number
of
fields
needed
Object
rawValue
=
tenumPMML.next;
FieldValue
activeValue
=
evaluator.prepare(activeField,
rawValue);
arguments.put(activeField,
activeValue);
}
45. Implementing CustomScoreQuery
2. Rescore each doc with IndexReader and docID
//Rescore
and
evaluate
with
PMML
Map<FieldName,
?>
results
=
evaluator.evaluate(arguments);
FieldName
targetName
=
evaluator.getTargetField();
Object
targetValue
=
results.get(targetName);
return
(float)
targetValue;
46. Potential issues
• Performance
• If search space is very large
• If model complexity explodes (i.e. kernel expansion)
• Operations
• Code is running on key infrastructure
• Versioning
• Binary Compatibility
47. Option D: Low Level Lucene
• CustomScoreQuery or Custom FunctionScore can’t control
matches
• If you want custom matches and scoring….
• Implement:
• Custom Query Class
• Custom Weight Class
• Custom Scorer Class
• http://opensourceconnections.com/blog/2014/03/12/using-
customscorequery-for-custom-solrlucene-scoring/
48. Conclusion
• Importance of the full picture – Learning systems from the
lenses of the whole elephant
• Reducing the time from science to production is
complicated
• Scalability is hard!
• Why not have ML use Search in its core during online eval?
• Solr and Lucene are a start to customize your learning system