Fusepool
Machine Learning
Framework
June 25th, Brussels
Fusepool
Structured Content
Visualization
Enable personalized software
Outline
Introduction to adaptive interfaces
Source refinement
Document labeling
Link prediction
Adaptive layout
Simple Machine Learning: Listen-Update-Predict (LUP)
LUP in detail for document labelling
Predictive Query: Predictive queries
Adaptive interfaces
Guillaume Bouchard (Xerox)
Customization/Contextualization of
interfaces
Known and
accepted by big
internet companies
Nor easy to
implement for SMEs
Annotation tools
●To manage large
knowledge bases, the is a
need for efficient interfaces
for annotators
●Web2.0 companies are
investigating these tools
●Mixed initiative
oA learning algorithm + human
interface
●Remark: a user can be
an annotator for some time
Supervised automation
Introduction
Challenge
LOD provides huge amount of data
Hard to organize
Goal
Streamline KB cleaning and management through
implicit and explicit feedback
Specifications
Easy tagging of documents
Near real-time prediction
Adaptive components in Fusepool
Document category
prediction
Entity labeling
Source refinement (re-ranking based on previous user clicks)
Adaptive Layout
Simple Machine Learning:
Listen-Update-Predict (LUP)
Guillaume Bouchard (Xerox)
Motivation
●Adaptive systems
●Many systems use machine learning algorithms as internal components
●The interaction between raw data, annotations, algorithms and predictions is not
simple:
• Data: Large and distributed (the 3 Vs: Velocity, Variety, Volume)
• Algorithms: multiple possible algorithms for the same task, slow
training/inference
• Visualization: must carry the uncertainty about data, annotations and
predictions
●Common problems:
• Confusion between predictions and data
• Models not automatically updated (manually « re-train » models)
• No simple way to test new algorithms
• Annotations not shared accross models in the same system
• Too few annotations in specific domain (no principled way to gather new
annotations)
Prior art
• Patterns (and Anti-Patterns) for Developing Machine Learning Systems. SysML 2008
• https://www.usenix.org/legacy/event/sysml08/tech/rios_talk.pdf
• The Agent Learning Pattern: Implementing ML algorithms in multiagent systems
• http://www.cs.cmu.edu/~alberto/papers/LearningPatternSugarLoaf.pdf
• Gestalt, a general-purpose integrated development environment designed the application of
machine learning
• Kayur Patel (University of Washington)
• http://www.acm.org/uist/archive/adjunct/2010/pdf/doctoral_consortium/p355.pdf
• Scikit-learn. Three complementary interfaces: Estimator, Predictor, transformer
• http://hal.inria.fr/docs/00/85/65/11/PDF/paper.pdf
• Infer.net: Probabilistic programming. Compilation of machine learning codes
• http://research.microsoft.com/en-us/um/people/cmbishop/downloads/bishop-mbml-2012.pdf
• Never-Ending Language Learning (NELL). The closest to our work but focused on language
• www.cs.cmu.edu/~acarlson/papers/carlson-aaai10.pdf
Never Ending Language Learning
● ●Intelligent computer agent
●Runs forever. Every day:
1. extract, or read, information from
the web
2. learn to perform this task better
●Carlson, Betteridge, Kisiel, Settles,
Hruschka and Mitchell (2010) give
the design principles for such an
agent
Machine learning process
LUPI Module overview
Listen
Gets notified when new annotations arrive
Update
Process annotation & update learning models
Predict
Exposes a prediction service available for other
components
Investigate
Actively ask for new annotations
LUP modules are monitored by
Fusepool main platform
LUP Module Implementation
●LUPEngine in a java interface
●Locations: com.xerox.services.LUPEngine
o + getGraphListener(...);
o + graphChanged(...);
o + updateModels(...);
o + predict(...);
Guillaume Bouchard (Xerox)
Supervised automation
Follow the LUP
Listen
Users give labels to documents in the GUI
Labels stored in annotation store
Update
Optimize the model with latest annotations
Warm start machine learning algorithms
Predict
Real time prediction based on updated model
Visible in the GUI
Supervised automation
Architecture
Components Process
Supervised automation
Xerox web services
Update and prediction using REST interface
Scaling up prediction to huge datasets
Listen
private class MyListener implements GraphListener {
public void graphChanged(List<GraphEvent> list) {
/**
* Listener method: called when matching modifications detected on
* the Annostore. This method triggers the Learning process, using
* the updateModels(HashMap<String,String> paramas) method.
*/
annostore = tcManager.getMGraph(ANNOTATION_GRAPH_NAME);
for (GraphEvent e : list) {
log.info("New #MyKindOfAnnotation !");
HashMap<String,String> params = new HashMap<String, String>();
// 1.) Accessing the target of the annotation
Iterator<Triple> it = annostore.filter(e.getTriple().getSubject(),
new UriRef("http://www.w3.org/ns/oa#hasTarget"),
null);
// 2.) Accessing the content as text of the target
// e.g. the new word to insert into the dictionary
Resource target = it.next().getObject();
it = annostore.filter((NonLiteral)target,
new UriRef("http://www.w3.org/2011/content#chars"),
null);
String newWord = it.next().getObject().toString();
params.put("newWord", newWord);
updateModels(params);
}
}
}
Update
public void updateModels(HashMap<String, String> params) {
/**
* This method updates the learning models.
*/
String newWord = params.get("newWord");
log.info("Adding " + newWord + " to dictionnary");
myDictionnary.add(newWord);
}
Predict
HashMap<String,String> params = new HashMap<String,String>();
String docURI = "<http://fusepool.info/doc/pmc/2751467>";
/**
* We build the parameters to give it to the L3.4via the predictionHub
*/
params.put("docURI", docURI);
/**
* We call the LUP34.predict(...) method via the predictionHub.predict(...)
method
*/
String predictedLabels = predictionHub.predict("LUP34", params);
/**
* We dump the result of the prediction
*/
log.info(predictedLabels);
/**
* "tissue__0.713##sodium__0.09135##English__0.016"
*/
Supervised automation
Multi-task learning services
● Better prediction based on
multi-task algorithm with label
embedding
● Efficient learning algorithms
o Alternating optimization
o Stochastic Gradient Descent
● Efficient storage based on
Cassandra
Supervised automation
Sequence diagram
1. The GUI insert
annotations
2. The Listener calls the
LUP3.4 Module
3. The LUP calls the
REST API
4. Then the information
flows back when
doing prediction
Supervised automation
Properly tested interface
Corpus 20 Newgroups WebKB Cade
Tolerance 1 2 3 1 2 3 1 2
Rank = 20 0.152 0.074 0.05 0.15 0.055 0.035 0.348 0.222
Rank = 50 0.16 0.072 0.052 0.2 0.085 0.04 0.386 0.266
Rank = 100 0.256 0.166 0.126 0.335 0.18 0.11 0.134 0.072
Predictive queries
Guillaume Bouchard (Xerox)
Motivation for predictive queries
Most of prediction problems can be expressed as a query
on “missing” information.
SELECT ?n WHERE
<?d, hasLabel, “WellWritten”>
<?p, isAuthor, ?d>
<?p, hasName, ?n>
Semantic Search API
Predictive SPARQL
Core idea: learn a model on KB
 Now we can query missing data!
● SPARQL is a standard query language for semantic data
● Predictive SPARQL: generalization to probabilistic models
Semantic Search API
Predictive SPARQL example
Semantic Search API
Predictive model
● Use of tensor
factorization methods
● Tensor=generalization of
matrices
● Scalable probabilistic
models
● Based on Rescal
approximation:
Tikj ≈ ei
TRk ej
where:
o ei and ej are entities
o Rk is the relational matrix
Predictive Sparql example
Conclusion
Guillaume Bouchard (Xerox)
Main achievements
● LUP: Listen-Update-Predict is a design pattern
that provide software engineering best practices
● Predictive SPARQL: A framework for predictive
queries on RDF data
Future of Fusepool
Xerox is using Fusepool for exploring and
organizing its customer KB

Fusepool Machine Learning Framework

  • 1.
  • 2.
  • 3.
    Outline Introduction to adaptiveinterfaces Source refinement Document labeling Link prediction Adaptive layout Simple Machine Learning: Listen-Update-Predict (LUP) LUP in detail for document labelling Predictive Query: Predictive queries
  • 4.
  • 5.
    Customization/Contextualization of interfaces Known and acceptedby big internet companies Nor easy to implement for SMEs
  • 6.
    Annotation tools ●To managelarge knowledge bases, the is a need for efficient interfaces for annotators ●Web2.0 companies are investigating these tools ●Mixed initiative oA learning algorithm + human interface ●Remark: a user can be an annotator for some time
  • 7.
    Supervised automation Introduction Challenge LOD provideshuge amount of data Hard to organize Goal Streamline KB cleaning and management through implicit and explicit feedback Specifications Easy tagging of documents Near real-time prediction
  • 8.
    Adaptive components inFusepool Document category prediction Entity labeling Source refinement (re-ranking based on previous user clicks) Adaptive Layout
  • 9.
    Simple Machine Learning: Listen-Update-Predict(LUP) Guillaume Bouchard (Xerox)
  • 10.
    Motivation ●Adaptive systems ●Many systemsuse machine learning algorithms as internal components ●The interaction between raw data, annotations, algorithms and predictions is not simple: • Data: Large and distributed (the 3 Vs: Velocity, Variety, Volume) • Algorithms: multiple possible algorithms for the same task, slow training/inference • Visualization: must carry the uncertainty about data, annotations and predictions ●Common problems: • Confusion between predictions and data • Models not automatically updated (manually « re-train » models) • No simple way to test new algorithms • Annotations not shared accross models in the same system • Too few annotations in specific domain (no principled way to gather new annotations)
  • 11.
    Prior art • Patterns(and Anti-Patterns) for Developing Machine Learning Systems. SysML 2008 • https://www.usenix.org/legacy/event/sysml08/tech/rios_talk.pdf • The Agent Learning Pattern: Implementing ML algorithms in multiagent systems • http://www.cs.cmu.edu/~alberto/papers/LearningPatternSugarLoaf.pdf • Gestalt, a general-purpose integrated development environment designed the application of machine learning • Kayur Patel (University of Washington) • http://www.acm.org/uist/archive/adjunct/2010/pdf/doctoral_consortium/p355.pdf • Scikit-learn. Three complementary interfaces: Estimator, Predictor, transformer • http://hal.inria.fr/docs/00/85/65/11/PDF/paper.pdf • Infer.net: Probabilistic programming. Compilation of machine learning codes • http://research.microsoft.com/en-us/um/people/cmbishop/downloads/bishop-mbml-2012.pdf • Never-Ending Language Learning (NELL). The closest to our work but focused on language • www.cs.cmu.edu/~acarlson/papers/carlson-aaai10.pdf
  • 12.
    Never Ending LanguageLearning ● ●Intelligent computer agent ●Runs forever. Every day: 1. extract, or read, information from the web 2. learn to perform this task better ●Carlson, Betteridge, Kisiel, Settles, Hruschka and Mitchell (2010) give the design principles for such an agent
  • 13.
  • 14.
    LUPI Module overview Listen Getsnotified when new annotations arrive Update Process annotation & update learning models Predict Exposes a prediction service available for other components Investigate Actively ask for new annotations
  • 15.
    LUP modules aremonitored by Fusepool main platform
  • 16.
    LUP Module Implementation ●LUPEnginein a java interface ●Locations: com.xerox.services.LUPEngine o + getGraphListener(...); o + graphChanged(...); o + updateModels(...); o + predict(...);
  • 18.
  • 19.
    Supervised automation Follow theLUP Listen Users give labels to documents in the GUI Labels stored in annotation store Update Optimize the model with latest annotations Warm start machine learning algorithms Predict Real time prediction based on updated model Visible in the GUI
  • 20.
  • 21.
    Supervised automation Xerox webservices Update and prediction using REST interface Scaling up prediction to huge datasets
  • 22.
    Listen private class MyListenerimplements GraphListener { public void graphChanged(List<GraphEvent> list) { /** * Listener method: called when matching modifications detected on * the Annostore. This method triggers the Learning process, using * the updateModels(HashMap<String,String> paramas) method. */ annostore = tcManager.getMGraph(ANNOTATION_GRAPH_NAME); for (GraphEvent e : list) { log.info("New #MyKindOfAnnotation !"); HashMap<String,String> params = new HashMap<String, String>(); // 1.) Accessing the target of the annotation Iterator<Triple> it = annostore.filter(e.getTriple().getSubject(), new UriRef("http://www.w3.org/ns/oa#hasTarget"), null); // 2.) Accessing the content as text of the target // e.g. the new word to insert into the dictionary Resource target = it.next().getObject(); it = annostore.filter((NonLiteral)target, new UriRef("http://www.w3.org/2011/content#chars"), null); String newWord = it.next().getObject().toString(); params.put("newWord", newWord); updateModels(params); } } }
  • 23.
    Update public void updateModels(HashMap<String,String> params) { /** * This method updates the learning models. */ String newWord = params.get("newWord"); log.info("Adding " + newWord + " to dictionnary"); myDictionnary.add(newWord); }
  • 24.
    Predict HashMap<String,String> params =new HashMap<String,String>(); String docURI = "<http://fusepool.info/doc/pmc/2751467>"; /** * We build the parameters to give it to the L3.4via the predictionHub */ params.put("docURI", docURI); /** * We call the LUP34.predict(...) method via the predictionHub.predict(...) method */ String predictedLabels = predictionHub.predict("LUP34", params); /** * We dump the result of the prediction */ log.info(predictedLabels); /** * "tissue__0.713##sodium__0.09135##English__0.016" */
  • 25.
    Supervised automation Multi-task learningservices ● Better prediction based on multi-task algorithm with label embedding ● Efficient learning algorithms o Alternating optimization o Stochastic Gradient Descent ● Efficient storage based on Cassandra
  • 26.
    Supervised automation Sequence diagram 1.The GUI insert annotations 2. The Listener calls the LUP3.4 Module 3. The LUP calls the REST API 4. Then the information flows back when doing prediction
  • 27.
    Supervised automation Properly testedinterface Corpus 20 Newgroups WebKB Cade Tolerance 1 2 3 1 2 3 1 2 Rank = 20 0.152 0.074 0.05 0.15 0.055 0.035 0.348 0.222 Rank = 50 0.16 0.072 0.052 0.2 0.085 0.04 0.386 0.266 Rank = 100 0.256 0.166 0.126 0.335 0.18 0.11 0.134 0.072
  • 28.
  • 29.
    Motivation for predictivequeries Most of prediction problems can be expressed as a query on “missing” information. SELECT ?n WHERE <?d, hasLabel, “WellWritten”> <?p, isAuthor, ?d> <?p, hasName, ?n>
  • 30.
    Semantic Search API PredictiveSPARQL Core idea: learn a model on KB  Now we can query missing data! ● SPARQL is a standard query language for semantic data ● Predictive SPARQL: generalization to probabilistic models
  • 31.
  • 32.
    Semantic Search API Predictivemodel ● Use of tensor factorization methods ● Tensor=generalization of matrices ● Scalable probabilistic models ● Based on Rescal approximation: Tikj ≈ ei TRk ej where: o ei and ej are entities o Rk is the relational matrix
  • 33.
  • 34.
  • 35.
    Main achievements ● LUP:Listen-Update-Predict is a design pattern that provide software engineering best practices ● Predictive SPARQL: A framework for predictive queries on RDF data
  • 36.
    Future of Fusepool Xeroxis using Fusepool for exploring and organizing its customer KB