SEARN is an algorithm for structured prediction that casts it as a sequence of cost-sensitive classification problems. It works by learning a policy to make incremental decisions that build up the full structured output. The policy is trained through an iterative process of generating cost-sensitive examples from sample outputs produced by the current policy, training a classifier on those examples, and interpolating the new policy with the previous one. This allows SEARN to learn the structured prediction task without requiring assumptions about the output structure, unlike approaches that make independence assumptions or rely on global prediction models.
2. Outline
● What is Structured Prediction
● Approaches to Structured Prediction
● Idea of Search-Based Structured Prediction
● Background Information for SEARN
● SEARN Algorithm
● Comparison with other approaches
4. What is Structured Prediction?
● If we define informally structured prediction is a process by which a
structure inside a given input is captured.
● Main difference with other machine learning problems is Structured
Prediction problems usually have a complex output.
11. Different Approaches to SP
● Structured Perceptron
○ Direct implementation of Averaged Perceptron in binary classification to use in SP
● Incremental Perceptron
○ Also a search-based approach.
● Maximum Entropy Markov Models
○ Similar to logistic regression in binary classification
● Conditional Random Fields
○ Solves label bias problem of Maximum Entropy models
● Maximum Margin Markov Networks
● SVM for Independent and Structured Outputs (SVMstruct
)
13. Traditional Approach to SP
ModelInput
All
Possible
Outputs
Xn
W - Parameters
F - features
● There is a model that can generate all the possible outputs for a given
input
● Based on the input features, model parameters assign scores for each of
those outputs
15. Decoding
ModelInput
All
Possible
Outputs
Xn
W - Parameters
● In the decoding phase, the input is ran through the model and then all the
outputs are searched to find out the output with the highest score
Search for
the output
with
highest
score
16. Role of Search
● Search gets the output with highest score from the search space.
● Almost all SP approaches needs a search component
● In most cases, searching through the whole space is intractable.
○ Assumptions about the output is made so that dynamic programming can be applied
○ Use approximating methods such as beam search, greedy search and any other heuristic
based search methods
● Search can be seen as a sequence of decisions taken to get the best
output.
17. Search-based SP
● Search phase and the model is combined.
● Rather than searching after the model, learn how to search.
● Each decision made during the search is considered as a large
classification problem.
● Now each search decision that make will build the output incrementally.
● The goal is to train these classifiers to build an optimal output.
19. Learning Reductions
● Relating a hard and complex prediction problem to a simpler prediction
problem.
● Maps a harder problem to a simpler problem, then obtains a solution for
the simple problem and maps that solution to the harder problem.
● A reduction has three components
○ Sample mapping - Mapping complex problem dataset to the simpler problem
○ Hypothesis mapping - Mapping the solution to the easier problem to the hard problem
○ Bounds - How well the reduction solves the larger problem
20. Importance Weighted Binary Classification
● A simple extension to binary classification.
● Each example (data item) has an associated weight which reflects the
importance of that data item. (xi
, yi
, ci
)
● Solution should be a binary classifier that minimizes the expected weight
loss.
21. Importance Weighted Binary Classification
● Solved by reducing the problem to C parallel binary classifiers.
● C datasets are generated sampling from the original dataset with a
sampling probability proportional to their importance weights.
● Using those different datasets, C binary classifiers will be trained.
● Prediction is made based on majority prediction of those C parallel binary
classifiers.
22. Cost Sensitive Classification
● This is a natural extension of Importance Weighted Binary Classification to
a multi class scenario.
● For a K-Class task, we have to find a hypothesis (h) such that it minimizes
the expected cost of predictions.
● C is a k sized vector containing cost for each classification.
23. Cost Sensitive Classification
● This is reduced to Importance Weighted Binary Classification problem
using Weighted All Pairs (Beygelzimer et al., 2005) reduction.
● WAP generates k
Cc
Importance weighted binary classification problems.
● Importance weights calculated using a special formula so that
classification is done correctly.
25. SEARN Algorithm
● Searn is developed by casting structured prediction in the language of
reductions;
● In particular, it reduces structured prediction to cost-sensitive
classification.
● In that case, the cost-sensitive classification problem can be reduced to
binary classification by applying weighted all pairs method.
● So the structured prediction can be solved using binary classification.
26. SEARN Algorithm
● Removes the “search” from the prediction process by learning a classifier
to make incremental decisions.
27. Definition of Structured Prediction
We can define structured prediction problem as a cost-sensitive classification
problem as follows.
28. Definition of Structured Prediction
The goal of the structured prediction is to find a hypothesis h : X→Y that
minimizes the given loss.
29. Policy
● We need to find a h such that, given a state s, and and the input x, h(x,s)
gives the next action.
● We can consider policy h as a classifier. Now the whole problem becomes
a classification problem.
● Now we need to train this classifier.
30. Training
● Training is an iterative process
● Initialize with a known policy
● Using that policy create cost-sensitive examples
● Create a new policy using the cost-sensitive examples
● Interpolate the previous policy and the new policy
31. Cost Sensitive Examples
● A policy generates one path per one training example. (Path is a sequence
of states; state is a partial structure)
● SEARN creates a single cost sensitive example for each state in each path.
● The classes associated with each example is the cost of available actions.
(Next states)
● Now the difficulty lies in specifying these cost values.
32. Cost
● Cost of each action can be considered as regret. It is defined as follows. (π
is the policy)
● The complexity of the above equation is problem dependent.
● There are multiple ways to compute it. (Monte-carlo sampling, Single
Monte-carlo sampling, etc)
33. Optimal Policy
● The optimal policy is a policy that, for a given state, input and
output(structured prediction cost vector) always predicts the best action to
take.
34. Optimal Policy
● Searn uses the optimal policy to initialize the iterative process, and
attempts to migrate toward a completely learned policy that will generalize
well.
● SEARN assumes existence of an optimal policy to the problem.
35. Algorithm
● π*
is the optimal policy
● Learn is a multi class learner.
● Policy will be initialized using
the optimal policy. (Line 1)
● Algorithm then iterates for a
number of iterations.
● Makes cost-sensitive examples
using the current policy.
● Interpolate the previous policy
with current policy
37. Vs. Independent Classifiers
● Output structure is assumed to be decomposable and each part is
classified (predicted) individually.
● Cannot define features that span across output structure.
● Even if the previous results are taken into consideration it can be sub
optimal.
● Limited to hamming loss.
38. Vs. Perceptron algorithms
● Assumes a tractable argmax operation.
● Generalize poorly. (Can solve this by averaging the weights)
● Limited to only one loss function.
39. Vs. Global prediction algorithms
● Highly dependent on assumptions about output structure. (Markov
assumption)
● In comparison SEARN is more general, limited neither to linear chains nor
to Markov style features.
● SEARN requires far more weaker assumptions.
40.
41. SEARN algorithm can solve structured
prediction problems under any model,
any feature functions and any loss
42. References
● Search-based Structured Prediction. Hal Daumé III, John Langford and
Daniel Marcu. Submitted to Machine Learning Journal, 2006.
● Practical Structured Learning Techniques for Natural Language
Processing. Hal Daumé III. PhD Thesis, 2006 (USC)