Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
SUPERVISED PREDICTION
OF GRAPH SUMMARIES
Daniil Mirylenka
University of Trento, Italy
1
Outline
• Motivating example (from my Ph.D. research)
• Supervised learning
•  Binary classification
•  Perceptron
•  Rank...
Motivating example
Representing academic search results
3
graph summary
search results
Motivating example
Suppose we can do this:
4
large graph
Motivating example
Then we only need to do this:
5
small graph
Motivating example
What is a good graph summary?
Let’s learn from examples!
6
Supervised learning
7
What is supervised learning?
x1, y1( ),  x2, y2( ), …,  xn, yn( )
f : x ! y
8
Supervised
Learning
bunch of
examples
functi...
x1, y1( ),  x2, y2( ), …,  xn, yn( )
9
distribution of examples
Statistical learning theory
P(x,y)
Where do our examples c...
10
hypothesis space
Statistical learning theory
What functions do we consider?
f ∈ H
linear?
H1
cubic?
H2
piecewise-linear...
11
loss function
Statistical learning theory
How bad is it to predict instead of (true) ?f x( ) y
L f x( ), y( )
L y, !y( ...
Statistical learning theory
argmin
f ∈H
L( f (x), y)p(x, y)dxdy
X×Y
∫
Goal:
Requirement:
argmin
f ∈H
L( f (xi ), yi )
i=1
...
Linear models
w = argmin
w
L( fw x( ), yi )
i=1
n
∑
Inference (prediction):
fw x( )= g( w,ϕ x( ) )
optimization with respe...
Binary classification
y ∈ {−1,1}
Prediction: fw x( )= sign( w, x )
14
above or below the
line (hyperplane)?
yi w, xi > 0
N...
Perceptron
Learning algorithm
(optimizes one example at a time)
For every xi
if yi w, xi ≤ 0
w ← w + yi xi
•  if made a mi...
Perceptron
Update rule:
wold
16
xi
w ← w + yi xi
Perceptron
wold
17
xi
wnew
Update rule: w ← w + yi xi
Max-margin classification
Idea: ensure some distance form the hyperplane
18
yi w, xi ≥1
Require:
Preference learning
Suppose we want to predict rankings: x ! y =
v1
v2
"
vk
x,vi( )≻ x,vj( )⇔ i < j
19
joint features of x...
Structured prediction
20
Structured prediction
Examples:
•  “Time flies like an arrow.”x =
part-of-speech tagging
•  (noun verb preposition determi...
Structured prediction
How can we approach this problem?
f x( )= g w,ϕ x( )( )•  before we had:
•  now must be a complex ob...
Structured Perceptron
Almost the same as ordinary perceptron
•  For every xi
•  if ˆyi ≠ yi
w ← w +ψ xi, yi( )−ψ xi, ˆyi( ...
Argmax problem
often infeasible
ˆy = argmax
y
w,ψ x, y( )( )
Prediction:
Examples:
24
•  sequence of length T, with d opti...
Argmax problem
ˆy = argmax
y
w,ψ x, y( )( )
Prediction:
Learning:
•  even more difficult
•  includes prediction as a subro...
Argmax problem: easy cases
Independent prediction
•  suppose decomposes into
•  and decomposes into
y v1,v2,…,vT( )
ψ x, y...
Argmax problem: easy cases
Sequence labeling
•  suppose decomposes into
•  and decomposes into
y v1,v2,…,vT( )
ψ x, y( )= ...
Approximate argmax
General idea:
•  search
in the space of outputs
Natural generalization:
•  space of partial outputs
•  ...
Learning to search
29
Learning to search
Sequential prediction of structured outputs
y = v1,v2,…,vT( )•  decompose the output
π : v1,v2,…,vt( )→...
Learning to search
The caveat of sequential prediction
siStates : coordinates of the car
Actions : steering (‘left’, ‘righ...
Searn and DAGGER
Searn = “search” + “learn”
•  start from optimal policy;
move away
•  generate new states with current
• ...
Searn and DAGGER
DAGGER = “dataset” + “aggregation”
•  start from the ‘ground truth’ dataset,
enrich it with new state-act...
DAGGER for building the graph summaries
Input: topic graph ,
search results ,
relation
Output: topic summary of size
G V, ...
DAGGER for building the graph summaries
•  Provide the ‘ground truth’ topic sequences
V,S, R( ),v1,v2,…,vT( ) a single gro...
DAGGER: producing the ‘expert action’
•  The expert’s action brings us closer
to the ‘ground-truth’ trajectory
expert’s
ac...
DAGGER: topic sequence dissimilarity
•  Set-based dissimilarity, e.g. Jaccard distance
•  similarity between topics ?
•  e...
DAGGER: topic graph features
•  Coverage and diversity
•  [transitive] document coverage
•  [transitive] topic frequency, ...
Recap
•  We’ve learnt:
•  … how to do binary classification
•  and implement it in 4 lines of code
•  … about more complex...
Questions?
40
dmirylenka @ disi.unitn.it
Extra slides
41
Support Vector Machine
Idea: large margin between positive and negative examples
Loss function:
L y, f x( )( )=[1− y⋅ f x(...
Structured SVM
correct outputs score higher by a margin
w → min
w,ψ xi, y( ) − w,ψ xi, yi( ) ≥1,  yi ≠ y
w,ψ xi, yi( )−ψ x...
Upcoming SlideShare
Loading in …5
×

of

Supervised Prediction of Graph Summaries Slide 1 Supervised Prediction of Graph Summaries Slide 2 Supervised Prediction of Graph Summaries Slide 3 Supervised Prediction of Graph Summaries Slide 4 Supervised Prediction of Graph Summaries Slide 5 Supervised Prediction of Graph Summaries Slide 6 Supervised Prediction of Graph Summaries Slide 7 Supervised Prediction of Graph Summaries Slide 8 Supervised Prediction of Graph Summaries Slide 9 Supervised Prediction of Graph Summaries Slide 10 Supervised Prediction of Graph Summaries Slide 11 Supervised Prediction of Graph Summaries Slide 12 Supervised Prediction of Graph Summaries Slide 13 Supervised Prediction of Graph Summaries Slide 14 Supervised Prediction of Graph Summaries Slide 15 Supervised Prediction of Graph Summaries Slide 16 Supervised Prediction of Graph Summaries Slide 17 Supervised Prediction of Graph Summaries Slide 18 Supervised Prediction of Graph Summaries Slide 19 Supervised Prediction of Graph Summaries Slide 20 Supervised Prediction of Graph Summaries Slide 21 Supervised Prediction of Graph Summaries Slide 22 Supervised Prediction of Graph Summaries Slide 23 Supervised Prediction of Graph Summaries Slide 24 Supervised Prediction of Graph Summaries Slide 25 Supervised Prediction of Graph Summaries Slide 26 Supervised Prediction of Graph Summaries Slide 27 Supervised Prediction of Graph Summaries Slide 28 Supervised Prediction of Graph Summaries Slide 29 Supervised Prediction of Graph Summaries Slide 30 Supervised Prediction of Graph Summaries Slide 31 Supervised Prediction of Graph Summaries Slide 32 Supervised Prediction of Graph Summaries Slide 33 Supervised Prediction of Graph Summaries Slide 34 Supervised Prediction of Graph Summaries Slide 35 Supervised Prediction of Graph Summaries Slide 36 Supervised Prediction of Graph Summaries Slide 37 Supervised Prediction of Graph Summaries Slide 38 Supervised Prediction of Graph Summaries Slide 39 Supervised Prediction of Graph Summaries Slide 40 Supervised Prediction of Graph Summaries Slide 41 Supervised Prediction of Graph Summaries Slide 42 Supervised Prediction of Graph Summaries Slide 43
Upcoming SlideShare
Nlp.set
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

Supervised Prediction of Graph Summaries

Download to read offline

Supervised machine learning addresses the problem of approximating a function, given the examples of inputs and outputs. The classical tasks of regression and classification deal with functions whose outputs are real numbers. Structured output prediction goes beyond one-dimensional outputs, and allows predicting complex objects, such as sequences, trees, and graphs. In this talk I will show how to apply structured output prediction to building informative summaries of the topic graphs—a problem I encountered in my Ph.D. research. The focus of the talk will be on understanding the intuitions behind the machine learning algorithms. We will start from the basics and walk our way through the inner workings of DAgger—state-of-the-art method of structured output prediction.

This talk was be given at a seminar in Google Krakow.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Supervised Prediction of Graph Summaries

  1. 1. SUPERVISED PREDICTION OF GRAPH SUMMARIES Daniil Mirylenka University of Trento, Italy 1
  2. 2. Outline • Motivating example (from my Ph.D. research) • Supervised learning •  Binary classification •  Perceptron •  Ranking, Multiclass • Structured output prediction •  General approach, Structured Perceptron •  “Easy” cases • Prediction as search •  Searn, DAGGER •  Back to the motivating example 2
  3. 3. Motivating example Representing academic search results 3 graph summary search results
  4. 4. Motivating example Suppose we can do this: 4 large graph
  5. 5. Motivating example Then we only need to do this: 5 small graph
  6. 6. Motivating example What is a good graph summary? Let’s learn from examples! 6
  7. 7. Supervised learning 7
  8. 8. What is supervised learning? x1, y1( ),  x2, y2( ), …,  xn, yn( ) f : x ! y 8 Supervised Learning bunch of examples function
  9. 9. x1, y1( ),  x2, y2( ), …,  xn, yn( ) 9 distribution of examples Statistical learning theory P(x,y) Where do our examples come from? samples drawn i.i.d.
  10. 10. 10 hypothesis space Statistical learning theory What functions do we consider? f ∈ H linear? H1 cubic? H2 piecewise-linear? H3
  11. 11. 11 loss function Statistical learning theory How bad is it to predict instead of (true) ?f x( ) y L f x( ), y( ) L y, !y( )= 0, when y = !y 1, otherwise " # $ Example: zero-one loss
  12. 12. Statistical learning theory argmin f ∈H L( f (x), y)p(x, y)dxdy X×Y ∫ Goal: Requirement: argmin f ∈H L( f (xi ), yi ) i=1 n ∑ 12 expected loss on new examples total loss on training data
  13. 13. Linear models w = argmin w L( fw x( ), yi ) i=1 n ∑ Inference (prediction): fw x( )= g( w,ϕ x( ) ) optimization with respect to w (e.g. gradient descent) Learning: 13 features of x scalar product (linear combination, weighted sum) H
  14. 14. Binary classification y ∈ {−1,1} Prediction: fw x( )= sign( w, x ) 14 above or below the line (hyperplane)? yi w, xi > 0 Note that: w, x > 0 w, x < 0 w, x = 0
  15. 15. Perceptron Learning algorithm (optimizes one example at a time) For every xi if yi w, xi ≤ 0 w ← w + yi xi •  if made a mistake •  update the weights if yi>0 makes wi+1 more like xi if yi<0 makes wi+1 more like -xi 15 Repeat
  16. 16. Perceptron Update rule: wold 16 xi w ← w + yi xi
  17. 17. Perceptron wold 17 xi wnew Update rule: w ← w + yi xi
  18. 18. Max-margin classification Idea: ensure some distance form the hyperplane 18 yi w, xi ≥1 Require:
  19. 19. Preference learning Suppose we want to predict rankings: x ! y = v1 v2 " vk x,vi( )≻ x,vj( )⇔ i < j 19 joint features of x and v w ϕ x,v1( ) ϕ x,v2( ) ϕ x,v3( ) ϕ x,v4( ) ϕ x,v5( ) Also works for: •  selecting just the best one •  multiclass classification w,ϕ x,v( )−ϕ x, "v( ) ≥1
  20. 20. Structured prediction 20
  21. 21. Structured prediction Examples: •  “Time flies like an arrow.”x = part-of-speech tagging •  (noun verb preposition determiner noun)!y = •  (S (NP (NNP Time)) (VP (VBZ flies) (PP (IN like) (NP (DT an) (NN arrow))))) y = parse tree or 21
  22. 22. Structured prediction How can we approach this problem? f x( )= g w,ϕ x( )( )•  before we had: •  now must be a complex objectf x( ) f x( )= argmax y w,ψ x, y( )( ) joint features of x and y (kind of like we had with ranking) 22
  23. 23. Structured Perceptron Almost the same as ordinary perceptron •  For every xi •  if ˆyi ≠ yi w ← w +ψ xi, yi( )−ψ xi, ˆyi( ) (if made a mistake) update the weights •  predict: ˆyi = argmax y w,ψ xi, y( )( ) •  23
  24. 24. Argmax problem often infeasible ˆy = argmax y w,ψ x, y( )( ) Prediction: Examples: 24 •  sequence of length T, with d options for each label: dT •  a subgraph of size T from a graph G: |G| chose T •  10-word sentence, 5 parts of speech: ~10 million outputs •  10-node subgraph of a 300-node graph: 1,398,320,233,231,701,770 outputs (around 1018)
  25. 25. Argmax problem ˆy = argmax y w,ψ x, y( )( ) Prediction: Learning: •  even more difficult •  includes prediction as a subroutine 25 often infeasible
  26. 26. Argmax problem: easy cases Independent prediction •  suppose decomposes into •  and decomposes into y v1,v2,…,vT( ) ψ x, y( )= ψi x,vi( ) i=1 T ∑ ψ x, y( ) argmax y w,ψ x, y( ) =               argmax v1 w,ψ1 x,v1( ) ,…,argmax vn w,ψn x,vn( ) ! " # $ % & •  then predictions can be made independently 26
  27. 27. Argmax problem: easy cases Sequence labeling •  suppose decomposes into •  and decomposes into y v1,v2,…,vT( ) ψ x, y( )= ψi x,vi,vi+1( ) i=1 T−1 ∑ ψ x, y( ) •  dynamic programming : O(Td2) 27 •  with ternary features : O(Td3), etc. •  in general tractable in graphs with bounded treewidth
  28. 28. Approximate argmax General idea: •  search in the space of outputs Natural generalization: •  space of partial outputs •  composing the solution sequentially Let’s learn to make good moves! How do we decide which moves to take? Most interesting/crazy idea of this talk (And we don’t need the original argmax problem anymore) 28
  29. 29. Learning to search 29
  30. 30. Learning to search Sequential prediction of structured outputs y = v1,v2,…,vT( )•  decompose the output π : v1,v2,…,vt( )→ vt+1•  learn the policy state action s0 π ! →! s1 π ! →! … π ! →! sT = y •  apply the policy sequentially st → vt+1 •  policy can be trained on examples st,vt+1( ) •  preference learning 30
  31. 31. Learning to search The caveat of sequential prediction siStates : coordinates of the car Actions : steering (‘left’, ‘right’)vi+1 Oops! left! left! right! right! left! Problem: •  errors accumulate •  training data is not i.i.d.! Solution: •  Train on the states produced by our policy! Chicken-and-egg problem (solution: iterate) 31
  32. 32. Searn and DAGGER Searn = “search” + “learn” •  start from optimal policy; move away •  generate new states with current •  generate actions based on regret •  train on new state-action pairs •  interpolate the current policy πi+1 ← βπi + 1− β( ) #πi+1 policy learnt at ith iteration regret 32 [1] Hal Daumé III, John Langford, Daniel Marcu. Search-based Structured Prediction. Machine Learning Journal, 2006. [1] !πi+1 πi
  33. 33. Searn and DAGGER DAGGER = “dataset” + “aggregation” •  start from the ‘ground truth’ dataset, enrich it with new state-action pairs •  train a policy on the current dataset •  use the policy to generate new states •  generate ‘expert’s actions’ for new states •  add new state-action pairs to the dataset expert’s actions As in Searn, we’ll eventually be training on our own-produced states 33 [2] Stephane Ross, Geoffrey Gordon, Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. Journal of Machine Learning Research, 2011. [2]
  34. 34. DAGGER for building the graph summaries Input: topic graph , search results , relation Output: topic summary of size G V, E( ) R ⊆ V × S S GT VT , ET( ) T A few tricks: •  Predict only vertices ,VT •  Require that the summaries be nested: •  which means •  hence, the task is to predict the sequence ∅ =V0 ⊂ V1 ⊂…⊂ VT Vi+1 =Vi + vi+1 v1,v2,…,vT( ) 34
  35. 35. DAGGER for building the graph summaries •  Provide the ‘ground truth’ topic sequences V,S, R( ),v1,v2,…,vT( ) a single ground truth example topics (vertices) documents (search results) topic-document relations topic sequence •  Create the dataset D0 = ∪ si,vi+1( ){ }i=0 T •  train the policy onπi Di •  apply to the initial states to generate state sequencesπi s0 s1,s2,…,sT( ) empty summary intermediate summaries •  produce the ‘expert action’ for every generated state •  produce v* Di+1 = Di ∪ s,v* ( ){ } 35
  36. 36. DAGGER: producing the ‘expert action’ •  The expert’s action brings us closer to the ‘ground-truth’ trajectory expert’s actions •  Suppose the ‘ground-truth’ trajectory is 36 s1,s2,…,sT( ) •  And the generated trajectory is ˆs1, ˆs2,…, ˆsT( ) •  The expert’s action vi+1 * = argmin v Δ ˆsi ∪ v{ },si+1( )( ) dissimilarity between the states
  37. 37. DAGGER: topic sequence dissimilarity •  Set-based dissimilarity, e.g. Jaccard distance •  similarity between topics ? •  encourages redundancy •  Sequence-matching based dissimilarity •  greedy approximation 37 Δ v1,v2,…,vt( ), !v1, !v2,…, !vt( )( )
  38. 38. DAGGER: topic graph features •  Coverage and diversity •  [transitive] document coverage •  [transitive] topic frequency, average and min •  topic overlap, average and max •  parent-child overlap, average and max •  … 38 ψ V,S, R( ), v1,v2,…,vt( )( )
  39. 39. Recap •  We’ve learnt: •  … how to do binary classification •  and implement it in 4 lines of code •  … about more complex problems (ranking, and structured prediction) •  general approach, structured Perceptron •  argmax problem •  … that learning and search are two sides of the same coin •  … how to predict complex structures by building them sequentially •  Searn and DAGGER 39
  40. 40. Questions? 40 dmirylenka @ disi.unitn.it
  41. 41. Extra slides 41
  42. 42. Support Vector Machine Idea: large margin between positive and negative examples Loss function: L y, f x( )( )=[1− y⋅ f x( )]+ Hinge loss (solved by constrained convex optimization) yi w, xi ≥ C C → max # $ % &% ⇔ yi w, xi ≥1 w → min # $ % &% 42
  43. 43. Structured SVM correct outputs score higher by a margin w → min w,ψ xi, y( ) − w,ψ xi, yi( ) ≥1,  yi ≠ y w,ψ xi, yi( )−ψ xi, y( ) ≥ Δ yi, y( ),  yi ≠ y margin depends on dissimilarity Taking into account (dis)similarity between the outputs 43

Supervised machine learning addresses the problem of approximating a function, given the examples of inputs and outputs. The classical tasks of regression and classification deal with functions whose outputs are real numbers. Structured output prediction goes beyond one-dimensional outputs, and allows predicting complex objects, such as sequences, trees, and graphs. In this talk I will show how to apply structured output prediction to building informative summaries of the topic graphs—a problem I encountered in my Ph.D. research. The focus of the talk will be on understanding the intuitions behind the machine learning algorithms. We will start from the basics and walk our way through the inner workings of DAgger—state-of-the-art method of structured output prediction. This talk was be given at a seminar in Google Krakow.

Views

Total views

470

On Slideshare

0

From embeds

0

Number of embeds

27

Actions

Downloads

10

Shares

0

Comments

0

Likes

0

×