Makoto P. Kato
(University of Tsukuba)
Wiradee Imrattanatrai (Kyoto University), Takehiro Yamamoto,
Hiroaki Ohshima (University of Hyogo), and Katsumi Tanaka (Kyoto University)
Context-guided Learning to Rank Entities
= 0.9(Context-guided Learning) + 0.1(Learning to Rank Entities)
Learn to rank entities with their numerical attributes
and a subset of ranked entities
Goal
Rank Country GDP $ Military Land area … Min. Temp.
1st Sweden 493 5 450 -50
2nd Canada 1,550 15 9,987 -40
3rd Switzerland 664 3 41 -30
4th Australia 1,225 21 7,692 -20
5th Norway 388 5 323 -10
Entity
Numerical attribute
Learn
Ranking of Popular Countries
𝐏𝐨𝐩𝐮𝐥𝐚𝐫𝐢𝐭𝐲 = +𝟎. 𝟓 𝐇𝐚𝐩𝐩𝐢𝐧𝐞𝐬𝐬 − 𝟎. 𝟑 #𝐒𝐮𝐢𝐜𝐢𝐝𝐞𝐬
2
Attractiveness of cities =
+ 0.035 (Avg. lifetime of women) - 0.032 (# Traffic accidents)
- 0.031 (Population/# Households)
Popularity of countries =
+ 0.058 (Happiness) - 0.057 (#refugees) - 0.045 (# Suicides)
Peacefulness of countries =
+ 0.170 (Grain harvest) + 0.166 (GDP grow rate) - 0.126 (# Suicides)
Usability of cameras =
- 0.240 (Weight) - 0.213 (Height) + 0.133 (Max. shutter speed)
Real Examples from Experiments 3
If the ranking of entities was learned,
we could realize the following applications
Motivation 4
safe countries
1.Iceland
2.New Zealand
3.Portugal
4.Austria
5.Denmark
Ranking entities
in a specified order
1.Iceland
2.New Zealand
3.Portugal
4.Austria
5.Denmark
Understanding rankings
Safety = +𝟎. 𝟓 Police budget − 𝟎. 𝟖 Crime rate
Safe country ranking 2020
Too many attributes for a small size of training data
(known as over-fitting)
Challenge
𝐏𝐨𝐩𝐮𝐥𝐚𝐫𝐢𝐭𝐲 = −𝟏. 𝟎 𝐌𝐢𝐧. 𝐓𝐞𝐦𝐩.
Rank Country GDP $ Military Land area … Min. Temp.
1st Sweden 493 5 450 -50
2nd Canada 1,550 15 9,987 -40
3rd Switzerland 664 3 41 -30
4th Australia 1,225 21 7,692 -20
5th Norway 388 5 323 -10
Entity
Numerical attribute
Learn
Ranking of Popular Countries
This should hold only
for these five countries
5
• Over-fitting
 The learned model is highly accurate for seen data (training data),
while it is not for unseen data (test data)
 In general, it happens when the number of features is large
compared to the number of training instances
• Is it a serious problem?
 If the number of attributes for an entity class is fixed,
only a solution is to increase the size of training data
Over-fitting? 6
Can you increase the number of entities?
e.g. the number of countries (max. ~ 200)
Sometimes yes, and sometimes no
Why not help rankers understand attributes by their context?
→ Context-guided Learning (CGL)
Observation 7
• It seems obvious for us that
𝐏𝐨𝐩𝐮𝐥𝐚𝐫𝐢𝐭𝐲 = −𝟏. 𝟎 𝐌𝐢𝐧. 𝐓𝐞𝐦𝐩.
is a wrong model for the popularity ranking
• Why?
 We know the meaning of the minimum temperature,
and that it is (probably) nothing to do with the country popularity
 We probably learned it by reading/listening to many sentences on
"popularity" and "minimum temperature"
Key Idea
1. Introduced the problem of learning to rank entities by
using attributes as features
 For ranking entities by various criteria and precisely understanding
ranking criteria
2. Proposed Context-guided Learning (CGL)
 A general ML method using contexts of labeling criteria and
features for preventing over-fitting
3. Conducted experiments with a wide variety of orders,
and demonstrated the effectiveness of CGL in the task
of learning to rank entities
Contributions 8
Learn the weights of a linear model by training instances,
as well as contexts of the labeling criteria and attributes
Context-guided Learning (CGL)
 Labeling criteria: language expression used to explain how labels are
given (e.g. popularity)
 Context of x: sentences on x
A large # suicides affects the popularity of countries.
# suicides may indicate low popularity of the country.
Contexts suggesting negative correlation
While the min. temp. is low, the country is popular.
The country is cold but popular.
Contexts suggesting no correlation
Estimated as
non-zero and negative
Estimated as zero
𝐏𝐨𝐩𝐮𝐥𝐚𝐫𝐢𝐭𝐲 = 𝑤1 𝐆𝐃𝐏 + 𝑤2 #𝐒𝐮𝐢𝐜𝐢𝐝𝐞𝐬 + 𝑤3(𝐌𝐢𝐧. 𝐓𝐞𝐦𝐩. )
A linear model for "popularity"
9
• Suppose we try to learn a linear model 𝑓 𝐱 = 𝐰T
𝐱 + 𝑏
• One of the weight values fitting the training data is
𝐰 = (𝟏, 𝟎) meaning that "warm countries are rich"
 (𝟎, 𝟏) is another candidate for 𝐰, but no evidence on which is better
Example: Learning without CGL 10
Rich
(𝑙)
Temp.
(𝑎1)
GDP
(𝑎2)
𝐱1 +1 14 9
𝐱2 +1 13 4
𝐱3 −1 3 1
Entities
Labeling
Criteria Attributes
Attributes and Labels of Entities Temp.
GDP
Decisionboundaryby𝐰
𝐱1
𝐱2
𝐱3 𝐰: Weights of a linear function
Learn
a linear
model
−1
+1
+1
𝐠 is a weight "roughly" estimated by the contexts
Expected that 𝐠 is somewhat close to the ideal weight
Example: Learning with CGL 1/2 11
… The average temp. of the lobster-
rich waters …
… The effect of rich air/fuel ratios
and temp. …
… Culturally-rich country has
moderate temp. …
Contexts of 𝑙
(usually derived from the Web corpus)
𝑐1
𝑐2
Temp.
GDP
𝐱1
𝐱2
𝐠
Predict… GDP is a key factor for richness.
…
… Rich countries have high GDP.
…
… Rich regions, where GDP was
above the EU-28 …
𝐱3
For"temp."For"GDP"
−1
+1
+1
CGL estimate 𝐰 by 𝐰 = 𝐠 + 𝐯
The difference 𝐯 is expected to be small
Evidences to support 𝐰 = (0, 1) meaning that "a high GDP indicates richness"
Example: Learning with CGL 2/2 12
Temp.
GDP
𝐱1
𝐱2
𝐱3
𝐠Predict
Rich
(𝑙)
Temp.
(𝑎1)
GDP
(𝑎2)
𝐱1 +1 14 9
𝐱2 +1 13 4
𝐱3 −1 3 1
Entities
Labeling
Criteria Attributes
Attributes and Labels of Entities
Decision boundary by 𝐰
𝐰
𝐯
−1
+1
+1
• Linear function 𝑓𝑘 to rank entities in order 𝑘
(we assume there are several orders to be learned)
𝑓𝑘 𝐱 𝑖 =
𝑗=1
𝑀
𝑤 𝑘,𝑗 x𝑖,𝑗
• Weight Model
𝑤 𝑘,𝑗 = 𝐮 𝑇
𝐜 𝑘,𝑗 + 𝑣 𝑘,𝑗
Formalization
𝒋-th attribute value of 𝒊-th entityWeight value
for 𝒋-th attribute
Weight vector
for context vectors
Context vector
for order 𝒌 and
𝒋-th attribute
Weight value that could not
be explained by only contexts
13
Any models such as TF-IDF, doc2vec, or
Sentence-BERT can be applied to the
contexts for generating context vectors
Context Model 14
… The average temp. of the lobster-
rich waters …
… The effect of rich air/fuel ratios
and temp. …
… Culturally-rich country has
moderate temp. …
Contexts of 𝑙
(usually derived from the Web corpus)
𝑐1,1
𝑐1,2
… GDP is a key factor for richness.
…
… Rich countries have high GDP.
…
… Rich regions, where GDP was
above the EU-28 …
For"temp."For"GDP"
𝐜1,1 = (1.2, 0, 0.1)
𝐜1,2 = (0, 2.2, 1.7)
𝐮 𝑇 𝐜1,1 = 0.1
𝐮 𝑇 𝐜1,2 = 2.9
If 𝐮 = (0, 0.5, 1)
𝐮 determines how to estimate
the weight based on the
context vector
• Find the solution of this optimization problem:
min
𝐮,𝐯 𝑘,𝜉 𝑘,𝑖
𝐮 2
+
𝑐
𝐾
𝑘=1
𝐾
𝐯 𝑘
2
+ 𝐶
𝑘=1
𝐾
𝑖=1
𝑁 𝑘
𝜉 𝑘,𝑖
 subject, for 𝑘 = 1, … , 𝐾 and 𝑖 = 1, … , to the constraints
𝑓𝑘 𝐱 𝑖
sup
− 𝑓𝑘 𝐱 𝑖
inf
≥ 1 − 𝜉 𝑘,𝑖
• Can be solved by SVM solvers with a special kernel
Learning of CGL
Regularization term
similar to SVM
Slack variables
similar to SVM
The rank of 𝐱 𝑖
sup
is higher than that of 𝐱 𝑖
inf
in
the training data. Similar to RankingSVM.
15
Cities Countries Cameras
# Entities 47 138 149
# Orders 64 40 54
# Attributes 137 83 16
Examples of Orders
Attractiveness,
Richness
Livability, Safety Portability, Usability
Examples of
Attributes
Population,
Crime rate
# Visitors,
# Suicides
Resolution,
Weight
Experiments
Experiments were conducted with
Used a half of ranked entities as training data, and
examined if the rest of the entities can be ranked correctly
16
• Baselines
 RankNet
 RankBoost
 Linear-Feature
(A linear feature-based model optimized by coordinate ascent)
 LambdaMART
 ListNet
• Proposed Methods
 CGL (TF-IDF)
• The TF-IDF weighting schema was used as a context model
 CGL (Distributed)
• Paragraph vector was used as a context model
Comparative Methods 17
Context-guided Learning (CGL) worked well (+16%) at every class of entities
No significant difference between the two context models
Experimental Results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
City Country Camera Total
Accuracy
RankNet RankBoost LinearFeature LambdaMART
ListNet CGL (TF-IDF) CGL (Distributed)
18
Attractiveness of cities =
+ 0.035 (Avg. lifetime of women) - 0.032 (# Traffic accidents)
- 0.031 (Population/# Households)
Popularity of countries =
+ 0.058 (Happiness) - 0.057 (#refugees) - 0.045 (# Suicides)
Peacefulness of countries =
+ 0.170 (Grain harvest) + 0.166 (GDP grow rate) - 0.126 (# Suicides)
Usability of cameras =
- 0.240 (Weight) - 0.213 (Height) + 0.133 (Max. shutter speed)
Real Examples from Experiments 19
User Study
• Evaluated the learned model
by crowdsourcing
 “If you agree that there is a
correlation between <labeling
criterion> and <attribute>, please
assign a score +2. If you disagree,
please assign a score −2. If you
cannot agree or disagree, please
assign a score 0.”
• Compared CGL and Linear-
Feature
• CGL was slightly better
20
1. Introduced the problem of
learning to rank entities by
using attributes as features
2. Proposed Context-guided
Learning (CGL)
3. Conducted experiments with a
wide variety of orders, and
demonstrated the
effectiveness of CGL in the
task of learning to rank entities
Summary 21
Can take questions at
https://www.mpkato.net/

Context-guided Learning to Rank Entities

  • 1.
    Makoto P. Kato (Universityof Tsukuba) Wiradee Imrattanatrai (Kyoto University), Takehiro Yamamoto, Hiroaki Ohshima (University of Hyogo), and Katsumi Tanaka (Kyoto University) Context-guided Learning to Rank Entities = 0.9(Context-guided Learning) + 0.1(Learning to Rank Entities)
  • 2.
    Learn to rankentities with their numerical attributes and a subset of ranked entities Goal Rank Country GDP $ Military Land area … Min. Temp. 1st Sweden 493 5 450 -50 2nd Canada 1,550 15 9,987 -40 3rd Switzerland 664 3 41 -30 4th Australia 1,225 21 7,692 -20 5th Norway 388 5 323 -10 Entity Numerical attribute Learn Ranking of Popular Countries 𝐏𝐨𝐩𝐮𝐥𝐚𝐫𝐢𝐭𝐲 = +𝟎. 𝟓 𝐇𝐚𝐩𝐩𝐢𝐧𝐞𝐬𝐬 − 𝟎. 𝟑 #𝐒𝐮𝐢𝐜𝐢𝐝𝐞𝐬 2
  • 3.
    Attractiveness of cities= + 0.035 (Avg. lifetime of women) - 0.032 (# Traffic accidents) - 0.031 (Population/# Households) Popularity of countries = + 0.058 (Happiness) - 0.057 (#refugees) - 0.045 (# Suicides) Peacefulness of countries = + 0.170 (Grain harvest) + 0.166 (GDP grow rate) - 0.126 (# Suicides) Usability of cameras = - 0.240 (Weight) - 0.213 (Height) + 0.133 (Max. shutter speed) Real Examples from Experiments 3
  • 4.
    If the rankingof entities was learned, we could realize the following applications Motivation 4 safe countries 1.Iceland 2.New Zealand 3.Portugal 4.Austria 5.Denmark Ranking entities in a specified order 1.Iceland 2.New Zealand 3.Portugal 4.Austria 5.Denmark Understanding rankings Safety = +𝟎. 𝟓 Police budget − 𝟎. 𝟖 Crime rate Safe country ranking 2020
  • 5.
    Too many attributesfor a small size of training data (known as over-fitting) Challenge 𝐏𝐨𝐩𝐮𝐥𝐚𝐫𝐢𝐭𝐲 = −𝟏. 𝟎 𝐌𝐢𝐧. 𝐓𝐞𝐦𝐩. Rank Country GDP $ Military Land area … Min. Temp. 1st Sweden 493 5 450 -50 2nd Canada 1,550 15 9,987 -40 3rd Switzerland 664 3 41 -30 4th Australia 1,225 21 7,692 -20 5th Norway 388 5 323 -10 Entity Numerical attribute Learn Ranking of Popular Countries This should hold only for these five countries 5
  • 6.
    • Over-fitting  Thelearned model is highly accurate for seen data (training data), while it is not for unseen data (test data)  In general, it happens when the number of features is large compared to the number of training instances • Is it a serious problem?  If the number of attributes for an entity class is fixed, only a solution is to increase the size of training data Over-fitting? 6 Can you increase the number of entities? e.g. the number of countries (max. ~ 200) Sometimes yes, and sometimes no
  • 7.
    Why not helprankers understand attributes by their context? → Context-guided Learning (CGL) Observation 7 • It seems obvious for us that 𝐏𝐨𝐩𝐮𝐥𝐚𝐫𝐢𝐭𝐲 = −𝟏. 𝟎 𝐌𝐢𝐧. 𝐓𝐞𝐦𝐩. is a wrong model for the popularity ranking • Why?  We know the meaning of the minimum temperature, and that it is (probably) nothing to do with the country popularity  We probably learned it by reading/listening to many sentences on "popularity" and "minimum temperature" Key Idea
  • 8.
    1. Introduced theproblem of learning to rank entities by using attributes as features  For ranking entities by various criteria and precisely understanding ranking criteria 2. Proposed Context-guided Learning (CGL)  A general ML method using contexts of labeling criteria and features for preventing over-fitting 3. Conducted experiments with a wide variety of orders, and demonstrated the effectiveness of CGL in the task of learning to rank entities Contributions 8
  • 9.
    Learn the weightsof a linear model by training instances, as well as contexts of the labeling criteria and attributes Context-guided Learning (CGL)  Labeling criteria: language expression used to explain how labels are given (e.g. popularity)  Context of x: sentences on x A large # suicides affects the popularity of countries. # suicides may indicate low popularity of the country. Contexts suggesting negative correlation While the min. temp. is low, the country is popular. The country is cold but popular. Contexts suggesting no correlation Estimated as non-zero and negative Estimated as zero 𝐏𝐨𝐩𝐮𝐥𝐚𝐫𝐢𝐭𝐲 = 𝑤1 𝐆𝐃𝐏 + 𝑤2 #𝐒𝐮𝐢𝐜𝐢𝐝𝐞𝐬 + 𝑤3(𝐌𝐢𝐧. 𝐓𝐞𝐦𝐩. ) A linear model for "popularity" 9
  • 10.
    • Suppose wetry to learn a linear model 𝑓 𝐱 = 𝐰T 𝐱 + 𝑏 • One of the weight values fitting the training data is 𝐰 = (𝟏, 𝟎) meaning that "warm countries are rich"  (𝟎, 𝟏) is another candidate for 𝐰, but no evidence on which is better Example: Learning without CGL 10 Rich (𝑙) Temp. (𝑎1) GDP (𝑎2) 𝐱1 +1 14 9 𝐱2 +1 13 4 𝐱3 −1 3 1 Entities Labeling Criteria Attributes Attributes and Labels of Entities Temp. GDP Decisionboundaryby𝐰 𝐱1 𝐱2 𝐱3 𝐰: Weights of a linear function Learn a linear model −1 +1 +1
  • 11.
    𝐠 is aweight "roughly" estimated by the contexts Expected that 𝐠 is somewhat close to the ideal weight Example: Learning with CGL 1/2 11 … The average temp. of the lobster- rich waters … … The effect of rich air/fuel ratios and temp. … … Culturally-rich country has moderate temp. … Contexts of 𝑙 (usually derived from the Web corpus) 𝑐1 𝑐2 Temp. GDP 𝐱1 𝐱2 𝐠 Predict… GDP is a key factor for richness. … … Rich countries have high GDP. … … Rich regions, where GDP was above the EU-28 … 𝐱3 For"temp."For"GDP" −1 +1 +1
  • 12.
    CGL estimate 𝐰by 𝐰 = 𝐠 + 𝐯 The difference 𝐯 is expected to be small Evidences to support 𝐰 = (0, 1) meaning that "a high GDP indicates richness" Example: Learning with CGL 2/2 12 Temp. GDP 𝐱1 𝐱2 𝐱3 𝐠Predict Rich (𝑙) Temp. (𝑎1) GDP (𝑎2) 𝐱1 +1 14 9 𝐱2 +1 13 4 𝐱3 −1 3 1 Entities Labeling Criteria Attributes Attributes and Labels of Entities Decision boundary by 𝐰 𝐰 𝐯 −1 +1 +1
  • 13.
    • Linear function𝑓𝑘 to rank entities in order 𝑘 (we assume there are several orders to be learned) 𝑓𝑘 𝐱 𝑖 = 𝑗=1 𝑀 𝑤 𝑘,𝑗 x𝑖,𝑗 • Weight Model 𝑤 𝑘,𝑗 = 𝐮 𝑇 𝐜 𝑘,𝑗 + 𝑣 𝑘,𝑗 Formalization 𝒋-th attribute value of 𝒊-th entityWeight value for 𝒋-th attribute Weight vector for context vectors Context vector for order 𝒌 and 𝒋-th attribute Weight value that could not be explained by only contexts 13
  • 14.
    Any models suchas TF-IDF, doc2vec, or Sentence-BERT can be applied to the contexts for generating context vectors Context Model 14 … The average temp. of the lobster- rich waters … … The effect of rich air/fuel ratios and temp. … … Culturally-rich country has moderate temp. … Contexts of 𝑙 (usually derived from the Web corpus) 𝑐1,1 𝑐1,2 … GDP is a key factor for richness. … … Rich countries have high GDP. … … Rich regions, where GDP was above the EU-28 … For"temp."For"GDP" 𝐜1,1 = (1.2, 0, 0.1) 𝐜1,2 = (0, 2.2, 1.7) 𝐮 𝑇 𝐜1,1 = 0.1 𝐮 𝑇 𝐜1,2 = 2.9 If 𝐮 = (0, 0.5, 1) 𝐮 determines how to estimate the weight based on the context vector
  • 15.
    • Find thesolution of this optimization problem: min 𝐮,𝐯 𝑘,𝜉 𝑘,𝑖 𝐮 2 + 𝑐 𝐾 𝑘=1 𝐾 𝐯 𝑘 2 + 𝐶 𝑘=1 𝐾 𝑖=1 𝑁 𝑘 𝜉 𝑘,𝑖  subject, for 𝑘 = 1, … , 𝐾 and 𝑖 = 1, … , to the constraints 𝑓𝑘 𝐱 𝑖 sup − 𝑓𝑘 𝐱 𝑖 inf ≥ 1 − 𝜉 𝑘,𝑖 • Can be solved by SVM solvers with a special kernel Learning of CGL Regularization term similar to SVM Slack variables similar to SVM The rank of 𝐱 𝑖 sup is higher than that of 𝐱 𝑖 inf in the training data. Similar to RankingSVM. 15
  • 16.
    Cities Countries Cameras #Entities 47 138 149 # Orders 64 40 54 # Attributes 137 83 16 Examples of Orders Attractiveness, Richness Livability, Safety Portability, Usability Examples of Attributes Population, Crime rate # Visitors, # Suicides Resolution, Weight Experiments Experiments were conducted with Used a half of ranked entities as training data, and examined if the rest of the entities can be ranked correctly 16
  • 17.
    • Baselines  RankNet RankBoost  Linear-Feature (A linear feature-based model optimized by coordinate ascent)  LambdaMART  ListNet • Proposed Methods  CGL (TF-IDF) • The TF-IDF weighting schema was used as a context model  CGL (Distributed) • Paragraph vector was used as a context model Comparative Methods 17
  • 18.
    Context-guided Learning (CGL)worked well (+16%) at every class of entities No significant difference between the two context models Experimental Results 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 City Country Camera Total Accuracy RankNet RankBoost LinearFeature LambdaMART ListNet CGL (TF-IDF) CGL (Distributed) 18
  • 19.
    Attractiveness of cities= + 0.035 (Avg. lifetime of women) - 0.032 (# Traffic accidents) - 0.031 (Population/# Households) Popularity of countries = + 0.058 (Happiness) - 0.057 (#refugees) - 0.045 (# Suicides) Peacefulness of countries = + 0.170 (Grain harvest) + 0.166 (GDP grow rate) - 0.126 (# Suicides) Usability of cameras = - 0.240 (Weight) - 0.213 (Height) + 0.133 (Max. shutter speed) Real Examples from Experiments 19
  • 20.
    User Study • Evaluatedthe learned model by crowdsourcing  “If you agree that there is a correlation between <labeling criterion> and <attribute>, please assign a score +2. If you disagree, please assign a score −2. If you cannot agree or disagree, please assign a score 0.” • Compared CGL and Linear- Feature • CGL was slightly better 20
  • 21.
    1. Introduced theproblem of learning to rank entities by using attributes as features 2. Proposed Context-guided Learning (CGL) 3. Conducted experiments with a wide variety of orders, and demonstrated the effectiveness of CGL in the task of learning to rank entities Summary 21 Can take questions at https://www.mpkato.net/