CUbRIK research at SIGMOD 2012

+

Top-k bounded diversification
Piero Fraternali, Davide Martinenghi, Marco Tagliasacchi
Politecnico di Milano, Italy

Scottsdale, AZ, USA - May 24, 2012 0

+ 1

Motivation

 Diversification is useful in application domains where objects
can be described by
 a score
 a 2- or 3-dimensional feature vector

 Many examples from search (real estate, image search, …)
 Apartments distributed over a map
 Score (e.g., price) + 2D feature vector (geo-localization)
 Evolution in time of price of apartments over a map
 Score (e.g., price) + 3D feature vector (geo-localization + time)
 Properties of images (e.g., HSI color features)
 Score (e.g., relevance to a given keyword) + 3D feature vector
(e.g., average HSI components in the image)

+ 2

Diversified result set
Looking for good restaurants in Milan

+ 3


top 15

+ 4


top 15

top 15
diversified
over the
region

+ 5

Diversification

 We are given a set O of N objects
 is the vector-space representation of object o
 is the relevance score of object o

 Diversification problem

+ 6

Diversification

 We are given a set O of N objects
 is the vector-space representation of object o
 is the relevance score of object o
Objective
function
 Diversification problem

Best diversified Set of Relevance to Diversity (as
set of K objects objects query (as score) distance)

+ 7

Greedy approach to diversification
MMR (Maximum Marginal Relevance)

 Diversification problems are NP-hard

 Approximate greedy algorithms are needed

 MMR is a well-known greedy algorithm with good quality of
result (i.e., value of the objective function)
 Find K objects that are both relevant and diverse
 At each step, pick the object with largest diversity-weighted score
 K steps in total

+ 8




Relevance
Diversity
Balance between
relevance and
diversity

Diversity-
weighted score

+ 9





 Corresponding objective function:

+ 10





 Main disadvantage:
 All objects must be available from the beginning

+ 11

Bounded diversification

 Objects are embedded in a bounded region of space
 E.g., a bounding rectangle

 Accessing objects is costly
 Objects are progressively accessed (not available at time 0)
 The number of accessed objects (sumDepths) should be
minimized

 Indexes for sorted access to objects are available
 Access by score (in descending order)
 Access by distance from a given point (in ascending order)
 Both are very common in services on the Web (e.g., apartments
search)

+ 12

Distance-based access
Restaurants by distance from a given point q

+

Size of icon proportional to score

+ 13

Score-based access
Restaurants by score

+

Size of icon proportional to score

+ 14

Attacking bounded diversification
The Pull-Bound MMR (PBMMR) template

 Goal: achieve the same quality of result as MMR
 But minimizing the number of accessed objects

 K iterations: within each of them do this as long as needed
 Pulling strategy: choose an access method (by score or distance)
 If by distance, choose from which point (probing location)
 Bounding scheme: compute an upper bound on the diversity-
weighted score that can be achieved by unseen objects
 If a seen object exceeds the bound, select it and do next iteration

Credits to [Schnaitter&Polyzotis 2008] for their Pull-Bound Rank Join template

+ 15

Choosing probing locations

 Goal of distance-based access:
 Exploring the region of space in which the object with the best
diversity-weighted score is most likely to be found

 At each of the K iterations, we fix the probing locations at the
most promising points of the unexplored space
 Vertices of the bounded Voronoi diagram of the points selected at
the previous iterations

 Of these, the most promising ones are as far as possible from
all the objects of the current selection

+ 16

Example
Voronoi diagram of selected objects

 4 objects x1, …, x4 selected during the first 4 iterations

 Bounding region is a square

+ 17

Example

 4 objects x1, …, x4 selected during the first 4 iterations

 Bounding region is a square
Probing
locations

+ 18

Example

 A new object is selected

+ 19

Example
Bounded Voronoi diagram of selected objects

 Probing locations: v1, …, v4 (vertices of the bounding region)

 Shading: distance from closest points (brightest in vertices)

+ 20

Example

 Probing locations: v1, …, v6 (vertices of bounded Voronoi diagram)


 The local maxima of the function “distance from the closest point
between x1 and x2” are among v1, …, v6

+ 21

Example

 Probing locations: v1, …, v8


 The local maxima of the function “distance from the closest
point among x1, …, x3” are among v1, …, v8

+ 22

Example

 Probing locations: v1, …, v10



+ 23

Example

 Probing locations: v1, …, v12 (no other intersection in region)



+ 24

Example
A running state

 Inside red circumferences: explored region

 Pink discs: objects retrieved by distance-based access

+ 25

Example
A running state



+ 26

Example
A running state



+ 27

Example
A running state



+ 28

Example
A running state



(shown as light red discs wit h sizes proport ional t o t he
s). Not e t hat Vor( X , U) and t he corresponding prob-
+ 29

ocat ions are updatschemeime a new select ed object is
Bounding ed each t
d t o Computing a R. upper bound
O by PBM tight M
e unseen objects ret rievable with t he next dist ance-
d access belong t oif t he set achieved in some which leaves out
 A bound is tight it can be Z = U D, hypothetical
explored hypersphere Σ u being ered in v u , u = 1, . . . , V .
continuation of the instance cent explored

ight upperupper bound canbe computed as follows:
A tight bound can be found as follows
l ast
τ = ( 1 − λ) Sq + λ max min ∥x − y ∥ (11)
x ∈Z y ∈X

eorem 5.1 provides an effect ive comput at ion procedure
11).
eor em 5.1. The point x ∗ ∈Z that maximizes the min-
m distance from all the points in X is a vertex of the con-
ull of Pi D, where Pi is one of the cells of Vor( X , U) .

+ 30

Bounding ed each t
O by PBM tight M

l ast
τ = ( 1 − λ) Sq + λ max min ∥x − y ∥ (11)
x ∈Z y ∈X
Highest score
possible (last seen
by score-based
11). access)
Maximal minimal
distance from the
∗ selected objects
eor em 5.1. The point x ∈Zof selected
Unexplored
Set
region of space
that maximizes the min-
objects

+ 31

Bounding ed each t
O by PBM tight M

l ast
τ = ( 1 − λ) Sq + λ max min ∥x − y ∥ (11)
x ∈Z y ∈X

 Theorem: the point x* that maximizes the minimal distance
11). from all the selected objects is a vertex of the convex hull of
unexplored part of a cell of the bounded Voronoi diagram
eor em 5.1. The point x ∗ ∈Z that maximizes the min-
 Theorem: the bound obtained in this way is tight

+ 32

Selecting the next probing location

 In 2D, the point maximizing the
minimal distance can only be
 A vertex of the bounded
Voronoi diagram
 An intersection between an
edge and a circumference
 An intersection between two
circumferences

 The corresponding vertex is
selected as the next probing
location

+ 33


 A vertex of the bounded
Voronoi diagram
circumferences
Vertex selected as
next probing location
location

Point maximizing the
minimal distance

+ 34


Vertex selected as bounded
 A vertex of the
next probing location
Voronoi diagram
Point maximizing the
circumferences
minimal distance

location

+ 35

Pulling strategy

 Round robin: select, in alternation, each probing location
 Some loose form of instance optimality can already be achieved
with a tight bounding scheme and round robin

 Potential adaptive:
 Choose the probing location that is most likely to reduce the
upper bound
 Potential adaptive is never worse than round robin
 Choice between access by score or by distance
 Looking at how they reduce the upper bound wrt. the number
of accessed objects

+ 36

Batched access

 In the model so far, objects are accessed one by one
 Not practical for many scenarios
 “Batched access” modes available in many practical systems:
 Give a point and a radius and receive all objects that fall within

 Strategy with batched access:
 Perform exactly one request per probing location with an optimal
choice of the radius
 This amounts to solving an optimization problem that
 Minimizes the threshold by appropriately choosing the radii
 Is subject to a budget constraint (how many objects am I willing
to retrieve)

+ 37

Experiments
Synthetic data, uniform distribution

+ 38

Experiments
Synthetic data, exponential distribution

+ 39

Experiments
Real data

+ 40

Conclusion

 Diversification revisited
 Sorted access modes to avoid accessing all objects
 Same quality as MMR
 A structured template with bounding scheme and pulling strategy

 Optimality guarantees with one-by-one access to objects
 Tight bound
 Instance optimality (in a loose sense)

 Extreme practical efficiency with batched access mode

 Future work:
 Adaptation to other diversification algorithms

+ 41

Acknowledgments:
CUbRIK Project
 CUbRIK is a research project
financed by the European Union

 Goals:
 Advance the architecture of
multimedia search
 Exploit the human
contribution in multimedia
search
 Use open-source components
provided by the community
 Start up a search business
ecosystem

 http://www.cubrikproject.eu/

CUbRIK research at SIGMOD 2012

Recommended

Recommended

More Related Content

More from CUbRIK Project

More from CUbRIK Project (20)

CUbRIK research at SIGMOD 2012

Editor's Notes