1 KYOTO UNIVERSITY
KYOTO UNIVERSITY
Fast Unbalanced Optimal Transport on a Tree
Ryoma Sato
Kyoto University / RIKEN AIP
2 / 13 KYOTO UNIVERSITY
Self Introduction

I am a second year master's student at
Kyoto University

I’m interested in algorithmic aspects of machine
learning and data mining for structured data, including
Graph neural networks:

Ryoma Sato, Makoto Yamada, Hisashi Kashima. Approximation Ratios of
Graph Neural Networks for Combinatorial Problems. NeurIPS 2019.

Ryoma Sato, Makoto Yamada, Hisashi Kashima. Random Features
Strengthen Graph Neural Networks. SDM 2021
Optimal transport:

Ryoma Sato, Makoto Yamada, Hisashi Kashima. Fast Unbalanced Optimal
Transport on a Tree. NeurIPS 2020.
 Today’s topic
3 / 13 KYOTO UNIVERSITY
Background: optimal transport is useful

The optimal transport (OT) distance measures
the discrepancy of two distributions.
We consider discrete distributions in this presentation.

The OT distance is the minimum cumulative
distance that all masses need to travel from
one distribution to another distribution

In generative modeling, a mass is a sample.
discrepancy of model  sample distribution

In text classification, a mass is a word.
 OT does not require the same support  KL divergence
 OT exploits the underground geometry
From Word Embeddings To
Document Distances, ICML 2015
4 / 13 KYOTO UNIVERSITY
Background: sliced OT is computationally cheap

OT is formulated as a linear program  cubic cost

Sliced OT projects distributions to random
1D spaces and computes OT there

Greedy matching solves 1D OT exactly  linear cost
: distance matrix (input), : matching matrix (variable)
: 1st mass vector (input), : 2nd mass vector (input)
The leftmost mass should be matched to
the leftmost mass
The second leftmost mass should be matched
to the second leftmost mass ...
https://www.programmersought.com/article/67174999352/
https://analyticsindiamag.com/how-to-establish-domain-transferability-in-neural-models/
5 / 13 KYOTO UNIVERSITY
Background: unbalanced OT is robust

OT is sensitive to outliers because transporting outliers
becomes the dominating term

Unbalanced OT (UOT) allows to discard and create
masses by paying some penalties
 We can discard outliers  robust to outliers

UOT is also formulated by a linear program
 cubic cost
6 / 13 KYOTO UNIVERSITY
Background: UOT is difficult even in 1D spaces

We want to make a cheap alternative of UOT as 1D OT

But the greedy matching fails to solve 1D UOT

Let’s consider the following instance with discard cost λ

The following plan costs 3λ.

The following plan costs 2λ + 2ε. Thus this is better.
λ λ λ
7 / 13 KYOTO UNIVERSITY
Background: UOT is difficult even in 1D spaces

Let’s consider the following instance with discard cost λ

The following plan costs λ + 2ε.

The following plan costs 2λ + 2ε. Thus this is worse.
λ ε
8 / 13 KYOTO UNIVERSITY
Background: UOT is difficult even in 1D spaces

Although these two instances share the leftmost part,
the leftmost mass in the first instance should be
discarded while that in the second instance should not

The optimal UOT plan cannot be determined locally
 The optimal OT plan is determined locally

Thus the greedy algorithm fails to solve 1D UOT

We proposed how to solve 1D UOT efficiently
λ λ λ
λ ε
9 / 13 KYOTO UNIVERSITY
Algorithm: prune redundant plans

Our proposed method determines assignments from
left to right (as the greedy algorithm)

Although there are exponentially many plans, most of
them are redundant.
We proved that only O(n) plans are non-redundant
 Only one plan is non-redundant (thus greedy is valid) in the standard OT
not yet
not yet
 non redundant
 redundant
10 / 13 KYOTO UNIVERSITY
Algorithm: we solve 1D UOT in O(n log2
n) time

A naive algorithm requires cubic time even with this
(non redundant plan) observation

More algorithmic techniques are required for further
speedup (skipped in this presentation)

Dynamic programming

Fast convex min-sum convolution

Efficient data structure (BBST)

Weighted union heuristics

Finally, we derived a quasi-linear time algorithm
which runs in O(n log2
n) time in the worst case
11 / 13 KYOTO UNIVERSITY
Algorithm: tree UOT generalizes 1D UOT

Our method can be extended to tree spaces
A 1D space (path) is a special case of tree spaces

In text classification, the word
space can be represented by a
word tree. Each mass (word)
travels on the word tree to a
nearby (semantically similar) word.

We can “tree-slice” high dimensional
spaces instead of 1D-slicing,
which captures richer structures
http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/117-hcpc-hierarchical-clustering-on-principal-components-essentials/
a → b
12 / 13 KYOTO UNIVERSITY
Experiments: our algorithm is empirically fast

We confirmed that our algorithm could compute tree
UOT with one million masses within one second

We also confirmed that tree-slicing high dimensional
spaces could approximate the original UOT problem
13 / 13 KYOTO UNIVERSITY
Conclusion: fast computation of tree UOT

Sliced OT is a fast alternative of OT

UOT is a robust variant of OT

1D UOT is more difficult than 1D OT

We proposed an efficient algorithm for 1D UOT for the
first time

Our method can be extended to tree spaces

Our method is empirically fast (1M masses in 1 sec)

Fast Unbalanced Optimal Transport on a Tree

  • 1.
    1 KYOTO UNIVERSITY KYOTOUNIVERSITY Fast Unbalanced Optimal Transport on a Tree Ryoma Sato Kyoto University / RIKEN AIP
  • 2.
    2 / 13KYOTO UNIVERSITY Self Introduction  I am a second year master's student at Kyoto University  I’m interested in algorithmic aspects of machine learning and data mining for structured data, including Graph neural networks:  Ryoma Sato, Makoto Yamada, Hisashi Kashima. Approximation Ratios of Graph Neural Networks for Combinatorial Problems. NeurIPS 2019.  Ryoma Sato, Makoto Yamada, Hisashi Kashima. Random Features Strengthen Graph Neural Networks. SDM 2021 Optimal transport:  Ryoma Sato, Makoto Yamada, Hisashi Kashima. Fast Unbalanced Optimal Transport on a Tree. NeurIPS 2020.  Today’s topic
  • 3.
    3 / 13KYOTO UNIVERSITY Background: optimal transport is useful  The optimal transport (OT) distance measures the discrepancy of two distributions. We consider discrete distributions in this presentation.  The OT distance is the minimum cumulative distance that all masses need to travel from one distribution to another distribution  In generative modeling, a mass is a sample. discrepancy of model  sample distribution  In text classification, a mass is a word.  OT does not require the same support  KL divergence  OT exploits the underground geometry From Word Embeddings To Document Distances, ICML 2015
  • 4.
    4 / 13KYOTO UNIVERSITY Background: sliced OT is computationally cheap  OT is formulated as a linear program  cubic cost  Sliced OT projects distributions to random 1D spaces and computes OT there  Greedy matching solves 1D OT exactly  linear cost : distance matrix (input), : matching matrix (variable) : 1st mass vector (input), : 2nd mass vector (input) The leftmost mass should be matched to the leftmost mass The second leftmost mass should be matched to the second leftmost mass ... https://www.programmersought.com/article/67174999352/ https://analyticsindiamag.com/how-to-establish-domain-transferability-in-neural-models/
  • 5.
    5 / 13KYOTO UNIVERSITY Background: unbalanced OT is robust  OT is sensitive to outliers because transporting outliers becomes the dominating term  Unbalanced OT (UOT) allows to discard and create masses by paying some penalties  We can discard outliers  robust to outliers  UOT is also formulated by a linear program  cubic cost
  • 6.
    6 / 13KYOTO UNIVERSITY Background: UOT is difficult even in 1D spaces  We want to make a cheap alternative of UOT as 1D OT  But the greedy matching fails to solve 1D UOT  Let’s consider the following instance with discard cost λ  The following plan costs 3λ.  The following plan costs 2λ + 2ε. Thus this is better. λ λ λ
  • 7.
    7 / 13KYOTO UNIVERSITY Background: UOT is difficult even in 1D spaces  Let’s consider the following instance with discard cost λ  The following plan costs λ + 2ε.  The following plan costs 2λ + 2ε. Thus this is worse. λ ε
  • 8.
    8 / 13KYOTO UNIVERSITY Background: UOT is difficult even in 1D spaces  Although these two instances share the leftmost part, the leftmost mass in the first instance should be discarded while that in the second instance should not  The optimal UOT plan cannot be determined locally  The optimal OT plan is determined locally  Thus the greedy algorithm fails to solve 1D UOT  We proposed how to solve 1D UOT efficiently λ λ λ λ ε
  • 9.
    9 / 13KYOTO UNIVERSITY Algorithm: prune redundant plans  Our proposed method determines assignments from left to right (as the greedy algorithm)  Although there are exponentially many plans, most of them are redundant. We proved that only O(n) plans are non-redundant  Only one plan is non-redundant (thus greedy is valid) in the standard OT not yet not yet  non redundant  redundant
  • 10.
    10 / 13KYOTO UNIVERSITY Algorithm: we solve 1D UOT in O(n log2 n) time  A naive algorithm requires cubic time even with this (non redundant plan) observation  More algorithmic techniques are required for further speedup (skipped in this presentation)  Dynamic programming  Fast convex min-sum convolution  Efficient data structure (BBST)  Weighted union heuristics  Finally, we derived a quasi-linear time algorithm which runs in O(n log2 n) time in the worst case
  • 11.
    11 / 13KYOTO UNIVERSITY Algorithm: tree UOT generalizes 1D UOT  Our method can be extended to tree spaces A 1D space (path) is a special case of tree spaces  In text classification, the word space can be represented by a word tree. Each mass (word) travels on the word tree to a nearby (semantically similar) word.  We can “tree-slice” high dimensional spaces instead of 1D-slicing, which captures richer structures http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/117-hcpc-hierarchical-clustering-on-principal-components-essentials/ a → b
  • 12.
    12 / 13KYOTO UNIVERSITY Experiments: our algorithm is empirically fast  We confirmed that our algorithm could compute tree UOT with one million masses within one second  We also confirmed that tree-slicing high dimensional spaces could approximate the original UOT problem
  • 13.
    13 / 13KYOTO UNIVERSITY Conclusion: fast computation of tree UOT  Sliced OT is a fast alternative of OT  UOT is a robust variant of OT  1D UOT is more difficult than 1D OT  We proposed an efficient algorithm for 1D UOT for the first time  Our method can be extended to tree spaces  Our method is empirically fast (1M masses in 1 sec)