2. โ Embedding Symbolic and Hierarchical Data
โก Introduction to Hyperbolic Space
โข Optimization over Hyperbolic Space
โฃ Toy Experiments
Overview
2
4. Symbolic and Hierarchical Data
4
Symbolic data with Implicit hierarchy.
Downstream tasks
link prediction, node classification, community detection, visualization
Wordnet Twitter Social Graph
?LINK
community
5. Good Hierarchical Embedding
5
For downstream tasks, symbolic and hierarchical data needs to
be embedded into space.
Good Embedding?
Embeddings of similar symbols should aggregate in some sense.
Symbolic arithmetic exists: v(King)- v(man) + v(woman)=v(Queen)
Hierarchy can be restored from embedded data.
The space should have low dimension.
7. Limitation of Euclidean Embedding
7
Embed graph structure while preserving distances
Thm) Trees cannot be embedded into Euclidean space with
arbitrarily low distortion for any number of dimensions
a
b Graph Euclidean ??
D(a,b) 2 0.1 1.889
D(a,c) 2 1 1.902
D(a,d) 2 1.8 1.962
Euclidean
Graph
??
c
d
a
b
c
d
a
b
c
d
Embedding
Representation tradeoffs for hyperbolic Embeddings (ICML 2018)
12. Suggested loss function
12
A Example of loss function over hyperbolic space.
Fundamentally, gradients of loss tells which direction the points
should proceed.
Poincarรฉ Embeddings for Learning Hierarchical Representations (ICML 2017)
13. Gradient Descent Algorithm
13
Input: ๐: ๐ฟ2 โ โ, ๐0 โ ๐ฟ2, ๐ = 0
repeat
choose a descent direction ๐ฃ ๐ โ ๐๐ ๐
๐ฟ2
choose a retraction ๐ ๐ ๐
: ๐๐ ๐
๐ฟ2
โ ๐ฟ2
choose a step length ๐ผ ๐ โ โ
set ๐ ๐+1 = ๐ ๐ ๐
(๐ผ ๐ ๐ฃ ๐)
๐ โ ๐ + 1
until ๐ ๐+1 sufficiently minimize ๐
Nothing different from usual gradient descent except for
Gradient direction
Retraction
Optimization methods on Riemannian manifolds and their application to shape space (SIAM 2012)
14. Gradient Descent Algorithm
14
Input: ๐: ๐ฟ2 โ โ, ๐0 โ ๐ฟ2, ๐ = 0
repeat
choose a descent direction ๐ฃ ๐ โ ๐๐ ๐
๐ฟ2
choose a retraction ๐ ๐ ๐
: ๐๐ ๐
๐ฟ2
โ ๐ฟ2
choose a step length ๐ผ ๐ โ โ
set ๐ ๐+1 = ๐ ๐ ๐
(๐ผ ๐ ๐ฃ ๐)
๐ โ ๐ + 1
until ๐ ๐+1 sufficiently minimize ๐
What is the gradient on Hyperbolic space?
๐ โถ (โ2
, โ๐๐ฅ0 2
+ ๐๐ฅ1 2
+ ๐๐ฅ ๐ 2
) โ โ
โ๐ ?
16. Gradient Descent Algorithm
16
Input: ๐: ๐ฟ2 โ โ, ๐0 โ ๐ฟ2, ๐ = 0
repeat
choose a descent direction ๐ฃ ๐ โ ๐๐ ๐
๐ฟ2
choose a retraction ๐ ๐ ๐
: ๐๐ ๐
๐ฟ2
โ ๐ฟ2
choose a step length ๐ผ ๐ โ โ
set ๐ ๐+1 = ๐ ๐ ๐
(๐ผ ๐ ๐ฃ ๐)
๐ โ ๐ + 1
until ๐ ๐+1 sufficiently minimize ๐
What is the retraction on Hyperbolic space?
17. Hyperboloid model
17
Retraction tells how ends points of tangent vectors correspond
to the point on manifold.
We chose affine geodesic as retraction
๐พ๐ก = cosh ||๐ฃ||โ ๐ก ๐ + sinh ||๐ฃ||โ ๐ก
๐ฃ
||๐ฃ||โ
๐โฒ โ ๐ฟ2
๐ (๐โฒ
) โ ๐ฟ2
At ๐ โ ๐ฟ2 with direction ๐ฃ โ ๐๐ ๐ฟ2
23. Takeaways
23
Hyperbolic space is promising to represent symbolic and
hierarchical datasets.
Geometry determines path toward optimal points.
Regardless of optimization technique, the optimal point is only
depends on loss function.
Interpretation: Can the path entail semantics?
Loss function over hyperbolic space should be discreetly
chosen.
Is it suitable for given geometry? Differentiable? / operation?
Unfortunately, we loose simple arithmetic.
Editor's Notes
Good evening. I am segwang kim from machine intelligence lab. My topic is Hierarchical representation with hyperbolic geometry.
This topic is the topic I am currently working on, but I have gotten nothing meaningful yet. I found this topic intriguing in that it suggests alternative ways to represent symbolic and hierarchical datasets, which in turns helps to do downstream tasks in Natural Language Processing or Social Network Analysis.
This is an overview.
The main goal of this talk is to make you get along with hyperbolic representation.
First, I will introduce the data of interest to be represented and conventional way to embed those datasets.
Second, I will go over shortcomings of conventional embedding and introduce the gist of hyperbolic space.
Third, I am gonna show optimization technique over hyperbolic space.
In the end, Toy Experiments are followed.
Recent papers are included in this presentation.
The datasets I am dealing with, such as wordnet or social network are symbolic and hierarchical. They are symbolic because words or users have no meaningful numeric values. They are just symbols. On top of that, they are hierarchical since there exist partial orderings between data points like dogs belong to mammals and mammals belong to animal. Or, when a twitter user follows another, then we can have ordering between them.
The typical machine learning problem on those datasets are link prediction, node classification, community detection or visualization. To be specific, someone would ask are sprinkler and birdcage linked? or what community does a particular user belong to?
To tackle those problems, we need to parametrize symbolic and hierarchical dataset into numeric forms. We call this process as embedding. Once data points are embedded into some space, we can apply a machine learning model that work on the space.
Even if symbolic datapoints are represented in numerical form, it is natural to expect that the embedding should agree on our intuition.
For instance, two words with similar meaning should be represented as two points that are close to each other.
This two-dimensional figure seems to catch semantic relation.
Like this, we expect some properties from good embedding.
Down the ages, we have embedded symbolic data into the most familiar space, Euclidean space.
However, there are some limitations of Euclidean Embedding.
To illustrate, assume that we want to solve machine learning problem on this bushy-structured datasets. Edge between two nodes means they have something in common.
Therefore, we would want to find the embedding that preserves distances among nodes measured in the graph.
Unfortunately, a second you embed the data points into two dimensional Euclidean space, you would realize that the huge distortions have been made. While the graph distance between node a and b is 2, the Euclidean distance between corresponding points is far less than 2.
To remedy this problem, researchers have increased the dimensionality of Euclidean space. However, by doing that, we loose opportunities to analyze it low dimension.
On top of that, trying to embed trees into Euclidean space is wrong from the beginning.
To be more formally, there is a theorem that Trees cannot be โฆ.
So, main question is, what if we have a space that preserve graph structure well like this one? What is this mysterious space? Now, itโs time to introduce hyperbolic space.
Time for series of math slides.
The best analogy I can use for introducing hyperbolic space is Euclidean space.
We can define geometry of given space or manifold by looking into its domain and inner product structure on tangent space.
Before elaborating why inner product structure does matter, letโs formally define hyperbolic space.
Hyperbolic space is a manifold with constant sectional curvature -1 and five different models are used for describing it. Actually they are same because there exists isometries among them.
Anyhow, I pick one of them. A Poicare disk model.
The domain of N-dimensional poicare disk model is N-dimensional sphere. A innerproduct of tangent space is defined like this.
Unlike Euclidean space which has the same innerprdocut rule for all tangent space, hyperbolic space has different innerproduct structure depending on which point given tangent space is attached. In mathetmatical term, this is called Riemannian metric.
To compare these two spaces, letโs do an inner product.
First you attach tangent plane to given point p in Euclidean or hyperbolic space and then, you pick two arbitrary tangent vectors from the tangent plane. In case of Euclidean product, you take component-wise product and do summation. Note that the point p has nothing to do with computing inner product.
However, in case of hyperbolic space, this highlighted term is multiplied after usual inner product. Note that it depends on point p. Because of this term, strange things are happened.
As I said, inner product of tangent space governs geometry of space. Because, it defines length, angle and โlineโ of given space.
From the calculus 101, we know that length of given path is defined as line integral of norm of instantaneous velocity, which is tangent vector. Since norm is defined when inner product is given, the Riemannian Metric comes into play.
Also, angle between two tangent vectors is governed by innerproduct structure because inner products need to be done.
Finally, if we keep in mind that line is not defined as straight path but the shortest path connecting starting and end points, shape of line in hyperbolic space must be different.
The shortest path is the optimal solution of this functional equation which seems almost impossible to solve.
But Mathematician concludes that line in hyperbolic space is either an usual arc which perpendicularly intersects with boundary of n-dimensional sphere or straight line starting from the center.
Considering the norm of tangent vector increases as base point goes to boundary, the shortest path must be inclined to pass region around center rather than near boundary. So it must be tilted toward center.
One interesting fact about hyperbolic space is we can choose a one model among five ones depending on situation. Fundamentally, they are all same because of existence of isometry.
The paper โ โ suggest that Poincare ball model is more adequate for visualization than Lorentz model, defined like this. This is because Lorentz model is defined on ambient space with constraints. But Lorentz model guarantees more computational stability of gradient than Poincare ball model.
In the following optimization section, I will explain optimization technique on Lorentz model not Poincare model.
This is one example of loss function over hyperbolic space.
As you can see, this loss function has hyperbolic distance terms.
Details are omitted, but basically, this disperses irrelevant datapoints and aggregates relevant ones.
Because gradients of loss tells which direction the datapoints should proceed, we need to know how to compute derivative of given loss function.
This is Riemannian Gradient descent algorithm.
There are only two parts you need to focus on. First, choosing a descent direction, second choosing a retraction
Choosing a descent direction needs more a little bit of efforts than usual gradient.
Letโs assume that we want to minimize a loss function over two-dimensional Lorentz model.
Basically, we want to find gradient of f.
It takes two steps.
Basically, we need to correspond naรฏve gradients to a tangent vector.
First, once we get a gradient from tensorflow or any api, as shown in blue box, this value is unique no matter which metric tensor you have chosen.
If we interpret gradient as linear mapping from tangent space to real number, Riesz representation theorem implies that there is a corresponding vector such that inner product with the vector is the gradient map.
To find the vector, inverse of metric tensor needs to be multiplied to usual derivatives in order to compensates extra terms in hyperbolic innerproduct.
It is complicated but, bottom line is just flip the sign of the first element of usual gradient.
The second step is projection.
Because Lorentz model is defined in ambient space, we need to project the resulting vector from the first step to tangent place of model. It only takes some multiplication and addition.
Therefore, we can get Riemannian descent direction by flipping signs of all components of hyperbolic gradient of the loss.
Retraction tells how can a point be moved to given direction.
When the point is moved to the tip of the direction, it escapes a manifold. This is sad.
However, the point is moved to the tip of the geodesics, then it stays on the manifold and we are happy.
The geodesic is a hyperbolic version of line and this simple formula is all you need.
The last step is trivial. We just need to iterate previous steps until we get sufficiently small errors.