Approximate Tree Kernels

B ACKGROUND PARSE T REE K ERNELS A PPROXIMATE T REE K ERNELS R ESULTS Conclusion

Approximate Tree Kernels
Konrad Rieck, Tammo Krueger, Ulf Brefeld, Klaus-Robert
¨
Muller

Presented By
Niharjyoti Sarangi
Indian Institute of Technology Madras

April 21, 2012


O UTLINE OF THE P RESENTATION
1 B ACKGROUND
Learning from tree-structured data
Application Domains
2 PARSE T REE K ERNELS
Computing PTK
Computational constraints
3 A PPROXIMATE T REE K ERNELS
Computing ATK
Validity of ATK
Types of learning
4 R ESULTS
Performance
Time
Memory
5 Conclusion


T REE - STRUCTURED DATA

Trees: carry hierarchical information
Flat feature Vectors: Fail to capture the underlying
dependency structure

Parse Tree
An ordered, rooted tree that represents the syntactic structure
of a string according to some formal grammar.

A tree X is called a parse tree of G = (S, P, s) if X is derived by
assembling productions p ∈ P such that every node x ∈ X is
labeled with a symbol l(x) ∈ S.


E XAMPLES

Figure: Parse trees for natural language text and the HTTP network
protocol.


L EARNING FROM TREES

Kernel functions for Structured data
Convolution of local kernels
Parse tree kernel proposed by Collins and Duffy(2002)

Kernel Functions
k : X × X → R is a symmetric and positive semi-deﬁnite
function, which implicitly computes an inner product in a
reproducing kernel Hilbert space


A PPLICATION D OMAINS

Natural Language Processing
Web Spam Detection
Network Intrusion Detection
Information Retreival from structured
documents
...


C OMPUTING PTK

A generic technique for deﬁning kernel functions over
structured data is the convolution of local kernels deﬁned over
sub-structures.
Parse Tree kernel
k(X, Z) = x∈X z∈Z c(x, z) , Where, X and Z are two parse trees.

Notations
xi : i-th child of a node x
|X|: Number of nodes in X
χ: Set of all possible trees


I LLUSTRATION

Figure: Shared subtrees in two parse trees.


C OUNTING F UNCTION

c(x, z) is known as the counting function which recursively
determines the number of shared subtrees rooted in the tree
nodes x and z.
Deﬁning c(x,z)


 0 if x,z not derived from same P
c(x, z) = λ if x,z are leaf nodes
 |x|
λ i=1 c(xi , zi ) otherwise

0 ≤ λ ≤ 1 , balances the contribution of subtrees, such that
small values of decay the contribution of lower nodes in large
subtrees


C OMPUTATIONAL COMPLEXITY

The complexity is (n2 ), where n is the number of nodes
in each parse tree.

Experimental data
The computation of a parse tree kernel for two HTML documents
comprising 10,000 nodes each, requires about 1 gigabyte of memory
and takes over 100 seconds on a recent computer system.

We need to compare a large number of parse trees. Going
by the above statistics, the use of PTKs are rendered to be
of no practical signiﬁcance because of the computing
resources required.


ATTEMPTED IMPROVEMENTS

A feature selection procedure based on statistical tests.
Suzuki et.al.
Limiting computation to node pairs with matching grammar
symbols. Moschitti


C OMPUTING ATK

Approximation of tree kernels is based on the observation that
trees often contain redundant parts that are not only irrelevant
for the learning task but also slow-down the kernel
computation unnecessarily.

Approximate Tree kernel
ˆ z∈Z ˜(x, z)
k(X,Z)= s∈S w(s) x∈X c , Where, X and Z are two parse trees.
l(x)=s l(z)=s

Selection function: w : S → 0, 1
Controls whether subtrees rooted in nodes with the symbol
s ∈ S contribute to the convolution. (w(s) = 0 or w(s) = 1)


A PPROXIMATE C OUNTING F UNCTION

˜(x, z) is the approximate counting function
c

Deﬁning ˜(x, z)
c


 0

if x,z not derived from same P
 0 if x or z not selected
˜(x, z) =
c
 λ
 if x,z are leaf nodes
 |x|
λ ˜
i=1 c(xi , zi ) otherwise

The selection function w(s) is decided based on the domain and
data. the exact parse tree kernel is obtained as a special case of
ATK if w(s) = 1 for all symbols s ∈ S.


ATK IS A VALID KERNEL

Proof
Let Φ(X) be the vector of frequencies of all subtrees occurring
ˆ
in X. Then, by deﬁnition, Kw can always be written as
ˆ
Kw = Pw Φ(X), Pw Φ(Z) ,
For any w, the projection Pw is independent of the actual X and
ˆ
Z, and hence Kw is a valid kernel.


ATK IS FASTER THAN PTK

Speed up factor qw

s∈S #s (X)#s (Z)
qw =
s∈S ws #s (X)#s (Z)
Where #s (X) denotes the occurances of nodes x ∈ X that were selected.

Looking at the above equation, we can argue that even if only
one symbol is rejected in Approximate Tree Kernel, we get a
speedup qw ≥ 1.


S UPERVISED S ETTING

Given n labeled parse trees (X1 , y1 ), · · · , (Xn , yn ), where yi are
the class labels.
An ideal kernel gram matrix Y is given as follows:

Yij = [|yi = yj |] − [|yi = yj |]

Kernel Target alignment

ˆ
Y, Kw = ˆ
Ki j − ˆ
Ki j
F
yi =yj yi =yj

Our target now is to maximize the above term w.r.t w.


S UPERVISED S ETTING ( CONTD .)

Optimization Problem
n
w = argmaxw∈[0,1]|S| w(S) ˜(x, z)
c
i,j=1 s∈S x∈Xi z∈Zi
i=j l(x)=s l(z)=s

subject to,
w(s) ≤ N, N∈N
s∈S


U NSUPERVISED S ETTING

Average Frequency of Node comparison
n
1
f (s) = #s (Xi )#s (Xj )
n2
i,j=1

ExpectedNodeComparisons
ComparisonRatio(ρ) =
ActualNumberOfComparisonsinPTK


U NSUPERVISED S ETTING ( CONTD .)

Optimization Problem
n
w = argmaxw∈[0,1]|S| w(S) ˜(x, z)
c
i,j=1 s∈S x∈Xi z∈Zi
i=j l(x)=s l(z)=s

subject to,
s∈S w(s)f (s)
≤ρ
s∈S f (s)


S YNTHETIC D ATA

Figure: Classiﬁcation performance Figure: Detection performance for
for the supervised synthetic data. the unsupervised synthetic data.


R EAL D ATA

Figure: Classiﬁcation performance Figure: Detection performance for
for question classiﬁcation task. the intrusion detection task (FTP).


T IME

Figure: Training and testing time of SVMs using the exact and the
approximate tree kernel.


T IME ( COND .)

Figure: Run-times for web spam (WS) and intrusion detection (ID).


M EMORY

Figure: Memory requirements for web spam (WS) and intrusion
detection (ID).


C ONCLUSION

Approximate Parse tree Kernels give us a fast and efﬁcient
way to work with parse trees.
Improvements in terms of run-time and memory
requirements. For large trees, the approximation reduces a
single kernel computation from 1 gigabyte to less than 800
kilobytes, accompanied by run-time improvements up to
three orders of magnitude.
Best results were obtained for Network Intrusion
Detection.


Q UESTIONS

Any Questions ???

Approximate Tree Kernels

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Approximate Tree Kernels

Similar to Approximate Tree Kernels (20)

Recently uploaded

Recently uploaded (20)

Approximate Tree Kernels