slides for "Supervised Model Learning with Feature Grouping based on a Discrete Constraint "

Supervised Model Learning with Feature
Grouping based on a Discrete
Constraint(Suzuki+, ACL, 2013)の紹介
03/10/12 kensuke-mi

背景
✦ NLPのタスクでは，機械学習のモデルが巨大化しやすい．
というのも，素性数の多さに伴って重みベクトルが巨大化
✦ L1正則化項を導入すると，モデルサイズは小さくなる．
でも，それって本当に妥当なの？
✦ 一般に，今日のNLPタスクでは
L1正則項モデルサイズが小さくなる．高速デコード可能
L2正則項 L1より精度良いが，モデルサイズが巨大化
2

おさらい
3
タベクトルの要素の絶対値の汎ル、要棄の2乗の府ルなります
ι次のようになります。
・口正則化
・L2正則化
口正則化の場合、 wkの値が0に近づけぱぺナルティはほぼ'0にな
U正則化の場合、ペナルティが0になるのは値が完全に0である
す。そのため、 Uのほラがパラメータベクトルの亦0の要素を減
く卿持ます。このように11くと U正則化のほうが良さそうですが
バW)* 0Σk1如kl
,'(仙)= 0 三k 1脚k12
L2正則化は，w_kの値が０に近づけばペナルティは
ほぼ０になる．
L1正則化はペナルティが０になるのは，値が完全に
０のときのみ．
L1ノルムの方がNon-zeroのパラメータベクトルを
減らす力が強い．

この論文のアイディア
✦ 重みベクトルをグループ化して，計算量を下げる
✦ こんなメリットがあります
モデルの大きさが抑制できる．
(=Model Complexityを下げる)
過学習を避けることができる．
non-zeroの素性を選ぶ安定性が高まる．
4

じゃあ，どうやってグループ化するの？
5
uster without any loss.
Modeling with Feature Grouping
ection describes our proposal for obtaining
ure grouping solution.
Integration of a Discrete Constraint
be a finite set of discrete values, i.e., a set
r from 4 to 4, that is, S ={ 4,. . . , 1, 0,
, 4}. The detailed discussion how we define
be found in our experiments section since
ply depends on training data. Then, we de-
e objective that can simultaneously achieve
ure grouping and model learning as follows:
O(w; D) = L(w; D) + ⌦(w)
s.t. w 2 SN .
(2)
SN is the cartesian power of a set S. The
difference with Eq. 1 is the additional dis-
constraint, namely, w 2 SN . This con-
means that each variable (feature weight)
the standard loss minimization problem
Eq. 1 and the additional discrete const
larizer by the dual decomposition techn
To solve the optimization in Eq. 3,
age the alternating direction method of
(ADMM) (Gabay and Mercier, 1976; B
2011). ADMM provides a very efficient
tion framework for the problem in the du
position form. Here, ↵ represents dua
for the equivalence constraint w = u. A
troduces the augmented Lagrangian ter
u||2
2 with ⇢>0 which ensures strict con
increases robustness3.
Finally, the optimization problem in
be converted into a series of iterative
tion problems. Detailed derivation in t
case can be found in (Boyd et al., 201
shows the entire model learning framew
proposed method. The remarkable po
ADMM works by iteratively computing
three optimization variable sets w, u, an
to provide compact model representation,
which is especially useful in actual use.
Introduction
his paper focuses on the topic of supervised
model learning, which is typically represented as
he following form of the optimization problem:
ˆw = arg min
w
O(w; D) ,
O(w; D) = L(w; D) + ⌦(w),
(1)
where D is supervised training data that consists
f the corresponding input x and output y pairs,
hat is, (x, y) 2 D. w is an N-dimensional vector
epresentation of a set of optimization variables,
which are also interpreted as feature weights.
(w; D) and ⌦(w) represent a loss function and
regularization term, respectively. Nowadays, we,
n most cases, utilize a supervised learning method
xpressed as the above optimization problem to
stimate the feature weights of many natural lan-
uage processing (NLP) tasks, such as text clas-
fication, POS-tagging, named entity recognition,
ependency parsing, and semantic role labeling.
In the last decade, the L1-regularization tech-
a model learning framework that can reduce the
model complexity beyond that possible by sim-
ply applying L1-regularizers. To achieve our goal
we focus on the recently developed concept of au-
tomatic feature grouping (Tibshirani et al., 2005
Bondell and Reich, 2008). We introduce a mode
learning framework that achieves feature group-
ing by incorporating a discrete constraint during
model learning.
2 Feature Grouping Concept
Going beyond L1-regularized sparse modeling
the idea of ‘automatic feature grouping’ has re-
cently been developed. Examples are fused
lasso (Tibshirani et al., 2005), grouping pur-
suit (Shen and Huang, 2010), and OSCAR (Bon-
dell and Reich, 2008). The concept of automatic
feature grouping is to find accurate models tha
have fewer degrees of freedom. This is equiva-
lent to enforce every optimization variables to be
equal as much as possible. A simple example is
that ˆw1 = (0.1, 0.5, 0.1, 0.5, 0.1) is preferred over
ˆw2 = (0.1, 0.3, 0.2, 0.5, 0.3) since ˆw1 and ˆw2
have two and four unique values, respectively.
元々の学習式
Lは損失関数，Ωが正則化項
重みの集合Sを定義する
例えば-4から4までの範囲とすると，S={-4,-3,-2.....3,4}
重みはSのべき乗集合から選ぶことにする．

じゃあ，どうやってグループ化するの？
6
uster without any loss.
Modeling with Feature Grouping
ection describes our proposal for obtaining
ure grouping solution.
Integration of a Discrete Constraint
be a finite set of discrete values, i.e., a set
r from 4 to 4, that is, S ={ 4,. . . , 1, 0,
, 4}. The detailed discussion how we define
be found in our experiments section since
e objective that can simultaneously achieve
O(w; D) = L(w; D) + ⌦(w)
s.t. w 2 SN .
(2)
SN is the cartesian power of a set S. The
means that each variable (feature weight)
the standard loss minimization problem
Eq. 1 and the additional discrete const
larizer by the dual decomposition techn
To solve the optimization in Eq. 3,
age the alternating direction method of
(ADMM) (Gabay and Mercier, 1976; B
2011). ADMM provides a very efficient
tion framework for the problem in the du
position form. Here, ↵ represents dua
for the equivalence constraint w = u. A
troduces the augmented Lagrangian ter
u||2
2 with ⇢>0 which ensures strict con
Finally, the optimization problem in
be converted into a series of iterative
tion problems. Detailed derivation in t
case can be found in (Boyd et al., 201
shows the entire model learning framew
proposed method. The remarkable po
ADMM works by iteratively computing
three optimization variable sets w, u, an
to provide compact model representation,
which is especially useful in actual use.
Introduction
his paper focuses on the topic of supervised
model learning, which is typically represented as
he following form of the optimization problem:
ˆw = arg min
w
O(w; D) ,
O(w; D) = L(w; D) + ⌦(w),
(1)
where D is supervised training data that consists
f the corresponding input x and output y pairs,
hat is, (x, y) 2 D. w is an N-dimensional vector
epresentation of a set of optimization variables,
which are also interpreted as feature weights.
(w; D) and ⌦(w) represent a loss function and
regularization term, respectively. Nowadays, we,
n most cases, utilize a supervised learning method
xpressed as the above optimization problem to
stimate the feature weights of many natural lan-
uage processing (NLP) tasks, such as text clas-
fication, POS-tagging, named entity recognition,
ependency parsing, and semantic role labeling.
In the last decade, the L1-regularization tech-
a model learning framework that can reduce the
model complexity beyond that possible by sim-
ply applying L1-regularizers. To achieve our goal
we focus on the recently developed concept of au-
tomatic feature grouping (Tibshirani et al., 2005
Bondell and Reich, 2008). We introduce a mode
learning framework that achieves feature group-
ing by incorporating a discrete constraint during
model learning.
2 Feature Grouping Concept
Going beyond L1-regularized sparse modeling
the idea of ‘automatic feature grouping’ has re-
cently been developed. Examples are fused
lasso (Tibshirani et al., 2005), grouping pur-
suit (Shen and Huang, 2010), and OSCAR (Bon-
dell and Reich, 2008). The concept of automatic
feature grouping is to find accurate models tha
have fewer degrees of freedom. This is equiva-
lent to enforce every optimization variables to be
equal as much as possible. A simple example is
that ˆw1 = (0.1, 0.5, 0.1, 0.5, 0.1) is preferred over
ˆw2 = (0.1, 0.3, 0.2, 0.5, 0.3) since ˆw1 and ˆw2
have two and four unique values, respectively.
元々の学習式
Lは損失関数，Ωが正則化項
重みの集合Sを定義する
例えば-4から4までの範囲とすると，S={-4,-3,-2.....3,4}
重みはSのべき乗集合から選ぶことにする．
つまり．．重み値の集合を作成して，この集合から重
みを選ぶ．この行為がグループ化である．

ただし，この問題は普通には解けない．
7
he objective that can simultaneously achieve
O(w; D) = L(w; D) + ⌦(w)
s.t. w 2 SN .
(2)
e SN is the cartesian power of a set S. The
t means that each variable (feature weight)
ined models must take a value in S, that is,
S, where ˆwn is the n-th factor of ˆw, and
{1, . . . , N}. As a result, feature weights in
d models are automatically grouped in terms
troduces th
u||2
2 with ⇢
increases r
Finally,
be convert
tion proble
case can b
shows the e
proposed m
ADMM wo
three optim
holding the
t = 1, 2, . .
Step1 (w
Sのべき乗が巨大化するので，最適化の際に組み合わせ爆発
が発生する．つまりNP困難問題
そこで，双対分解を導入して，これを解く

双対分解のおさらい
8
双対分解とは，
つまり「NP困難な問題を分割して解く問題の解き方」
1.argmax_y ( g(y) + h(y) )はNP困難になるため解けない
2.そこで，argmax_z,y ( g(z) + h(y) ) st. z=yと問題を分解
3. 2の式にラグランジュ法を導入して新しくLを定義．Lの最適解をL*とする．こ
の時，双対定理によりL*=min_u L(u)
4. L(u)は凸関数なので，勾配法で最適解が求まる．勾配の更新をu:=u-µ(y*-z*)と
する．
5. y*=z*の時に，2の式が解ける．
詳しくはhttp://research.preferred.jp/2010/11/dual-decomposition/

本論文での双対分解の適用
9
γはΩに似た項らしい．(Sec. 3.1より)
γはなんでも良いのだが，このpaperでは
of L(w; D) and ⌦(w). Thus, we ignore their spe-
ific definition in this section. Typical cases can
be found in the experiments section. Then, we re-
ormulate Eq. 2 by using the dual decomposition
echnique (Everett, 1963):
O(w, u; D) = L(w; D) + ⌦(w) + ⌥(u)
s.t. w = u, and u 2 SN .
(3)
Difference from Eq. 2, Eq. 3 has an additional term
⌥(u), which is similar to the regularizer ⌦(w),
whose optimization variables w and u are tight-
ned with equality constraint w = u. Here, this
paper only considers the case ⌥(u) = 2
2 ||u||2
2 +
1||u||1, and 2 0 and 1 02. This objec-
. , 4}. The detailed discussion how we define
n be found in our experiments section since
eply depends on training data. Then, we de-
he objective that can simultaneously achieve
ture grouping and model learning as follows:
O(w; D) = L(w; D) + ⌦(w)
s.t. w 2 SN .
(2)
e SN is the cartesian power of a set S. The
nt means that each variable (feature weight)
ained models must take a value in S, that is,
2 S, where ˆwn is the n-th factor of ˆw, and
{1, . . . , N}. As a result, feature weights in
ed models are automatically grouped in terms
e basis of model learning. This is the basic
of feature grouping proposed in this paper.
position form. Here, ↵ rep
for the equivalence constrain
troduces the augmented Lag
u||2
2 with ⇢>0 which ensure
Finally, the optimization
be converted into a series
tion problems. Detailed der
case can be found in (Boyd
shows the entire model learn
proposed method. The rem
ADMM works by iteratively
three optimization variable s
holding the other variables
t = 1, 2, . . . until convergen
Step1 (w-update): This
tion problem shown in Eq.
with a ‘biased’ L2-regulariz
that the direction of regulari
元の式を双対分解
s, we ignore their spe-
on. Typical cases can
section. Then, we re-
e dual decomposition
+ ⌦(w) + ⌥(u)
nd u 2 SN .
(3)
has an additional term
he regularizer ⌦(w),
es w and u are tight-
nt w = u. Here, this
se ⌥(u) = 2
2 ||u||2
2 +
1 02. This objec-
the decomposition of
ion problem shown in
ious studies clari-
e of over-fitting to
ng, 2010). This is
y NLP tasks since
high-dimensional
-fitting problem is
en reported that it
cting non-zero fea-
h the standard L1-
f many highly cor-
Yu, 2003; Zou and
n dramatically re-
is because we can
e weight values are
into a single fea-
Grouping
of L(w; D) and ⌦(w). Thus, we ignore their spe-
cific definition in this section. Typical cases can
be found in the experiments section. Then, we re-
formulate Eq. 2 by using the dual decomposition
technique (Everett, 1963):
O(w, u; D) = L(w; D) + ⌦(w) + ⌥(u)
s.t. w = u, and u 2 SN .
(3)
Difference from Eq. 2, Eq. 3 has an additional term
⌥(u), which is similar to the regularizer ⌦(w),
whose optimization variables w and u are tight-
ened with equality constraint w = u. Here, this
paper only considers the case ⌥(u) = 2
2 ||u||2
2 +
1||u||1, and 2 0 and 1 02. This objec-
tive can also be viewed as the decomposition of
the standard loss minimization problem shown in
Eq. 1 and the additional discrete constraint regu-
のみを考える．

パラメータの更新と最適化
✦ 分解式を特にはADMMというアルゴリズムを用いる
（双対分解では一般的に用いられるアルゴリズムらしい）
✦ 詳しいパラメーターの更新はsec. 3.1を見てください．
（たぶん，勾配的に更新してると思われる）
✦ 計算量はO(N log ¦S¦ )に抑えられる．
✦ ADMMの中でオンライン学習を用いて高速化が可能
(sec. 3.3)
10

２つのタスクで２軸で評価実験
✦ ２つのNLPタスクで評価を行った
Named Entity Recognitionタスク (NER)
Dependency Parsingタスク(DEPAR)
✦ 手法の精度評価
Complete Sentence Accuracy(COMP)が完全一致？
NERタスクにF-sc(F-score)
DEPARタスクにUAS(unlabelのedgeの正確さ)
✦ モデル複雑度の評価
#nzF:featureの数，ただし対応する重みがnon-zero
#DoF:uniqueなnon-zeroな重み
11

重み集合Sの定義(4.1)
12
Sの定義は自由にしてもいいが，一般的に以下が最適
plate which is suitable for large feature set. Let
⌘, , and  represent non-negative real-value con-
stants, ⇣ be a positive integer, = { 1, 1}, and
a function f⌘, ,(x, y) = y(⌘x + ). T hen, we
define a finite set of values S as follows:
S⌘, ,,⇣ ={f⌘, ,(x, y)|(x, y) 2 S⇣ ⇥ } [ {0},
where S⇣ is a set of non-negative integers from
ero to ⇣ 1, that is, S⇣ ={m}⇣ 1
m=0. For example,
if we set ⌘ = 0.1, = 0.4,  = 4, and ⇣ = 3, then
S⌘, ,,⇣ = { 2.0, 0.8, 0.5, 0, 0.5, 0.8, 2.0}.
he intuition of this template is that the distribu-
tion of the feature weights in trained model often
tak es a form a similar to that of the ‘ power law’
in the case of the large feature sets. T herefore, us-
ing an exponential function with a scale and bias
seems to be appropriate for fitting them.
nite set for S. However, we have to carefully se-
lect it since it deeply affects the performance. Ac-
tually, this is the most considerable point of our
method. We preliminarily investigated the several
settings. Here, we introduce an example of tem-
plate which is suitable for large feature set. Let
⌘, , and  represent non-negative real-value con-
stants, ⇣ be a positive integer, = { 1, 1}, and
a function f⌘, ,(x, y) = y(⌘x + ). Then, we
define a finite set of values S as follows:
S⌘, ,,⇣ ={f⌘, ,(x, y)|(x, y) 2 S⇣ ⇥ } [ {0},
where S⇣ is a set of non-negative integers from
zero to ⇣ 1, that is, S⇣ ={m}⇣ 1
m=0. For example,
if we set ⌘ = 0.1, = 0.4,  = 4, and ⇣ = 3, then
S⌘, ,,⇣ = { 2.0, 0.8, 0.5, 0, 0.5, 0.8, 2.0}.
The intuition of this template is that the distribu-
tion of the feature weights in trained model often
takes a form a similar to that of the ‘power law’
in the case of the large feature sets. Therefore, us-
ただし η,k,δは非負の実数
ζは正の整数．S_ζは0からζ１までの実数集合
重みの分布は一般的に，べき乗則(power law)に従う傾向
がある．なので，指数関数でフィッテングを行った．
上式の根拠
ちなみに#DoFはζによってコントロール可能．

実験結果
13
0E+00 1.0E+03 1.0E+06
DC-ADMM
L1CRF (w/ QT)
L1CRF
L2CRF
quantized
degrees of freedom (#DoF) [log-scale]
30.0
35.0
40.0
45.0
50.0
55.0
1.0E+00 1.0E+03 1.0E+06
DC-ADMM
L1RAD (w/ QT)
L1RDA
L2PA
CompleteSentenceAccuracy
quantized
# of degrees of freedom (#DoF) [log-scale]
(a) NER (b) DEPAR
ure 3: Performance vs. degree of freedom in
trained model for the development data
ote that we can control the upper bound of
F in trained model by ⇣, namely if ⇣ = 4 then
upper bound of #DoF is 8 (doubled by posi-
and negative sides). We ﬁxed ⇢ = 1, ⇠ = 1,
= 0,  = 4 (or 2 if ⇣ 5), = ⌘/2 in all ex-
ments. Thus the only tunable parameter in our
Test Model complex.
NER COMP F-sc #nzF #DoF
L2CRF 84.88 89.97 61.6M 38.6M
L1CRF 84.85 89.99 614K 321K
(w/ QT ⇣ =4) 78.39 85.33 568K 8
(w/ QT ⇣ =2) 73.40 81.45 454K 4
(w/ QT ⇣ =1) 65.53 75.87 454K 2
DC-ADMM (⇣ =4) 84.96 89.92 643K 8
(⇣ =2) 84.04 89.35 455K 4
(⇣ =1) 83.06 88.62 364K 2
Test Model complex.
DEPER COMP UAS #nzF #DoF
L2PA 49.67 93.51 15.5M 5.59M
L1RDA 49.54 93.48 7.76M 3.56M
(w/ QT ⇣ =4) 38.58 90.85 6.32M 8
(w/ QT ⇣ =2) 34.19 89.42 3.08M 4
(w/ QT ⇣ =1) 30.42 88.67 3.08M 2
DC-ADMM (⇣ =4) 49.83 93.55 5.81M 8
(⇣ =2) 48.97 93.18 4.11M 4
(⇣ =1) 46.56 92.86 6.37M 2
Table 1: Comparison results of the methods on test
data (K: thousand, M: million)
NERとDEPERの両方で
Baselineと謙遜ない精度を
出しつつ，モデルの複雑さを
抑えた

まとめ
✦ NLPタスクの機械学習は重みの多さからモデル複雑度が高
くなりがち．
✦ 複雑度を抑えるために，重みのグループ化を行った．
✦ グループ化に伴い発生するNP困難問題を双対分解で解決
✦ Named Entity RecognitionタスクとDependency
Parsingタスクで精度を保ちつつ，複雑度を抑えた
14

slides for "Supervised Model Learning with Feature Grouping based on a Discrete Constraint "

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to slides for "Supervised Model Learning with Feature Grouping based on a Discrete Constraint "

Similar to slides for "Supervised Model Learning with Feature Grouping based on a Discrete Constraint " (20)

More from Kensuke Mitsuzawa

More from Kensuke Mitsuzawa (7)

Recently uploaded

Recently uploaded (20)

slides for "Supervised Model Learning with Feature Grouping based on a Discrete Constraint "