SlideShare a Scribd company logo
Supervised Model Learning with Feature
Grouping based on a Discrete
Constraint(Suzuki+, ACL, 2013)の紹介
03/10/12 kensuke-mi
背景
✦ NLPのタスクでは,機械学習のモデルが巨大化しやすい.
というのも,素性数の多さに伴って重みベクトルが巨大化
✦ L1正則化項を導入すると,モデルサイズは小さくなる.
でも,それって本当に妥当なの?
✦ 一般に,今日のNLPタスクでは
L1正則項 モデルサイズが小さくなる.高速デコード可能
L2正則項 L1より精度良いが,モデルサイズが巨大化
2
おさらい
3
タベクトルの要素の絶対値の汎ル、要棄の2乗の府ルなります
ι次のようになります。
・口正則化
・L2正則化
口正則化の場合、 wkの値が0に近づけぱぺナルティはほぼ'0にな
U正則化の場合、ペナルティが0になるのは値が完全に0である
す。そのため、 Uのほラがパラメータベクトルの亦0の要素を減
く卿持ます。このように11くと U正則化のほうが良さそうですが
バW)* 0Σk1如kl
,'(仙)= 0 三k 1脚k12
L2正則化は,w_kの値が0に近づけばペナルティは
ほぼ0になる.
L1正則化はペナルティが0になるのは,値が完全に
0のときのみ.
L1ノルムの方がNon-zeroのパラメータベクトルを
減らす力が強い.
この論文のアイディア
✦ 重みベクトルをグループ化して,計算量を下げる
✦ こんなメリットがあります
モデルの大きさが抑制できる.
(=Model Complexityを下げる)
過学習を避けることができる.
non-zeroの素性を選ぶ安定性が高まる.
4
じゃあ,どうやってグループ化するの?
5
uster without any loss.
Modeling with Feature Grouping
ection describes our proposal for obtaining
ure grouping solution.
Integration of a Discrete Constraint
be a finite set of discrete values, i.e., a set
r from 4 to 4, that is, S ={ 4,. . . , 1, 0,
, 4}. The detailed discussion how we define
be found in our experiments section since
ply depends on training data. Then, we de-
e objective that can simultaneously achieve
ure grouping and model learning as follows:
O(w; D) = L(w; D) + ⌦(w)
s.t. w 2 SN .
(2)
SN is the cartesian power of a set S. The
difference with Eq. 1 is the additional dis-
constraint, namely, w 2 SN . This con-
means that each variable (feature weight)
the standard loss minimization problem
Eq. 1 and the additional discrete const
larizer by the dual decomposition techn
To solve the optimization in Eq. 3,
age the alternating direction method of
(ADMM) (Gabay and Mercier, 1976; B
2011). ADMM provides a very efficient
tion framework for the problem in the du
position form. Here, ↵ represents dua
for the equivalence constraint w = u. A
troduces the augmented Lagrangian ter
u||2
2 with ⇢>0 which ensures strict con
increases robustness3.
Finally, the optimization problem in
be converted into a series of iterative
tion problems. Detailed derivation in t
case can be found in (Boyd et al., 201
shows the entire model learning framew
proposed method. The remarkable po
ADMM works by iteratively computing
three optimization variable sets w, u, an
to provide compact model representation,
which is especially useful in actual use.
Introduction
his paper focuses on the topic of supervised
model learning, which is typically represented as
he following form of the optimization problem:
ˆw = arg min
w
O(w; D) ,
O(w; D) = L(w; D) + ⌦(w),
(1)
where D is supervised training data that consists
f the corresponding input x and output y pairs,
hat is, (x, y) 2 D. w is an N-dimensional vector
epresentation of a set of optimization variables,
which are also interpreted as feature weights.
(w; D) and ⌦(w) represent a loss function and
regularization term, respectively. Nowadays, we,
n most cases, utilize a supervised learning method
xpressed as the above optimization problem to
stimate the feature weights of many natural lan-
uage processing (NLP) tasks, such as text clas-
fication, POS-tagging, named entity recognition,
ependency parsing, and semantic role labeling.
In the last decade, the L1-regularization tech-
a model learning framework that can reduce the
model complexity beyond that possible by sim-
ply applying L1-regularizers. To achieve our goal
we focus on the recently developed concept of au-
tomatic feature grouping (Tibshirani et al., 2005
Bondell and Reich, 2008). We introduce a mode
learning framework that achieves feature group-
ing by incorporating a discrete constraint during
model learning.
2 Feature Grouping Concept
Going beyond L1-regularized sparse modeling
the idea of ‘automatic feature grouping’ has re-
cently been developed. Examples are fused
lasso (Tibshirani et al., 2005), grouping pur-
suit (Shen and Huang, 2010), and OSCAR (Bon-
dell and Reich, 2008). The concept of automatic
feature grouping is to find accurate models tha
have fewer degrees of freedom. This is equiva-
lent to enforce every optimization variables to be
equal as much as possible. A simple example is
that ˆw1 = (0.1, 0.5, 0.1, 0.5, 0.1) is preferred over
ˆw2 = (0.1, 0.3, 0.2, 0.5, 0.3) since ˆw1 and ˆw2
have two and four unique values, respectively.
元々の学習式
Lは損失関数,Ωが正則化項
重みの集合Sを定義する
例えば-4から4までの範囲とすると,S={-4,-3,-2.....3,4}
重みはSのべき乗集合から選ぶことにする.
じゃあ,どうやってグループ化するの?
6
uster without any loss.
Modeling with Feature Grouping
ection describes our proposal for obtaining
ure grouping solution.
Integration of a Discrete Constraint
be a finite set of discrete values, i.e., a set
r from 4 to 4, that is, S ={ 4,. . . , 1, 0,
, 4}. The detailed discussion how we define
be found in our experiments section since
ply depends on training data. Then, we de-
e objective that can simultaneously achieve
ure grouping and model learning as follows:
O(w; D) = L(w; D) + ⌦(w)
s.t. w 2 SN .
(2)
SN is the cartesian power of a set S. The
difference with Eq. 1 is the additional dis-
constraint, namely, w 2 SN . This con-
means that each variable (feature weight)
the standard loss minimization problem
Eq. 1 and the additional discrete const
larizer by the dual decomposition techn
To solve the optimization in Eq. 3,
age the alternating direction method of
(ADMM) (Gabay and Mercier, 1976; B
2011). ADMM provides a very efficient
tion framework for the problem in the du
position form. Here, ↵ represents dua
for the equivalence constraint w = u. A
troduces the augmented Lagrangian ter
u||2
2 with ⇢>0 which ensures strict con
increases robustness3.
Finally, the optimization problem in
be converted into a series of iterative
tion problems. Detailed derivation in t
case can be found in (Boyd et al., 201
shows the entire model learning framew
proposed method. The remarkable po
ADMM works by iteratively computing
three optimization variable sets w, u, an
to provide compact model representation,
which is especially useful in actual use.
Introduction
his paper focuses on the topic of supervised
model learning, which is typically represented as
he following form of the optimization problem:
ˆw = arg min
w
O(w; D) ,
O(w; D) = L(w; D) + ⌦(w),
(1)
where D is supervised training data that consists
f the corresponding input x and output y pairs,
hat is, (x, y) 2 D. w is an N-dimensional vector
epresentation of a set of optimization variables,
which are also interpreted as feature weights.
(w; D) and ⌦(w) represent a loss function and
regularization term, respectively. Nowadays, we,
n most cases, utilize a supervised learning method
xpressed as the above optimization problem to
stimate the feature weights of many natural lan-
uage processing (NLP) tasks, such as text clas-
fication, POS-tagging, named entity recognition,
ependency parsing, and semantic role labeling.
In the last decade, the L1-regularization tech-
a model learning framework that can reduce the
model complexity beyond that possible by sim-
ply applying L1-regularizers. To achieve our goal
we focus on the recently developed concept of au-
tomatic feature grouping (Tibshirani et al., 2005
Bondell and Reich, 2008). We introduce a mode
learning framework that achieves feature group-
ing by incorporating a discrete constraint during
model learning.
2 Feature Grouping Concept
Going beyond L1-regularized sparse modeling
the idea of ‘automatic feature grouping’ has re-
cently been developed. Examples are fused
lasso (Tibshirani et al., 2005), grouping pur-
suit (Shen and Huang, 2010), and OSCAR (Bon-
dell and Reich, 2008). The concept of automatic
feature grouping is to find accurate models tha
have fewer degrees of freedom. This is equiva-
lent to enforce every optimization variables to be
equal as much as possible. A simple example is
that ˆw1 = (0.1, 0.5, 0.1, 0.5, 0.1) is preferred over
ˆw2 = (0.1, 0.3, 0.2, 0.5, 0.3) since ˆw1 and ˆw2
have two and four unique values, respectively.
元々の学習式
Lは損失関数,Ωが正則化項
重みの集合Sを定義する
例えば-4から4までの範囲とすると,S={-4,-3,-2.....3,4}
重みはSのべき乗集合から選ぶことにする.
つまり..重み値の集合を作成して,この集合から重
みを選ぶ.この行為がグループ化である.
ただし,この問題は普通には解けない.
7
ply depends on training data. Then, we de-
he objective that can simultaneously achieve
ure grouping and model learning as follows:
O(w; D) = L(w; D) + ⌦(w)
s.t. w 2 SN .
(2)
e SN is the cartesian power of a set S. The
difference with Eq. 1 is the additional dis-
constraint, namely, w 2 SN . This con-
t means that each variable (feature weight)
ined models must take a value in S, that is,
S, where ˆwn is the n-th factor of ˆw, and
{1, . . . , N}. As a result, feature weights in
d models are automatically grouped in terms
troduces th
u||2
2 with ⇢
increases r
Finally,
be convert
tion proble
case can b
shows the e
proposed m
ADMM wo
three optim
holding the
t = 1, 2, . .
Step1 (w
Sのべき乗が巨大化するので,最適化の際に組み合わせ爆発
が発生する.つまりNP困難問題
そこで,双対分解を導入して,これを解く
双対分解のおさらい
8
双対分解とは,
つまり「NP困難な問題を分割して解く問題の解き方」
1.argmax_y ( g(y) + h(y) )はNP困難になるため解けない
2.そこで,argmax_z,y ( g(z) + h(y) ) st. z=yと問題を分解
3. 2の式にラグランジュ法を導入して新しくLを定義.Lの最適解をL*とする.こ
の時,双対定理によりL*=min_u L(u)
4. L(u)は凸関数なので,勾配法で最適解が求まる.勾配の更新をu:=u-µ(y*-z*)と
する.
5. y*=z*の時に,2の式が解ける.
詳しくはhttp://research.preferred.jp/2010/11/dual-decomposition/
本論文での双対分解の適用
9
γはΩに似た項らしい.(Sec. 3.1より)
γはなんでも良いのだが,このpaperでは
of L(w; D) and ⌦(w). Thus, we ignore their spe-
ific definition in this section. Typical cases can
be found in the experiments section. Then, we re-
ormulate Eq. 2 by using the dual decomposition
echnique (Everett, 1963):
O(w, u; D) = L(w; D) + ⌦(w) + ⌥(u)
s.t. w = u, and u 2 SN .
(3)
Difference from Eq. 2, Eq. 3 has an additional term
⌥(u), which is similar to the regularizer ⌦(w),
whose optimization variables w and u are tight-
ned with equality constraint w = u. Here, this
paper only considers the case ⌥(u) = 2
2 ||u||2
2 +
1||u||1, and 2 0 and 1 02. This objec-
. , 4}. The detailed discussion how we define
n be found in our experiments section since
eply depends on training data. Then, we de-
he objective that can simultaneously achieve
ture grouping and model learning as follows:
O(w; D) = L(w; D) + ⌦(w)
s.t. w 2 SN .
(2)
e SN is the cartesian power of a set S. The
difference with Eq. 1 is the additional dis-
constraint, namely, w 2 SN . This con-
nt means that each variable (feature weight)
ained models must take a value in S, that is,
2 S, where ˆwn is the n-th factor of ˆw, and
{1, . . . , N}. As a result, feature weights in
ed models are automatically grouped in terms
e basis of model learning. This is the basic
of feature grouping proposed in this paper.
position form. Here, ↵ rep
for the equivalence constrain
troduces the augmented Lag
u||2
2 with ⇢>0 which ensure
increases robustness3.
Finally, the optimization
be converted into a series
tion problems. Detailed der
case can be found in (Boyd
shows the entire model learn
proposed method. The rem
ADMM works by iteratively
three optimization variable s
holding the other variables
t = 1, 2, . . . until convergen
Step1 (w-update): This
tion problem shown in Eq.
with a ‘biased’ L2-regulariz
that the direction of regulari
元の式を双対分解
s, we ignore their spe-
on. Typical cases can
section. Then, we re-
e dual decomposition
+ ⌦(w) + ⌥(u)
nd u 2 SN .
(3)
has an additional term
he regularizer ⌦(w),
es w and u are tight-
nt w = u. Here, this
se ⌥(u) = 2
2 ||u||2
2 +
1 02. This objec-
the decomposition of
ion problem shown in
ious studies clari-
e of over-fitting to
ng, 2010). This is
y NLP tasks since
high-dimensional
-fitting problem is
en reported that it
cting non-zero fea-
h the standard L1-
f many highly cor-
Yu, 2003; Zou and
n dramatically re-
is because we can
e weight values are
into a single fea-
Grouping
of L(w; D) and ⌦(w). Thus, we ignore their spe-
cific definition in this section. Typical cases can
be found in the experiments section. Then, we re-
formulate Eq. 2 by using the dual decomposition
technique (Everett, 1963):
O(w, u; D) = L(w; D) + ⌦(w) + ⌥(u)
s.t. w = u, and u 2 SN .
(3)
Difference from Eq. 2, Eq. 3 has an additional term
⌥(u), which is similar to the regularizer ⌦(w),
whose optimization variables w and u are tight-
ened with equality constraint w = u. Here, this
paper only considers the case ⌥(u) = 2
2 ||u||2
2 +
1||u||1, and 2 0 and 1 02. This objec-
tive can also be viewed as the decomposition of
the standard loss minimization problem shown in
Eq. 1 and the additional discrete constraint regu-
のみを考える.
パラメータの更新と最適化
✦ 分解式を特にはADMMというアルゴリズムを用いる
(双対分解では一般的に用いられるアルゴリズムらしい)
✦ 詳しいパラメーターの更新はsec. 3.1を見てください.
(たぶん,勾配的に更新してると思われる)
✦ 計算量はO(N log ¦S¦ )に抑えられる.
✦ ADMMの中でオンライン学習を用いて高速化が可能
(sec. 3.3)
10
2つのタスクで2軸で評価実験
✦ 2つのNLPタスクで評価を行った
Named Entity Recognitionタスク (NER)
Dependency Parsingタスク(DEPAR)
✦ 手法の精度評価
Complete Sentence Accuracy(COMP)が完全一致?
NERタスクにF-sc(F-score)
DEPARタスクにUAS(unlabelのedgeの正確さ)
✦ モデル複雑度の評価
#nzF:featureの数,ただし対応する重みがnon-zero
#DoF:uniqueなnon-zeroな重み
11
重み集合Sの定義(4.1)
12
Sの定義は自由にしてもいいが,一般的に以下が最適
plate which is suitable for large feature set. Let
⌘, , and  represent non-negative real-value con-
stants, ⇣ be a positive integer, = { 1, 1}, and
a function f⌘, ,(x, y) = y(⌘x + ). T hen, we
define a finite set of values S as follows:
S⌘, ,,⇣ ={f⌘, ,(x, y)|(x, y) 2 S⇣ ⇥ } [ {0},
where S⇣ is a set of non-negative integers from
ero to ⇣ 1, that is, S⇣ ={m}⇣ 1
m=0. For example,
if we set ⌘ = 0.1, = 0.4,  = 4, and ⇣ = 3, then
S⌘, ,,⇣ = { 2.0, 0.8, 0.5, 0, 0.5, 0.8, 2.0}.
he intuition of this template is that the distribu-
tion of the feature weights in trained model often
tak es a form a similar to that of the ‘ power law’
in the case of the large feature sets. T herefore, us-
ing an exponential function with a scale and bias
seems to be appropriate for fitting them.
nite set for S. However, we have to carefully se-
lect it since it deeply affects the performance. Ac-
tually, this is the most considerable point of our
method. We preliminarily investigated the several
settings. Here, we introduce an example of tem-
plate which is suitable for large feature set. Let
⌘, , and  represent non-negative real-value con-
stants, ⇣ be a positive integer, = { 1, 1}, and
a function f⌘, ,(x, y) = y(⌘x + ). Then, we
define a finite set of values S as follows:
S⌘, ,,⇣ ={f⌘, ,(x, y)|(x, y) 2 S⇣ ⇥ } [ {0},
where S⇣ is a set of non-negative integers from
zero to ⇣ 1, that is, S⇣ ={m}⇣ 1
m=0. For example,
if we set ⌘ = 0.1, = 0.4,  = 4, and ⇣ = 3, then
S⌘, ,,⇣ = { 2.0, 0.8, 0.5, 0, 0.5, 0.8, 2.0}.
The intuition of this template is that the distribu-
tion of the feature weights in trained model often
takes a form a similar to that of the ‘power law’
in the case of the large feature sets. Therefore, us-
ただし η,k,δは非負の実数
ζは正の整数.S_ζは0からζ­1までの実数集合
重みの分布は一般的に,べき乗則(power law)に従う傾向
がある.なので,指数関数でフィッテングを行った.
上式の根拠
ちなみに#DoFはζによってコントロール可能.
実験結果
13
0E+00 1.0E+03 1.0E+06
DC-ADMM
L1CRF (w/ QT)
L1CRF
L2CRF
quantized
degrees of freedom (#DoF) [log-scale]
30.0
35.0
40.0
45.0
50.0
55.0
1.0E+00 1.0E+03 1.0E+06
DC-ADMM
L1RAD (w/ QT)
L1RDA
L2PA
CompleteSentenceAccuracy
quantized
# of degrees of freedom (#DoF) [log-scale]
(a) NER (b) DEPAR
ure 3: Performance vs. degree of freedom in
trained model for the development data
ote that we can control the upper bound of
F in trained model by ⇣, namely if ⇣ = 4 then
upper bound of #DoF is 8 (doubled by posi-
and negative sides). We fixed ⇢ = 1, ⇠ = 1,
= 0,  = 4 (or 2 if ⇣ 5), = ⌘/2 in all ex-
ments. Thus the only tunable parameter in our
Test Model complex.
NER COMP F-sc #nzF #DoF
L2CRF 84.88 89.97 61.6M 38.6M
L1CRF 84.85 89.99 614K 321K
(w/ QT ⇣ =4) 78.39 85.33 568K 8
(w/ QT ⇣ =2) 73.40 81.45 454K 4
(w/ QT ⇣ =1) 65.53 75.87 454K 2
DC-ADMM (⇣ =4) 84.96 89.92 643K 8
(⇣ =2) 84.04 89.35 455K 4
(⇣ =1) 83.06 88.62 364K 2
Test Model complex.
DEPER COMP UAS #nzF #DoF
L2PA 49.67 93.51 15.5M 5.59M
L1RDA 49.54 93.48 7.76M 3.56M
(w/ QT ⇣ =4) 38.58 90.85 6.32M 8
(w/ QT ⇣ =2) 34.19 89.42 3.08M 4
(w/ QT ⇣ =1) 30.42 88.67 3.08M 2
DC-ADMM (⇣ =4) 49.83 93.55 5.81M 8
(⇣ =2) 48.97 93.18 4.11M 4
(⇣ =1) 46.56 92.86 6.37M 2
Table 1: Comparison results of the methods on test
data (K: thousand, M: million)
NERとDEPERの両方で
Baselineと謙遜ない精度を
出しつつ,モデルの複雑さを
抑えた
まとめ
✦ NLPタスクの機械学習は重みの多さからモデル複雑度が高
くなりがち.
✦ 複雑度を抑えるために,重みのグループ化を行った.
✦ グループ化に伴い発生するNP困難問題を双対分解で解決
✦ Named Entity RecognitionタスクとDependency
Parsingタスクで精度を保ちつつ,複雑度を抑えた
14

More Related Content

What's hot

Optimal control of coupled PDE networks with automated code generation
Optimal control of coupled PDE networks with automated code generationOptimal control of coupled PDE networks with automated code generation
Optimal control of coupled PDE networks with automated code generation
Delta Pi Systems
 
Tensor Decomposition and its Applications
Tensor Decomposition and its ApplicationsTensor Decomposition and its Applications
Tensor Decomposition and its Applications
Keisuke OTAKI
 
Reading papers - survey on Non-Convex Optimization
Reading papers - survey on Non-Convex OptimizationReading papers - survey on Non-Convex Optimization
Reading papers - survey on Non-Convex Optimization
X 37
 
Nn3
Nn3Nn3
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
Nguyễn Anh
 
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron ClassifiersArtificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Mohammed Bennamoun
 
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Ono Shigeru
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly
 
Perceptron
PerceptronPerceptron
Perceptron
VARUN KUMAR
 
S. Duplij, Polyadic integer numbers and finite (m,n)-fields (Journal version,...
S. Duplij, Polyadic integer numbers and finite (m,n)-fields (Journal version,...S. Duplij, Polyadic integer numbers and finite (m,n)-fields (Journal version,...
S. Duplij, Polyadic integer numbers and finite (m,n)-fields (Journal version,...
Steven Duplij (Stepan Douplii)
 
11.solution of linear and nonlinear partial differential equations using mixt...
11.solution of linear and nonlinear partial differential equations using mixt...11.solution of linear and nonlinear partial differential equations using mixt...
11.solution of linear and nonlinear partial differential equations using mixt...Alexander Decker
 
Solution of linear and nonlinear partial differential equations using mixture...
Solution of linear and nonlinear partial differential equations using mixture...Solution of linear and nonlinear partial differential equations using mixture...
Solution of linear and nonlinear partial differential equations using mixture...
Alexander Decker
 
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
Ono Shigeru
 
Lecture 9 Perceptron
Lecture 9 PerceptronLecture 9 Perceptron
Lecture 9 Perceptron
Marina Santini
 
Amelioration of Modeling and Solving the Weighted Constraint Satisfaction Pro...
Amelioration of Modeling and Solving the Weighted Constraint Satisfaction Pro...Amelioration of Modeling and Solving the Weighted Constraint Satisfaction Pro...
Amelioration of Modeling and Solving the Weighted Constraint Satisfaction Pro...
IJCSIS Research Publications
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
cairo university
 
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural Network
Liwei Ren任力偉
 

What's hot (20)

Optimal control of coupled PDE networks with automated code generation
Optimal control of coupled PDE networks with automated code generationOptimal control of coupled PDE networks with automated code generation
Optimal control of coupled PDE networks with automated code generation
 
Chapter 16
Chapter 16Chapter 16
Chapter 16
 
Chtp405
Chtp405Chtp405
Chtp405
 
Tensor Decomposition and its Applications
Tensor Decomposition and its ApplicationsTensor Decomposition and its Applications
Tensor Decomposition and its Applications
 
Reading papers - survey on Non-Convex Optimization
Reading papers - survey on Non-Convex OptimizationReading papers - survey on Non-Convex Optimization
Reading papers - survey on Non-Convex Optimization
 
Nn3
Nn3Nn3
Nn3
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron ClassifiersArtificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
 
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
 
Perceptron
PerceptronPerceptron
Perceptron
 
S. Duplij, Polyadic integer numbers and finite (m,n)-fields (Journal version,...
S. Duplij, Polyadic integer numbers and finite (m,n)-fields (Journal version,...S. Duplij, Polyadic integer numbers and finite (m,n)-fields (Journal version,...
S. Duplij, Polyadic integer numbers and finite (m,n)-fields (Journal version,...
 
11.solution of linear and nonlinear partial differential equations using mixt...
11.solution of linear and nonlinear partial differential equations using mixt...11.solution of linear and nonlinear partial differential equations using mixt...
11.solution of linear and nonlinear partial differential equations using mixt...
 
Solution of linear and nonlinear partial differential equations using mixture...
Solution of linear and nonlinear partial differential equations using mixture...Solution of linear and nonlinear partial differential equations using mixture...
Solution of linear and nonlinear partial differential equations using mixture...
 
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
 
Lecture 9 Perceptron
Lecture 9 PerceptronLecture 9 Perceptron
Lecture 9 Perceptron
 
www.ijerd.com
www.ijerd.comwww.ijerd.com
www.ijerd.com
 
Amelioration of Modeling and Solving the Weighted Constraint Satisfaction Pro...
Amelioration of Modeling and Solving the Weighted Constraint Satisfaction Pro...Amelioration of Modeling and Solving the Weighted Constraint Satisfaction Pro...
Amelioration of Modeling and Solving the Weighted Constraint Satisfaction Pro...
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
 
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural Network
 

Similar to slides for "Supervised Model Learning with Feature Grouping based on a Discrete Constraint "

Skip-gram Model Broken Down
Skip-gram Model Broken DownSkip-gram Model Broken Down
Skip-gram Model Broken Down
Chin Huan Tan
 
The Estimations Based on the Kolmogorov Complexity and ...
The Estimations Based on the Kolmogorov Complexity and ...The Estimations Based on the Kolmogorov Complexity and ...
The Estimations Based on the Kolmogorov Complexity and ...butest
 
Harnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic RulesHarnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic Rules
Sho Takase
 
Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_reportRavi Gupta
 
Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
Devashish Patel
 
learning boolean weight learning real valued weights rank learning as ordina...
learning boolean weight learning real valued weights  rank learning as ordina...learning boolean weight learning real valued weights  rank learning as ordina...
learning boolean weight learning real valued weights rank learning as ordina...
jaishriramm0
 
Course Assignment : Skip gram
Course Assignment : Skip gramCourse Assignment : Skip gram
Course Assignment : Skip gram
KhalilBergaoui
 
Regularization
RegularizationRegularization
Regularization
Darren Yow-Bang Wang
 
Latent Structured Ranking
Latent Structured RankingLatent Structured Ranking
Latent Structured Ranking
Sunny Kr
 
lec18_ref.pdf
lec18_ref.pdflec18_ref.pdf
lec18_ref.pdf
vishal choudhary
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
Masahiro Suzuki
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
Eugene Nho
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
Shravan Vasishth
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
Anuj Gupta
 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up
Satyam Saxena
 
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert  pre_training_of_deep_bidirectional_transformers_for_language_understandingBert  pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
ThyrixYang1
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
Jay Nagar
 
2.7 other classifiers
2.7 other classifiers2.7 other classifiers
2.7 other classifiers
Krish_ver2
 

Similar to slides for "Supervised Model Learning with Feature Grouping based on a Discrete Constraint " (20)

Skip-gram Model Broken Down
Skip-gram Model Broken DownSkip-gram Model Broken Down
Skip-gram Model Broken Down
 
Efficient projections
Efficient projectionsEfficient projections
Efficient projections
 
Efficient projections
Efficient projectionsEfficient projections
Efficient projections
 
The Estimations Based on the Kolmogorov Complexity and ...
The Estimations Based on the Kolmogorov Complexity and ...The Estimations Based on the Kolmogorov Complexity and ...
The Estimations Based on the Kolmogorov Complexity and ...
 
Harnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic RulesHarnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic Rules
 
Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_report
 
Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
 
learning boolean weight learning real valued weights rank learning as ordina...
learning boolean weight learning real valued weights  rank learning as ordina...learning boolean weight learning real valued weights  rank learning as ordina...
learning boolean weight learning real valued weights rank learning as ordina...
 
Course Assignment : Skip gram
Course Assignment : Skip gramCourse Assignment : Skip gram
Course Assignment : Skip gram
 
Regularization
RegularizationRegularization
Regularization
 
Latent Structured Ranking
Latent Structured RankingLatent Structured Ranking
Latent Structured Ranking
 
lec18_ref.pdf
lec18_ref.pdflec18_ref.pdf
lec18_ref.pdf
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up
 
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert  pre_training_of_deep_bidirectional_transformers_for_language_understandingBert  pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 
2.7 other classifiers
2.7 other classifiers2.7 other classifiers
2.7 other classifiers
 

More from Kensuke Mitsuzawa

サポーターズ勉強会スライド 2018/2/27
サポーターズ勉強会スライド 2018/2/27サポーターズ勉強会スライド 2018/2/27
サポーターズ勉強会スライド 2018/2/27
Kensuke Mitsuzawa
 
サポーターズ勉強会スライド
サポーターズ勉強会スライドサポーターズ勉強会スライド
サポーターズ勉強会スライド
Kensuke Mitsuzawa
 
ラベル付けのいろは
ラベル付けのいろはラベル付けのいろは
ラベル付けのいろは
Kensuke Mitsuzawa
 
形態素解析器の比較できるPythonパッケージつくった話
形態素解析器の比較できるPythonパッケージつくった話形態素解析器の比較できるPythonパッケージつくった話
形態素解析器の比較できるPythonパッケージつくった話
Kensuke Mitsuzawa
 
アダルトデータマイニングの勧め
アダルトデータマイニングの勧めアダルトデータマイニングの勧め
アダルトデータマイニングの勧め
Kensuke Mitsuzawa
 
Learning to rankの評価手法
Learning to rankの評価手法Learning to rankの評価手法
Learning to rankの評価手法Kensuke Mitsuzawa
 

More from Kensuke Mitsuzawa (7)

サポーターズ勉強会スライド 2018/2/27
サポーターズ勉強会スライド 2018/2/27サポーターズ勉強会スライド 2018/2/27
サポーターズ勉強会スライド 2018/2/27
 
サポーターズ勉強会スライド
サポーターズ勉強会スライドサポーターズ勉強会スライド
サポーターズ勉強会スライド
 
ラベル付けのいろは
ラベル付けのいろはラベル付けのいろは
ラベル付けのいろは
 
形態素解析器の比較できるPythonパッケージつくった話
形態素解析器の比較できるPythonパッケージつくった話形態素解析器の比較できるPythonパッケージつくった話
形態素解析器の比較できるPythonパッケージつくった話
 
アダルトデータマイニングの勧め
アダルトデータマイニングの勧めアダルトデータマイニングの勧め
アダルトデータマイニングの勧め
 
Learning to rankの評価手法
Learning to rankの評価手法Learning to rankの評価手法
Learning to rankの評価手法
 
Brml 3.3 d-separation
Brml 3.3 d-separationBrml 3.3 d-separation
Brml 3.3 d-separation
 

Recently uploaded

Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
chanes7
 
The Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptxThe Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptx
DhatriParmar
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
tarandeep35
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
Levi Shapiro
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
Wasim Ak
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
Krisztián Száraz
 
JEE1_This_section_contains_FOUR_ questions
JEE1_This_section_contains_FOUR_ questionsJEE1_This_section_contains_FOUR_ questions
JEE1_This_section_contains_FOUR_ questions
ShivajiThube2
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Scholarhat
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdfMASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
goswamiyash170123
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
Academy of Science of South Africa
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 

Recently uploaded (20)

Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
 
The Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptxThe Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptx
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
 
JEE1_This_section_contains_FOUR_ questions
JEE1_This_section_contains_FOUR_ questionsJEE1_This_section_contains_FOUR_ questions
JEE1_This_section_contains_FOUR_ questions
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdfMASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 

slides for "Supervised Model Learning with Feature Grouping based on a Discrete Constraint "

  • 1. Supervised Model Learning with Feature Grouping based on a Discrete Constraint(Suzuki+, ACL, 2013)の紹介 03/10/12 kensuke-mi
  • 3. おさらい 3 タベクトルの要素の絶対値の汎ル、要棄の2乗の府ルなります ι次のようになります。 ・口正則化 ・L2正則化 口正則化の場合、 wkの値が0に近づけぱぺナルティはほぼ'0にな U正則化の場合、ペナルティが0になるのは値が完全に0である す。そのため、 Uのほラがパラメータベクトルの亦0の要素を減 く卿持ます。このように11くと U正則化のほうが良さそうですが バW)* 0Σk1如kl ,'(仙)= 0 三k 1脚k12 L2正則化は,w_kの値が0に近づけばペナルティは ほぼ0になる. L1正則化はペナルティが0になるのは,値が完全に 0のときのみ. L1ノルムの方がNon-zeroのパラメータベクトルを 減らす力が強い.
  • 5. じゃあ,どうやってグループ化するの? 5 uster without any loss. Modeling with Feature Grouping ection describes our proposal for obtaining ure grouping solution. Integration of a Discrete Constraint be a finite set of discrete values, i.e., a set r from 4 to 4, that is, S ={ 4,. . . , 1, 0, , 4}. The detailed discussion how we define be found in our experiments section since ply depends on training data. Then, we de- e objective that can simultaneously achieve ure grouping and model learning as follows: O(w; D) = L(w; D) + ⌦(w) s.t. w 2 SN . (2) SN is the cartesian power of a set S. The difference with Eq. 1 is the additional dis- constraint, namely, w 2 SN . This con- means that each variable (feature weight) the standard loss minimization problem Eq. 1 and the additional discrete const larizer by the dual decomposition techn To solve the optimization in Eq. 3, age the alternating direction method of (ADMM) (Gabay and Mercier, 1976; B 2011). ADMM provides a very efficient tion framework for the problem in the du position form. Here, ↵ represents dua for the equivalence constraint w = u. A troduces the augmented Lagrangian ter u||2 2 with ⇢>0 which ensures strict con increases robustness3. Finally, the optimization problem in be converted into a series of iterative tion problems. Detailed derivation in t case can be found in (Boyd et al., 201 shows the entire model learning framew proposed method. The remarkable po ADMM works by iteratively computing three optimization variable sets w, u, an to provide compact model representation, which is especially useful in actual use. Introduction his paper focuses on the topic of supervised model learning, which is typically represented as he following form of the optimization problem: ˆw = arg min w O(w; D) , O(w; D) = L(w; D) + ⌦(w), (1) where D is supervised training data that consists f the corresponding input x and output y pairs, hat is, (x, y) 2 D. w is an N-dimensional vector epresentation of a set of optimization variables, which are also interpreted as feature weights. (w; D) and ⌦(w) represent a loss function and regularization term, respectively. Nowadays, we, n most cases, utilize a supervised learning method xpressed as the above optimization problem to stimate the feature weights of many natural lan- uage processing (NLP) tasks, such as text clas- fication, POS-tagging, named entity recognition, ependency parsing, and semantic role labeling. In the last decade, the L1-regularization tech- a model learning framework that can reduce the model complexity beyond that possible by sim- ply applying L1-regularizers. To achieve our goal we focus on the recently developed concept of au- tomatic feature grouping (Tibshirani et al., 2005 Bondell and Reich, 2008). We introduce a mode learning framework that achieves feature group- ing by incorporating a discrete constraint during model learning. 2 Feature Grouping Concept Going beyond L1-regularized sparse modeling the idea of ‘automatic feature grouping’ has re- cently been developed. Examples are fused lasso (Tibshirani et al., 2005), grouping pur- suit (Shen and Huang, 2010), and OSCAR (Bon- dell and Reich, 2008). The concept of automatic feature grouping is to find accurate models tha have fewer degrees of freedom. This is equiva- lent to enforce every optimization variables to be equal as much as possible. A simple example is that ˆw1 = (0.1, 0.5, 0.1, 0.5, 0.1) is preferred over ˆw2 = (0.1, 0.3, 0.2, 0.5, 0.3) since ˆw1 and ˆw2 have two and four unique values, respectively. 元々の学習式 Lは損失関数,Ωが正則化項 重みの集合Sを定義する 例えば-4から4までの範囲とすると,S={-4,-3,-2.....3,4} 重みはSのべき乗集合から選ぶことにする.
  • 6. じゃあ,どうやってグループ化するの? 6 uster without any loss. Modeling with Feature Grouping ection describes our proposal for obtaining ure grouping solution. Integration of a Discrete Constraint be a finite set of discrete values, i.e., a set r from 4 to 4, that is, S ={ 4,. . . , 1, 0, , 4}. The detailed discussion how we define be found in our experiments section since ply depends on training data. Then, we de- e objective that can simultaneously achieve ure grouping and model learning as follows: O(w; D) = L(w; D) + ⌦(w) s.t. w 2 SN . (2) SN is the cartesian power of a set S. The difference with Eq. 1 is the additional dis- constraint, namely, w 2 SN . This con- means that each variable (feature weight) the standard loss minimization problem Eq. 1 and the additional discrete const larizer by the dual decomposition techn To solve the optimization in Eq. 3, age the alternating direction method of (ADMM) (Gabay and Mercier, 1976; B 2011). ADMM provides a very efficient tion framework for the problem in the du position form. Here, ↵ represents dua for the equivalence constraint w = u. A troduces the augmented Lagrangian ter u||2 2 with ⇢>0 which ensures strict con increases robustness3. Finally, the optimization problem in be converted into a series of iterative tion problems. Detailed derivation in t case can be found in (Boyd et al., 201 shows the entire model learning framew proposed method. The remarkable po ADMM works by iteratively computing three optimization variable sets w, u, an to provide compact model representation, which is especially useful in actual use. Introduction his paper focuses on the topic of supervised model learning, which is typically represented as he following form of the optimization problem: ˆw = arg min w O(w; D) , O(w; D) = L(w; D) + ⌦(w), (1) where D is supervised training data that consists f the corresponding input x and output y pairs, hat is, (x, y) 2 D. w is an N-dimensional vector epresentation of a set of optimization variables, which are also interpreted as feature weights. (w; D) and ⌦(w) represent a loss function and regularization term, respectively. Nowadays, we, n most cases, utilize a supervised learning method xpressed as the above optimization problem to stimate the feature weights of many natural lan- uage processing (NLP) tasks, such as text clas- fication, POS-tagging, named entity recognition, ependency parsing, and semantic role labeling. In the last decade, the L1-regularization tech- a model learning framework that can reduce the model complexity beyond that possible by sim- ply applying L1-regularizers. To achieve our goal we focus on the recently developed concept of au- tomatic feature grouping (Tibshirani et al., 2005 Bondell and Reich, 2008). We introduce a mode learning framework that achieves feature group- ing by incorporating a discrete constraint during model learning. 2 Feature Grouping Concept Going beyond L1-regularized sparse modeling the idea of ‘automatic feature grouping’ has re- cently been developed. Examples are fused lasso (Tibshirani et al., 2005), grouping pur- suit (Shen and Huang, 2010), and OSCAR (Bon- dell and Reich, 2008). The concept of automatic feature grouping is to find accurate models tha have fewer degrees of freedom. This is equiva- lent to enforce every optimization variables to be equal as much as possible. A simple example is that ˆw1 = (0.1, 0.5, 0.1, 0.5, 0.1) is preferred over ˆw2 = (0.1, 0.3, 0.2, 0.5, 0.3) since ˆw1 and ˆw2 have two and four unique values, respectively. 元々の学習式 Lは損失関数,Ωが正則化項 重みの集合Sを定義する 例えば-4から4までの範囲とすると,S={-4,-3,-2.....3,4} 重みはSのべき乗集合から選ぶことにする. つまり..重み値の集合を作成して,この集合から重 みを選ぶ.この行為がグループ化である.
  • 7. ただし,この問題は普通には解けない. 7 ply depends on training data. Then, we de- he objective that can simultaneously achieve ure grouping and model learning as follows: O(w; D) = L(w; D) + ⌦(w) s.t. w 2 SN . (2) e SN is the cartesian power of a set S. The difference with Eq. 1 is the additional dis- constraint, namely, w 2 SN . This con- t means that each variable (feature weight) ined models must take a value in S, that is, S, where ˆwn is the n-th factor of ˆw, and {1, . . . , N}. As a result, feature weights in d models are automatically grouped in terms troduces th u||2 2 with ⇢ increases r Finally, be convert tion proble case can b shows the e proposed m ADMM wo three optim holding the t = 1, 2, . . Step1 (w Sのべき乗が巨大化するので,最適化の際に組み合わせ爆発 が発生する.つまりNP困難問題 そこで,双対分解を導入して,これを解く
  • 8. 双対分解のおさらい 8 双対分解とは, つまり「NP困難な問題を分割して解く問題の解き方」 1.argmax_y ( g(y) + h(y) )はNP困難になるため解けない 2.そこで,argmax_z,y ( g(z) + h(y) ) st. z=yと問題を分解 3. 2の式にラグランジュ法を導入して新しくLを定義.Lの最適解をL*とする.こ の時,双対定理によりL*=min_u L(u) 4. L(u)は凸関数なので,勾配法で最適解が求まる.勾配の更新をu:=u-µ(y*-z*)と する. 5. y*=z*の時に,2の式が解ける. 詳しくはhttp://research.preferred.jp/2010/11/dual-decomposition/
  • 9. 本論文での双対分解の適用 9 γはΩに似た項らしい.(Sec. 3.1より) γはなんでも良いのだが,このpaperでは of L(w; D) and ⌦(w). Thus, we ignore their spe- ific definition in this section. Typical cases can be found in the experiments section. Then, we re- ormulate Eq. 2 by using the dual decomposition echnique (Everett, 1963): O(w, u; D) = L(w; D) + ⌦(w) + ⌥(u) s.t. w = u, and u 2 SN . (3) Difference from Eq. 2, Eq. 3 has an additional term ⌥(u), which is similar to the regularizer ⌦(w), whose optimization variables w and u are tight- ned with equality constraint w = u. Here, this paper only considers the case ⌥(u) = 2 2 ||u||2 2 + 1||u||1, and 2 0 and 1 02. This objec- . , 4}. The detailed discussion how we define n be found in our experiments section since eply depends on training data. Then, we de- he objective that can simultaneously achieve ture grouping and model learning as follows: O(w; D) = L(w; D) + ⌦(w) s.t. w 2 SN . (2) e SN is the cartesian power of a set S. The difference with Eq. 1 is the additional dis- constraint, namely, w 2 SN . This con- nt means that each variable (feature weight) ained models must take a value in S, that is, 2 S, where ˆwn is the n-th factor of ˆw, and {1, . . . , N}. As a result, feature weights in ed models are automatically grouped in terms e basis of model learning. This is the basic of feature grouping proposed in this paper. position form. Here, ↵ rep for the equivalence constrain troduces the augmented Lag u||2 2 with ⇢>0 which ensure increases robustness3. Finally, the optimization be converted into a series tion problems. Detailed der case can be found in (Boyd shows the entire model learn proposed method. The rem ADMM works by iteratively three optimization variable s holding the other variables t = 1, 2, . . . until convergen Step1 (w-update): This tion problem shown in Eq. with a ‘biased’ L2-regulariz that the direction of regulari 元の式を双対分解 s, we ignore their spe- on. Typical cases can section. Then, we re- e dual decomposition + ⌦(w) + ⌥(u) nd u 2 SN . (3) has an additional term he regularizer ⌦(w), es w and u are tight- nt w = u. Here, this se ⌥(u) = 2 2 ||u||2 2 + 1 02. This objec- the decomposition of ion problem shown in ious studies clari- e of over-fitting to ng, 2010). This is y NLP tasks since high-dimensional -fitting problem is en reported that it cting non-zero fea- h the standard L1- f many highly cor- Yu, 2003; Zou and n dramatically re- is because we can e weight values are into a single fea- Grouping of L(w; D) and ⌦(w). Thus, we ignore their spe- cific definition in this section. Typical cases can be found in the experiments section. Then, we re- formulate Eq. 2 by using the dual decomposition technique (Everett, 1963): O(w, u; D) = L(w; D) + ⌦(w) + ⌥(u) s.t. w = u, and u 2 SN . (3) Difference from Eq. 2, Eq. 3 has an additional term ⌥(u), which is similar to the regularizer ⌦(w), whose optimization variables w and u are tight- ened with equality constraint w = u. Here, this paper only considers the case ⌥(u) = 2 2 ||u||2 2 + 1||u||1, and 2 0 and 1 02. This objec- tive can also be viewed as the decomposition of the standard loss minimization problem shown in Eq. 1 and the additional discrete constraint regu- のみを考える.
  • 10. パラメータの更新と最適化 ✦ 分解式を特にはADMMというアルゴリズムを用いる (双対分解では一般的に用いられるアルゴリズムらしい) ✦ 詳しいパラメーターの更新はsec. 3.1を見てください. (たぶん,勾配的に更新してると思われる) ✦ 計算量はO(N log ¦S¦ )に抑えられる. ✦ ADMMの中でオンライン学習を用いて高速化が可能 (sec. 3.3) 10
  • 11. 2つのタスクで2軸で評価実験 ✦ 2つのNLPタスクで評価を行った Named Entity Recognitionタスク (NER) Dependency Parsingタスク(DEPAR) ✦ 手法の精度評価 Complete Sentence Accuracy(COMP)が完全一致? NERタスクにF-sc(F-score) DEPARタスクにUAS(unlabelのedgeの正確さ) ✦ モデル複雑度の評価 #nzF:featureの数,ただし対応する重みがnon-zero #DoF:uniqueなnon-zeroな重み 11
  • 12. 重み集合Sの定義(4.1) 12 Sの定義は自由にしてもいいが,一般的に以下が最適 plate which is suitable for large feature set. Let ⌘, , and  represent non-negative real-value con- stants, ⇣ be a positive integer, = { 1, 1}, and a function f⌘, ,(x, y) = y(⌘x + ). T hen, we define a finite set of values S as follows: S⌘, ,,⇣ ={f⌘, ,(x, y)|(x, y) 2 S⇣ ⇥ } [ {0}, where S⇣ is a set of non-negative integers from ero to ⇣ 1, that is, S⇣ ={m}⇣ 1 m=0. For example, if we set ⌘ = 0.1, = 0.4,  = 4, and ⇣ = 3, then S⌘, ,,⇣ = { 2.0, 0.8, 0.5, 0, 0.5, 0.8, 2.0}. he intuition of this template is that the distribu- tion of the feature weights in trained model often tak es a form a similar to that of the ‘ power law’ in the case of the large feature sets. T herefore, us- ing an exponential function with a scale and bias seems to be appropriate for fitting them. nite set for S. However, we have to carefully se- lect it since it deeply affects the performance. Ac- tually, this is the most considerable point of our method. We preliminarily investigated the several settings. Here, we introduce an example of tem- plate which is suitable for large feature set. Let ⌘, , and  represent non-negative real-value con- stants, ⇣ be a positive integer, = { 1, 1}, and a function f⌘, ,(x, y) = y(⌘x + ). Then, we define a finite set of values S as follows: S⌘, ,,⇣ ={f⌘, ,(x, y)|(x, y) 2 S⇣ ⇥ } [ {0}, where S⇣ is a set of non-negative integers from zero to ⇣ 1, that is, S⇣ ={m}⇣ 1 m=0. For example, if we set ⌘ = 0.1, = 0.4,  = 4, and ⇣ = 3, then S⌘, ,,⇣ = { 2.0, 0.8, 0.5, 0, 0.5, 0.8, 2.0}. The intuition of this template is that the distribu- tion of the feature weights in trained model often takes a form a similar to that of the ‘power law’ in the case of the large feature sets. Therefore, us- ただし η,k,δは非負の実数 ζは正の整数.S_ζは0からζ­1までの実数集合 重みの分布は一般的に,べき乗則(power law)に従う傾向 がある.なので,指数関数でフィッテングを行った. 上式の根拠 ちなみに#DoFはζによってコントロール可能.
  • 13. 実験結果 13 0E+00 1.0E+03 1.0E+06 DC-ADMM L1CRF (w/ QT) L1CRF L2CRF quantized degrees of freedom (#DoF) [log-scale] 30.0 35.0 40.0 45.0 50.0 55.0 1.0E+00 1.0E+03 1.0E+06 DC-ADMM L1RAD (w/ QT) L1RDA L2PA CompleteSentenceAccuracy quantized # of degrees of freedom (#DoF) [log-scale] (a) NER (b) DEPAR ure 3: Performance vs. degree of freedom in trained model for the development data ote that we can control the upper bound of F in trained model by ⇣, namely if ⇣ = 4 then upper bound of #DoF is 8 (doubled by posi- and negative sides). We fixed ⇢ = 1, ⇠ = 1, = 0,  = 4 (or 2 if ⇣ 5), = ⌘/2 in all ex- ments. Thus the only tunable parameter in our Test Model complex. NER COMP F-sc #nzF #DoF L2CRF 84.88 89.97 61.6M 38.6M L1CRF 84.85 89.99 614K 321K (w/ QT ⇣ =4) 78.39 85.33 568K 8 (w/ QT ⇣ =2) 73.40 81.45 454K 4 (w/ QT ⇣ =1) 65.53 75.87 454K 2 DC-ADMM (⇣ =4) 84.96 89.92 643K 8 (⇣ =2) 84.04 89.35 455K 4 (⇣ =1) 83.06 88.62 364K 2 Test Model complex. DEPER COMP UAS #nzF #DoF L2PA 49.67 93.51 15.5M 5.59M L1RDA 49.54 93.48 7.76M 3.56M (w/ QT ⇣ =4) 38.58 90.85 6.32M 8 (w/ QT ⇣ =2) 34.19 89.42 3.08M 4 (w/ QT ⇣ =1) 30.42 88.67 3.08M 2 DC-ADMM (⇣ =4) 49.83 93.55 5.81M 8 (⇣ =2) 48.97 93.18 4.11M 4 (⇣ =1) 46.56 92.86 6.37M 2 Table 1: Comparison results of the methods on test data (K: thousand, M: million) NERとDEPERの両方で Baselineと謙遜ない精度を 出しつつ,モデルの複雑さを 抑えた
  • 14. まとめ ✦ NLPタスクの機械学習は重みの多さからモデル複雑度が高 くなりがち. ✦ 複雑度を抑えるために,重みのグループ化を行った. ✦ グループ化に伴い発生するNP困難問題を双対分解で解決 ✦ Named Entity RecognitionタスクとDependency Parsingタスクで精度を保ちつつ,複雑度を抑えた 14