SlideShare a Scribd company logo
Supervised Model Learning with Feature
Grouping based on a Discrete
Constraint(Suzuki+, ACL, 2013)の紹介
03/10/12 kensuke-mi
✦ NLPのタスクでは,機械学習のモデルが巨大化しやすい.
✦ L1正則化項を導入すると,モデルサイズは小さくなる.
✦ 一般に,今日のNLPタスクでは
L1正則項 モデルサイズが小さくなる.高速デコード可能
L2正則項 L1より精度良いが,モデルサイズが巨大化
口正則化の場合、 wkの値が0に近づけぱぺナルティはほぼ'0にな
す。そのため、 Uのほラがパラメータベクトルの亦0の要素を減
く卿持ます。このように11くと U正則化のほうが良さそうですが
バW)* 0Σk1如kl
,'(仙)= 0 三k 1脚k12
✦ 重みベクトルをグループ化して,計算量を下げる
✦ こんなメリットがあります
(=Model Complexityを下げる)
uster without any loss.
Modeling with Feature Grouping
ection describes our proposal for obtaining
ure grouping solution.
Integration of a Discrete Constraint
be a finite set of discrete values, i.e., a set
r from 4 to 4, that is, S ={ 4,. . . , 1, 0,
, 4}. The detailed discussion how we define
be found in our experiments section since
ply depends on training data. Then, we de-
e objective that can simultaneously achieve
ure grouping and model learning as follows:
O(w; D) = L(w; D) + ⌦(w)
s.t. w 2 SN .
SN is the cartesian power of a set S. The
difference with Eq. 1 is the additional dis-
constraint, namely, w 2 SN . This con-
means that each variable (feature weight)
the standard loss minimization problem
Eq. 1 and the additional discrete const
larizer by the dual decomposition techn
To solve the optimization in Eq. 3,
age the alternating direction method of
(ADMM) (Gabay and Mercier, 1976; B
2011). ADMM provides a very efficient
tion framework for the problem in the du
position form. Here, ↵ represents dua
for the equivalence constraint w = u. A
troduces the augmented Lagrangian ter
2 with ⇢>0 which ensures strict con
increases robustness3.
Finally, the optimization problem in
be converted into a series of iterative
tion problems. Detailed derivation in t
case can be found in (Boyd et al., 201
shows the entire model learning framew
proposed method. The remarkable po
ADMM works by iteratively computing
three optimization variable sets w, u, an
to provide compact model representation,
which is especially useful in actual use.
his paper focuses on the topic of supervised
model learning, which is typically represented as
he following form of the optimization problem:
ˆw = arg min
O(w; D) ,
O(w; D) = L(w; D) + ⌦(w),
where D is supervised training data that consists
f the corresponding input x and output y pairs,
hat is, (x, y) 2 D. w is an N-dimensional vector
epresentation of a set of optimization variables,
which are also interpreted as feature weights.
(w; D) and ⌦(w) represent a loss function and
regularization term, respectively. Nowadays, we,
n most cases, utilize a supervised learning method
xpressed as the above optimization problem to
stimate the feature weights of many natural lan-
uage processing (NLP) tasks, such as text clas-
fication, POS-tagging, named entity recognition,
ependency parsing, and semantic role labeling.
In the last decade, the L1-regularization tech-
a model learning framework that can reduce the
model complexity beyond that possible by sim-
ply applying L1-regularizers. To achieve our goal
we focus on the recently developed concept of au-
tomatic feature grouping (Tibshirani et al., 2005
Bondell and Reich, 2008). We introduce a mode
learning framework that achieves feature group-
ing by incorporating a discrete constraint during
model learning.
2 Feature Grouping Concept
Going beyond L1-regularized sparse modeling
the idea of ‘automatic feature grouping’ has re-
cently been developed. Examples are fused
lasso (Tibshirani et al., 2005), grouping pur-
suit (Shen and Huang, 2010), and OSCAR (Bon-
dell and Reich, 2008). The concept of automatic
feature grouping is to find accurate models tha
have fewer degrees of freedom. This is equiva-
lent to enforce every optimization variables to be
equal as much as possible. A simple example is
that ˆw1 = (0.1, 0.5, 0.1, 0.5, 0.1) is preferred over
ˆw2 = (0.1, 0.3, 0.2, 0.5, 0.3) since ˆw1 and ˆw2
have two and four unique values, respectively.
uster without any loss.
Modeling with Feature Grouping
ection describes our proposal for obtaining
ure grouping solution.
Integration of a Discrete Constraint
be a finite set of discrete values, i.e., a set
r from 4 to 4, that is, S ={ 4,. . . , 1, 0,
, 4}. The detailed discussion how we define
be found in our experiments section since
ply depends on training data. Then, we de-
e objective that can simultaneously achieve
ure grouping and model learning as follows:
O(w; D) = L(w; D) + ⌦(w)
s.t. w 2 SN .
SN is the cartesian power of a set S. The
difference with Eq. 1 is the additional dis-
constraint, namely, w 2 SN . This con-
means that each variable (feature weight)
the standard loss minimization problem
Eq. 1 and the additional discrete const
larizer by the dual decomposition techn
To solve the optimization in Eq. 3,
age the alternating direction method of
(ADMM) (Gabay and Mercier, 1976; B
2011). ADMM provides a very efficient
tion framework for the problem in the du
position form. Here, ↵ represents dua
for the equivalence constraint w = u. A
troduces the augmented Lagrangian ter
2 with ⇢>0 which ensures strict con
increases robustness3.
Finally, the optimization problem in
be converted into a series of iterative
tion problems. Detailed derivation in t
case can be found in (Boyd et al., 201
shows the entire model learning framew
proposed method. The remarkable po
ADMM works by iteratively computing
three optimization variable sets w, u, an
to provide compact model representation,
which is especially useful in actual use.
his paper focuses on the topic of supervised
model learning, which is typically represented as
he following form of the optimization problem:
ˆw = arg min
O(w; D) ,
O(w; D) = L(w; D) + ⌦(w),
where D is supervised training data that consists
f the corresponding input x and output y pairs,
hat is, (x, y) 2 D. w is an N-dimensional vector
epresentation of a set of optimization variables,
which are also interpreted as feature weights.
(w; D) and ⌦(w) represent a loss function and
regularization term, respectively. Nowadays, we,
n most cases, utilize a supervised learning method
xpressed as the above optimization problem to
stimate the feature weights of many natural lan-
uage processing (NLP) tasks, such as text clas-
fication, POS-tagging, named entity recognition,
ependency parsing, and semantic role labeling.
In the last decade, the L1-regularization tech-
a model learning framework that can reduce the
model complexity beyond that possible by sim-
ply applying L1-regularizers. To achieve our goal
we focus on the recently developed concept of au-
tomatic feature grouping (Tibshirani et al., 2005
Bondell and Reich, 2008). We introduce a mode
learning framework that achieves feature group-
ing by incorporating a discrete constraint during
model learning.
2 Feature Grouping Concept
Going beyond L1-regularized sparse modeling
the idea of ‘automatic feature grouping’ has re-
cently been developed. Examples are fused
lasso (Tibshirani et al., 2005), grouping pur-
suit (Shen and Huang, 2010), and OSCAR (Bon-
dell and Reich, 2008). The concept of automatic
feature grouping is to find accurate models tha
have fewer degrees of freedom. This is equiva-
lent to enforce every optimization variables to be
equal as much as possible. A simple example is
that ˆw1 = (0.1, 0.5, 0.1, 0.5, 0.1) is preferred over
ˆw2 = (0.1, 0.3, 0.2, 0.5, 0.3) since ˆw1 and ˆw2
have two and four unique values, respectively.
ply depends on training data. Then, we de-
he objective that can simultaneously achieve
ure grouping and model learning as follows:
O(w; D) = L(w; D) + ⌦(w)
s.t. w 2 SN .
e SN is the cartesian power of a set S. The
difference with Eq. 1 is the additional dis-
constraint, namely, w 2 SN . This con-
t means that each variable (feature weight)
ined models must take a value in S, that is,
S, where ˆwn is the n-th factor of ˆw, and
{1, . . . , N}. As a result, feature weights in
d models are automatically grouped in terms
troduces th
2 with ⇢
increases r
be convert
tion proble
case can b
shows the e
proposed m
three optim
holding the
t = 1, 2, . .
Step1 (w
1.argmax_y ( g(y) + h(y) )はNP困難になるため解けない
2.そこで,argmax_z,y ( g(z) + h(y) ) st. z=yと問題を分解
3. 2の式にラグランジュ法を導入して新しくLを定義.Lの最適解をL*とする.こ
の時,双対定理によりL*=min_u L(u)
4. L(u)は凸関数なので,勾配法で最適解が求まる.勾配の更新をu:=u-µ(y*-z*)と
5. y*=z*の時に,2の式が解ける.
γはΩに似た項らしい.(Sec. 3.1より)
of L(w; D) and ⌦(w). Thus, we ignore their spe-
ific definition in this section. Typical cases can
be found in the experiments section. Then, we re-
ormulate Eq. 2 by using the dual decomposition
echnique (Everett, 1963):
O(w, u; D) = L(w; D) + ⌦(w) + ⌥(u)
s.t. w = u, and u 2 SN .
Difference from Eq. 2, Eq. 3 has an additional term
⌥(u), which is similar to the regularizer ⌦(w),
whose optimization variables w and u are tight-
ned with equality constraint w = u. Here, this
paper only considers the case ⌥(u) = 2
2 ||u||2
2 +
1||u||1, and 2 0 and 1 02. This objec-
. , 4}. The detailed discussion how we define
n be found in our experiments section since
eply depends on training data. Then, we de-
he objective that can simultaneously achieve
ture grouping and model learning as follows:
O(w; D) = L(w; D) + ⌦(w)
s.t. w 2 SN .
e SN is the cartesian power of a set S. The
difference with Eq. 1 is the additional dis-
constraint, namely, w 2 SN . This con-
nt means that each variable (feature weight)
ained models must take a value in S, that is,
2 S, where ˆwn is the n-th factor of ˆw, and
{1, . . . , N}. As a result, feature weights in
ed models are automatically grouped in terms
e basis of model learning. This is the basic
of feature grouping proposed in this paper.
position form. Here, ↵ rep
for the equivalence constrain
troduces the augmented Lag
2 with ⇢>0 which ensure
increases robustness3.
Finally, the optimization
be converted into a series
tion problems. Detailed der
case can be found in (Boyd
shows the entire model learn
proposed method. The rem
ADMM works by iteratively
three optimization variable s
holding the other variables
t = 1, 2, . . . until convergen
Step1 (w-update): This
tion problem shown in Eq.
with a ‘biased’ L2-regulariz
that the direction of regulari
s, we ignore their spe-
on. Typical cases can
section. Then, we re-
e dual decomposition
+ ⌦(w) + ⌥(u)
nd u 2 SN .
has an additional term
he regularizer ⌦(w),
es w and u are tight-
nt w = u. Here, this
se ⌥(u) = 2
2 ||u||2
2 +
1 02. This objec-
the decomposition of
ion problem shown in
ious studies clari-
e of over-fitting to
ng, 2010). This is
y NLP tasks since
-fitting problem is
en reported that it
cting non-zero fea-
h the standard L1-
f many highly cor-
Yu, 2003; Zou and
n dramatically re-
is because we can
e weight values are
into a single fea-
of L(w; D) and ⌦(w). Thus, we ignore their spe-
cific definition in this section. Typical cases can
be found in the experiments section. Then, we re-
formulate Eq. 2 by using the dual decomposition
technique (Everett, 1963):
O(w, u; D) = L(w; D) + ⌦(w) + ⌥(u)
s.t. w = u, and u 2 SN .
Difference from Eq. 2, Eq. 3 has an additional term
⌥(u), which is similar to the regularizer ⌦(w),
whose optimization variables w and u are tight-
ened with equality constraint w = u. Here, this
paper only considers the case ⌥(u) = 2
2 ||u||2
2 +
1||u||1, and 2 0 and 1 02. This objec-
tive can also be viewed as the decomposition of
the standard loss minimization problem shown in
Eq. 1 and the additional discrete constraint regu-
✦ 分解式を特にはADMMというアルゴリズムを用いる
✦ 詳しいパラメーターの更新はsec. 3.1を見てください.
✦ 計算量はO(N log ¦S¦ )に抑えられる.
✦ ADMMの中でオンライン学習を用いて高速化が可能
(sec. 3.3)
✦ 2つのNLPタスクで評価を行った
Named Entity Recognitionタスク (NER)
Dependency Parsingタスク(DEPAR)
✦ 手法の精度評価
Complete Sentence Accuracy(COMP)が完全一致?
✦ モデル複雑度の評価
plate which is suitable for large feature set. Let
⌘, , and  represent non-negative real-value con-
stants, ⇣ be a positive integer, = { 1, 1}, and
a function f⌘, ,(x, y) = y(⌘x + ). T hen, we
define a finite set of values S as follows:
S⌘, ,,⇣ ={f⌘, ,(x, y)|(x, y) 2 S⇣ ⇥ } [ {0},
where S⇣ is a set of non-negative integers from
ero to ⇣ 1, that is, S⇣ ={m}⇣ 1
m=0. For example,
if we set ⌘ = 0.1, = 0.4,  = 4, and ⇣ = 3, then
S⌘, ,,⇣ = { 2.0, 0.8, 0.5, 0, 0.5, 0.8, 2.0}.
he intuition of this template is that the distribu-
tion of the feature weights in trained model often
tak es a form a similar to that of the ‘ power law’
in the case of the large feature sets. T herefore, us-
ing an exponential function with a scale and bias
seems to be appropriate for fitting them.
nite set for S. However, we have to carefully se-
lect it since it deeply affects the performance. Ac-
tually, this is the most considerable point of our
method. We preliminarily investigated the several
settings. Here, we introduce an example of tem-
plate which is suitable for large feature set. Let
⌘, , and  represent non-negative real-value con-
stants, ⇣ be a positive integer, = { 1, 1}, and
a function f⌘, ,(x, y) = y(⌘x + ). Then, we
define a finite set of values S as follows:
S⌘, ,,⇣ ={f⌘, ,(x, y)|(x, y) 2 S⇣ ⇥ } [ {0},
where S⇣ is a set of non-negative integers from
zero to ⇣ 1, that is, S⇣ ={m}⇣ 1
m=0. For example,
if we set ⌘ = 0.1, = 0.4,  = 4, and ⇣ = 3, then
S⌘, ,,⇣ = { 2.0, 0.8, 0.5, 0, 0.5, 0.8, 2.0}.
The intuition of this template is that the distribu-
tion of the feature weights in trained model often
takes a form a similar to that of the ‘power law’
in the case of the large feature sets. Therefore, us-
ただし η,k,δは非負の実数
重みの分布は一般的に,べき乗則(power law)に従う傾向
0E+00 1.0E+03 1.0E+06
L1CRF (w/ QT)
degrees of freedom (#DoF) [log-scale]
1.0E+00 1.0E+03 1.0E+06
L1RAD (w/ QT)
# of degrees of freedom (#DoF) [log-scale]
(a) NER (b) DEPAR
ure 3: Performance vs. degree of freedom in
trained model for the development data
ote that we can control the upper bound of
F in trained model by ⇣, namely if ⇣ = 4 then
upper bound of #DoF is 8 (doubled by posi-
and negative sides). We fixed ⇢ = 1, ⇠ = 1,
= 0,  = 4 (or 2 if ⇣ 5), = ⌘/2 in all ex-
ments. Thus the only tunable parameter in our
Test Model complex.
NER COMP F-sc #nzF #DoF
L2CRF 84.88 89.97 61.6M 38.6M
L1CRF 84.85 89.99 614K 321K
(w/ QT ⇣ =4) 78.39 85.33 568K 8
(w/ QT ⇣ =2) 73.40 81.45 454K 4
(w/ QT ⇣ =1) 65.53 75.87 454K 2
DC-ADMM (⇣ =4) 84.96 89.92 643K 8
(⇣ =2) 84.04 89.35 455K 4
(⇣ =1) 83.06 88.62 364K 2
Test Model complex.
L2PA 49.67 93.51 15.5M 5.59M
L1RDA 49.54 93.48 7.76M 3.56M
(w/ QT ⇣ =4) 38.58 90.85 6.32M 8
(w/ QT ⇣ =2) 34.19 89.42 3.08M 4
(w/ QT ⇣ =1) 30.42 88.67 3.08M 2
DC-ADMM (⇣ =4) 49.83 93.55 5.81M 8
(⇣ =2) 48.97 93.18 4.11M 4
(⇣ =1) 46.56 92.86 6.37M 2
Table 1: Comparison results of the methods on test
data (K: thousand, M: million)
✦ NLPタスクの機械学習は重みの多さからモデル複雑度が高
✦ 複雑度を抑えるために,重みのグループ化を行った.
✦ グループ化に伴い発生するNP困難問題を双対分解で解決
✦ Named Entity RecognitionタスクとDependency

More Related Content

What's hot

Optimal control of coupled PDE networks with automated code generation
Optimal control of coupled PDE networks with automated code generationOptimal control of coupled PDE networks with automated code generation
Optimal control of coupled PDE networks with automated code generation
Delta Pi Systems
Tensor Decomposition and its Applications
Tensor Decomposition and its ApplicationsTensor Decomposition and its Applications
Tensor Decomposition and its Applications
Keisuke OTAKI
Reading papers - survey on Non-Convex Optimization
Reading papers - survey on Non-Convex OptimizationReading papers - survey on Non-Convex Optimization
Reading papers - survey on Non-Convex Optimization
X 37
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
Nguyễn Anh
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron ClassifiersArtificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Mohammed Bennamoun
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Ono Shigeru
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
S. Duplij, Polyadic integer numbers and finite (m,n)-fields (Journal version,...
S. Duplij, Polyadic integer numbers and finite (m,n)-fields (Journal version,...S. Duplij, Polyadic integer numbers and finite (m,n)-fields (Journal version,...
S. Duplij, Polyadic integer numbers and finite (m,n)-fields (Journal version,...
Steven Duplij (Stepan Douplii)
11.solution of linear and nonlinear partial differential equations using mixt...
11.solution of linear and nonlinear partial differential equations using mixt...11.solution of linear and nonlinear partial differential equations using mixt...
11.solution of linear and nonlinear partial differential equations using mixt...Alexander Decker
Solution of linear and nonlinear partial differential equations using mixture...
Solution of linear and nonlinear partial differential equations using mixture...Solution of linear and nonlinear partial differential equations using mixture...
Solution of linear and nonlinear partial differential equations using mixture...
Alexander Decker
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
Ono Shigeru
Lecture 9 Perceptron
Lecture 9 PerceptronLecture 9 Perceptron
Lecture 9 Perceptron
Marina Santini
Amelioration of Modeling and Solving the Weighted Constraint Satisfaction Pro...
Amelioration of Modeling and Solving the Weighted Constraint Satisfaction Pro...Amelioration of Modeling and Solving the Weighted Constraint Satisfaction Pro...
Amelioration of Modeling and Solving the Weighted Constraint Satisfaction Pro...
IJCSIS Research Publications
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
cairo university
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural Network
Liwei Ren任力偉

What's hot (20)

Optimal control of coupled PDE networks with automated code generation
Optimal control of coupled PDE networks with automated code generationOptimal control of coupled PDE networks with automated code generation
Optimal control of coupled PDE networks with automated code generation
Chapter 16
Chapter 16Chapter 16
Chapter 16
Tensor Decomposition and its Applications
Tensor Decomposition and its ApplicationsTensor Decomposition and its Applications
Tensor Decomposition and its Applications
Reading papers - survey on Non-Convex Optimization
Reading papers - survey on Non-Convex OptimizationReading papers - survey on Non-Convex Optimization
Reading papers - survey on Non-Convex Optimization
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron ClassifiersArtificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
S. Duplij, Polyadic integer numbers and finite (m,n)-fields (Journal version,...
S. Duplij, Polyadic integer numbers and finite (m,n)-fields (Journal version,...S. Duplij, Polyadic integer numbers and finite (m,n)-fields (Journal version,...
S. Duplij, Polyadic integer numbers and finite (m,n)-fields (Journal version,...
11.solution of linear and nonlinear partial differential equations using mixt...
11.solution of linear and nonlinear partial differential equations using mixt...11.solution of linear and nonlinear partial differential equations using mixt...
11.solution of linear and nonlinear partial differential equations using mixt...
Solution of linear and nonlinear partial differential equations using mixture...
Solution of linear and nonlinear partial differential equations using mixture...Solution of linear and nonlinear partial differential equations using mixture...
Solution of linear and nonlinear partial differential equations using mixture...
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
Lecture 9 Perceptron
Lecture 9 PerceptronLecture 9 Perceptron
Lecture 9 Perceptron
Amelioration of Modeling and Solving the Weighted Constraint Satisfaction Pro...
Amelioration of Modeling and Solving the Weighted Constraint Satisfaction Pro...Amelioration of Modeling and Solving the Weighted Constraint Satisfaction Pro...
Amelioration of Modeling and Solving the Weighted Constraint Satisfaction Pro...
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural Network

Similar to slides for "Supervised Model Learning with Feature Grouping based on a Discrete Constraint "

Skip-gram Model Broken Down
Skip-gram Model Broken DownSkip-gram Model Broken Down
Skip-gram Model Broken Down
Chin Huan Tan
The Estimations Based on the Kolmogorov Complexity and ...
The Estimations Based on the Kolmogorov Complexity and ...The Estimations Based on the Kolmogorov Complexity and ...
The Estimations Based on the Kolmogorov Complexity and ...butest
Harnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic RulesHarnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic Rules
Sho Takase
Ai_Project_reportRavi Gupta
Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
Devashish Patel
learning boolean weight learning real valued weights rank learning as ordina...
learning boolean weight learning real valued weights  rank learning as ordina...learning boolean weight learning real valued weights  rank learning as ordina...
learning boolean weight learning real valued weights rank learning as ordina...
Course Assignment : Skip gram
Course Assignment : Skip gramCourse Assignment : Skip gram
Course Assignment : Skip gram
Darren Yow-Bang Wang
Latent Structured Ranking
Latent Structured RankingLatent Structured Ranking
Latent Structured Ranking
Sunny Kr
vishal choudhary
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
Masahiro Suzuki
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
Eugene Nho
Lecture 2
Lecture 2Lecture 2
Lecture 2
Shravan Vasishth
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
Anuj Gupta
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up
Satyam Saxena
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert  pre_training_of_deep_bidirectional_transformers_for_language_understandingBert  pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
Jay Nagar
2.7 other classifiers
2.7 other classifiers2.7 other classifiers
2.7 other classifiers

Similar to slides for "Supervised Model Learning with Feature Grouping based on a Discrete Constraint " (20)

Skip-gram Model Broken Down
Skip-gram Model Broken DownSkip-gram Model Broken Down
Skip-gram Model Broken Down
Efficient projections
Efficient projectionsEfficient projections
Efficient projections
Efficient projections
Efficient projectionsEfficient projections
Efficient projections
The Estimations Based on the Kolmogorov Complexity and ...
The Estimations Based on the Kolmogorov Complexity and ...The Estimations Based on the Kolmogorov Complexity and ...
The Estimations Based on the Kolmogorov Complexity and ...
Harnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic RulesHarnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic Rules
Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
learning boolean weight learning real valued weights rank learning as ordina...
learning boolean weight learning real valued weights  rank learning as ordina...learning boolean weight learning real valued weights  rank learning as ordina...
learning boolean weight learning real valued weights rank learning as ordina...
Course Assignment : Skip gram
Course Assignment : Skip gramCourse Assignment : Skip gram
Course Assignment : Skip gram
Latent Structured Ranking
Latent Structured RankingLatent Structured Ranking
Latent Structured Ranking
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
Lecture 2
Lecture 2Lecture 2
Lecture 2
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert  pre_training_of_deep_bidirectional_transformers_for_language_understandingBert  pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
2.7 other classifiers
2.7 other classifiers2.7 other classifiers
2.7 other classifiers

More from Kensuke Mitsuzawa

サポーターズ勉強会スライド 2018/2/27
サポーターズ勉強会スライド 2018/2/27サポーターズ勉強会スライド 2018/2/27
サポーターズ勉強会スライド 2018/2/27
Kensuke Mitsuzawa
Kensuke Mitsuzawa
Kensuke Mitsuzawa
Kensuke Mitsuzawa
Kensuke Mitsuzawa
Learning to rankの評価手法
Learning to rankの評価手法Learning to rankの評価手法
Learning to rankの評価手法Kensuke Mitsuzawa

More from Kensuke Mitsuzawa (7)

サポーターズ勉強会スライド 2018/2/27
サポーターズ勉強会スライド 2018/2/27サポーターズ勉強会スライド 2018/2/27
サポーターズ勉強会スライド 2018/2/27
Learning to rankの評価手法
Learning to rankの評価手法Learning to rankの評価手法
Learning to rankの評価手法
Brml 3.3 d-separation
Brml 3.3 d-separationBrml 3.3 d-separation
Brml 3.3 d-separation

Recently uploaded

Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
The Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptxThe Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptx
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
Levi Shapiro
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
Wasim Ak
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
Krisztián Száraz
JEE1_This_section_contains_FOUR_ questions
JEE1_This_section_contains_FOUR_ questionsJEE1_This_section_contains_FOUR_ questions
JEE1_This_section_contains_FOUR_ questions
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdfMASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
Academy of Science of South Africa
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network

Recently uploaded (20)

Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
The Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptxThe Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptx
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
JEE1_This_section_contains_FOUR_ questions
JEE1_This_section_contains_FOUR_ questionsJEE1_This_section_contains_FOUR_ questions
JEE1_This_section_contains_FOUR_ questions
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdfMASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network

slides for "Supervised Model Learning with Feature Grouping based on a Discrete Constraint "

  • 1. Supervised Model Learning with Feature Grouping based on a Discrete Constraint(Suzuki+, ACL, 2013)の紹介 03/10/12 kensuke-mi
  • 3. おさらい 3 タベクトルの要素の絶対値の汎ル、要棄の2乗の府ルなります ι次のようになります。 ・口正則化 ・L2正則化 口正則化の場合、 wkの値が0に近づけぱぺナルティはほぼ'0にな U正則化の場合、ペナルティが0になるのは値が完全に0である す。そのため、 Uのほラがパラメータベクトルの亦0の要素を減 く卿持ます。このように11くと U正則化のほうが良さそうですが バW)* 0Σk1如kl ,'(仙)= 0 三k 1脚k12 L2正則化は,w_kの値が0に近づけばペナルティは ほぼ0になる. L1正則化はペナルティが0になるのは,値が完全に 0のときのみ. L1ノルムの方がNon-zeroのパラメータベクトルを 減らす力が強い.
  • 5. じゃあ,どうやってグループ化するの? 5 uster without any loss. Modeling with Feature Grouping ection describes our proposal for obtaining ure grouping solution. Integration of a Discrete Constraint be a finite set of discrete values, i.e., a set r from 4 to 4, that is, S ={ 4,. . . , 1, 0, , 4}. The detailed discussion how we define be found in our experiments section since ply depends on training data. Then, we de- e objective that can simultaneously achieve ure grouping and model learning as follows: O(w; D) = L(w; D) + ⌦(w) s.t. w 2 SN . (2) SN is the cartesian power of a set S. The difference with Eq. 1 is the additional dis- constraint, namely, w 2 SN . This con- means that each variable (feature weight) the standard loss minimization problem Eq. 1 and the additional discrete const larizer by the dual decomposition techn To solve the optimization in Eq. 3, age the alternating direction method of (ADMM) (Gabay and Mercier, 1976; B 2011). ADMM provides a very efficient tion framework for the problem in the du position form. Here, ↵ represents dua for the equivalence constraint w = u. A troduces the augmented Lagrangian ter u||2 2 with ⇢>0 which ensures strict con increases robustness3. Finally, the optimization problem in be converted into a series of iterative tion problems. Detailed derivation in t case can be found in (Boyd et al., 201 shows the entire model learning framew proposed method. The remarkable po ADMM works by iteratively computing three optimization variable sets w, u, an to provide compact model representation, which is especially useful in actual use. Introduction his paper focuses on the topic of supervised model learning, which is typically represented as he following form of the optimization problem: ˆw = arg min w O(w; D) , O(w; D) = L(w; D) + ⌦(w), (1) where D is supervised training data that consists f the corresponding input x and output y pairs, hat is, (x, y) 2 D. w is an N-dimensional vector epresentation of a set of optimization variables, which are also interpreted as feature weights. (w; D) and ⌦(w) represent a loss function and regularization term, respectively. Nowadays, we, n most cases, utilize a supervised learning method xpressed as the above optimization problem to stimate the feature weights of many natural lan- uage processing (NLP) tasks, such as text clas- fication, POS-tagging, named entity recognition, ependency parsing, and semantic role labeling. In the last decade, the L1-regularization tech- a model learning framework that can reduce the model complexity beyond that possible by sim- ply applying L1-regularizers. To achieve our goal we focus on the recently developed concept of au- tomatic feature grouping (Tibshirani et al., 2005 Bondell and Reich, 2008). We introduce a mode learning framework that achieves feature group- ing by incorporating a discrete constraint during model learning. 2 Feature Grouping Concept Going beyond L1-regularized sparse modeling the idea of ‘automatic feature grouping’ has re- cently been developed. Examples are fused lasso (Tibshirani et al., 2005), grouping pur- suit (Shen and Huang, 2010), and OSCAR (Bon- dell and Reich, 2008). The concept of automatic feature grouping is to find accurate models tha have fewer degrees of freedom. This is equiva- lent to enforce every optimization variables to be equal as much as possible. A simple example is that ˆw1 = (0.1, 0.5, 0.1, 0.5, 0.1) is preferred over ˆw2 = (0.1, 0.3, 0.2, 0.5, 0.3) since ˆw1 and ˆw2 have two and four unique values, respectively. 元々の学習式 Lは損失関数,Ωが正則化項 重みの集合Sを定義する 例えば-4から4までの範囲とすると,S={-4,-3,-2.....3,4} 重みはSのべき乗集合から選ぶことにする.
  • 6. じゃあ,どうやってグループ化するの? 6 uster without any loss. Modeling with Feature Grouping ection describes our proposal for obtaining ure grouping solution. Integration of a Discrete Constraint be a finite set of discrete values, i.e., a set r from 4 to 4, that is, S ={ 4,. . . , 1, 0, , 4}. The detailed discussion how we define be found in our experiments section since ply depends on training data. Then, we de- e objective that can simultaneously achieve ure grouping and model learning as follows: O(w; D) = L(w; D) + ⌦(w) s.t. w 2 SN . (2) SN is the cartesian power of a set S. The difference with Eq. 1 is the additional dis- constraint, namely, w 2 SN . This con- means that each variable (feature weight) the standard loss minimization problem Eq. 1 and the additional discrete const larizer by the dual decomposition techn To solve the optimization in Eq. 3, age the alternating direction method of (ADMM) (Gabay and Mercier, 1976; B 2011). ADMM provides a very efficient tion framework for the problem in the du position form. Here, ↵ represents dua for the equivalence constraint w = u. A troduces the augmented Lagrangian ter u||2 2 with ⇢>0 which ensures strict con increases robustness3. Finally, the optimization problem in be converted into a series of iterative tion problems. Detailed derivation in t case can be found in (Boyd et al., 201 shows the entire model learning framew proposed method. The remarkable po ADMM works by iteratively computing three optimization variable sets w, u, an to provide compact model representation, which is especially useful in actual use. Introduction his paper focuses on the topic of supervised model learning, which is typically represented as he following form of the optimization problem: ˆw = arg min w O(w; D) , O(w; D) = L(w; D) + ⌦(w), (1) where D is supervised training data that consists f the corresponding input x and output y pairs, hat is, (x, y) 2 D. w is an N-dimensional vector epresentation of a set of optimization variables, which are also interpreted as feature weights. (w; D) and ⌦(w) represent a loss function and regularization term, respectively. Nowadays, we, n most cases, utilize a supervised learning method xpressed as the above optimization problem to stimate the feature weights of many natural lan- uage processing (NLP) tasks, such as text clas- fication, POS-tagging, named entity recognition, ependency parsing, and semantic role labeling. In the last decade, the L1-regularization tech- a model learning framework that can reduce the model complexity beyond that possible by sim- ply applying L1-regularizers. To achieve our goal we focus on the recently developed concept of au- tomatic feature grouping (Tibshirani et al., 2005 Bondell and Reich, 2008). We introduce a mode learning framework that achieves feature group- ing by incorporating a discrete constraint during model learning. 2 Feature Grouping Concept Going beyond L1-regularized sparse modeling the idea of ‘automatic feature grouping’ has re- cently been developed. Examples are fused lasso (Tibshirani et al., 2005), grouping pur- suit (Shen and Huang, 2010), and OSCAR (Bon- dell and Reich, 2008). The concept of automatic feature grouping is to find accurate models tha have fewer degrees of freedom. This is equiva- lent to enforce every optimization variables to be equal as much as possible. A simple example is that ˆw1 = (0.1, 0.5, 0.1, 0.5, 0.1) is preferred over ˆw2 = (0.1, 0.3, 0.2, 0.5, 0.3) since ˆw1 and ˆw2 have two and four unique values, respectively. 元々の学習式 Lは損失関数,Ωが正則化項 重みの集合Sを定義する 例えば-4から4までの範囲とすると,S={-4,-3,-2.....3,4} 重みはSのべき乗集合から選ぶことにする. つまり..重み値の集合を作成して,この集合から重 みを選ぶ.この行為がグループ化である.
  • 7. ただし,この問題は普通には解けない. 7 ply depends on training data. Then, we de- he objective that can simultaneously achieve ure grouping and model learning as follows: O(w; D) = L(w; D) + ⌦(w) s.t. w 2 SN . (2) e SN is the cartesian power of a set S. The difference with Eq. 1 is the additional dis- constraint, namely, w 2 SN . This con- t means that each variable (feature weight) ined models must take a value in S, that is, S, where ˆwn is the n-th factor of ˆw, and {1, . . . , N}. As a result, feature weights in d models are automatically grouped in terms troduces th u||2 2 with ⇢ increases r Finally, be convert tion proble case can b shows the e proposed m ADMM wo three optim holding the t = 1, 2, . . Step1 (w Sのべき乗が巨大化するので,最適化の際に組み合わせ爆発 が発生する.つまりNP困難問題 そこで,双対分解を導入して,これを解く
  • 8. 双対分解のおさらい 8 双対分解とは, つまり「NP困難な問題を分割して解く問題の解き方」 1.argmax_y ( g(y) + h(y) )はNP困難になるため解けない 2.そこで,argmax_z,y ( g(z) + h(y) ) st. z=yと問題を分解 3. 2の式にラグランジュ法を導入して新しくLを定義.Lの最適解をL*とする.こ の時,双対定理によりL*=min_u L(u) 4. L(u)は凸関数なので,勾配法で最適解が求まる.勾配の更新をu:=u-µ(y*-z*)と する. 5. y*=z*の時に,2の式が解ける. 詳しくは
  • 9. 本論文での双対分解の適用 9 γはΩに似た項らしい.(Sec. 3.1より) γはなんでも良いのだが,このpaperでは of L(w; D) and ⌦(w). Thus, we ignore their spe- ific definition in this section. Typical cases can be found in the experiments section. Then, we re- ormulate Eq. 2 by using the dual decomposition echnique (Everett, 1963): O(w, u; D) = L(w; D) + ⌦(w) + ⌥(u) s.t. w = u, and u 2 SN . (3) Difference from Eq. 2, Eq. 3 has an additional term ⌥(u), which is similar to the regularizer ⌦(w), whose optimization variables w and u are tight- ned with equality constraint w = u. Here, this paper only considers the case ⌥(u) = 2 2 ||u||2 2 + 1||u||1, and 2 0 and 1 02. This objec- . , 4}. The detailed discussion how we define n be found in our experiments section since eply depends on training data. Then, we de- he objective that can simultaneously achieve ture grouping and model learning as follows: O(w; D) = L(w; D) + ⌦(w) s.t. w 2 SN . (2) e SN is the cartesian power of a set S. The difference with Eq. 1 is the additional dis- constraint, namely, w 2 SN . This con- nt means that each variable (feature weight) ained models must take a value in S, that is, 2 S, where ˆwn is the n-th factor of ˆw, and {1, . . . , N}. As a result, feature weights in ed models are automatically grouped in terms e basis of model learning. This is the basic of feature grouping proposed in this paper. position form. Here, ↵ rep for the equivalence constrain troduces the augmented Lag u||2 2 with ⇢>0 which ensure increases robustness3. Finally, the optimization be converted into a series tion problems. Detailed der case can be found in (Boyd shows the entire model learn proposed method. The rem ADMM works by iteratively three optimization variable s holding the other variables t = 1, 2, . . . until convergen Step1 (w-update): This tion problem shown in Eq. with a ‘biased’ L2-regulariz that the direction of regulari 元の式を双対分解 s, we ignore their spe- on. Typical cases can section. Then, we re- e dual decomposition + ⌦(w) + ⌥(u) nd u 2 SN . (3) has an additional term he regularizer ⌦(w), es w and u are tight- nt w = u. Here, this se ⌥(u) = 2 2 ||u||2 2 + 1 02. This objec- the decomposition of ion problem shown in ious studies clari- e of over-fitting to ng, 2010). This is y NLP tasks since high-dimensional -fitting problem is en reported that it cting non-zero fea- h the standard L1- f many highly cor- Yu, 2003; Zou and n dramatically re- is because we can e weight values are into a single fea- Grouping of L(w; D) and ⌦(w). Thus, we ignore their spe- cific definition in this section. Typical cases can be found in the experiments section. Then, we re- formulate Eq. 2 by using the dual decomposition technique (Everett, 1963): O(w, u; D) = L(w; D) + ⌦(w) + ⌥(u) s.t. w = u, and u 2 SN . (3) Difference from Eq. 2, Eq. 3 has an additional term ⌥(u), which is similar to the regularizer ⌦(w), whose optimization variables w and u are tight- ened with equality constraint w = u. Here, this paper only considers the case ⌥(u) = 2 2 ||u||2 2 + 1||u||1, and 2 0 and 1 02. This objec- tive can also be viewed as the decomposition of the standard loss minimization problem shown in Eq. 1 and the additional discrete constraint regu- のみを考える.
  • 10. パラメータの更新と最適化 ✦ 分解式を特にはADMMというアルゴリズムを用いる (双対分解では一般的に用いられるアルゴリズムらしい) ✦ 詳しいパラメーターの更新はsec. 3.1を見てください. (たぶん,勾配的に更新してると思われる) ✦ 計算量はO(N log ¦S¦ )に抑えられる. ✦ ADMMの中でオンライン学習を用いて高速化が可能 (sec. 3.3) 10
  • 11. 2つのタスクで2軸で評価実験 ✦ 2つのNLPタスクで評価を行った Named Entity Recognitionタスク (NER) Dependency Parsingタスク(DEPAR) ✦ 手法の精度評価 Complete Sentence Accuracy(COMP)が完全一致? NERタスクにF-sc(F-score) DEPARタスクにUAS(unlabelのedgeの正確さ) ✦ モデル複雑度の評価 #nzF:featureの数,ただし対応する重みがnon-zero #DoF:uniqueなnon-zeroな重み 11
  • 12. 重み集合Sの定義(4.1) 12 Sの定義は自由にしてもいいが,一般的に以下が最適 plate which is suitable for large feature set. Let ⌘, , and  represent non-negative real-value con- stants, ⇣ be a positive integer, = { 1, 1}, and a function f⌘, ,(x, y) = y(⌘x + ). T hen, we define a finite set of values S as follows: S⌘, ,,⇣ ={f⌘, ,(x, y)|(x, y) 2 S⇣ ⇥ } [ {0}, where S⇣ is a set of non-negative integers from ero to ⇣ 1, that is, S⇣ ={m}⇣ 1 m=0. For example, if we set ⌘ = 0.1, = 0.4,  = 4, and ⇣ = 3, then S⌘, ,,⇣ = { 2.0, 0.8, 0.5, 0, 0.5, 0.8, 2.0}. he intuition of this template is that the distribu- tion of the feature weights in trained model often tak es a form a similar to that of the ‘ power law’ in the case of the large feature sets. T herefore, us- ing an exponential function with a scale and bias seems to be appropriate for fitting them. nite set for S. However, we have to carefully se- lect it since it deeply affects the performance. Ac- tually, this is the most considerable point of our method. We preliminarily investigated the several settings. Here, we introduce an example of tem- plate which is suitable for large feature set. Let ⌘, , and  represent non-negative real-value con- stants, ⇣ be a positive integer, = { 1, 1}, and a function f⌘, ,(x, y) = y(⌘x + ). Then, we define a finite set of values S as follows: S⌘, ,,⇣ ={f⌘, ,(x, y)|(x, y) 2 S⇣ ⇥ } [ {0}, where S⇣ is a set of non-negative integers from zero to ⇣ 1, that is, S⇣ ={m}⇣ 1 m=0. For example, if we set ⌘ = 0.1, = 0.4,  = 4, and ⇣ = 3, then S⌘, ,,⇣ = { 2.0, 0.8, 0.5, 0, 0.5, 0.8, 2.0}. The intuition of this template is that the distribu- tion of the feature weights in trained model often takes a form a similar to that of the ‘power law’ in the case of the large feature sets. Therefore, us- ただし η,k,δは非負の実数 ζは正の整数.S_ζは0からζ­1までの実数集合 重みの分布は一般的に,べき乗則(power law)に従う傾向 がある.なので,指数関数でフィッテングを行った. 上式の根拠 ちなみに#DoFはζによってコントロール可能.
  • 13. 実験結果 13 0E+00 1.0E+03 1.0E+06 DC-ADMM L1CRF (w/ QT) L1CRF L2CRF quantized degrees of freedom (#DoF) [log-scale] 30.0 35.0 40.0 45.0 50.0 55.0 1.0E+00 1.0E+03 1.0E+06 DC-ADMM L1RAD (w/ QT) L1RDA L2PA CompleteSentenceAccuracy quantized # of degrees of freedom (#DoF) [log-scale] (a) NER (b) DEPAR ure 3: Performance vs. degree of freedom in trained model for the development data ote that we can control the upper bound of F in trained model by ⇣, namely if ⇣ = 4 then upper bound of #DoF is 8 (doubled by posi- and negative sides). We fixed ⇢ = 1, ⇠ = 1, = 0,  = 4 (or 2 if ⇣ 5), = ⌘/2 in all ex- ments. Thus the only tunable parameter in our Test Model complex. NER COMP F-sc #nzF #DoF L2CRF 84.88 89.97 61.6M 38.6M L1CRF 84.85 89.99 614K 321K (w/ QT ⇣ =4) 78.39 85.33 568K 8 (w/ QT ⇣ =2) 73.40 81.45 454K 4 (w/ QT ⇣ =1) 65.53 75.87 454K 2 DC-ADMM (⇣ =4) 84.96 89.92 643K 8 (⇣ =2) 84.04 89.35 455K 4 (⇣ =1) 83.06 88.62 364K 2 Test Model complex. DEPER COMP UAS #nzF #DoF L2PA 49.67 93.51 15.5M 5.59M L1RDA 49.54 93.48 7.76M 3.56M (w/ QT ⇣ =4) 38.58 90.85 6.32M 8 (w/ QT ⇣ =2) 34.19 89.42 3.08M 4 (w/ QT ⇣ =1) 30.42 88.67 3.08M 2 DC-ADMM (⇣ =4) 49.83 93.55 5.81M 8 (⇣ =2) 48.97 93.18 4.11M 4 (⇣ =1) 46.56 92.86 6.37M 2 Table 1: Comparison results of the methods on test data (K: thousand, M: million) NERとDEPERの両方で Baselineと謙遜ない精度を 出しつつ,モデルの複雑さを 抑えた
  • 14. まとめ ✦ NLPタスクの機械学習は重みの多さからモデル複雑度が高 くなりがち. ✦ 複雑度を抑えるために,重みのグループ化を行った. ✦ グループ化に伴い発生するNP困難問題を双対分解で解決 ✦ Named Entity RecognitionタスクとDependency Parsingタスクで精度を保ちつつ,複雑度を抑えた 14